Training from scratch on AMD's R9700

Picked up a Radeon AI PRO R9700 to train language models from scratch, partly because 32GB of VRAM for the price is hard to argue with, and partly because somebody has to actually use AMD for this or the software never gets better. Hope this helps the three other people considering the same thing.

Short version: ROCm is in far better shape than its reputation and almost everything I assumed about why turned out wrong.

The card:

GPURadeon AI PRO R9700 (RDNA4, Navi 48, gfx1201)
VRAM32GB GDDR6, 256-bit
RuntimeROCm 6.3 via the PyTorch wheel
BoxFedora 44, paired with an RTX 3070 Ti for comparison

The bring-up was the easy part

Unexpectedly, after a genuinely painful NVIDIA driver saga on the same box (Secure Boot, locally signing the kmod, MOK enrollment at the physical console, the works), I braced for ROCm to be worse. It was the opposite.

Two things make it easy on RDNA4.

The amdgpu kernel driver is mainline and Fedora-signed. No Secure Boot fight, no MOK, no signing your own module. The kernel already trusts it. That was ninety percent of the NVIDIA pain, just gone.

The PyTorch ROCm wheel bundles its own userspace runtime. No system ROCm install, no sudo, no HSA_OVERRIDE_GFX_VERSION incantation. The whole recipe is a fresh venv and one install:

# Python 3.12 (3.14 still lacks some wheels). uv, but plain venv + pip works too.
uv venv --python 3.12 --seed ~/venvs/rocm
source ~/venvs/rocm/bin/activate

# torch built against ROCm 6.3, then the usual training deps
uv pip install torch --index-url https://download.pytorch.org/whl/rocm6.3
uv pip install tokenizers numpy huggingface_hub

That pulls torch 2.9.1+rocm6.3. One check to confirm the card is really there:

python -c "import torch; print(torch.__version__, torch.cuda.is_available(), torch.cuda.get_device_name(0))"
# 2.9.1+rocm6.3 True AMD Radeon Graphics

Two things to notice. cuda.is_available() returns True, because ROCm masquerades as CUDA in the torch API, so most code written for CUDA runs unchanged. And the device name is the generic “AMD Radeon Graphics” rather than the model, which is cosmetic (a missing name database) and does not affect anything.

Then a ten-second smoke test that the math actually runs, forward and backward, in bf16:

import torch

# "cuda" is ROCm here. No code change from an NVIDIA box.
x = torch.randn(4096, 4096, device="cuda", dtype=torch.bfloat16)
w = torch.randn(4096, 4096, device="cuda", dtype=torch.bfloat16, requires_grad=True)
(x @ w).sum().backward()

print(w.grad.norm().item())   # a number => fwd + bwd run on gfx1201

From there, a full from-scratch Llama trained native on gfx1201 with no overrides.

If you came in expecting a weekend of yak-shaving, budget an hour.

The sharp edges

Not all free. Three things cost me time:

TunableOp will hang you at step zero.

export PYTORCH_TUNABLEOP_ENABLED=1   # don't, at least not on an MoE

This is the first “free speedup” knob you will find suggested for ROCm. On a mixture-of-experts model it hung the GPU at step zero for an hour: 97 percent utilization, never finished a single step, no error. TunableOp tries to tune every GEMM shape it sees, and an MoE routes tokens to experts with variable per-call shapes, so it tunes forever. Dense static-shape models might tolerate it, but I would not turn it on without a clean before-and-after probe and a timeout. It fails silent, which is the worst way to fail.

hipBLASLt is unsupported on gfx1201.

export TORCH_BLAS_PREFER_HIPBLASLT=1   # no-op at best, silent stall at worst

Leave it off.

expandable_segments does nothing. You will see this on startup:

UserWarning: expandable_segments not supported on this platform

It is benign. The allocator just ignores the setting. (Also note PYTORCH_HIP_ALLOC_CONF is now deprecated in favor of PYTORCH_ALLOC_CONF.)

The lesson under all three: do not paste ROCm performance env vars off a forum without measuring. On gfx1201 the “obvious” knobs range from useless to a step-zero hang.

The env I keep set, and the ones I keep unset:

# harmless, basically a no-op on gfx1201, but it shuts up cleanly
export PYTORCH_ALLOC_CONF=expandable_segments:True

# leave these OFF on gfx1201:
#   PYTORCH_TUNABLEOP_ENABLED      -> step-zero hang on an MoE
#   TORCH_BLAS_PREFER_HIPBLASLT    -> unsupported

One more note: rocm-smi is not installed by default. You don’t really need it, sysfs has what matters (adjust the card index if you have more than one GPU):

# utilization, percent
cat /sys/class/drm/card1/device/gpu_busy_percent

# power draw in microwatts (divide by 1e6 for watts)
cat /sys/class/drm/card1/device/hwmon/hwmon*/power1_average

The throughput surprise

Here is the part I got wrong, and I got it wrong twice.

Trained a ~100M dense Llama from scratch, then put the R9700 head to head against the RTX 3070 Ti sitting in the same machine. On paper this is not close. The R9700 is the bigger, newer, far more expensive card. So when it landed at about 31k tokens per second against the 3070 Ti’s ~32k, my first instinct was that something was wrong with my setup.

Nothing was wrong. So I went looking for the missing performance, and guessed wrong twice.

Guess one: not enough batch. Surely the 32GB card is just under-fed. Doubled the batch from 8 to 16:

configtok/s
batch 831.4k
batch 1631.1k

Flat. If it were compute or occupancy bound, more batch would have helped. It did not, so it isn’t.

Guess two: it needs torch.compile. Built clean on gfx1201 this time (no graph breaks on a dense model) and delivered exactly nothing:

configtok/s
eager31.4k
torch.compile31.4k

At that point the answer is the only thing left standing: training a small dense model is memory-bandwidth bound, not compute bound. The GEMMs at this size (768 by 768, 768 by 2048) are too small to saturate either card’s tensor units, so peak TFLOPS is irrelevant. What matters is how fast you move weights and activations, and there the two cards are nearly identical:

cardmemorybandwidth
R9700256-bit GDDR6~640 GB/s
RTX 3070 Ti256-bit GDDR6X608 GB/s

Within noise. So they tie, exactly as the bandwidth numbers predict and the TFLOPS numbers do not.

A note on power, because the cap matters more than the uncap. The R9700 draws up to 300W stock, and my training box sits in the room I work in, so I had it capped down to 210W (its floor) just to keep the office bearable. Naturally I wanted to know what that cap was costing me, so I let it back up to 300W. It bought 11 percent.

Think about that the other way around. A 43 percent power increase for 11 percent more throughput is a bad trade going up, which makes it a great trade going down. Capping a home or office training card to its floor costs almost nothing in speed and takes 90W of heat out of the room. The card is not compute-starved, so the extra watts mostly turn into heat you sit next to.

The cap lives in sysfs in microwatts (root to write). On gfx1201 the floor is 210W and the ceiling is 300W:

# current cap, and the allowed range
cat /sys/class/drm/card1/device/hwmon/hwmon*/power1_cap
cat /sys/class/drm/card1/device/hwmon/hwmon*/power1_cap_{min,max}

# set 250W
echo 250000000 | sudo tee /sys/class/drm/card1/device/hwmon/hwmon*/power1_cap

So what is it actually for

If you are training small models and you only care about throughput, the R9700 will not beat a cheap used NVIDIA card. They tie, and the NVIDIA card runs cooler doing it.

But throughput was never the reason to buy this card. The reason is the 32GB, the thing the 8GB 3070 Ti physically cannot do:

The R9700 ran a 16-expert MoE that the 3070 cannot load at all. Capacity is the product. Speed is a tie you should stop optimizing.

And the meta-point, the one that made me buy it in the first place: ROCm on RDNA4 is genuinely usable for from-scratch training today. Not “usable with an asterisk and a patched fork,” usable. The bring-up is easier than NVIDIA’s, the card does real work, and the rough edges are a short list. The only way that list gets shorter is more people running real workloads and writing down what broke.

So here is mine. If you have hit something on gfx1201 I haven’t, I want to hear it.