Running AMD ROCm AI Workloads locally
Brief introduction
- System-Level Attributes
- Agent 1 — CPU
- Agent 2 — GPU ✅ (The important one for ROCm)
- Agent 3 — NPU (Neural Processing Unit)
- Install PyTorch for ROCm
- MIGraphX for Radeon GPUs
- ONNX
AI X1 Pro-470 form factor1 is the Zen5 12-core architecture, the 86 TOPS NPU, and the Radeon 890M iGPU — all of which directly contextualise why the OEM kernel requirement and ROCm APU-specific setup described in the post are necessary.
| ROCm 7.2 | AI X1 Pro-470 |
|---|---|
| AMD ROCm 7.2 was announced at CES 2026, delivering seamless support for Ryzen AI 400 Series processors AMD — including the Ryzen AI 9 HX 470. ROCm 7.2 introduces initial PyTorch support for Ryzen APUs as a preview, enabling cost-effective local ML development and inference, with up to 128GB of shared memory available to laptop users. AMD ROCm Ryzen users are recommended to use the inbox graphics drivers of Ubuntu 24.04.3 alongside ROCm 7.2. AMD On Windows, PyTorch is updated with ROCm 7.2 components for Ryzen AI processors, though the full ROCm stack is not yet supported on Windows. AMD ROCm This makes the HX 470 a capable entry point for local AI workflows. | ![]() |
I have installed the rocm as explained in the documentation2.
For ROCm on Ryzen, it is required to operate on the 6.14-1018 OEM kernel or newer.
I have installed:
uname -r
6.17.0-14-generic
The install guide requires --no-dkms when running amdgpu-install, because inbox drivers (drivers built directly into the kernel) are required for ROCm on Ryzen APUs3.
The oem kernel flavour is maintained by Canonical’s OEM team and contains AMD-specific patches, and driver backports that are not present in the mainline generic kernel. These include:
- iGPU/APU-specific KFD (Kernel Fusion Driver) patches for Ryzen compute workloads
- TTM shared memory management improvements AMD contributes to OEM kernels ahead of mainline
amdgpuinbox driver tuned for integrated graphics compute (gfx1100/gfx1151 series)
The generic kernel at 6.17 may not have these backports, even though it’s a newer base version.
Check if the GPU is listed as an agent:
rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.18
Runtime Ext Version: 1.15
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen AI 9 HX 470 w/ Radeon 890M
Uuid: CPU-XX
Marketing Name: AMD Ryzen AI 9 HX 470 w/ Radeon 890M
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 49152(0xc000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5297
BDFID: 0
Internal Node ID: 0
Compute Unit: 24
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 57217068(0x369102c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 57217068(0x369102c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 57217068(0x369102c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 4
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 57217068(0x369102c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1150
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 2048(0x800) KB
Chip ID: 5390(0x150e)
ASIC Revision: 5(0x5)
Cacheline Size: 128(0x80)
Max Clock Freq. (MHz): 3100
BDFID: 50432
Internal Node ID: 1
Compute Unit: 16
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties: APU
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 31
SDMA engine uCode:: 14
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 12582912(0xc00000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 12582912(0xc00000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1150
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
ISA 2
Name: amdgcn-amd-amdhsa--gfx11-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 2147483647(0x7fffffff)
y 65535(0xffff)
z 65535(0xffff)
FBarrier Max Size: 32
*******
Agent 3
*******
Name: aie2p
Uuid: AIE-XX
Marketing Name: RyzenAI-npu4
Vendor Name: AMD
Feature: AGENT_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 1(0x1)
Queue Min Size: 64(0x40)
Queue Max Size: 64(0x40)
Queue Type: SINGLE
Node: 0
Device Type: DSP
Cache Info:
L2: 2048(0x800) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 0(0x0)
Max Clock Freq. (MHz): 0
BDFID: 0
Internal Node ID: 0
Compute Unit: 0
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:0
Memory Properties:
Features: AGENT_DISPATCH
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, COARSE GRAINED
Size: 57217068(0x369102c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65536(0x10000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:0KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 57217068(0x369102c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*** Done ***
Your system has 3 HSA Agents detected — a CPU, a GPU, and an NPU. Here’s a breakdown:
System-Level Attributes
| Attribute | Value | Meaning |
|---|---|---|
| Runtime Version | 1.18 | ROCm HSA runtime installed and working |
| XNACK enabled | NO | Page fault retry for GPU memory is off (normal for APUs) |
| DMAbuf Support | YES | GPU can share buffers with other devices via Linux DMA-buf |
| VMM Support | YES | Virtual Memory Management is supported |
Agent 1 — CPU
AMD Ryzen AI 9 HX 470 w/ Radeon 890M
- 24 compute units, max clock 5297 MHz
- Has 4 memory pools totalling ~55.7 GB (your system RAM), all accessible by all agents — this is the unified memory characteristic of an APU
- Pool types cover Fine Grained, Extended Fine Grained, Kernarg, and Coarse Grained — used for different levels of CPU/GPU memory coherency
Agent 2 — GPU ✅ (The important one for ROCm)
AMD Radeon Graphics — gfx1150
This is your Radeon 890M iGPU. Key specs:
| Property | Value |
|---|---|
| ISA Target | amdgcn-amd-amdhsa--gfx1150 |
| Compute Units | 16 |
| SIMDs per CU | 2 |
| Max Clock | 3100 MHz |
| Wavefront Size | 32 |
| Max Workgroup Size | 1024 |
| Fast F16 | TRUE |
| Memory Pool | ~12 GB (shared from system RAM via TTM) |
Two important things to note:
- Memory Properties:
APU— confirms this is a unified memory APU, not a discrete GPU - Two ISAs are registered:
gfx1150(exact target) andgfx11-generic(fallback) — this means ROCm kernels compiled for the generic gfx11 family will also run on your chip - Accessible by all: FALSE on GPU pools — GPU VRAM pools are not directly CPU-accessible without explicit mapping (normal APU behaviour despite unified memory)
Agent 3 — NPU (Neural Processing Unit)
RyzenAI-npu4 (aie2p)
- This is the AMD XDNA NPU (AI Engine 2+) embedded in the Ryzen AI 9 HX 470
- Device type is
DSP, feature isAGENT_DISPATCH(notKERNEL_DISPATCHlike the GPU) - It uses system RAM pools but is not a general-purpose compute target for ROCm — it’s used via the separate ONNX/Vitis AI runtime path
- Max queue size is only 64 (single queue) — very different from the GPU’s 128 queues
ROCm utilises a shared system memory pool and is configured by default to half the system memory.
AMD recommends setting the minimum dedicated VRAM in the BIOS (0.5GB), and setting the TTM limit to a larger amount.
Using amd-ttm --set <NUM> set the shared memory. Current value is:
amd-ttm
💻 Current TTM pages limit: 3145728 pages (12.00 GB)
💻 Total system memory: 54.57 GB
You can clear the above settings by amd-ttm --clear.
I’ve updated the kernel to include Canonical drivers: sudo apt update && sudo apt install linux-oem-24.04c.
Verify the new oem version:
uname -r
6.17.0-1012-oem
Install PyTorch for ROCm
I used the Docker approach4 in his case,
cat > docker-compose.yaml << 'EOF'
services:
rocm-pytorch:
image: rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1
stdin_open: true
tty: true
cap_add:
- SYS_PTRACE
security_opt:
- seccomp=unconfined
devices:
- /dev/kfd
- /dev/dri
group_add:
- video
ipc: host
shm_size: 8g
EOF
The key mappings from docker run flags:
docker run flag |
docker-compose key |
|---|---|
-it |
stdin_open: true + tty: true |
--cap-add |
cap_add |
--security-opt |
security_opt |
--device |
devices |
--group-add |
group_add |
--ipc |
ipc |
--shm-size |
shm_size |
You can run PyTorch using docker compose run --rm rocm-pytorch.
Verify if Pytorch is installed and is detecting the GPU compute device:
docker run --rm \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--ipc=host \
--shm-size 8G \
rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1 \
bash -c "python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure'"
Success
test if the GPU is available.
docker run --rm \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--ipc=host \
--shm-size 8G \
rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1 \
bash -c "python3 -c 'import torch; print(torch.cuda.is_available())'"
True
display installed GPU device name:
docker run --rm \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--ipc=host \
--shm-size 8G \
rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1 \
bash -c 'python3 -c "import torch; print(f\"device name [0]:\", torch.cuda.get_device_name(0))"'
device name [0]: AMD Radeon Graphics
display component information within the current PyTorch environment
docker run --rm \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--ipc=host \
--shm-size 8G \
rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1 \
bash -c 'python3 -m torch.utils.collect_env'
<frozen runpy>:128: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
Collecting environment information...
PyTorch version: 2.9.1+rocm7.2.0.git7e1940d4
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 7.2.26015-fc0010cf6a
OS: Ubuntu 24.04.3 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.39
Python version: 3.12.3 (main, Jan 8 2026, 11:30:50) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-6.17.0-1012-oem-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to:
GPU models and configuration: AMD Radeon Graphics (gfx1150)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: 7.2.26015
MIOpen runtime version: 3.5.1
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: AuthenticAMD
Model name: AMD Ryzen AI 9 HX 470 w/ Radeon 890M
CPU family: 26
Model: 36
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 0
Frequency boost: enabled
CPU(s) scaling MHz: 47%
CPU max MHz: 5297.2979
CPU min MHz: 621.6220
BogoMIPS: 3992.31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpuid_fault cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze
Virtualization: AMD-V
L1d cache: 576 KiB (12 instances)
L1i cache: 384 KiB (12 instances)
L2 cache: 12 MiB (12 instances)
L3 cache: 24 MiB (2 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Ghostwrite: Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Old microcode: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; IBPB on VMEXIT only
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsa: Not affected
Vulnerability Tsx async abort: Not affected
Vulnerability Vmscape: Mitigation; IBPB on VMEXIT
Versions of relevant libraries:
[pip3] numpy==2.4.1
[pip3] torch==2.9.1+rocm7.2.0.lw.git7e1940d4
[pip3] torchaudio==2.9.0+rocm7.2.0.gite3c6ee2b
[pip3] torchvision==0.24.0+rocm7.2.0.gitb919bd0c
[pip3] triton==3.5.1+rocm7.2.0.gita272dfa8
[conda] Could not collect
MIGraphX for Radeon GPUs
After ROCm installation (verified with PyTorch), the next step is to install MIGraphX, the graph inference engine that accelerates machine learning model inference and can be used to accelerate workloads within the Torch MIGraphX and ONNX Runtime backend frameworks.
sudo apt install migraphx
Verify the installation:
dpkg -l | grep migraphx
ii migraphx 2.15.0.70200-43~24.04 amd64 AMD graph optimizer
ii migraphx-dev 2.15.0.70200-43~24.04 amd64 AMD graph optimizer
dpkg -l | grep half
ii half 1.12.0.70200-43~24.04 amd64 HALF-PRECISION FLOATING POINT LIBRARY
Perform a simple inference with MIGraphX to verify the installation:
/opt/rocm-7.2.0/bin/migraphx-driver perf --test
Running [ MIGraphX Version: 2.15.0.20250912-17-195-g1afd1b89c ]: /opt/rocm-7.2.0/bin/migraphx-driver perf --test
[2026-03-08 11:26:14]
Compiling ...
module: "main"
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}
main:#output_0 = @param:main:#output_0 -> float_type, {4, 3}, {3, 1}
b = @param:b -> float_type, {5, 3}, {3, 1}
a = @param:a -> float_type, {4, 5}, {5, 1}
@4 = gpu::code_object[code_object=4656,symbol_name=mlir_dot,global=256,local=256,output_arg=2,](a,b,main:#output_0) -> float_type, {4, 3}, {3, 1}
Allocating params ...
Running performance report ...
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}: 0.00028068ms, 3%
main:#output_0 = @param:main:#output_0 -> float_type, {4, 3}, {3, 1}: 0.00022842ms, 3%
b = @param:b -> float_type, {5, 3}, {3, 1}: 0.0002194ms, 3%
a = @param:a -> float_type, {4, 5}, {5, 1}: 0.00021078ms, 3%
@4 = gpu::code_object[code_object=4656,symbol_name=mlir_dot,global=256,local=256,output_arg=2,](a,b,main:#output_0) -> float_type, {4, 3}, {3, 1}: 0.00880674ms, 91%
Summary:
gpu::code_object::mlir_dot: 0.00880674ms / 1 = 0.00880674ms, 91%
@param: 0.0006586ms / 3 = 0.000219533ms, 7%
check_context::migraphx::gpu::context: 0.00028068ms / 1 = 0.00028068ms, 3%
Batch size: 1
Rate: 83845.5 inferences/sec
Total time: 0.0119267ms (Min: 0.011309ms, Max: 0.022467ms, Mean: 0.0125111ms, Median: 0.0119195ms)
Percentiles (90%, 95%, 99%): (0.013382ms, 0.016197ms, 0.021636ms)
Total instructions time: 0.00974602ms
Overhead time: 0.00045574ms, 0.00218068ms
Overhead: 4%, 18%
[2026-03-08 11:26:15]
[ MIGraphX Version: 2.15.0.20250912-17-195-g1afd1b89c ] Complete(0.945651s): /opt/rocm-7.2.0/bin/migraphx-driver perf --test
ONNX
Open Neural Network Exchange (ONNX)5 is an open standard format for representing machine learning models, originally created by Microsoft and Meta in 2017 and now governed by the Linux Foundation. The Core Problem It Solves:
Every ML framework (PyTorch, TensorFlow, JAX, scikit-learn) has its own internal model format. Without ONNX, a model trained in PyTorch cannot be directly deployed in a TensorFlow serving environment. ONNX acts as a universal interchange format — train anywhere, deploy anywhere.
PyTorch ──┐
TensorFlow─┤──► ONNX model (.onnx) ──► Any runtime / hardware
JAX ───────┘
Key Concepts
- ONNX Model — a .onnx file that encodes the model’s computation graph (nodes, edges, operators) in a hardware-agnostic way using Google’s Protocol Buffers.
- ONNX Operators — a standardised set of operations (Conv, MatMul, ReLU, etc.) that all compliant frameworks must support.
- ONNX Runtime (ORT) — Microsoft’s high-performance inference engine for .onnx models. It is separate from the ONNX format itself and is what you install via pip install onnxruntime.
Execution Providers (EPs) — ORT’s plugin system for hardware acceleration. Rather than running on the CPU by default, EPs delegate computation to specific hardware:
| Execution Provider | Hardware |
|---|---|
CPUExecutionProvider |
Any CPU |
MIGraphXExecutionProvider |
AMD ROCm GPUs (via MIGraphX) |
ROCMExecutionProvider |
AMD ROCm GPUs (direct) |
CUDAExecutionProvider |
NVIDIA CUDA GPUs |
TensorrtExecutionProvider |
NVIDIA TensorRT |
CoreMLExecutionProvider |
Apple Silicon |
Typical workflow:
- Train model in PyTorch
- Export to ONNX
torch.onnx.export(model, dummy_input, "model.onnx")
- Run inference via ONNX Runtime
sess = ort.InferenceSession("model.onnx",
providers=["MIGraphXExecutionProvider"])
output = sess.run(None, {"input": data})
AMD Radeon 890M (
gfx1150), the MIGraphX EP compiles the ONNX graph using AMD’s MIGraphX graph optimiser at inference time, fusing operators and generating GPU-optimised kernels — giving you significantly better performance than the CPU fallback.
Here is the Dockerfile:
cat > Dockerfile << 'EOF'
FROM rocm/pytorch:rocm7.2_ubuntu24.04_py3.12_pytorch_release_2.9.1
# Install MIGraphX (prerequisite for ONNX Runtime MIGraphX EP)
RUN apt-get update && apt-get install -y \
migraphx \
migraphx-dev \
half \
&& rm -rf /var/lib/apt/lists/*
# Downgrade numpy (v2.0 is incompatible with onnxruntime-migraphx wheels)
RUN pip3 install numpy==1.26.4
# Install ONNX Runtime with MIGraphX Execution Provider
RUN pip3 uninstall -y onnxruntime-migraphx || true && \
pip3 install onnxruntime-migraphx -f https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/
WORKDIR /workspace
EOF
create a docker compose file to make things easier:
cat > docker-compose.yaml << 'EOF'
services:
rocm-onnx:
build: .
image: rocm-onnx:rocm7.2
cap_add:
- SYS_PTRACE
security_opt:
- seccomp=unconfined
devices:
- /dev/kfd
- /dev/dri
group_add:
- video
ipc: host
shm_size: 8g
volumes:
- ./workspace:/workspace
EOF
The volumes mount maps a local ./workspace directory into the container so you can drop .onnx model files there and run inference without rebuilding the image.
Build the Docker instance using the above files: docker compose build.
To verify:
docker compose run --rm rocm-onnx \
python3 -c "import onnxruntime as ort; print(ort.get_available_providers())"
[?25l[0G[+] create 1/1
✔ Network blogs_default Created 0.0s
[?25h[?25l[2A[0G[+] 1/1
✔ Network blogs_default Created 0.0s
[?25hContainer blogs-rocm-onnx-run-315e1fcece42 Creating
Container blogs-rocm-onnx-run-315e1fcece42 Created
['MIGraphXExecutionProvider', 'CPUExecutionProvider']
Verify MIGraphX works:
docker compose run --rm rocm-onnx \
/opt/rocm-7.2.0/bin/migraphx-driver perf --test
Container blogs-rocm-onnx-run-9472d6713026 Creating
Container blogs-rocm-onnx-run-9472d6713026 Created
Running [ MIGraphX Version: 2.15.0.20250912-17-195-g1afd1b89c ]: /opt/rocm-7.2.0/bin/migraphx-driver perf --test
[2026-03-08 00:36:46]
Compiling ...
module: "main"
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}
main:#output_0 = @param:main:#output_0 -> float_type, {4, 3}, {3, 1}
b = @param:b -> float_type, {5, 3}, {3, 1}
a = @param:a -> float_type, {4, 5}, {5, 1}
@4 = gpu::code_object[code_object=4656,symbol_name=mlir_dot,global=256,local=256,output_arg=2,](a,b,main:#output_0) -> float_type, {4, 3}, {3, 1}
Allocating params ...
Running performance report ...
@0 = check_context::migraphx::gpu::context -> float_type, {}, {}: 0.00028054ms, 3%
main:#output_0 = @param:main:#output_0 -> float_type, {4, 3}, {3, 1}: 0.00022022ms, 3%
b = @param:b -> float_type, {5, 3}, {3, 1}: 0.0002111ms, 3%
a = @param:a -> float_type, {4, 5}, {5, 1}: 0.00020764ms, 3%
@4 = gpu::code_object[code_object=4656,symbol_name=mlir_dot,global=256,local=256,output_arg=2,](a,b,main:#output_0) -> float_type, {4, 3}, {3, 1}: 0.0089923ms, 91%
Summary:
gpu::code_object::mlir_dot: 0.0089923ms / 1 = 0.0089923ms, 91%
@param: 0.00063896ms / 3 = 0.000212987ms, 7%
check_context::migraphx::gpu::context: 0.00028054ms / 1 = 0.00028054ms, 3%
Batch size: 1
Rate: 99657.6 inferences/sec
Total time: 0.0100344ms (Min: 0.009097ms, Max: 0.03133ms, Mean: 0.0133908ms, Median: 0.009989ms)
Percentiles (90%, 95%, 99%): (0.025579ms, 0.026391ms, 0.029907ms)
Total instructions time: 0.0099118ms
Overhead time: 0.00045108ms, 0.00012256ms
Overhead: 4%, 1%
[2026-03-08 00:36:46]
[ MIGraphX Version: 2.15.0.20250912-17-195-g1afd1b89c ] Complete(0.666102s): /opt/rocm-7.2.0/bin/migraphx-driver perf --test
