Running Gemma 4 E4B on the AMD ROCm
The Gemma 4, the 4-billion-effective-parameter instruction-tuned variant (google/gemma-4-E4B-it)1. This guide grounds Gemma 4 model card in the Ryzen AI 9 HX 470 machine — a Minisforum AI X1 Pro with Ryzen AI 9 HX 470, Radeon 890M (gfx1150 / RDNA 3.5), XDNA 2 NPU, 64 GiB UMA, ROCm 7.2.0, MIGraphX 2.15.0.dev — and walks from “what am I” through to “how do I tune your KV cache for 128 K context”2. The Docker command is the verified-working configuration, validated on this exact hardware.
- How to read this guide
- Part I — Foundations
- Part II — Architecture
- Part III — Serving with vLLM on Strix Point
- Part IV — Advanced topics
- Part V — Operations
How to read this guide
The guide is layered. Part I is for newcomers, Part II introduces the architecture with math, Part III is the vLLM operating manual including the verified Docker command, and Part IV is for engineers who want to reason about quantisation, NPU offload, and long-context economics. Part V covers operations, benchmarking, and where to go next.
Part I — Foundations
The Gemma 4 family
Gemma 4 is a family of four open-weights models released by Google DeepMind on 2 April 2026 under Apache 2.03 4. All four are decoder-only Transformers built from the same research as Gemini 3, all four are multimodal (text + image), and the two smallest also accept audio5.
flowchart LR
classDef edge fill:#fef3c7,stroke:#d97706,color:#000
classDef desktop fill:#dbeafe,stroke:#2563eb,color:#000
classDef server fill:#fce7f3,stroke:#db2777,color:#000
F["Gemma 4 family<br/>Apache 2.0 · multimodal · 140+ languages"]
F --> E2B["E2B<br/>~2.3B effective<br/>128K ctx · text/image/audio<br/>Phones, Pi, browsers"]:::edge
F --> E4B["<b>E4B</b> ← <b>this guide</b><br/>~4.5B effective · PLE<br/>128K ctx · text/image/audio<br/>Laptops, mini-PCs"]:::edge
F --> M26["26B A4B<br/>MoE · 4B active of 26B<br/>256K ctx · text/image<br/>Single consumer GPU"]:::desktop
F --> M31["31B Dense<br/>all params active<br/>256K ctx · text/image<br/>Workstations, servers"]:::server
E4B sits in the middle of that range: a dense model with Per-Layer Embeddings (PLE), a 128 K context window, native system role, native function calling, and a configurable thinking mode5 6. On the Ryzen AI 9 HX 470 machine it is the right size — large enough for serious work, small enough that the iGPU’s UMA pool is not the bottleneck.
What “Effective 4B” means
The “E” in E4B stands for effective parameters. The model uses Per-Layer Embeddings (PLE): each decoder layer carries its own small token-embedding table that is consulted by lookup, not multiplied7 8.
That distinction matters because it splits memory and compute apart:
- Compute — only the active matmul-bearing parameters count. ~4.5 B.
- Static memory — the PLE tables push the on-disk and resident weight size higher than 4.5 B × 2 B (FP16) would suggest. Plan for ~8–10 GB FP16 or ~4–5 GB Q4_K_M / INT47.
flowchart TB
classDef tab fill:#f3f4f6,stroke:#6b7280,color:#000
classDef proj fill:#dbeafe,stroke:#2563eb,color:#000
T["Token id"] --> L0["Layer 0<br/>own PLE table"]:::tab
T --> L1["Layer 1<br/>own PLE table"]:::tab
T --> Ld["Layer L-1<br/>own PLE table"]:::tab
L0 --> A0["\+ Layer 0 attn / FFN"]:::proj
L1 --> A1["\+ Layer 1 attn / FFN"]:::proj
Ld --> Ad["\+ Layer L-1 attn / FFN"]:::proj
That is the trick that lets a 4.5 B-effective model beat older 7–8 B baselines on most reasoning suites9.
Why the Ryzen AI 9 HX 470 machine can Gemma-4
The rocminfo output shows three HSA agents — CPU, GPU (gfx1150), and NPU (aie2p) — sharing one 64 GiB pool2. The mapping to Gemma 4 serving stack:
| Layer | Ryzen AI 9 HX 470 machine | Role for Gemma 4 |
|---|---|---|
| CPU | Ryzen AI 9 HX 470, 12 Zen 5 cores @ 5.30 GHz | vLLM scheduler, tokenizer, audio preprocessing |
| iGPU | Radeon 890M, gfx1150, 16 CUs, wave32 | Where Gemma 4 matmuls run under HIP + hipBLASLt |
| NPU | XDNA 2 / aie2p / RyzenAI-npu4, 86 TOPS | Not used by vLLM; reserved for ONNX-RT or MIGraphX sidecar |
| Memory | 64 GiB UMA (your amd-smi shows 2.68/65.5) |
Both static weights and KV cache live here |
| Kernel | 6.17.0-1012-oem | Required path for Strix Point IOMMU / amdkfd10 |
| ROCm | 7.2.0 | First stable release for gfx1150 production serving11 |
| MIGraphX | 2.15.0.dev (g1afd1b89c) | Optional ONNX router that can target the NPU |
You have what you need. The NPU is a separate opportunity, not a prerequisite.
Part II — Architecture
Decoder topology and PLE
Stripped to essentials, Gemma 4 forward pass per token is:
flowchart TB
X["token / patch / audio frame"] --> E["embed via PLE<br/>(text) or vision/audio encoder"]
E --> H0["Layer 0<br/>SWA attn → FFN"]
H0 --> H1["Layer 1<br/>SWA attn → FFN"]
H1 --> Hg["Layer k<br/>GLOBAL attn → FFN"]
Hg --> Hn["… interleaved …"]
Hn --> HL["Last layer<br/><b>GLOBAL</b> attn → FFN"]
HL --> O["LM head → next-token logits"]
Two facts about Gemma 4 decoder are non-negotiable for a serving operator:
- Most layers run sliding-window attention (SWA); periodic and the last layer run full global attention5 8.
- The vision encoder is small (~150 M for E2B/E4B) and runs on the same iGPU as the LM8.
Hybrid attention, with math
The complexity argument
For a transformer layer with hidden dim $d$ and sequence length $n$, full self-attention costs
\[\mathcal{O}_{\text{full}}(n) \;=\; \Theta\!\left(n^{2} d\right)\]per layer, and the KV cache for that layer occupies
\[\text{KV}_{\text{full}}(n) \;=\; 2 \cdot n \cdot h_{kv} \cdot d_{h} \cdot b \quad \text{bytes}\]where $h_{kv}$ is the number of key/value heads, $d_{h}$ is the head dimension, and $b$ is bytes per element (2 for FP16/BF16).
Sliding-window attention with window $w$ replaces the $n^{2}$ term by $n \cdot w$:
\[\mathcal{O}_{\text{SWA}}(n) \;=\; \Theta\!\left(n \cdot w \cdot d\right), \qquad \text{KV}_{\text{SWA}}(n) \;=\; 2 \cdot \min(n, w) \cdot h_{kv} \cdot d_{h} \cdot b\]For Gemma 4 (E4B), $w = 512$ tokens8. Once $n > 512$, every SWA layer’s KV cost is constant in $n$ — only the global layers grow.
The hybrid total
If the model has $L_{s}$ SWA layers and $L_{g}$ global layers, total KV bytes are
\[\text{KV}_{\text{hybrid}}(n) \;=\; 2\,h_{kv}\,d_{h}\,b \,\Big[\,L_{s}\cdot \min(n, w) \;+\; L_{g}\cdot n \,\Big]\]For large $n$, this is linear in $n$ with slope $L_{g}$ rather than $L_{s} + L_{g}$. That is the entire reason 128 K context is feasible for Gemma 4 on a 16 GB-class budget — most of Gemma 4 layers stop paying KV per token once you cross the window.
flowchart TB
classDef swa fill:#dcfce7,stroke:#16a34a,color:#000
classDef glo fill:#fee2e2,stroke:#dc2626,color:#000
L0["Layer 0 · SWA (w=512)"]:::swa
L1["Layer 1 · SWA"]:::swa
L2["Layer 2 · SWA"]:::swa
L3["Layer 3 · SWA"]:::swa
L4["Layer 4 · SWA"]:::swa
L5["Layer 5 · GLOBAL (unified K=V, p-RoPE)"]:::glo
Ldot["…"]
LL["Last layer · <b>GLOBAL</b> (always)"]:::glo
L0 --> L1 --> L2 --> L3 --> L4 --> L5 --> Ldot --> LL
Global-layer optimisations
The global layers — the ones that do grow with $n$ — apply two extra savings:
- Unified K and V projections (sometimes written $W_{K} = W_{V}$): the same projection matrix produces both keys and values, halving the KV cache for those layers5.
- Proportional RoPE (p-RoPE): only a fraction $p \in (0, 1]$ of head dimensions are rotated by RoPE; the rest pass through unrotated. This improves long-context generalisation past the training length5.
Concretely, ordinary RoPE on a head of dimension $d_{h}$ rotates each pair of dimensions $(2i, 2i+1)$ at frequency $\theta_{i} = \theta_{\text{base}}^{-2i/d_{h}}$:
\[\text{RoPE}(\mathbf{x}, m) = \begin{pmatrix} \cos(m\theta_{i}) & -\sin(m\theta_{i}) \\ \sin(m\theta_{i}) & \cos(m\theta_{i}) \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix}\]p-RoPE applies that rotation to only the first $p \cdot d_{h}$ dimensions and leaves the rest unchanged.
Multimodal pipeline
flowchart LR
classDef enc fill:#dbeafe,stroke:#2563eb,color:#000
classDef tok fill:#fef3c7,stroke:#d97706,color:#000
IMG["Image<br/>(any aspect ratio)"] --> VE["Vision encoder<br/>~150 M params"]:::enc
AUD["Audio<br/>(E2B/E4B only)"] --> AE["Mel-spec → 2× Conv2D<br/>downsample"]:::enc
TXT["Text"] --> TT["BPE tokenize → PLE lookup"]:::tok
VE --> P["Linear projection<br/>into LM embedding space"]
AE --> P
P --> S["Unified token stream"]:::tok
TT --> S
S --> DEC["Decoder stack<br/>(SWA + global)"]
Visual token budget
Image inputs are converted into a configurable number of tokens. Pick the budget to match the task5:
| Budget | Use for |
|---|---|
| 70 | Classification, captioning, dense video (many frames) |
| 140 | Light captioning + simple charts |
| 280 | General VQA, screen understanding |
| 560 | OCR, document parsing, detailed charts |
| 1120 | Fine-grained pointing, dense detection, small text |
Critical rule: fine-tune at the same budget you intend to serve at. Training at 1120 and serving at 280 (or vice versa) measurably degrades Gemma 4 output quality12.
Where to put media in a prompt
Always place media before text in a single user message: [image | audio][text], not the reverse5. This is a constraint of how Gemma 4 multimodal training data was structured.
Part III — Serving with vLLM on Strix Point
ROCm topology on the Ryzen AI 9 HX 470 machine
flowchart TB
classDef user fill:#fef3c7,stroke:#d97706,color:#000
classDef vllm fill:#e0e7ff,stroke:#4f46e5,color:#000
classDef rocm fill:#fce7f3,stroke:#db2777,color:#000
classDef hw fill:#dcfce7,stroke:#16a34a,color:#000
U["Client<br/>(curl, openai-py, agent loop)"]:::user
API["vLLM OpenAI API<br/>:8000"]:::vllm
SCH["vLLM scheduler<br/>+ paged-attention KV"]:::vllm
EXE["GPU executor<br/>(PyTorch + custom HIP)"]:::vllm
PT["PyTorch ROCm wheel"]:::rocm
HBL["hipBLASLt + rocBLAS"]:::rocm
AOT["AOTriton flash attn"]:::rocm
HIP["HIP runtime + amdkfd"]:::rocm
KER["Kernel 6.17.0-1012-oem<br/>amdkfd / amdgpu"]:::hw
GPU["Radeon 890M · gfx1150<br/>16 CUs · wave32 · 16 GB UMA"]:::hw
NPU["XDNA 2 · aie2p<br/>86 TOPS · NOT used by vLLM"]:::hw
U --> API --> SCH --> EXE
EXE --> PT --> HBL --> HIP
EXE --> AOT --> HIP
HIP --> KER --> GPU
KER -.-> NPU
The NPU sits there available for an ONNX Runtime or MIGraphX sidecar — see Part IV.
Container launch (verified-working)
This is the single canonical command for the Ryzen AI 9 HX 470 machine. It has been validated end-to-end against vLLM v0.20.1 on ROCm 7.2.0 with kernel 6.17.0-1012-oem, including model load, KV cache provisioning, multimodal warmup, and tool-call parsing.
Choice of image:
vllm/vllm-openai-rocm:v0.20.1is the upstream-built image, post-PR #25908 which added gfx1150/gfx1151 to the build matrix13. Do not userocm/vllm-dev— that one targets AMD Instinct accelerators14, not Strix Point iGPUs.
Listing 1: vLLM Docker to run Gemma-4 locally
#!/usr/bin/env bash
set -euo pipefail
IMAGE=vllm/vllm-openai-rocm:v0.20.1
MODEL=google/gemma-4-E4B-it
mkdir -p "$HOME/.cache/vllm" "$HOME/.cache/huggingface" "$HOME/models"
docker run --rm -it \
--name vllm-gemma4 \
--network=host \
--ipc=host \
--shm-size=16G \
\
--device=/dev/kfd \
--device=/dev/dri \
--group-add=video \
--group-add=render \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
\
-e HSA_OVERRIDE_GFX_VERSION=11.5.0 \
-e HSA_ENABLE_SDMA=0 \
-e ROCBLAS_USE_HIPBLASLT=1 \
-e HIP_FORCE_DEV_KERNARG=1 \
-e SAFETENSORS_FAST_GPU=1 \
-e TOKENIZERS_PARALLELISM=false \
-e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 \
-e HF_TOKEN="${HF_TOKEN:-}" \
\
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-v "$HOME/.cache/vllm:/root/.cache/vllm" \
-v "$HOME/models:/app/models" \
\
"$IMAGE" \
"$MODEL" \
--dtype float16 \
--max-model-len 131072 \
--gpu-memory-utilization 0.85 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--safetensors-load-strategy=prefetch \
--host 0.0.0.0 \
--port 8000
Why each flag is the way it is
| Variable / flag | Why on gfx1150 | Source |
|---|---|---|
HSA_OVERRIDE_GFX_VERSION=11.5.0 |
RDNA 3.5 ISA. 11.0.0 (gfx1100) silently mismatches. |
10 |
HSA_ENABLE_SDMA=0 |
Disables SDMA copy engines that race the CPU on the shared bus and cause hangs. | 10 |
ROCBLAS_USE_HIPBLASLT=1 |
Switches GEMM to hipBLASLt; ~10–15% throughput on small batch sizes. | 10 |
HIP_FORCE_DEV_KERNARG=1 |
Keeps kernel arguments in device memory; prevents rare SIGBUS faults on UMA APUs. | 10 |
SAFETENSORS_FAST_GPU=1 |
Faster safetensors-to-GPU transfer. Already baked into upstream Dockerfile; setting it explicitly is harmless. | 15 |
TOKENIZERS_PARALLELISM=false |
Avoids HuggingFace tokenizer thread storm under load. | (HF docs) |
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 |
Enables memory-efficient SDPA in the vision encoder (still experimental on AMD). | (vLLM log) |
--group-add=render |
On Ubuntu 24.04 + kernel 6.17, /dev/dri/renderD128 is owned by render, not video. |
(kernel docs) |
--tool-call-parser gemma4 |
Gemma 4-specific parser. The legacy name gemma does not exist in v0.20.1. |
(vLLM 0.20.1) |
--max-model-len 131072 |
Full 128 K context. Possible because hybrid SWA caps the per-layer KV growth (see Part IV). | This guide |
--gpu-memory-utilization 0.85 |
Leaves ~10 GiB UMA for the OS / desktop session. | This guide |
| Compile-cache volume mount | Persists torch.compile artifacts across --rm restarts. Saves ~40 s on every cold start. |
(vLLM log) |
Flags that look like they should be there, but aren’t:
--enable-chunked-prefill(default in V1 engine),--attention-backend TRITON_ATTN(auto-forced by vLLM because Gemma 4 has heterogeneous head dims of 256/512), andPYTORCH_HIP_ALLOC_CONF=expandable_segments:True(silently ignored — that flag is CUDA-only).
Run these once before launching:
%%bash
# 1. UMA must be a fixed size in BIOS, not "Auto"
# Recommended: 32 GiB minimum. Otherwise vLLM may die with
# "amdgpu version file missing"[^ollama-issue-11451].
cat /sys/module/amdgpu/version 2>/dev/null || echo "UMA likely set to Auto — fix in BIOS"
UMA likely set to Auto — fix in BIOS
%%bash
# 2. Confirm gfx1150 is what ROCm sees
rocminfo | grep -A1 "Name:.*gfx"
Name: gfx1150
Uuid: GPU-XX
--
Name: amdgcn-amd-amdhsa--gfx1150
Machine Models: HSA_MACHINE_MODEL_LARGE
--
Name: amdgcn-amd-amdhsa--gfx11-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
%%bash
# 3. Make sure your user is in the right groups (host side)
id | tr ',' '\n' | grep -E 'video|render'
44(video)
992(render)
%%bash
# 4. Verify the image has gfx1150 compiled in
docker exec vllm-gemma4 bash -c 'rocminfo | grep -A1 "Name:.*gfx"'
Name: gfx1150
Uuid: GPU-XX
--
Name: amdgcn-amd-amdhsa--gfx1150
Machine Models: HSA_MACHINE_MODEL_LARGE
--
Name: amdgcn-amd-amdhsa--gfx11-generic
Machine Models: HSA_MACHINE_MODEL_LARGE
Smoke test
Health check:
%%bash
curl -s http://localhost:8000/v1/models | jq
{
"object": "list",
"data": [
{
"id": "google/gemma-4-E4B-it",
"object": "model",
"created": 1777974627,
"owned_by": "vllm",
"root": "google/gemma-4-E4B-it",
"parent": null,
"max_model_len": 131072,
"permission": [
{
"id": "modelperm-b1991f33b5a7c34c",
"object": "model_permission",
"created": 1777974627,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}
Self introduction
%%bash
curl -s http://localhost:8000/v1/chat/completions \
-H 'content-type: application/json' \
-d '{"model":"google/gemma-4-E4B-it",
"messages":[{"role":"user","content":"Identify yourself in one sentence."}]}' \
| jq -r '.choices[0].message.content'
I am Gemma 4, a Large Language Model developed by Google DeepMind.
If the second call returns my self-introduction as abive, your stack is healthy.
Capabilities, with examples
All examples assume OPENAI_BASE_URL=http://localhost:8000/v1 and the openai>=1.0 client.
Native system role
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = c.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[
{"role": "system",
"content": "You are a terse, technically precise assistant. "
"Respond in <=2 sentences."},
{"role": "user",
"content": "Explain RDNA 3.5 wave32 and why it matters for LLM kernels."},
],
temperature=1.0, top_p=0.95,
)
print(resp.choices[0].message.content)
RDNA 3.5 wave32 is a hardware feature that allows more granular thread management and execution grouping on AMD GPUs. This finer-grained control improves resource utilization and parallelism, benefiting LLM kernels that require massive concurrent computations.
The system role is native in Gemma 4 and persists across the entire multi-turn conversation — no more “user-prefix” workaround from Gemma 35.
Configurable thinking
# Thinking ON — math, multi-step reasoning, code review
sys = {"role": "system", "content": "thinking: on"}
# Thinking OFF — chat, summarisation, low-latency UX
sys = {"role": "system", "content": "thinking: off"}
When thinking is on my response stream contains <|channel|>thought ... </|channel|> followed by the final answer16.
Strip the thought block before replaying history. Leaving prior thoughts in the conversation degrades my next turn — this is the single most common multi-turn bug reported on E4B5.
Image understanding
import base64
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
img_b64 = base64.b64encode(open("/home/ojitha/workspace/data/invoice.jpg", "rb").read()).decode()
resp = c.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[{
"role": "user",
"content": [
{"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_b64}"}},
{"type": "text",
"text": "Extract every line item as row to display in Markdwon table: "
"[{description, qty, unit_price, total}]. "
"Also return a bounding box for the totals row "
"in [y1,x1,y2,x2] normalized to 1000."},
],
}],
extra_body={"mm_processor_kwargs": {"image_token_budget": 560}},
)
print(resp.choices[0].message.content)
| Description | Qty | Unit Price | Total |
|---|---|---|---|
| CLEARANCE! Fast Dell Desktop Computer PC DUAL CORE WINDOWS 10 4/8/16GB RAM | 3.00 | 209.00 | 627.00 |
| HP T520 Thin Client Computer AMD GX-212C 1.2GHz 4GB RAM TESTED !!READ BELOW!! | 5.00 | 37.75 | 188.75 |
| gaming pc desktop computer | 1.00 | 400.00 | 400.00 |
| 12-Core Gaming Computer Desktop PC Tower Affordable GAMING PC 8GB AMD Vega RGB | 3.00 | 464.89 | 1,394.67 |
| Custom Build Dell Optiplex 9020 MT i5-4570 3.20GHz Desktop Computer PC | 5.00 | 221.99 | 1,109.95 |
| Dell Optiplex 990 MT Computer PC Quad Core i7 3.4GHz 16GB 2TB WD Windows 10 Pro | 4.00 | 269.95 | 1,079.80 |
| Dell Core 2 Duo Desktop Computer | Windows XP Pro | 4GB | 5.00 | 168.00 | 840.00 |
| **Total** | | | |
Bounding Box for Totals Row: [794, 381, 815, 887]
Gemma emit bounding boxes as [y1, x1, y2, x2] integers in $[0, 1000]$ — the canonical Gemma 4 detection format12.
Audio understanding (E4B native)
Audio is one of E4B’s distinguishing capabilities3. The vLLM ROCm wheel may trail the CUDA wheel by a release on multi-modal audio. If audio_url rejects, run audio through transformers’ any-to-any pipeline on the same machine4:
from transformers import pipeline
pipe = pipeline("any-to-any", model="google/gemma-4-E4B-it",
device_map="auto", torch_dtype="float16")
out = pipe([{"role": "user",
"content": [{"type": "audio", "audio": "speech.wav"},
{"type": "text", "text": "Transcribe and translate to English."}]}])
Gemma 4 do well at ASR across 100+ languages, speech-to-text translation, speaker-turn detection, and audio captioning. I do not synthesise audio.
Function calling (native, structured)
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
tools = [{
"type": "function",
"function": {
"name": "get_rocm_smi",
"description": "Return current GPU utilisation and memory for gfx1150.",
"parameters": {
"type": "object",
"properties": {"verbose": {"type": "boolean"}},
"required": [],
},
},
}]
resp = c.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[{"role": "user",
"content": "What is my GPU doing right now? Use a tool."}],
tools=tools, tool_choice="auto",
)
resp
ChatCompletion(id='chatcmpl-ba4debfaf19a1ac0', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-bab0a39d1c90fb58', function=Function(arguments='{}', name='get_rocm_smi'), type='function')], reasoning=None), stop_reason=None, token_ids=None)], created=1777986405, model='google/gemma-4-E4B-it', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=13, prompt_tokens=75, total_tokens=88, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, prompt_token_ids=None, kv_transfer_params=None)
The agentic loop:
sequenceDiagram
participant User
participant Agent as Your code
participant vLLM as vLLM (Gemma 4)
participant Tool as get_rocm_smi
User->>Agent: "What is my GPU doing?"
Agent->>vLLM: chat.completions w/ tools
vLLM-->>Agent: tool_calls=[get_rocm_smi]
Agent->>Tool: exec amd-smi
Tool-->>Agent: {gfx_util, mem_used}
Agent->>vLLM: history + tool result
vLLM-->>Agent: natural-language answer
Agent-->>User: "GPU is at 73% with 7.2/16 GiB used."
vLLM’s --tool-call-parser gemma4 rewrites Gemma 4 native channel format into the OpenAI tool_calls shape automatically.
Long context (128 K)
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
src = open("/path/to/large_repo_concat.py").read() # ~250 KB ≈ 70K tokens
resp = c.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[
{"role": "system", "content": "thinking: on"},
{"role": "user",
"content": f"Here is a Python module:\n\n```python\n{src}\n```\n\n"
f"Find every `# TODO`, group by function, propose a fix."},
],
max_tokens=4096,
)
--enable-prefix-caching is already in the launch script — subsequent queries against the same prefix amortise prefill near zero, a meaningful saving when iterating on the same repo.
Multilingual
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = c.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[{"role": "user",
"content": "ඔයාට සිංහල කථා කරන්න පුළුවන් ද? "
"Sri Lankan කන්දේගල් තැන් 5ක් යෝජනා කරන්න."}],
)
print(resp.choices[0].message.content)
ඔව්, මට සිංහල කතා කරන්න පුළුවන්. 😊
ශ්රී ලංකාවේ සංචාරක කටයුතු සඳහා සුන්දර සහ රසවත් කන්ද සහිත ස්ථාන 5ක් මම ඔබට යෝජනා කරන්නම්. ඔබේ කැමැත්ත (ස්වභාවික සුන්දරත්වය, ඓතිහාසික වැදගත්කම, නැගීම තරමක් අපහසු වීම වැනි) අනුව මේවා තෝරා ගත හැකියි.
---
### ශ්රී ලංකාවේ කඳු සහිත ස්ථාන 5ක්:
**1. නුවරඑළිය (Nuwara Eliya)**
* **ඇයි මෙතනට යන්න ඕනේ:** මෙය ශ්රී ලංකාවේ වඩාත්ම ජනප්රිය කඳුකර නගරයයි. තේ වතු, සිසිල් දේශගුණය, ලස්සන දර්ශන සහ විවිධ ගමනාන්තයන් මෙහි තිබේ.
* **විශේෂත්වය:** තේ කර්මාන්තය, කඳු නැගීමේ සුන්දරත්වය, සහ ප්රසන්න ග්රාමීය පරිසරය.
* **කඳුකර අත්දැකීම:** විවිධ උසවල තේ වතුවලින් යුත් භූ දර්ශන දැකගත හැකිය.
**2. සීගිරිය (Sigiriya) - (කන්දක් ලෙස සැලකිය හැක)**
* **ඇයි මෙතනට යන්න ඕනේ:** මෙය ලෝක උරුමයක් වන අතර, අතිශය නාටකාකාර ලෙස ඉදිකර ඇති පර්වතයක් මත පිහිටා ඇත. එහි ඉතිහාසය හා වාස්තු විද්යාව විශ්මයජනකයි.
* **විශේෂත්වය:** සිංහල ශිෂ්ටාචාරයේ උච්චතම අවස්ථාවක් නියෝජනය කරයි. කඳු මුදුනට නැගීමත් සමඟ ලැබෙන දර්ශන අසමසමයි.
* **කඳුකර අත්දැකීම:** පැරණි රාජකීය බලකොටුවක් මත ඇති අභියෝගාත්මක ගමන.
**3. එල්ලේ (Ella)**
* **ඇයි මෙතනට යන්න ඕනේ:** මෑතකදී ප්රසිද්ධියට පත් වූ මෙම ප්රදේශය, කඳුකරයේ තරුණ හා ස්වභාවික සුන්දරත්වය නියෝජනය කරයි.
* **විශේෂත්වය:** ඇල්ලේට ආසන්නයේ පිහිටි **තොටගල වසන්ත උද්යානය**, **ඩොලිනීස් (Dodiyawala)** සහ කඳුකරයේ ඇති කුඩා ගම්මාන දැකගත හැකිය. මෙහි සිට ජේස්ට් පාලම (Little Adam's Peak) වෙත යන මාර්ගය ඉතා සුන්දරය.
* **කඳුකර අත්දැකීම:** සන්සුන්, තරුණ සහ ඓතිහාසික නොවන, නමුත් ඉතා සුවිශේෂී කඳුකර අත්දැකීමක්.
**4. නුවරඑළිය සහ මාතලේ ප්රදේශයේ කඳු (Matale Hills)**
* **ඇයි මෙතනට යන්න ඕනේ:** නුවරඑළියට වඩා විවිධත්වය සහිත, අඩු සංචාරක ජනතාවක් සිටින කඳුකර ප්රදේශ කිහිපයකි.
* **විශේෂත්වය:** මෙම ප්රදේශවල ඔබට තේ වතුවලට අමතරව කුඩා ගම්මාන, ස්වාභාවික ජල ඇලි සහ ග්රාමීය ජීවිතය දැකගත හැකිය.
* **කඳුකර අත්දැකීම:** නිස්කලංක හා සැබෑ ශ්රී ලාංකේය කඳුකර ජීවිතය අත්විඳීම.
**5. බදුල්ල/හඹානගල ප්රදේශයේ කඳුකරය (Badulla Area - Horton Plains/Hidden Valley)**
* **ඇයි මෙතනට යන්න ඕනේ:** ඔබට සැබෑ, උස් සහ වියළි කඳුකර භූ දර්ශනයක් අවශ්ය නම් මෙය සුදුසුයි.
* **විශේෂත්වය:** මෙම ප්රදේශවල ශාක විද්යාත්මකව ඉතා වැදගත්, උස් බිම් සහිත කලාප ඇත. (උදා: හෝර්ටන් තැන්න නම් වන ස්ථාන). මෙහි දේශගුණය නුවරඑළියට වඩා වෙනස්, වඩා තද සහ විෂමතාවයක් ඇත.
* **කඳුකර අත්දැකීම:** මීදුම් සහිත, විද්යාත්මකව වැදගත් සහ අභියෝගාත්මක කඳුකර ගමනක්.
---
**📌 කෙටි සාරාංශය (ඔබේ අවශ්යතාවය අනුව තෝරාගන්න):**
* **ලස්සන දර්ශන සහ සුවපහසු බව:** නුවරඑළිය
* **ඉතිහාසය සහ අභියෝගය:** සීගිරිය
* **තරුණ සහ සැහැල්ලු අත්දැකීම:** එල්ලේ
* **සැබෑ, අභියෝගාත්මක කඳුකරය:** බදුල්ල/හෝර්ටන් තැන්න
ඔබට මේවායින් කුමන ආකාරයේ අත්දැකීමක්ද අවශ්ය වන්නේ? මට තවදුරටත් තොරතුරු ලබා දිය හැක!
Gemma 4 pre-trained on 140+ languages and instruction-tuned on 35+5. Code-switching between Sinhala / English / Tamil within one prompt is fine; pin the target register in the system role for stability.
Coding
For diff-style edits, drop temperature:
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = c.chat.completions.create(
model="google/gemma-4-E4B-it",
messages=[
{"role": "system",
"content": "Output unified diffs only. No prose."},
{"role": "user",
"content": "Refactor this loop to a comprehension:\n"
"```python\nout=[]\nfor x in xs:\n if x>0:\n out.append(x*x)\n```"}],
temperature=0.2, top_p=0.9,
)
print(resp.choices[0].message.content)
```diff
--- a/script.py
+++ b/script.py
@@ -1,5 +1,3 @@
-out=[]
-for x in xs:
- if x>0:
- out.append(x*x)
+out = [x*x for x in xs if x > 0]
```
Visual token budget tuning (operational)
Switching budgets per-request is allowed and is exactly how production multimodal pipelines on E4B economise compute12.
# Hot path: per-frame "is anyone there"
extra = {"mm_processor_kwargs": {"image_token_budget": 70}}
# Cold path: alarm fired, read the badge
extra = {"mm_processor_kwargs": {"image_token_budget": 1120}}
Part IV — Advanced topics
KV cache mathematics for hybrid attention
Take the formula from Part II:
\[\text{KV}_{\text{hybrid}}(n) \;=\; 2\,h_{kv}\,d_{h}\,b \,\Big[\,L_{s}\cdot \min(n, w) \;+\; L_{g}\cdot n \,\Big]\]For the global layers I additionally apply unified $K = V$, which divides their contribution by 2:
\[\text{KV}_{\text{E4B}}(n) \;=\; h_{kv}\,d_{h}\,b \,\Big[\,2L_{s}\cdot \min(n, w) \;+\; L_{g}\cdot n \,\Big]\]Two regimes worth remembering:
- Short context ($n \le w = 512$): cache scales with $n \cdot (L_{s} + L_{g})$. Same as a fully-global model.
- Long context ($n \gg w$): SWA term saturates; cache scales with $2 L_{s} w + L_{g} n$. Effectively only the global layers pay.
For 128 K context the SWA contribution is bounded; the global term dominates. This is why the only knob that meaningfully shrinks long-context KV is either reducing $L_{g}$ (architectural, fixed for E4B) or quantising the cache itself — see the quantisation section below.
p-RoPE and unified K/V — why both exist
Long-context generalisation has two failure modes:
- Attention dilution — attention scores spread thin over many tokens; the model loses focus.
- RoPE wraparound — at positions far past training, RoPE phases become indistinguishable, breaking distance encoding.
p-RoPE addresses (2) by leaving a fraction $1 - p$ of head dimensions unrotated. Those untouched dimensions provide a positional-invariant identity channel that the model can lean on at extreme distances. Unified $K = V$ in global layers addresses cache pressure that arises because of (1): when you must attend over many tokens, every byte of KV per token matters.
The two design choices compose: unified KV makes 256 K affordable on the medium models, p-RoPE makes the resulting attention well-behaved out there58.
Quantisation trade-offs
Three options that work for me on gfx1150 today:
| Scheme | Resident weights | Prefill speedup vs FP16 | Quality cost | vLLM flag |
|---|---|---|---|---|
| FP16 | ~9 GB | 1.0× | baseline | (default) |
| Q4_K_M | ~4.5 GB | ~1.6× (llama.cpp) | small loss on math, negligible on chat | not via vLLM — use llama.cpp17 |
| AWQ-INT4 | ~4 GB | ~1.8× | small loss across the board | --quantization awq |
| GPTQ-INT4 | ~4 GB | ~1.7× | similar to AWQ; sometimes worse on long context | --quantization gptq |
| FP8 (KV) | weights FP16, KV FP8 | minor | small KV recall loss | --kv-cache-dtype fp8 (where supported) |
On bandwidth-bound iGPUs, AWQ-INT4 is roughly a 2× decode-throughput upgrade. If you measured 6–11 tok/s in FP16 (see Part V benchmarking), expect 12–18 tok/s after switching to AWQ. KV-cache FP8 stacks on top.
NPU as a sidecar via MIGraphX
vLLM does not route ops to the XDNA 2 NPU. Your installed MIGraphX 2.15.0.dev (g1afd1b89c) is the bridge. The recommended split:
flowchart LR
classDef vllm fill:#e0e7ff,stroke:#4f46e5,color:#000
classDef npu fill:#fce7f3,stroke:#db2777,color:#000
C["Client"]
C -->|interactive chat<br/>:8000| V["vLLM<br/>Gemma 4 E4B FP16<br/>iGPU gfx1150"]:::vllm
C -->|batch / classification<br/>:8001| M["ONNX-RT or MIGraphX<br/>quantised E4B INT8/BF16<br/>NPU aie2p"]:::npu
Why split:
- Latency-sensitive interactive paths (chat, agents) want vLLM’s continuous batching and KV reuse — that lives on the iGPU.
- High-throughput one-shot inference (image classification, sentiment, short-context Q&A) wants the NPU’s INT8/BF16 compute density. Quantise me to ONNX once, route via MIGraphX, expose on a separate port.
The two endpoints share one weight cache on disk but run in two processes — they never compete for the same HIP allocator.
Part V — Operations
Troubleshooting
| Symptom | Cause / fix |
|---|---|
Aborted (core dumped) on first request |
KV cache too large. Drop --max-model-len to 16384, climb back up. |
| Throughput < 5 tok/s | hipBLASLt not active. Confirm ROCBLAS_USE_HIPBLASLT=110. |
| Random kernel panic under sustained load | SDMA contention. HSA_ENABLE_SDMA=0 (already in the launch script)10. |
gfx1150 not in supported_archs |
Older ROCm. You’re on 7.2.0 — fine. If forced to downgrade, accept perf hit and keep HSA_OVERRIDE_GFX_VERSION=11.5.0. |
amdgpu version file missing on vLLM startup |
UMA = Auto in BIOS hides /sys/module/amdgpu/version. Set UMA fixed (≥32 GiB)18. |
| Image inference much slower than text | Vision encoder runs on the iGPU too. Lower image_token_budget, or pre-resize host-side. |
| OOM at 128 K context | Either use AWQ-INT4 weights or set --kv-cache-dtype fp8. |
| Multi-turn quality decays | You replayed thinking blocks. Strip them from history5. |
KeyError: 'invalid tool call parser: gemma' |
Use --tool-call-parser gemma4 (the v4-family parser). The legacy name gemma does not exist in vLLM 0.20.1. |
expandable_segments not supported on this platform |
The CUDA-only env var leaked into the ROCm path. Remove PYTORCH_HIP_ALLOC_CONF=expandable_segments:True. Harmless but noisy. |
| 40-second startup every restart | The torch.compile cache is in the container, not on disk. Mount $HOME/.cache/vllm:/root/.cache/vllm (already in the launch script). |
Benchmarking
The vLLM usage object only contains token counts, not timing — you must measure wall clock independently.
Quick decode benchmark
%%bash
START=$(date +%s.%N); \
R=$(curl -s http://localhost:8000/v1/chat/completions \
-H 'content-type: application/json' \
-d '{"model":"google/gemma-4-E4B-it",
"messages":[{"role":"user","content":"Write a 500-word essay about the history of compilers."}],
"max_tokens":600, "temperature":0.0}'); \
END=$(date +%s.%N); \
TOK=$(echo "$R" | jq -r '.usage.completion_tokens'); \
ELAPSED=$(echo "$END - $START" | bc); \
printf 'tokens=%s elapsed=%.2fs tok/s=%.2f\n' \
"$TOK" "$ELAPSED" "$(echo "scale=4; $TOK/$ELAPSED" | bc)"
tokens=600 elapsed=87.98s tok/s=6.82
Prometheus metrics (more accurate)
%%bash
curl -s http://localhost:8000/metrics | grep -E \
'^vllm:(time_to_first_token_seconds|time_per_output_token_seconds|generation_tokens|prompt_tokens)' \
| grep -v '#'
vllm:prompt_tokens_total{engine="0",model_name="google/gemma-4-E4B-it"} 2972.0
vllm:prompt_tokens_created{engine="0",model_name="google/gemma-4-E4B-it"} 1.7779838226789353e+09
vllm:prompt_tokens_by_source_total{engine="0",model_name="google/gemma-4-E4B-it",source="local_compute"} 1020.0
vllm:prompt_tokens_by_source_total{engine="0",model_name="google/gemma-4-E4B-it",source="local_cache_hit"} 1952.0
vllm:prompt_tokens_by_source_total{engine="0",model_name="google/gemma-4-E4B-it",source="external_kv_transfer"} 0.0
vllm:prompt_tokens_by_source_created{engine="0",model_name="google/gemma-4-E4B-it",source="local_compute"} 1.7779838226789443e+09
vllm:prompt_tokens_by_source_created{engine="0",model_name="google/gemma-4-E4B-it",source="local_cache_hit"} 1.7779838226789477e+09
vllm:prompt_tokens_by_source_created{engine="0",model_name="google/gemma-4-E4B-it",source="external_kv_transfer"} 1.777983822678951e+09
vllm:prompt_tokens_cached_total{engine="0",model_name="google/gemma-4-E4B-it"} 1952.0
vllm:prompt_tokens_cached_created{engine="0",model_name="google/gemma-4-E4B-it"} 1.7779838226789572e+09
vllm:generation_tokens_total{engine="0",model_name="google/gemma-4-E4B-it"} 5779.0
vllm:generation_tokens_created{engine="0",model_name="google/gemma-4-E4B-it"} 1.777983822678964e+09
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.001",model_name="google/gemma-4-E4B-it"} 0.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.005",model_name="google/gemma-4-E4B-it"} 0.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.01",model_name="google/gemma-4-E4B-it"} 0.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.02",model_name="google/gemma-4-E4B-it"} 0.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.04",model_name="google/gemma-4-E4B-it"} 0.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.06",model_name="google/gemma-4-E4B-it"} 0.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.08",model_name="google/gemma-4-E4B-it"} 0.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.1",model_name="google/gemma-4-E4B-it"} 0.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.25",model_name="google/gemma-4-E4B-it"} 10.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.5",model_name="google/gemma-4-E4B-it"} 12.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="0.75",model_name="google/gemma-4-E4B-it"} 12.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="1.0",model_name="google/gemma-4-E4B-it"} 12.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="2.5",model_name="google/gemma-4-E4B-it"} 13.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="5.0",model_name="google/gemma-4-E4B-it"} 13.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="7.5",model_name="google/gemma-4-E4B-it"} 13.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="10.0",model_name="google/gemma-4-E4B-it"} 13.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="20.0",model_name="google/gemma-4-E4B-it"} 14.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="40.0",model_name="google/gemma-4-E4B-it"} 14.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="80.0",model_name="google/gemma-4-E4B-it"} 14.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="160.0",model_name="google/gemma-4-E4B-it"} 14.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="640.0",model_name="google/gemma-4-E4B-it"} 14.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="2560.0",model_name="google/gemma-4-E4B-it"} 14.0
vllm:time_to_first_token_seconds_bucket{engine="0",le="+Inf",model_name="google/gemma-4-E4B-it"} 14.0
vllm:time_to_first_token_seconds_count{engine="0",model_name="google/gemma-4-E4B-it"} 14.0
vllm:time_to_first_token_seconds_sum{engine="0",model_name="google/gemma-4-E4B-it"} 18.429856061935425
vllm:time_to_first_token_seconds_created{engine="0",model_name="google/gemma-4-E4B-it"} 1.7779838226792686e+09
Look at vllm:time_per_output_token_seconds_sum / vllm:generation_tokens_total for true average decode time per token.
Expected throughput on the Ryzen AI 9 HX 470
Decode is memory-bandwidth-bound on this iGPU. The theoretical ceiling is
\[\text{tok/s}_{\max} \;=\; \frac{\text{memory bandwidth}}{\text{bytes read per decode step}} \;\approx\; \frac{128\;\text{GB/s}}{\sim\!10\;\text{GB}} \;\approx\; 12\text{–}13\;\text{tok/s}\](The ~10 GB figure includes the PLE tables, which consume bandwidth on every token even though they’re lookups, not matmuls.)
| Decode tok/s | Verdict | What it means |
|---|---|---|
| < 5 | Broken | hipBLASLt not engaging, or running on CPU fallback |
| 5–7 | Sub-optimal | Some env var missing or kernel mis-selection |
| 7–11 | Healthy FP16 | Typical for well-tuned E4B FP16 on gfx1150 |
| 11–13 | Excellent FP16 | Hitting the bandwidth ceiling |
| 12–18 | Healthy AWQ-INT4 | After switching to quantised weights |
| > 20 | Suspicious | Likely measurement error |
A measured 6.85 tok/s decode on E4B FP16 with all the launch-script tuning applied is at the lower end of the healthy band. To break through to 12–18 tok/s, the only meaningful lever on this hardware is AWQ-INT4 quantisation — kernel tuning alone cannot exceed the FP16 bandwidth ceiling.
Prefill (prompt-processing) is compute-bound and runs much faster — expect 200–500+ tok/s on a 24-token prompt.
Where to go next
- Quantise to AWQ-INT4. The single biggest improvement available on this hardware. ~2× decode throughput, ~30% more headroom for KV cache. Pull a vetted community AWQ build of
google/gemma-4-E4B-it, or produce one yourself withautoawqagainst the FP16 weights you already have cached. Add--quantization awqto the launch script. - Activate the NPU. Quantise to ONNX INT8, route via MIGraphX, expose on port 8001. Keep vLLM as the chat path. Start with classification or sentiment workloads where the NPU’s INT8 compute density wins.
- Fine-tune E4B with QLoRA on your own data. The 4 B-effective size and PLE architecture are friendly to a single-GPU LoRA run on this very box. Train at the same image token budget you plan to serve at12.
- Pin a context strategy. For 90% of single-user workloads on this APU,
--max-model-len 32768 --enable-prefix-cachingis the sweet spot. The launch script ships with 131072 because the hybrid SWA architecture makes it cheap, but reach for 128 K only when the workload genuinely needs it.
-
Hugging Face model page. google/gemma-4-E4B-it. https://huggingface.co/google/gemma-4-E4B-it ↩
-
Ojitha Hewa Kumanayaka. Running AMD ROCm AI Workloads locally. 7 Mar 2026. https://ojitha.github.io/ai/2026/03/07/ContainerRocm.html ↩ ↩2
-
Google. Gemma 4: Byte for byte, the most capable open models. The Keyword, 2 Apr 2026. https://blog.google/technology/developers/gemma-4/ ↩ ↩2
-
Hugging Face. Welcome Gemma 4: Frontier multimodal intelligence on device. 2 Apr 2026. https://huggingface.co/blog/gemma4 ↩ ↩2
-
Google AI for Developers. Gemma 4 model card. https://ai.google.dev/gemma/docs/core/model_card_4 ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7 ↩8 ↩9 ↩10 ↩11 ↩12
-
LM Studio. Gemma 4 — supports tool use, vision input, and reasoning. https://lmstudio.ai/models/gemma-4 ↩
-
Google AI for Developers. Gemma 4 model overview. https://ai.google.dev/gemma/docs/core ↩ ↩2
-
Maarten Grootendorst. A Visual Guide to Gemma 4. Apr 2026. https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4 ↩ ↩2 ↩3 ↩4 ↩5
-
Aurigai. Gemma 4 by Google: Specs, Benchmarks, and How to Run It Locally (2026 Guide). https://aurigait.com/blog/gemma-4-features-benchmarks-guide/ ↩
-
FoxEgregore. rdna35-llm-baremetal — RDNA 3.5 (gfx1150) bare-metal ROCm setup with annotated env vars. GitHub. https://github.com/FoxEgregore/rdna35-llm-baremetal ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
AMD. vLLM Linux Docker Image — Use ROCm on Radeon and Ryzen. https://rocm.docs.amd.com/projects/radeon-ryzen/en/latest/docs/advanced/advancedryz/linux/llm/build-docker-image.html ↩
-
Datature. Gemma 4: What Computer Vision Engineers Actually Need to Know. https://datature.io/blog/gemma-4-what-computer-vision-engineers-actually-need-to-know ↩ ↩2 ↩3 ↩4
-
Hosang Yoon (AMD). PR #25908 — Add support for AMD Ryzen AI MAX / AI 300 Series (gfx1150 and gfx1151). https://github.com/vllm-project/vllm/pull/25908 ↩
-
AMD. rocm/vllm-dev — Docker Hub overview. https://hub.docker.com/r/rocm/vllm-dev ↩
-
vLLM project. Dockerfile.rocm — sets
HIP_FORCE_DEV_KERNARG=1andSAFETENSORS_FAST_GPU=1by default. https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile.rocm ↩ -
Ollama library. gemma4:e4b — chat template and disabled-thinking behaviour. https://ollama.com/library/gemma4:e4b ↩
-
llm-tracker. AMD GPUs — community notes on RDNA3/3.5 inference paths. https://llm-tracker.info/howto/AMD-GPUs ↩
-
ollama/ollama GitHub issue #11451. GPU not detected on Ryzen AI 300 (gfx1150) with Dynamic VRAM, but works with Fixed VRAM. https://github.com/ollama/ollama/issues/11451 ↩