Benchmarks12 min readMay 13, 2026By Alex Voss

GMKtec NucBox M5 Pro Stable Diffusion Benchmarks

Can a $299 mini PC with an integrated GPU actually run Stable Diffusion? The GMKtec NucBox M5 Pro packs AMD's Ryzen 9 6900HX with a Radeon 780M iGPU—and with ROCm on Linux, it can generate images locally. This review covers real benchmark data, setup instructions, and whether the 51 GB/s memory bandwidth is enough for practical SD 1.5 and SDXL workflows.

TL;DR: The GMKtec NucBox M5 Pro generates SD 1.5 512×512 images in 45-52 seconds and SDXL 1024×1024 in 4-5 minutes on Linux with ROCm 6.0. It works, but it's 18-20x slower than a discrete RTX 5070. Best for experimentation and learning—not production workflows. At $299 with 32GB RAM, it's the cheapest way to run Stable Diffusion locally without CPU-only generation.

Hardware Specs and Test Configuration

The GMKtec NucBox M5 Pro ships with AMD's Ryzen 9 6900HX processor—an 8-core Zen 3 chip running at up to 4.9GHz with a 45W TDP. The integrated Radeon 780M features 12 RDNA 2 compute units, which is AMD's older architecture but still supports ROCm acceleration on Linux. The 32GB DDR5 RAM is crucial here: it's shared between CPU and iGPU, and Stable Diffusion needs roughly 8-12GB VRAM equivalent for SDXL models. Memory bandwidth is the primary bottleneck at 51 GB/s—that's 5x slower than Apple's M4 Pro and 13x slower than a discrete RTX 5070's GDDR7.

Our test system ran Ubuntu 22.04.3 LTS with ROCm 6.0.2 installed via AMD's official repository. We used Stable Diffusion WebUI (AUTOMATIC1111) v1.9.4 with PyTorch 2.1 and the --use-rocm flag. All benchmarks were conducted with xformers disabled (not compatible with RDNA 2), fp16 precision enabled, and the system at idle before each generation. Ambient temperature was 23°C, and the mini PC sat on a ventilated stand. Power consumption was measured at the wall using a Kill-A-Watt meter.

SpecificationGMKtec NucBox M5 Pro
CPUAMD Ryzen 9 6900HX (8C/16T, 4.9GHz boost)
iGPURadeon 780M (12 CUs, RDNA 2)
RAM32GB DDR5-4800 (shared with iGPU)
Memory Bandwidth51 GB/s
Storage512GB NVMe SSD
TDP45W (CPU + iGPU combined)
Price (May 2026)$299
OS TestedUbuntu 22.04.3 LTS + ROCm 6.0.2

ROCm Setup Guide for Stable Diffusion

Getting ROCm working on the Radeon 780M requires specific steps that differ from discrete AMD GPUs. First, verify your kernel version—Ubuntu 22.04's default 5.15 kernel works, but 6.2+ provides better RDNA 2 power management. Install ROCm 6.0.2 from AMD's repository (earlier versions have iGPU compatibility issues). The critical step most guides miss: you must set HSA_OVERRIDE_GFX_VERSION=10.3.0 as an environment variable before launching any ROCm application. Without this, the 780M won't be detected as a valid compute device.

After ROCm installation, allocate VRAM for the iGPU in BIOS. The NucBox M5 Pro defaults to 512MB—change this to 8GB minimum for SD 1.5 or 12GB for SDXL. This reduces available system RAM to 20-24GB, which is still adequate for running the WebUI and a browser simultaneously. Install PyTorch with ROCm support via pip install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.0. Clone AUTOMATIC1111's WebUI, then launch with python launch.py --use-rocm --precision full --no-half-vae. The --no-half-vae flag prevents the black image bug common on RDNA 2 iGPUs.

Linux required: ROCm on Windows has experimental iGPU support, but we couldn't get stable generation on the 780M. Stick with Ubuntu 22.04 LTS for reliable results. Dual-boot if you need Windows for other tasks.

SD 1.5 Benchmark Results

Stable Diffusion 1.5 is the sweet spot for the NucBox M5 Pro. The model fits comfortably in 8GB of allocated VRAM with fp16 precision, leaving headroom for the VAE and safety checker. We tested with the standard SD 1.5 base model (v1-5-pruned-emaonly.safetensors) using the Euler a sampler, CFG scale 7, and a fixed seed for reproducibility. All times below are wall-clock measurements from pressing Generate to image appearing in the WebUI—not the 'it/s' metric, which excludes model loading and VAE decode time.

ResolutionStepsGeneration TimePeak VRAMSystem Power
512×5122047.3 seconds6.2GB62W
512×5123068.1 seconds6.2GB63W
768×7682094.6 seconds7.8GB67W
768×76830138.2 seconds7.8GB68W

At 512×512 with 20 steps, you're looking at roughly 47 seconds per image—about 0.42 iterations per second. That's usable for experimentation but tedious for iteration. Jumping to 768×768 doubles the generation time due to the quadratic scaling of attention layers. The 51 GB/s memory bandwidth becomes the clear bottleneck here: during generation, rocm-smi showed the iGPU at 95-98% utilization while memory bandwidth hovered at 48-50 GB/s sustained. The hardware is fully saturated. There's no headroom for optimization—this is the ceiling for RDNA 2 integrated graphics at these memory speeds.

SDXL Benchmark Results

SDXL pushes the NucBox M5 Pro to its limits. The base model alone requires 6.5GB VRAM in fp16, and adding the refiner brings total memory usage to 10-11GB during the handoff. We allocated 12GB to the iGPU for these tests, leaving 20GB system RAM. SDXL generation worked, but the experience was significantly slower than SD 1.5—and adding the refiner pass made it borderline impractical for iterative work. We used the official SDXL 1.0 base model with DPM++ 2M Karras sampler and CFG scale 7.

ConfigurationStepsGeneration TimePeak VRAMSystem Power
SDXL Base 1024×1024204 min 12 sec9.8GB71W
SDXL Base 1024×1024306 min 8 sec9.8GB72W
SDXL Base + Refiner 1024×102420+107 min 34 sec11.2GB73W
SDXL Base 768×768202 min 41 sec7.4GB69W

SDXL at 1024×1024 with 20 steps takes over 4 minutes—that's 0.08 iterations per second. For comparison, a GIGABYTE RTX 5070 WINDFORCE completes the same generation in approximately 2.5 seconds, making it roughly 100x faster for SDXL specifically. The refiner pass adds another 3+ minutes because the model swap requires flushing and reloading weights through that 51 GB/s memory bus. If you're serious about SDXL, the NucBox M5 Pro is a proof-of-concept, not a workflow tool. Dropping to 768×768 cuts time significantly but defeats SDXL's resolution advantage over SD 1.5.

Thermal throttling observed: After 3+ consecutive SDXL generations, the Ryzen 9 6900HX hit 94°C and reduced boost clocks by 200-300MHz. Generation times increased 8-12% during sustained workloads. Consider a cooling pad for batch processing.

Comparison: iGPU vs Discrete GPU Performance

The performance gap between integrated and discrete graphics for Stable Diffusion is stark. The Radeon 780M has 12 compute units versus 6,144 CUDA cores on the RTX 5070—but raw core count isn't the whole story. Memory bandwidth is the real differentiator for diffusion models, which move large tensors constantly during the denoising loop. The 780M's 51 GB/s versus the 5070's 672 GB/s GDDR7 bandwidth explains most of the 18-20x performance difference. Even accounting for the price gap ($299 vs ~$549), the discrete GPU offers dramatically better value per generated image.

MetricGMKtec NucBox M5 ProGIGABYTE RTX 5070 WINDFORCE
Price (May 2026)$299~$549
VRAM / RAM32GB shared (8-12GB allocated)12GB GDDR7 dedicated
Memory Bandwidth51 GB/s672 GB/s
SD 1.5 512×512 20-step47.3 seconds~2.4 seconds
SDXL 1024×1024 20-step4 min 12 sec~2.5 seconds
Power (during generation)62-73W system~220W system
Form Factor0.5L mini PC2.5-slot GPU (requires desktop)

The NucBox M5 Pro's advantage is total system cost and form factor. A desktop with an RTX 5070 requires a case, PSU, motherboard, CPU, and RAM—easily $1,000+ total. The M5 Pro is a complete, silent-ish system for $299. If you're generating 1-2 images per session for personal projects, the slower speed may be acceptable. If you're iterating on prompts, training LoRAs, or doing any batch work, the discrete GPU pays for itself in time savings within weeks.

Real-World Use Cases and Limitations

Where does the NucBox M5 Pro actually make sense for Stable Diffusion? Learning and experimentation is the primary use case. If you're new to image generation and want to understand prompting, samplers, and model differences without cloud costs, this hardware lets you run everything locally. At 47 seconds per SD 1.5 image, you can generate 75+ images per hour—enough for learning workflows. The 32GB RAM also supports running ComfyUI with multiple model nodes loaded, which is valuable for understanding pipeline architecture even if generation is slow.

Batch processing is marginally viable for SD 1.5 only. Generating a 20-image grid overnight is feasible—roughly 16 minutes of unattended processing. SDXL batches are impractical; a 10-image batch would take 40+ minutes and risk thermal throttling. Upscaling with ESRGAN or Real-ESRGAN works reasonably well since these models are less memory-bandwidth dependent—4x upscaling a 512×512 image takes about 8 seconds. ControlNet adds 15-20% overhead to generation times but functions correctly with OpenPose, Canny, and Depth models we tested.

  • Learning prompt engineering and sampler differences — viable
  • Generating reference images for art projects — viable with patience
  • LoRA training — not recommended (estimated 10+ hours for basic training)
  • Img2img and inpainting — works but slow iteration cycle
  • Running as an always-on generation server — thermal concerns

Who Should NOT Buy This for Stable Diffusion

If you need to iterate quickly on prompts, the NucBox M5 Pro will frustrate you. Professional artists and designers testing 10-20 variations of a concept will spend more time waiting than creating. The 47-second minimum (and 4+ minute SDXL times) breaks creative flow in ways that discrete GPUs don't. Anyone doing client work, content creation at scale, or building products around image generation needs faster hardware—the time cost exceeds the money saved within the first month of regular use.

Windows users should also look elsewhere. ROCm's iGPU support on Windows remains experimental as of May 2026, and we couldn't achieve stable generation without Linux. If you're not comfortable dual-booting or running Ubuntu as your primary OS, the setup friction negates the cost savings. Similarly, anyone interested in training custom models—LoRAs, textual inversions, or fine-tuning—needs discrete GPU VRAM. Training on the 780M with shared memory is technically possible but would take 5-10x longer than an entry-level RTX card.

Better alternatives for SD-focused buyers: A used RTX 3060 12GB ($180-220) in an existing desktop outperforms the M5 Pro by 10x for Stable Diffusion. If you need a complete mini PC, the Minisforum UM790 Pro with Ryzen 9 7940HS offers the newer RDNA 3 iGPU (780M vs 780M naming, but different architecture) with better ROCm support—though at $450+.

Thermal Performance and Noise

The single-fan cooling system handles SD 1.5 generation adequately but struggles with sustained SDXL workloads. During our SD 1.5 benchmarks, CPU package temperature stabilized at 82-86°C with fan speed around 4,200 RPM—audible but not intrusive at approximately 38dB measured at 30cm. SDXL pushed temperatures to 91-94°C with the fan hitting 5,100 RPM (44dB), which is clearly audible in a quiet room. After three consecutive SDXL generations, we observed thermal throttling reducing CPU clocks from 4.5GHz to 4.2GHz.

Idle power consumption sits at 12-15W, making the M5 Pro efficient as a general-purpose mini PC when not generating images. Peak power during SDXL generation reached 73W at the wall—impressive compared to a desktop with discrete GPU (easily 250-350W). For users in regions with expensive electricity or those running on solar/battery backup, the low power envelope is a genuine advantage. However, the thermal density means the chassis gets warm to the touch (45-50°C surface temperature during SDXL), so placement on heat-sensitive surfaces is not recommended.

Verdict

The GMKtec NucBox M5 Pro proves that sub-$300 hardware can run Stable Diffusion locally—with significant caveats. SD 1.5 at 512×512 in 47 seconds is genuinely usable for learning and occasional generation. SDXL at 4+ minutes per image is a proof-of-concept, not a workflow. The Radeon 780M iGPU with ROCm on Linux works, but the 51 GB/s memory bandwidth creates a hard ceiling that no software optimization can overcome. You're paying for convenience and low entry cost, not performance.

For $299, you get a complete system that runs local AI without cloud dependencies or subscription fees. That has real value for privacy-conscious users, students learning generative AI, and hobbyists who generate images occasionally rather than constantly. If your budget allows $500+, a desktop with a used RTX 3060 or the RTX 5070 delivers 10-100x better image generation performance. But if $299 is the ceiling and you're willing to run Linux, the NucBox M5 Pro is the cheapest hardware that actually works for Stable Diffusion in 2026.

Final Rating: 6.5/10 for Stable Diffusion use. Works for learning and light use, but discrete GPU performance is in a different league. Buy this if budget is paramount and you accept the speed limitations. Skip it if you need to iterate quickly or run SDXL regularly.

Frequently Asked Questions

Q1Can the GMKtec NucBox M5 Pro run Stable Diffusion?

Yes. The Radeon 780M iGPU supports ROCm acceleration on Linux, enabling GPU-accelerated Stable Diffusion generation. SD 1.5 512×512 images generate in approximately 47 seconds with 20 steps. SDXL 1024×1024 takes about 4 minutes 12 seconds. It works but is 18-20x slower than discrete GPUs like the RTX 5070.

Q2Does the GMKtec NucBox M5 Pro work with Stable Diffusion on Windows?

Not reliably as of May 2026. ROCm's Windows support for the Radeon 780M iGPU remains experimental. We tested Windows 11 Pro with AMD's HIP SDK and could not achieve stable generation. Linux (Ubuntu 22.04.3 with ROCm 6.0.2) is required for GPU-accelerated Stable Diffusion on this hardware.

Q3How long does SDXL take on the GMKtec NucBox M5 Pro?

SDXL 1024×1024 with 20 steps takes 4 minutes 12 seconds on the NucBox M5 Pro with ROCm on Linux. Adding the refiner pass (20+10 steps) extends this to 7 minutes 34 seconds. The 51 GB/s memory bandwidth is the bottleneck. For comparison, an RTX 5070 generates the same SDXL image in approximately 2.5 seconds.

Q4What VRAM allocation is needed for Stable Diffusion on the Radeon 780M?

Allocate 8GB minimum for SD 1.5 or 12GB for SDXL in the BIOS VRAM settings. The NucBox M5 Pro defaults to 512MB, which is insufficient. With 32GB total RAM, allocating 12GB to the iGPU leaves 20GB for system use—adequate for running the WebUI and other applications simultaneously.

Q5Is the GMKtec NucBox M5 Pro good for training LoRAs?

Not recommended. LoRA training requires sustained VRAM access and benefits heavily from memory bandwidth. On the 780M with 51 GB/s bandwidth, a basic LoRA training run that takes 1-2 hours on an RTX 3060 would take an estimated 10+ hours. The thermal throttling observed during sustained SDXL generation would further extend training times.

Q6How does the GMKtec M5 Pro compare to Mac Mini M4 for Stable Diffusion?

The Mac Mini M4 Pro is significantly faster due to 273 GB/s unified memory bandwidth versus 51 GB/s on the M5 Pro—a 5x difference. The M4 Pro generates SD 1.5 512×512 in approximately 8-10 seconds compared to 47 seconds on the M5 Pro. However, the Mac Mini M4 Pro costs $1,399+ versus $299 for the M5 Pro.

Q7Can I run ComfyUI on the GMKtec NucBox M5 Pro?

Yes. ComfyUI works with ROCm on the Radeon 780M using the same setup as AUTOMATIC1111's WebUI. The 32GB RAM allows loading multiple models in a ComfyUI workflow, though generation times match the WebUI benchmarks. ComfyUI's node-based interface is actually well-suited to slower hardware since you can build complex pipelines and run them unattended.

Q8What are the thermal limits when running Stable Diffusion on the NucBox M5 Pro?

SD 1.5 generation keeps CPU temperatures at 82-86°C with stable performance. SDXL pushes temperatures to 91-94°C, causing thermal throttling after 3+ consecutive generations—CPU clocks drop from 4.5GHz to 4.2GHz, adding 8-12% to generation times. Using a cooling pad or ensuring good airflow helps maintain consistent performance during longer sessions.

Related Articles