Skip to main content
NJannasch.Dev

MiniMax-M2.5 on 128GB RAM + OCuLink GPU: Can a Mini PC Run a 230B Model?

· 2 min read
AIHomelabllama.cppBenchmarking

After getting GLM-4.7-Flash running at 85 t/s locally, I got curious. MiniMax just released M2.5, a 230B model matching Claude Opus on coding benchmarks. Could I run it on my mini PC?

Spoiler: Yes, but you probably shouldn’t.

The Setup

MiniMax-M2.5 at Q3_K_XL quantization weighs 101GB. My 16GB RTX 5060 Ti can’t hold that, but with 128GB of system RAM, I could split it between CPU and GPU.

I compiled llama.cpp from source for full control:

cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j8

The GGML_NATIVE=ON flag enables AVX2 and other SIMD instructions specific to my Ryzen.

The Results

ConfigurationSpeed
GPU (-ngl 9)1.5 t/s
CPU only1.3 t/s
GLM-4.7-Flash84.5 t/s

The 230B model runs 56x slower than the 30B model.

Optimizing Proxmox

Before blaming hardware, I optimized the VM config:

cpu: host          # Was: x86-64-v2-AES (hides AVX2 from guest)
machine: q35       # Better PCIe passthrough
numa: 1            # NUMA-aware memory
ballooning: 0      # Don't reclaim memory mid-inference
hugepages: 2       # 2MB hugepages

One gotcha: q35 renamed my network interface from ens18 to enp6s18.

Finding the Bottleneck

I benchmarked memory to rule it out. Single-threaded tests showed ~18 GB/s, but multi-threaded STREAM revealed 64 GB/s, roughly 70% of theoretical DDR5-5600. The RAM is fine.

The real culprit: OCuLink runs at PCIe 4.0 x4, just ~8 GB/s. MoE models constantly shuffle expert weights between CPU RAM and GPU VRAM. That narrow pipe is the ceiling.

This experiment taught me more about Proxmox tuning than expected. cpu: host and hugepages are worth enabling for any compute-heavy VM.

But the main lesson: bottlenecks aren’t always where you expect. It’s not the CPU, not the RAM, not the GPU. It’s the interconnect between them.

For models this large, you need unified memory (Mac Studio, Strix Halo) or just use the API at $0.15/M tokens. For daily local use, 30B MoE models fit entirely in VRAM and run 50x faster.

Total cost: A few hours of benchmarking and one very patient GPU.

The views and opinions expressed here are my own and do not reflect those of my employer.