Running a Local AI Homelab: Mini PC, OCuLink, and a 5060 Ti
In my last post, I built an eBPF-based monitor to watch what my AI bot is doing on a server, tracking API calls, shell commands, and file access at the kernel level. That project made one thing clear: I’m sending a lot of prompts to cloud APIs. So I’ve been wanting a local AI setup for a while. Something private, fast, and independent from cloud APIs. Not to replace Claude or Gemini for hard problems, but to have a capable model available locally for everyday tasks, quick questions, and experimentation.
After weeks of waiting for the GPU to arrive, it finally showed up today. Here’s what I ended up building.
The Hardware
The base is a GMKtec mini PC running Proxmox as a hypervisor. It packs an AMD Ryzen 7 H 255 (16 threads) and 128GB of RAM into a tiny form factor. Small, quiet, and power-efficient. Perfect for always-on home infrastructure. But mini PCs don’t have PCIe slots for a GPU.
That’s where OCuLink comes in. It’s essentially an external PCIe connection, think of it as Thunderbolt but for GPUs. The mini PC has an OCuLink port, and I’m using a Minisforum DEG1 external GPU dock (PCIe x16 via PCIe 4.0 x4) to connect an NVIDIA RTX 5060 Ti (16GB). The dock supports standard ATX/SFX power supplies and has a force power-on button, so it runs independently from the mini PC’s power cycle.
The result: a tiny, quiet box with full GPU acceleration.
Why GLM-4.7-Flash
Choosing a model for 16GB of VRAM is a balancing act between intelligence and fit. After testing several options, I landed on GLM-4.7-Flash, a 30B parameter Mixture of Experts (MoE) model from Z.ai / Tsinghua University.
The MoE architecture is the key: while the model has 30 billion parameters in total, only ~3 billion are active for each token generated. This means you get the reasoning depth of a large model with the speed of a small one.
Running it at Q3_K_M quantization via Ollama:
ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF:Q3_K_M --verbose
The Numbers
| Metric | Value |
|---|---|
| Model size (quantized) | ~13.5 GB |
| VRAM used | ~14.8 GB (92% utilization) |
| Generation speed | 84.53 tokens/second |
| Active parameters | ~3B per token |
| Context window | 128K max, ~16-32K practical on GPU |
Cloud models on massive GPU clusters can generate faster in raw throughput. Google’s Flash models hit 200+ t/s. But where local inference wins is end-to-end latency: no network round-trip, no API queue, no waiting for a datacenter on another continent. Time to first token is ~43ms locally versus 1-3 seconds through a cloud API. For interactive use, that difference is what you actually feel.
Q3 Quantization: Is It Too Aggressive?
Running a 30B model at 3-bit quantization sounds aggressive, but the math works in your favor. A heavily quantized large model almost always outperforms a high-precision small model. A Q3 of 30B parameters retains more reasoning structure than a Q8 of 8B parameters.
In practice, I haven’t noticed meaningful quality degradation for the tasks I use it for: writing scripts, answering technical questions, drafting text, and quick code reviews. Where it occasionally stumbles is in very strictly formatted outputs like JSON tool calls, where a missed bracket can break things. But for general use, it’s surprisingly solid.
What It’s Good For
This isn’t a replacement for Claude or Gemini on hard problems. It’s a daily driver for the 80% of tasks that don’t need a frontier model:
- Quick coding questions and boilerplate generation
- Drafting emails and documentation
- Exploring ideas without API costs
- Working with private or sensitive content that shouldn’t leave my network
The privacy aspect is underrated. Everything stays on my hardware. No data leaves my network. For anything involving NDA work, credentials, or personal information, that matters.
The Stack
The full setup is straightforward:
- Proxmox as the hypervisor on the mini PC, managing VMs and containers
- GPU passthrough to hand the 5060 Ti to a dedicated VM
- Ollama serving the model with a simple API on
localhost:11434 - Open WebUI for a ChatGPT-like browser interface
One thing I particularly like: when idle, the GPU fans spin down to 0% (completely off) and power draw drops to just 4W:
| 0 NVIDIA GeForce RTX 5060 Ti Off | 00000000:00:10.0 Off |
| 0% 34C P8 4W / 180W | 2MiB / 16311MiB |
For an always-on homelab, this matters. The mini PC is already silent, and with the GPU essentially sleeping when not in use, the whole setup draws minimal power and produces zero noise. It only ramps up when you actually send a prompt.
Beyond AI inference, the Proxmox host also serves as my general homelab, running Home Assistant, various containers, and on-prem environment emulations. Having 128GB of RAM and 16 threads means I can spin up VMs that mimic customer environments or test deployment scenarios locally before touching any cloud infrastructure.
Total cost after the initial hardware investment: $0/month. Forever.
Conclusion
You don’t need a server rack to run capable AI locally. A mini PC with an OCuLink GPU gives you a quiet, compact, always-on inference server that fits on a shelf. The combination of MoE models, which punch above their weight in VRAM-constrained setups, and fast consumer GPUs has made local AI genuinely practical. Not just a novelty, but a useful daily tool.
The views and opinions expressed here are my own and do not reflect those of my employer.