Own Your Context: The Case for Local-First AI

After exploring what happens when inference gets fast and cheap, I kept coming back to one conclusion: the bottleneck shifts from compute to context. Getting data becomes slower than processing it.

That got me thinking about where data actually lives and why it matters.

The latency argument

Look at the numbers:

Operation	Latency
Gmail API call	150-400ms
Google Drive fetch	100-300ms
Slack search	200-500ms
Notion API	150-350ms
Local NVMe read	0.1-1ms

If you need context from five services, you’re spending 1-2 seconds just on round trips. Before the model sees a single token.

What if all your data was already local? Not mirrored. Not cached. Actually yours, on your hardware. Email via IMAP sync. Documents via rclone. Messages exported or bridged. Everything lands on your hardware, indexed once, queryable instantly.

Architecture	Context Assembly	Inference	Total
Cloud-native (API calls)	1-2s	500ms	1.5-2.5s
Local-first (NVMe)	5-10ms	500ms	~510ms

Context assembly becomes a rounding error.

Data should live where inference happens

Every network hop is a penalty. The winning architecture minimizes hops.

If inference happens on your phone, hot context belongs on your phone. If inference happens on your homelab, data belongs on your homelab. If inference happens in a datacenter, data belongs in that datacenter.

I run local inference on my mini PC at 85-150 tokens per second. The data I query most, notes, project files, configs, is already on the same machine. That’s not an accident.

I’ve always had too much respect for sending sensitive information to SaaS APIs for inference. NDA work, credentials, personal data, anything I wouldn’t paste into a web form. With private interference, that could change. Everything stays on my hardware, and I can throw whatever context I want at the model without thinking twice. // Disclaimer: Private use only :-)

You’re not always at home

There’s a hole in the local-first dream: you’re not always next to your NVMe array. You’re on a train, at a client site, behind a corporate firewall.

Scenario	Round trip to home server
Same city, good LTE	30-80ms
Roaming, mediocre signal	150-400ms
Airplane WiFi	500-2000ms
Client site with firewall	blocked

The answer is a tiered setup:

Tier 1: On-device. Small model (Qwen 3B, Phi, Gemma). Recent context: last week of emails, current project docs, active conversations. Works offline, works on airplane, works in dead zones.

Tier 2: Home server. Full model (32B+, or fast inference hardware). Complete context: years of email, all documents, full history. Requires connectivity, but you own it.

Tier 3: Cloud API. When you need frontier capability or when home is unreachable. Pay per token, but always available.

Your phone doesn’t need your 2019 tax documents. It needs this week’s emails, today’s meeting context, and the project you’re actively working on. Intelligent prefetching helps: calendar says you have a meeting with Acme Corp, sync recent Acme emails and docs before you need them.

MCP is the bridge we have today

While I’m dreaming about local data lakes and tiered sync, there’s something shipping right now: Anthropic’s Model Context Protocol.

MCP is a standardized way for LLMs to connect to external data sources and tools. One protocol, many data sources. The key insight: MCP servers can run locally. Your homelab becomes an MCP endpoint.

┌────────────────────────────────────────┐
│            Home Server                 │
│  ┌──────────┐ ┌──────────┐ ┌────────┐ │
│  │MCP: Email│ │MCP: Files│ │MCP:    │ │
│  │(local    │ │(local FS)│ │Notes   │ │
│  │ IMAP)    │ │          │ │        │ │
│  └────┬─────┘ └────┬─────┘ └───┬────┘ │
│       └─────────────┼───────────┘      │
│                     ↓                  │
│          MCP Aggregator                │
└─────────────────────┼──────────────────┘
                      ↓
               Claude / Any Client

MCP Today	The Dream
Pull-based (model requests data)	Push-based (context pre-assembled)
Round trip per tool call	All context local
Structured tool responses	Raw context acceptable

MCP is the right abstraction. I’ve been building custom MCP servers with Python and FastMCP to get familiar with the protocol, and the developer experience is surprisingly smooth. I’m not running a full local email-to-notes pipeline through MCP yet, but the building blocks are there. If you’re interested in this direction, MCP is where I’d start. Build the integrations now, optimize later.

The homelab of 2030 is a context server

Models are becoming commodity. A cheap API call. A Taalas chip. A local GPU. Pick one.

But your personal context layer, your email, your documents, your history, all indexed and instantly queryable, that’s not commodity. That’s yours.

The homelab project that matters might not be “run Llama locally.” It might be “build a personal context layer that makes any model actually useful for your life.” Sync everything. Index everything. Own the context.

I don’t have this fully built yet. But it’s the project I can’t stop thinking about.

Of course, consolidating all your data in one place has serious security implications. That’s the uncomfortable tradeoff I explore next.

This is part 2 of a series. Start with What If Inference Was Free? or skip ahead to AI Security at Machine Speed.

The views and opinions expressed here are my own and do not reflect those of my employer.