What If Inference Was Free?
I studied electrical engineering with a focus on information technology, and I’ve kept a soft spot for hardware ever since. ESP32 projects, Raspberry Pis, the occasional deep dive into how silicon actually moves bits around. I have a habit of falling down technical rabbit holes that most people would find deeply boring.
This week, I found a new rabbit hole.
Taalas emerged from stealth with the HC1, a chip with Llama 3.1 8B hardcoded directly into the silicon. Not loaded from memory. Etched into transistors. The result: 17,000 tokens per second at 200 watts.
That number broke something in my mental model of how we build software. So I started asking “what if” questions. This post is where they led me.
Baking the model into hardware
The conventional wisdom is that AI hardware needs to be flexible. Models change. Architectures evolve. You need general-purpose silicon.
Taalas asked a different question: what if you don’t?
Their approach: take a finished model, convert its weights to mask ROM, and etch the entire thing into a chip. No memory bandwidth bottleneck. No weight loading. The model is the chip.
The tradeoffs are brutal but honest:
- One chip, one model
- Can’t update weights (except for small fine-tuning layers in SRAM)
- New model = new tapeout
But they’ve partnered with TSMC to change only two metal layers per model. Turnaround: two months. Cost: roughly 1/100th of training the model in the first place.
We don’t redesign x86 every year. Maybe Llama 3.1 8B becomes the “8086 of language models,” frozen in silicon, deployed for a decade.
The death of templates
Here’s the math that got me:
| Operation | Latency |
|---|---|
| Database query | 5-50ms |
| Redis cache hit | 0.5-2ms |
| HC1 generating 500 tokens | ~30ms |
A typical webpage is 2-5K tokens of HTML. At 17,000 tok/s, that’s 120-300ms. Comparable to a cold page load from a traditional stack.
The entire web architecture, CDNs, template engines, static site generators, exists because generating content was expensive. What if it wasn’t?
Why cache when regeneration is instant and every page could be unique? Not “users in segment A see variant 3.” Actually personalized. Your reading level. Your context. Your language. Documentation that references your actual configuration. Forms that show only the fields you need.
The template becomes a prompt. The CMS becomes a context store.
Data access becomes the bottleneck
If inference is essentially free, the model still needs context. Where does context come from?
Email: 200ms to fetch
Calendar: 50ms
CRM: 100ms
Documents: varies
Getting data becomes slower than processing it. At 17K tok/s, processing 50K tokens takes 3 seconds. Why carefully curate when you can include everything relevant and let the model sort it out?
# Old thinking
context = carefully_select_relevant_data(query, token_budget=2000)
# New thinking
context = grab_everything_related(query)
if len(context) < window_limit:
ship_it()
You’re no longer asking “what’s the minimum context needed?” but “what’s the maximum context available?” I explore what this means for data architecture and local-first AI in the next post.
Not all architectures fit
I’ve been running MiniMax M2.5 on my homelab, a 230B parameter mixture-of-experts model with 256 experts that only activates ~10B parameters per forward pass. On a GPU, this is clever: you only load the active expert weights from memory, saving bandwidth.
On hardcoded silicon, the advantage disappears. All 230B parameters would be etched into the chip. But only a small fraction of those 256 experts fire per token. That’s ~96% of your die area sitting idle every cycle. The router decides at runtime which experts to activate, and that dynamic routing breaks the pure dataflow pipeline that makes hardcoded inference fast.
| Architecture | Fit for Hardcoded Silicon |
|---|---|
| Dense (Llama, Qwen) | Excellent, fixed dataflow, full utilization |
| MoE (Mixtral, MiniMax) | Poor, dynamic routing, wasted area |
| SSM (Mamba) | Excellent, even simpler than transformers |
The “best” model depends on where it runs. Train with MoE for efficiency. Distill to dense for deployment on hardcoded silicon.
Agents that think in real-time
You’ve probably seen demos of Claude or GPT-4 using a computer. Browse the web, click buttons, write code, run it, debug. Impressive, but slow. Each action requires a full inference cycle. A task that takes a human 30 seconds takes an agent 5 minutes.
| Current speeds | At 17,000 tok/s | |
|---|---|---|
| Observation | 50ms | 50ms |
| Inference | 2-5 seconds | ~30ms |
| Execute action | 100ms | 100ms |
| Per action | ~2-5 seconds | ~180ms |
A 20-step task goes from 40-100 seconds to 3.6 seconds. Approaching human speed.
But here’s the kicker: SaaS becomes the new bottleneck.
Agent loop at 17K tok/s with SaaS tools:
Think: 30ms → Call Stripe → [waiting 300ms]
Think: 30ms → Call Slack → [waiting 200ms]
Think: 30ms → Call Jira → [waiting 400ms]
Actual: 90ms thinking, 900ms waiting
The inference is no longer the bottleneck. The internet is. The agent that can act without round-tripping to the cloud will outperform the one that can’t. That’s one of the strongest arguments for local-first architectures.
Where this leads
I started thinking about a chip and ended up questioning data architecture, caching, and whether agents will be bottlenecked by Stripe’s API latency.
The interesting questions shift from “can we run the model?” to “what should we do with context?” and “how do we secure it?” I dig into the first in Own Your Context and the second in AI Security at Machine Speed.
This post started as a conversation about whether you could build a Taalas-style chip as a hobbyist. You can’t, they raised $219M and have a TSMC partnership. But the implications led me down this rabbit hole.
The views and opinions expressed here are my own and do not reflect those of my employer.