From Vibe Coding to AI Agent: My Local Qwen 3.6 Now Runs 24/7
TL;DR: Same Qwen 3.6 MoE I benchmarked at 144 t/s with MTP now runs a 24/7 autonomous agent. Hermes Agent monitors RSS feeds, curates news digests, builds persistent memory, and reports via Telegram. with Signal as a private, E2E-encrypted channel on a separate profile. The model I started by benchmarking became the model I vibe code with, and now the model that acts without me watching. Total API cost: zero.
Three months ago I was running benchmarks to see how fast Qwen 3.6 could go on a 16 GB GPU. Then I started using it to write code. Now the same model runs an autonomous agent on a daily cron. This post is about how each step made the next one obvious.
The Path: Benchmark → Vibe Code → Agent
It started with a straightforward question: can a 35B MoE run at full context on 16 GB? It could. 262K context, 98 t/s. Then MTP speculative decoding landed and pushed it to 144 t/s burst, 125 t/s sustained with TurboQuant at 98K context. Fast enough that waiting for the model stopped being a thing.
That speed changed what local models are useful for. I was already handing entire tasks to Claude Code. vibe coding an eBPF monitor in Go without being a Go developer, reviving a dead 3D printer through SSH. But those run on cloud APIs. Pointing OpenCode at the same local llama-server and getting the same kind of iteration loop at 125 t/s with zero latency and zero cost. that was new.
But vibe coding still requires me at the keyboard. The next step was letting the model work without me.
The Architecture
A small Proxmox VM (4 vCPU, 8 GB RAM, no GPU) running Hermes Agent by Nous Research. It connects over the network to the same Qwen 3.6 MoE + MTP llama-server that I use for vibe coding and benchmarking. The model runs on my RTX 5060 Ti at 125 t/s with TurboQuant (98K context) or 144 t/s with standard KV cache (65K context). Connected to Telegram and Signal as messaging gateways and to various data sources via MCP. Total API cost: zero.
Why speed matters for agents
A chat interaction is one question, one answer. An agent runs multi-step reasoning loops: fetch data → compare to memory → detect changes → compose a report → decide if action is needed. At 30 t/s (the dense 27B variant), a daily monitoring run takes uncomfortably long. At 125 t/s with the MTP-enabled MoE, the same run finishes before I notice it started. The speed difference between “toy” and “useful” for an always-on agent is real.
| Claude Code | Hermes Agent | |
|---|---|---|
| Strength | Deep coding, multi-file refactors | Persistent automation, learning over time |
| Model | Claude Opus/Sonnet (best-in-class) | My local Qwen 3.6 MTP (free, private) |
| Speed | Cloud-dependent | 125 t/s on my hardware |
| Session | Starts and ends | Runs 24/7 |
| Memory | Static CLAUDE.md | Active memory + skills that improve |
| Cost | API credits | Electricity |
The server behind it
The same llama-server config I use for everything else:
llama-server \
-m Qwen3.6-35B-A3B-MTP-UD-IQ3_S.gguf \
-ngl 99 -c 98304 \
-ctk turbo3 -ctv turbo3 \
--spec-type draft-mtp --spec-draft-n-max 2 \
-np 1 --host 0.0.0.0 --port 11433
One server, multiple clients. Hermes Agent, OpenCode, and direct API calls all hit the same endpoint. When the agent isn’t running, the GPU is available for coding. When it is, inference finishes fast enough that contention doesn’t matter.
Connecting to Data Sources via MCP
Hermes connects to external services through MCP endpoints. One config block per data source, and the agent gets tools it can call during normal reasoning. RSS feeds, APIs, monitoring endpoints. whatever exposes an MCP interface becomes part of the agent’s toolkit.
The Daily Briefing
Here is what I told the bot on Telegram:
Set up a daily cron at 8:00 AM: Check RSS for new posts. Pull data via MCP. Compare to your memory of previous days. Detect trends and changes. Update memory with today’s findings. Send me a short briefing.
That’s it. One Telegram message. Hermes created the cron job, and now every morning I get a structured briefing on my phone. All processed by a model running on hardware I own.
Where the Memory Changes Things
The first daily report is basic. Raw data, no context. But Hermes writes what it learns to a persistent memory file that gets injected into every future session.
After a week, the memory accumulates patterns:
RSS monitor: 2 new posts this week, both published Tuesday morning
News digest: 3 VC rounds > $50M, 2 acquisitions in AI infra
Server health: no anomalies, uptime 100%
Now the daily report is not just data. It is “two funding rounds in your watchlist sector this week, both in AI infrastructure. The last acquisition in this space was 3 weeks ago. activity is accelerating.”
After enough runs, Hermes automatically creates a skill: a reusable procedure it can load in future sessions. The skill encodes which tools to call, what format I prefer, what patterns matter. It gets refined each time it runs.
The compounding effect
A one-shot query gives you a snapshot. A daily cron with persistent memory gives you a narrative.
Week 1: “Here are the headlines.” Week 4: “Funding rounds in AI infra doubled compared to last month.” Week 8: “The three companies you flagged as interesting all announced follow-on rounds. The pattern matches what happened in the observability space last year.”
Nobody programmed those insights. The agent discovered them by comparing today’s data against accumulated memory.
Two Gateways, Two Purposes
Hermes supports multiple messaging platforms simultaneously. I use two:
Telegram handles structured, recurring output. I created a supergroup with forum topics enabled. each topic gets its own conversation context:
| Topic | Purpose |
|---|---|
| Monitoring | Daily RSS + data source briefing (cron, 8 AM + 8 PM) |
| News | Startup & VC digest from CTech (cron, 9 AM) |
| General | Ad-hoc requests |
Cron jobs deliver to their designated topic. Monitoring reports don’t clutter the news feed. Each topic maintains a separate session, so asking a follow-up question picks up where the last briefing left off.
Signal handles everything private. It runs on a separate Hermes profile with its own isolated memory. nothing crosses over from the Telegram sessions. The bot connects via signal-cli in HTTP daemon mode. End-to-end encrypted by default, no metadata leakage.
The distinction is intentional: Telegram for organized delivery where topic separation and rich formatting matter. Signal for anything I wouldn’t want on a server I don’t control.
Both gateways talk to the same local Qwen 3.6 at 125 t/s. The only difference is which door you walk through. Hermes also supports WhatsApp, Discord, and Slack. useful if you want an agent reachable from airplane Wi-Fi where only messaging apps work. That’s a setup for another post.
Setup Reference
The commands that matter if you want to replicate this.
Hermes + local model
# Point Hermes at your llama-server in ~/.hermes/config.yaml
model:
default: Qwen3.6-35B-A3B-MTP-UD-IQ3_S.gguf
provider: custom
base_url: http://<your-server-ip>:11433/v1
api_key: not-needed
context_length: 98304
Telegram with topics
# Enable Topics in your Telegram group:
# Group Settings → Permissions → Toggle "Topics"
# Give the bot "Manage Topics" permission
# Get a topic's thread_id from Telegram Web URL:
# https://t.me/c/1234567890/5 → thread_id = 5
# Route a cron job to a specific topic (in jobs.json):
"origin": {
"chat_id": "-100<group_id>",
"thread_id": "5"
}
Signal on a separate profile
# Create an isolated profile
hermes profile create personal
# Install signal-cli (needs Java 17+)
sudo apt install openjdk-25-jre-headless
curl -L -O https://github.com/AsamK/signal-cli/releases/latest/...
sudo mv signal-cli-<version> /opt/signal-cli
sudo ln -sf /opt/signal-cli/bin/signal-cli /usr/local/bin/
# Register a dedicated number
# Solve the captcha at https://signalcaptchas.org/registration/generate
# Right-click "Open Signal" → copy link. it contains the token
signal-cli -u +49XXXXXXXXX register --captcha 'signalcaptcha://signal-hcaptcha...'
signal-cli -u +49XXXXXXXXX verify <SMS_CODE>
# Start the daemon (or use systemd)
signal-cli -u +49XXXXXXXXX daemon --http 127.0.0.1:8080
# Add to ~/.hermes/.env
SIGNAL_HTTP_URL=http://127.0.0.1:8080
SIGNAL_ACCOUNT=+49XXXXXXXXX
SIGNAL_ALLOWED_USERS=+49YOUR_NUMBER,<your-signal-uuid>
# Start the personal profile gateway
hermes --profile personal gateway run
The Signal allowlist needs both your phone number and your UUID. you’ll see it in the gateway log on first denied message.
Security: Sandboxing an Autonomous Agent
An agent with terminal access running 24/7 unattended needs more than a framework-level tool allowlist. I wrote a separate deep-dive on sandboxing AI agents covering OS-level profile isolation, kernel sandboxing with nono/Landlock, network allowlists, audit trails, and the honest gaps that remain with in-process tools. The short version: least privilege beats prompt injection detection every time.
What I’d Do Differently
Don’t over-prompt the cron job. My first version was a detailed multi-step instruction. Hermes works better with a clear goal and freedom to figure out the steps. The skill system optimizes the procedure over time anyway.
Use the MoE, not the dense model. I initially tried the 27B dense variant, but the MoE runs 3x faster with much more context. The quality difference is negligible for daily briefings and agentic workflows where speed compounds.
The Bigger Picture
I started by measuring how fast a local model could go. Then I used that speed to build things faster. Now the model does work while I sleep. Each step only made sense because the previous one worked.
The pieces are not new individually. Cron jobs are decades old. MCP is a protocol. Qwen is an open model. Telegram bots are trivial. What is new is an agent framework that ties them together with persistent memory and self-improving skills, running entirely on hardware you control, at a speed that makes it practical.
For now, it sends me a Telegram briefing every morning and waits on Signal for anything private. And every morning, it knows a little more than the day before.
The views and opinions expressed here are my own and do not reflect those of my employer.