MODELS

Tiers A/B/C. Usage signals + hands-on testing. What @soyjavi uses day to day.

05 / 14·guide·v0.2.81

alpi works with any model that speaks the OpenAI tool-calling protocol via LiteLLM, but not every model is a good agent. Tool-calling fluency, system-prompt adherence, memory-tool triggering, and tolerance for long tool chains vary wildly — and token cost / latency vary just as much. This page is the distilled recommendation so you don't have to learn the hard way.

How this list was built

Signals combined, in order of weight:

[h]: https://openrouter.ai/apps/hermes-agent [oc]: https://openrouter.ai/apps/openclaw

Last updated: 2026-04-23. Re-check every 2-3 months — rankings shift fast and new releases reset the bar. Prices are list from OpenRouter, per 1M tokens in / out; reasoning tokens, retries, and agent loops can push real spend well above list.

Tier 1 — best quality

Pick when the cost of a wrong tool call or a missed refactor beats the cost of API tokens. These adopt the persona from AGENT.md on turn 1, call memory proactively, respect the tool schema, and hold coherence across long sessions.

ModelOpenRouter IDInputOutputNotes
MiMo-V2-Proxiaomi/mimo-v2-pro$1.00$3.00Default premium recommendation. #1 on Hermes Agent globally. 1M context, built for agent frameworks, feels close to Opus 4.6 in perceived quality at a fraction of the cost.
Claude Opus 4.6anthropic/claude-opus-4.6$5.00$25.00Ceiling when debugging long chains or doing multi-step refactors matters more than the bill. OpenRouter positions it as their strongest coding model.
Claude Sonnet 4.6anthropic/claude-sonnet-4.6$3.00$15.00Most sensible daily premium. 1M context, strong coding + computer-use story, clearly cheaper than Opus for frequent sessions.
Qwen3.6 Plusqwen/qwen3.6-plus$0.325$1.95Challenger that wins on price-for-quality. 1M context, 78.8 on SWE-bench Verified, field reports in agentic coding noticeably better than Qwen 3.5.
GLM 5.1z-ai/glm-5.1$1.05$3.50Frontier open-weight for long-horizon engineering. OpenRouter's copy talks about >8h tasks; community rates it especially high on code review. Slight wrapper lag — pin a newer version when available.

If you can only run one Tier 1 model across every profile, pick MiMo-V2-Pro. If the cost of mistakes dwarfs the API bill, go Opus 4.6. For the best middle ground, Sonnet 4.6.

Tier 2 — cost / service

Best return per token when there are tools, long context, and repeated calls. Not the cheapest available — the cheapest that still respects the tool schema and doesn't derail past turn 20.

ModelOpenRouter IDInputOutputNotes
Step 3.5 Flashstepfun-ai/step-3.5-flash$0.10$0.30Best cheap workhorse. Open-source reasoning, 400K context, heavy real-use adoption. Ideal for heartbeats, status checks, low-stakes turns.
MiniMax M2.7minimax/minimax-m2.7$0.30$1.20Favourite mid-tier. Agentic-serious, strong benchmarks on real workflows, community reports excellent tool-parallelism vs M2.5. Occasional slip on tool-parameter filling.
MiMo-V2-Flashxiaomi/mimo-v2-flash$0.09$0.29Best A/B budget pick. 1M context, #1 open-source multimodal on SWE-bench Verified per OpenRouter, comparable to Sonnet 4.5 at ~3.5% of the cost.
DeepSeek V3.2deepseek/deepseek-v3.2$0.252$0.378Price floor that's still hard to beat for reasoning + tool use. Still described as strong on agentic tool use.
Nemotron 3 Supernvidia/nemotron-3-super-120b-a12b:freefreefreeValue wildcard. 1M context, explicit multi-agent focus. Don't rank it above M2.7 or MiMo on raw quality, but on dollars-per-turn it's unbeatable. Expect rate limits on free tier.

Reality check on pricing: at list prices, Step 3.5 Flash runs ~30× cheaper on input and ~50× cheaper on output than Sonnet 4.6. MiMo-V2-Flash lands ~33× / ~52× cheaper. DeepSeek V3.2 is ~12× / ~40× cheaper. For a profile that runs a gateway daemon hitting Telegram inbound all day, those multipliers compound into real savings — many of those turns don't need Tier 1 reasoning.

Tier 3 — best on Ollama

"Best" here means best inside the Ollama ecosystem specifically. Explicit OpenClaw / coding-agent launches, useful context windows, sensible sizes, and recent updates weighed more than raw benchmark numbers. Mark each as local-only or local + :cloud variant.

ModelModeNotes
Qwen3.6local — 27B/35B, 17–24GB, 256K ctxDefault local pick. Ollama positions it for agentic coding with thinking preservation, ships direct OpenClaw launch integration, clear step-up from 3.5. Community preference leans 27B over 35B for long-session stability.
Gemma4local + cloud — 7.2GB–20GB, 128K–256K ctxBest "sovereign" local family if you want multimodal + native function calling. Official OpenClaw launch documented; some field reports of crashes on very long coding-agent loops.
qwen3-coder-nextlocal + cloud — 52GB q4, 256K ctxPure coding specialist. Ollama page covers OpenClaw launch, 800K executable tasks, 3B active-per-token, production-ready tool calling. Use as a dedicated coder rather than a general brain.
devstral-small-2local + cloud — 15GB q4, 384K ctxBest repo-tooling option at moderate size. Official Ollama page lists OpenClaw integration and 65.8% on SWE-bench Verified. Built for cross-file editing in large codebases.
Kimi K2.6cloud on Ollama — 256K ctxHigh-end expert within the Ollama ecosystem, not a true local. Ollama page frames it for long-horizon coding, swarms, 24/7 agents. Field reports still call out over-thinking and token burn on hard debugging.

If your constraint is truly zero cloud, the realistic order is Qwen3.6 > Gemma4 > qwen3-coder-next > devstral-small-2, with Kimi K2.6 excluded (it's :cloud-tagged on Ollama's own page).

alpi supports a primary model with automatic fallbacks, so mixing tiers by role is cheap to wire up:

Free tier reality check

OpenRouter's :free variants exist to funnel users into the ecosystem. For agent use:

Switching model

Three ways, any of them works:

The choice is per-profile. alpi -p work can run Sonnet 4.6 while alpi -p personal runs MiMo-V2-Flash — no interference.