Most model comparisons test chatbot performance. Benchmarks, vibes, writing quality in a conversation window. Agent workloads are a different thing and the results surprised me.
Tested sonnet, gpt4o, and gemini as the backend for the same openclaw setup with identical tasks.
Instruction following: gave each model a chained task with four steps and a conditional branch. Sonnet completed all steps in sequence every time. Gpt4o dropped the last step about 30% of the time. Gemini completed everything but occasionally fabricated input data it didn’t actually have.
Hallucination risk: this matters way more for agents than chatbots. If gemini hallucinates in a chat window you see wrong text and move on. If it hallucinates in an agent context it drafts emails referencing meetings that didn’t happen or cites data that doesn’t exist, and then acts on it. Sonnet’s tendency to say “I don’t have that information” instead of fabricating something is an actual safety property when the model has execution authority.
Voice matching: after about two weeks of conversation history sonnet matched my writing style closely enough that colleagues couldn’t distinguish agent-drafted emails from mine. Gpt4o was decent but had a consistent “AI-ish” formality it couldn’t shake. Gemini was the weakest here.
Cost: sonnet is expensive at volume. Fix is model routing: haiku for retrieval tasks (email checks, lookups, scheduling), sonnet only when the task requires reasoning or writing quality. Cut my monthly API from ~$35 to ~$20.
If you’re already using claude and haven’t tried it as an agent backend, the difference from the chat interface is significant.