I just finished a project that talks to Anthropic, OpenAI, and Google’s APIs simultaneously — a debate platform where AI agents powered by different providers argue with each other in real time. The codebase touches all three SDKs (@anthropic-ai/sdk, openai, u/google/genai) and each provider has completely different patterns for things like streaming, structured output, and tool use.
I used AI coding tools heavily throughout (Cursor + Codex for different parts), and the experience taught me a lot about where these tools shine and where they’ll confidently lead you off a cliff.
Where AI coding tools were reliable:
- Boilerplate and scaffolding. Express routes, React components, TypeScript interfaces, database schemas — all fast and accurate.
- Pattern replication. Once I had one LLM provider integration working, the tools could replicate the pattern for the next provider with minimal correction.
- Type definitions. Writing shared types between frontend and backend was nearly flawless.
Where they hallucinated or broke things:
- Model identifiers. This was the worst one. The tools would confidently use model IDs that don’t exist — like
gemini-3-flash instead of gemini-3-flash-preview, or suggest using web_search_preview as a tool type on models that don’t support it. These cause silent failures where the agent just drops out of the debate with no error. Every single model ID had to be manually verified against the provider’s actual documentation.
- API pattern mixing. OpenAI has two different APIs — Chat Completions for GPT-4o and the Responses API for newer models like GPT-5. The coding tools would constantly use the wrong one, or mix parameters from both in the same call. Anthropic’s streaming format is different from OpenAI’s, which is different from Google’s. The tools would apply patterns from one provider to another.
- Token limits and structured output. I had a bug where the consensus evaluator was truncating its JSON output because the max_tokens was set too low. The coding tools set a “reasonable” default that was fine for text but way too small for a structured JSON response with five scoring dimensions. This caused a silent fallback to a hardcoded score that took me days to track down.
- Streaming and concurrency. SSE implementation, race conditions between concurrent LLM calls, and memory management across debate rounds — these all needed manual work. The tools would suggest solutions that looked correct but failed under real concurrent load.
My takeaway: AI coding tools are genuinely 3-5x multipliers for a solo developer, but the multiplier only holds if you verify every external integration point manually. The tools are great at code structure and terrible at API specifics. If your project talks to external services, budget time for verification that the AI won’t do for you.
Curious if others have found good strategies for keeping AI coding tools accurate when working across multiple external APIs.