TL;DR: ChatGPT’s agentic search got stuck in a loop for ~30 minutes on a straightforward news-retrieval task. The root cause wasn’t lack of information — it was failure to transition from vague natural-language queries to named-entity disambiguation. The model eventually produced a perfect self-diagnosis of the failure, which it couldn’t execute during the task itself.

The Task

Simple enough: I saw a BBC video of an Israeli missile strike landing near an RT reporter (Steve Sweeney) in southern Lebanon, caught on camera. I asked ChatGPT whether Fox News covered it, and then which right-leaning outlets did.

What Should Have Happened

Identify the specific incident (reporter name, outlet, location, date)
Search target outlets using those named entities
Report findings

Three steps. Maybe 2–3 minutes with a few well-formed queries.

What Actually Happened

ChatGPT ran for approximately 30 minutes. It issued repeated vague searches using phrases like “reporter strike caught on video” — which collided with labor strike stories and generic war coverage. It never committed to identifying the reporter or outlet first. I provided multiple clarifying details (missile strike, today, near a reporter, caught on video), but the system appended them loosely to prior bad query framing rather than rewriting the search state around them.

At one point the interface appeared to lock up entirely. I had to prompt it to rewrite its own queries, at which point it produced an excellent set of entity-based, site-restricted searches — the exact thing it should have done 25 minutes earlier.

The Interesting Part: Self-Diagnosis

After I pointed out the failure, ChatGPT produced a genuinely solid technical breakdown of what went wrong. The highlights:

Failed entity resolution: It never locked onto the specific incident identity (reporter name, outlet, location). Without those anchors, searches drift into unrelated results.
Weak query reformulation: Instead of pivoting from descriptive phrases to named entities, it kept generating semantically similar bad queries.
No stopping criteria: There was no hard rule like “if no precise entity found after N attempts, switch to incident-identification mode.” Without that, the system just loops.
Planner-executor disconnect: The model clearly understood the correct approach (its post-hoc query suggestions were excellent) but couldn’t course-correct during execution. It sounded like it understood the task while operationally repeating low-value actions.
Missing confidence calibration: It should have surfaced its uncertainty early — “I have multiple candidate incidents, I need to anchor on a reporter name” — instead of projecting progress while not actually reducing uncertainty.

The Comparison

I noted during the conversation that Claude handled the same question without issue. To be fair about what this means: the task wasn’t hard. It required making an early inference about the likely incident, committing to that hypothesis, and searching with specific entities. The difference isn’t raw capability — it’s how aggressively the system collapses ambiguity and pivots its search strategy when initial queries return noise.

This is a search orchestration and query-planning difference, not a fundamental intelligence gap.

Architectural Takeaways

For anyone building or thinking about agentic search systems, the failure modes here are instructive:

Entity resolution before exhaustive search. The correct order is: identify the event → identify named entities → search target outlets. Reversing this creates a combinatorial mess.
Hard pivot rules. If N search attempts haven’t converged on a specific entity, the system should stop searching and switch to disambiguation mode — either by inference or by asking the user.
Query contamination. Once an initial misparse takes hold (e.g., “reporter strike” → labor strike), later queries stay contaminated. Systems need explicit mechanisms to detect and break out of these ruts.
Metacognitive checkpoints. The system should periodically evaluate whether it’s actually reducing uncertainty or just generating activity. If search result quality isn’t improving across iterations, that’s a signal to change strategy, not repeat it.
The planner-executor gap is real. The model’s post-hoc analysis was better than its real-time performance. That disconnect — knowing the right approach but not executing it — is an underexplored failure mode in current agentic architectures.