I am interested in the body of research that addresses what I believe is the fundamental and ultimately fatal limitation of transformer-based AI models. The issue is often described as “hallucination,” but I think that term understates the problem. The deeper limitation is that these models are inherently probabilistic. They do not reason from first principles in the way the industry suggests; rather, they operate as highly sophisticated guessing machines.
What AI companies consistently emphasize is what currently works. They point to benchmarks, demonstrate incremental gains, and highlight systems approaching 80%, 90%, or even near-100% accuracy on selected evaluations. But these results are often achieved on narrow slices of reality: shallow problems, constrained domains, trivial question sets, or tasks whose answers are already well represented in training data. Whether the questions are simple or highly advanced is not the main issue. The key issue is that they are usually limited in depth, complexity, or novelty. Under those conditions, it is unsurprising that accuracy can approach perfection.
A model will perform well when it is effectively doing retrieval, pattern matching, or high-confidence interpolation over familiar territory. It can answer straightforward factual questions, perform obvious lookups, or complete tasks that are close enough to its training distribution. In those cases, 100% accuracy is possible, or at least the appearance of it. But the real problem emerges when one moves away from this shallow surface and scales the task along a different axis: the axis of depth and complexity.
We often hear about scaling laws in terms of model size, compute, and performance improvement. My concern is that there is another scaling law that receives far less attention: as the depth of complexity increases, accuracy may decline in the opposite direction. In other words, the more uncertainty a task contains due to novelty, interdependence, hidden constraints, and layered complexity, the more these systems regress toward guesswork. My hypothesis is that there are mathematical bounds here, and that performance under genuine complexity trends toward something much closer to chance—effectively toward 50%, or a random guess.
This issue becomes especially clear in domains where the answer is not explicitly present in the training data, not because the domain is obscure, but because the problem is genuinely novel in its complexity. Consider engineering or software development in proprietary environments: deeply layered architectures, large interconnected systems, millions of lines of code, and countless hidden dependencies accumulated over time. In such settings, the model cannot simply retrieve a known answer. It must actually converge on a correct solution across many interacting layers. This is where these systems appear to hit a wall.
What often happens instead is non-convergence. The model fixes shallow problems, introduces new ones, then attempts to repair those new failures, generating an endless loop of partial corrections and fresh defects. This is what people often call “AI slop.” In essence, slop is the visible form of accumulated guessing. The model can appear productive at first, but as depth increases, unresolved uncertainty compounds and manifests as instability, inconsistency, and degradation.
That is why I am skeptical of the broader claims being made by the AI industry. These tools are useful in some applications, but their usefulness becomes far less impressive when one accounts for the cost of training and inference, especially relative to the ambitious problems they are supposed to solve. The promise is not merely better autocomplete or faster search. The promise is job replacement, autonomous agents, and expert-level production work. That is where I believe the claims break down.
In practice, most of the impressive demonstrations remain surface-level: mock-ups, MVPs, prototypes, or narrowly scoped implementations. The systems can often produce something that looks convincing in a demo, but that is very different from delivering enterprise-grade, production-ready work that is maintainable, reliable, and capable of converging toward correctness under real constraints. For software engineering in particular, this matters enormously. Generating code is not the same as producing robust systems. Code review, long-term maintainability, architecture coherence, and complete bug elimination remain the true test, and that is precisely where these models appear fundamentally inadequate.
My argument is that this is not a temporary engineering problem but a structural one. There may be a hard scaling limitation on the dimension of depth and complexity, even if progress continues on narrow benchmarked tasks. What companies showcase is the shallow slice, because that is where the systems appear strongest. What they do not emphasize is how quickly those gains may collapse when tasks become more novel, more interconnected, and more demanding.
The dynamic resembles repeated compounding of small inaccuracies. A model that is 80–90% correct on any individual step may still fail catastrophically across a long enough chain of dependent steps, because each gap in accuracy compounds over time. The result is similar to repeatedly regenerating an image until it gradually degrades into visual nonsense: the errors accumulate, structure breaks down, and the output drifts into slop. That, in my view, is not incidental. It is a consequence of the mathematical nature of these systems.
For that reason, I believe the current AI narrative is deeply misleading. While these models may evolve into useful tools for search, retrieval, summarization, and limited assistance, I do not believe they will ever be sufficient for true senior-level or expert-level autonomous work in complex domains. The appearance of progress is real, but it is confined to a narrow layer of task space. Beyond that layer, the limitations become dominant.
My view, therefore, is that the AI industry is being valued and marketed on a false premise. It presents benchmark saturation and polished demos as evidence of general capability, when in reality those results may be masking a deeper mathematical ceiling. Many people will reject that conclusion today. I believe that within the next five years, it will become increasingly difficult to ignore.