Why AI Isn't a Bubble - It's a Testing Crisis

AI, as a tool for general automated cognition, is a massively transformative technology. We should want to roll it out across the economy and government as quickly and safely as possible. McKinsey projects that generative AI could deliver $2.6-4.4 trillion annually to the global economy [1]. I have no idea if that number is right, but the direction is clear: the benefits of general automated cognition are enormous, and we should want them.

The problem is that we are not getting them yet. In 2025, MIT researchers found that 95% of businesses that had tried using AI found zero measurable value from their pilots [2]. We can do better than that.

Nobody expects all pilots to succeed. But even improving that 5% pass rate to 10% would be transformative. Given the potential value of AI deployments, even this small achievable improvement means capturing 5% of a very large number indeed. I believe the key to closing that gap is quality automated testing at scale.

Why pilots fail

This reflects our experience at Advai, working with many organisations that are keen to deploy AI but struggle to do so with confidence. McKinsey's State of AI report confirms this is not an isolated pattern: over half of organisations surveyed reported negative consequences from AI deployments, with inaccuracy cited as the most common problem, and nearly two-thirds have yet to begin scaling AI across the enterprise [3].

Current pre-deployment evaluation fails in two critical ways:

First, it is arbitrary. AI benchmarks have been described by Stanford researchers as "the Wild West," characterised by inconsistent metrics and a near-total absence of sector-specific relevance [4]. A benchmark that tells you a model scores well on general reasoning says nothing about whether it will perform reliably in a specific regulatory jurisdiction or operational context.

Second, it is unscalable. AI systems are general-purpose and improving fast. Manual testing simply cannot keep pace with AI's generality or its velocity of change. MIT's FutureTech database found that 90% of AI problems emerge post-deployment [5]. This is not surprising. In production, you accumulate real-world data at a scale and so you can now suddenly test the things that matter. Under the current paradigm, accessing this knowledge before deployment would be prohibitively expensive. The issue is not that organisations do not test - it is that pre-deployment testing cannot cover enough ground.

These two problems combine to create "pilot purgatory." Organisations can build AI that works in controlled settings but cannot generate sufficient evidence that it will work in the real world.

The uncomfortable answer

If manual testing cannot scale, the testing itself must be automated - and that means, at least in part, powered by AI.

This sounds like asking a student to mark their own homework. But consider what we already accept. We use automated software tests to validate software before deployment. No one objects that a unit test written in Python is invalid because the software it tests also runs on Python. We also trust unverifiable humans to write evaluations, with no way of knowing how or why they did what they did. The principle that matters is not technological simplicity or difference - it is an adversarial relationship between tester and tested.

The most well-known technical objection is that AI-generated data suffers from progressive diversity loss compared to human –generated data [6]. And, If this were true, it would indeed be a problem - tests that lack diversity are tests that miss edge cases.  Fortunately, subsequent research has shown that it is perfectly possible to control the variation in data produced by AI, avoiding this problem entirely. Retaining real human-generated data alongside synthetic data prevents degradation [7]. Grounding outputs in external knowledge stores through retrieval-augmented generation significantly reduces hallucination [8]. And utilising diverse ensembles of AI models produces higher-quality data than any single model alone [9].

What is at stake

Compliance costs under frameworks like the EU AI Act disproportionately affect smaller organisations, creating barriers to market entry [10]. When testing requires armies of human evaluators and bespoke evaluation frameworks, only large organisations can afford it. When testing can be substantially automated, the barrier drops - and more organisations can deploy AI safely.

Quality automated testing at scale is the key - not to making every pilot succeed, but to make enough of them succeed that AI's potential stops being theoretical and starts being real.

References

[1] MIT Technology Review Insights (2023) "Finding value in generative AI for financial services", MIT Technology Review. https://www.technologyreview.com/2023/11/26/1083841/finding-value-in-generative-ai-for-financial-services/

[2] Heaven, W.D. (2025) "The great AI hype correction of 2025", MIT Technology Review. https://www.technologyreview.com/2025/12/15/1129174/the-great-ai-hype-correction-of-2025/

[3] McKinsey & Company (2025) "The state of AI in 2025: Agents, innovation, and transformation". https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai

[4] Mulligan, S.J. (2024) "The way we measure progress in AI is terrible", MIT Technology Review. https://www.technologyreview.com/2024/11/26/1107346/the-way-we-measure-progress-in-ai-is-terrible/

[5] Mulligan, S.J. (2024) "A new public database lists all the ways AI could go wrong", MIT Technology Review. https://www.technologyreview.com/2024/08/14/1096455/new-database-lists-ways-ai-go-wrong/

[6] Shumailov, I. et al. (2024) "AI models collapse when trained on recursively generated data", Nature, 631, pp. 755-759. https://www.nature.com/articles/s41586-024-07566-y

[7] Kazdan, J. et al. (2025) "Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World", Proceedings of ICML 2025, PMLR 267:29469-29494. https://proceedings.mlr.press/v267/kazdan25a.html

[8] Li, Y. et al. (2025) "Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs", Findings of EMNLP 2025, pp. 12120-12145. https://aclanthology.org/2025.findings-emnlp.648.pdf

[9] Wang, J. et al. (2025) "Mixture-of-Agents Enhances Large Language Model Capabilities", Proceedings of ICLR 2025 (Spotlight). https://arxiv.org/abs/2406.04692

[10] Cors, P. (2025) "Artificial intelligence and the impact of the EU AI Act in business organizations", AI Magazine. https://onlinelibrary.wiley.com/doi/10.1002/aaai.70039