Agentic AI systems are spreading quickly. But our ability to properly test and understand them is lagging behind. This gap exists both before and after these systems are deployed.
One reason is simple: it is hard to define what “good behaviour” looks like for an autonomous system. It is even harder to test whether the system behaves that way. As AI agents become more capable, they are improving faster than the methods we use to evaluate them.
Unlike traditional AI systems that perform narrow tasks, AI agents interact with tools, environments, people, and sometimes other agents. Their behaviour unfolds over time and depends on memory, context, planning, and the systems around them. Because of this, evaluation can no longer rely only on static benchmarks. It must capture behaviour that is dynamic, interactive, and adaptive.
Jack-of-all-Trades, Master of Some?
Evaluating AI agents is harder than evaluating traditional AI because they perform a wide range of tasks. They can reason, plan, act, and reflect in order to complete complex goals at machine speed. This autonomy is powerful, but it also increases risk. The more capable an agent becomes, the more behaviours we need to test.
Reliability remains the foundation of evaluation. An agent must perform its task consistently and correctly. But defining “correct” is not always straightforward, as we have found in many of our customers’ business problems at Advai. In many real-world problems there is no single right answer. A research assistant agent or a hospital triage system, for example, could reach good outcomes through many different paths.
Research on AI agents shows that systems often discover unexpected ways to complete tasks. Sometimes this improves performance, but it also makes behaviour harder to predict. Testing every possible path is impossible, especially in complex environments with huge numbers of possible states, the rarest of which may also carry the highest risks.
Ideally, agents could be compared against clear baselines or optimal solutions. But encoding expert knowledge into evaluation systems is expensive and time-consuming. This makes it difficult to scale testing as new use cases appear.
Reliability is only one part of the challenge. Agents must also be tested for robustness. They must handle changing data, unusual conditions, and adversarial inputs. Safety and compliance must also be checked, including whether the agent stays within its intended role [4].
Security adds another layer of difficulty. In addition to traditional software vulnerabilities, agentic AI systems introduce new risks. Every component in the system - APIs, tools, retrieval systems, and user interfaces - creates a new potential attack surface.
Many agents also rely on planning and multi-step reasoning [2]. Success depends not just on individual actions but on the quality of the entire decision path. When something goes wrong, it can be difficult to trace which decision caused the problem, so explainability and interpretability methods become crucial for uncovering these causes. Some agents can also learn and adapt over time. These systems must be evaluated for their ability to handle new situations without losing previous capabilities.
Another complication comes from large language models (LLMs), which power many agents. Unlike traditional software, LLM outputs are probabilistic. The same input may produce different results, or even errors that propagate through decision paths. This variability makes testing and reproducibility more difficult and requires statistical approaches to evaluation.
Observability is another challenge. Evaluators may not always see an agent’s internal reasoning or system state. Even when they can, the architecture of the system matters. Agents may operate in parallel, delegate tasks, or collaborate in groups. Each structure requires different evaluation methods.
Multi-agent systems raise additional questions. Stakeholders want transparency into how decisions were made. Agents must coordinate with each other using shared protocols. And systems that include humans in the loop require evaluation of human-AI interaction.
Memory management is also critical. Many agents rely on short- and long-term memory to maintain context across interactions. Failures in memory can cause subtle errors that only appear later, when mistakes are more costly.
Fairness and bias must also be considered, especially when agents make decisions that affect people. At the same time, practical limits such as cost, computing power, data availability, and scalability also shape how these systems operate.
To complicate matters further, there are many types of AI agents. Some rely on simple rules. Others use planning algorithms, reinforcement learning, large language models, or combinations of these approaches. Each design introduces different failure modes, making a single universal evaluation method insufficient.
Finally, evaluation must match the specific use case. Different domains place different weight on safety, accuracy, speed, or cost. In our work at Advai, we have learned that determining the right balance is often one of the hardest and most important parts of designing an evaluation.
What’s Needed in Evaluation Frameworks?
These challenges mean we need new ways to evaluate AI agents. Recent research on agentic evaluations at Advai highlights several key elements that future evaluation frameworks should include.
1. Hybrid scoring
Automated metrics alone cannot capture the full behaviour of AI agents. Evaluation should combine quantitative metrics, such as task success or constraint violations, with human judgement about reasoning quality, safety, and context.
2. Systematic red-teaming
Adversarial testing should be built into evaluation. Red-teaming exposes agents to malicious inputs and edge cases in order to reveal vulnerabilities before deployment.
3. Trajectory-based analysis
Instead of evaluating only final outputs, evaluators should analyse the full sequence of decisions an agent makes. This helps identify where errors start, how they spread through the system, and ensure that agents cannot “cheat” by finding loopholes to complete a task.
4. Causal analysis
Understanding why an agent behaved in a certain way is essential. Controlled experiments, similar to randomised controlled trials (RCTs), can help identify which inputs, system components, or environmental factors caused a decision.
5. Multi-objective evaluation
Agent performance cannot be reduced to a single metric. Evaluations must consider trade-offs between accuracy, safety, fairness, efficiency, and cost.
6. Multi-agent evaluation
As more systems rely on teams of agents, evaluation must also measure coordination, communication, and collective behaviour.
7. Dynamic benchmarks
Static test sets quickly become outdated. Evaluation environments should evolve over time to continue challenging increasingly capable agents [1].
8. Continual learning assessment
Agents that adapt and update themselves must be evaluated over time. This ensures that improvements do not introduce new risks or degrade existing capabilities.
9. Meta-evaluation
Evaluation methods themselves must be tested. Metrics and benchmarks should be checked to ensure they actually measure the behaviours they claim to assess [3].
As AI agents become more autonomous and are used in critical systems, strong evaluation methods will become essential. Without them, the capabilities of these systems may advance faster than our ability to ensure they are safe, reliable, and aligned with real-world needs.
References
[1] Eval awareness in Claude Opus 4.6’s BrowseComp performance \ Anthropic
[2] [2503.16416] Survey on Evaluation of LLM-based Agents
[3] [2507.02825] Establishing Best Practices for Building Rigorous Agentic Benchmarks
[4] Agentic AI security: Risks & governance for enterprises | McKinsey
