ARFBench: AI trails on-call engineers in outage tests

Datadog and Carnegie Mellon University built ARFBench; GPT-5 scored 62.7% on 750 verified outage questions while on-call engineers scored 72.7% across 63 incidents.

Datadog and Carnegie Mellon University released ARFBench, a benchmark built from real production outages to test whether AI can answer the time-series questions engineers use during incidents.

The dataset includes 63 production incidents, 750 multiple-choice questions, 142 monitoring metrics and 5.38 million extracted data points. Questions were drawn from engineers’ Slack threads during live emergencies and were hand-verified. The authors write that no synthetic or textbook scenarios were included.

ARFBench groups questions into three tiers. Tier I asks whether an anomaly exists in a chart. Tier II asks when an anomaly started, how severe it is or what type it is. Tier III requires cross-metric reasoning, such as whether one chart is causing a problem seen in another chart.

Datadog ran evaluations on the ARFBench set. GPT-5 led general-purpose models with 62.7% overall accuracy. Gemini 3 Pro scored 58.1%, Claude Opus 4.6 scored 54.8% and Claude Sonnet 4.5 scored 47.2%. Random guessing yields about 24.5% accuracy.

On Tier III, the hardest tasks, GPT-5’s F1 score dropped to 47.5%. The authors flag cross-metric linking as a significant weakness for current models.

Human baselines outperformed the tested models. On-call engineers with observability experience scored 72.7% accuracy. Time-series researchers without deep observability backgrounds scored 69.7%. No standalone AI model outscored either human group.

The public leaderboard shows a hybrid entry at the top. Datadog combined its internal time-series model, Toto, with Qwen3-VL 32B. Toto-1.0-QA-Experimental reached 63.9% accuracy, edging past GPT-5 while using fewer parameters.

The authors report different error profiles for humans and models. Models tend to hallucinate, omit or misread metadata and lose domain context. Humans are more likely to misread exact timestamps or fail on complex instructions. The paper states the mistakes rarely overlap.

To estimate the potential of collaboration, the researchers built a theoretical Model-Expert Oracle that always selects the correct answer between the AI and a human. That oracle scored 87.2% accuracy and 82.8% F1. The authors present that result as a ceiling for combined performance, not as a product claim.

The benchmark and leaderboard are hosted on Hugging Face. The authors note that trillions of dollars are lost each year to system outages and created ARFBench to test whether AI can reliably support incident response.

The paper notes some AI offerings are marketed as autonomous site reliability engineer agents that investigate production incidents in place of humans. On the ARFBench tests, no single AI model matched the performance of the on-call engineers used as a human baseline.

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.

Articles by this author