Five AI models disagreed on 67% of 1,000 fact checks

Five top AI models disagreed on 672 of 1,000 user-submitted fact-check claims; unanimous agreement occurred on 328 and Krippendorff’s alpha measured 0.639.

Lenz Research published a study this month, led by Kosta Jordanov, that tested five advanced AI models on 1,000 real-world fact-check claims. The analysis found disagreement on 672 claims and unanimous agreement on 328.

The models evaluated were GPT-5.4, Claude Opus 4.7, Gemini 3 Pro, Gemini 3 Pro with Search, and Sonar Pro. Each model assigned one of four labels to every claim: true, mostly true, misleading, or false.

Researchers used claims submitted by users to Lenz’s fact-checking platform rather than standard benchmark datasets. The paper notes most of those claims likely do not appear in training corpora with a gold label attached.

On 672 claims at least one model disagreed with the majority label. On 341 claims the split was wider: a model labeled a claim true while another labeled the same claim false.

The study reports Krippendorff’s alpha at 0.639. Krippendorff’s alpha measures agreement on a scale from 1.0 (perfect agreement) to 0 (agreement at the level of random chance). The paper describes values below 0.8 as generally considered weak and calls 0.639 “nontrivial but limited agreement.”

When all five models agreed (328 claims), the unanimous labels tended to be the clearest outcomes: either true or false. The study found only four claims with a unanimous “misleading” label and none with a unanimous “mostly true” label. The paper states, “The panel converges on definitive verdicts; the middle of the rubric is where it fractures.”

The authors provide specific examples of divergent model judgments. For the claim “The World Bank’s active portfolio in Nigeria stands at over $16.4 billion as of 2025,” GPT-5.4 labeled it “mostly true,” Gemini 3 Pro labeled it “false,” and Gemini 3 Pro with Search labeled it “misleading.” For the claim “Donald Trump said that an attack on Iran was postponed at the request of Gulf Allies,” GPT-5.4 labeled it “false,” Claude Opus 4.7 labeled it “mostly true,” Gemini 3 Pro labeled it “false,” and Gemini 3 Pro with Search labeled it “true.”

The paper frames the experiment as a test of whether a panel of frontier models can act as a single, interchangeable fact-checking jury for claims that real users submit. The authors provide their data and analysis for further review.

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.

Articles by this author