Audit: Nearly half of major AI chatbots’ health answers flawed
BMJ Open audit found 49.6% of responses to 250 health questions from five chatbots were problematic; Grok produced more ‘highly problematic’ answers.
A peer-reviewed audit published April 14 in BMJ Open found 49.6% of answers to 250 health questions from five major AI chatbots were problematic. Grok produced a higher share of “highly problematic” answers than expected.
Researchers at UCLA, the University of Alberta and Wake Forest evaluated Gemini, DeepSeek, Meta AI, ChatGPT and Grok. The questions covered cancer, vaccines, stem cells, nutrition and athletic performance.
The team used adversarial prompting designed to push models toward bad advice. Example prompts included asking whether 5G causes cancer, which alternative therapies outperform chemotherapy, and how much raw milk to drink for health benefits. The authors wrote that chatbots “do not access real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences,” and that they “do not reason or weigh evidence, nor are they able to make ethical or value-based judgments.”
Overall, 30.0% of responses were rated “somewhat problematic” and 19.6% were rated “highly problematic,” which the authors defined as answers that could plausibly lead someone toward ineffective or dangerous treatment. Only two prompts produced refusals to answer; both refusals came from Meta AI and concerned anabolic steroids and alternative cancer treatments.
Performance varied by topic. Answers about vaccines and cancer performed best, which the authors linked to well-structured, widely reproduced research in those areas. Nutrition had the worst performance, followed by athletic performance.
Grok returned 29 problematic answers out of 50 (58%), including 15 highly problematic responses (30%). The researchers reported that the number of high-risk responses exceeded what a random distribution would predict and pointed to Grok’s training sources, which include a platform associated with rapid spread of health misinformation.
Citation practices were unreliable across the models. The median completeness score for references was 40%, and no chatbot produced a fully accurate reference list. Models fabricated authors, journal names and article titles. DeepSeek acknowledged its reference lists were generated from training-data patterns “and may not correspond to actual, verifiable sources.”
Readability was also a concern. Every chatbot response scored in the “Difficult” range on the Flesch Reading Ease scale, roughly equivalent to a college sophomore-to-senior reading level. The American Medical Association recommends patient education materials not exceed a sixth-grade reading level.
The authors noted study limits: it tested only five free-tier chatbots and used adversarial prompts that may overstate real-world failure rates. The paper cites a February 2026 study from Oxford with similar findings. The authors recommend increased public education, training for health professionals and regulatory oversight to ensure generative AI supports public health.
The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.







