Researchers urge caution as diagnostic AI outpaces regulation
The latest generations of medical AI are capable of “seemingly human‑like clinical reasoning”
Artificial intelligence systems are rapidly improving at clinical reasoning tasks, but researchers say the technology is moving far faster than the safety frameworks needed to govern its use in real‑world healthcare.
New commentary in Science, authored by Flinders University experts, warns that while advanced AI models are now matching – and in some cases exceeding – physician‑level diagnostic performance in controlled studies, this does not mean they are ready for deployment in hospitals or clinics.
The commentary accompanies a major international study showing that OpenAI’s first reasoning model, o1‑preview, can work through diagnostic scenarios step by step and achieve accuracy levels comparable to experienced clinicians.
“Health care decisions are complex, high stakes, and deeply human, and accuracy alone, particularly on just text‑based cases, does not make a system safe for patients,” co‑author and PhD candidate Erik Cornelisse said.
The latest generation of large language models represents a shift from simple question‑answering tools to algorithms capable of “seemingly human‑like clinical reasoning,” Mr Cornelisse explained, but warned that this reasoning remains limited to narrow, artificial conditions.
In one experiment, the model reached exact or near‑exact diagnoses in 88.6 per cent of published clinicopathological conference cases, outperforming GPT‑4 and surpassing physicians on several measures. In real emergency department cases, the model achieved 67.1 per cent diagnostic accuracy at triage, higher than two attending physicians.
But Flinders researchers say these results must be interpreted with caution.
Physical examination, visual and auditory cues, patient histories, social context and accountability for outcomes remain central to safe care, and are areas where AI cannot yet operate independently. Even as newer models such as GPT‑5.3 and Gemini 3.1 Pro incorporate multimodal inputs like images, audio and video, the authors reiterated that rigorous evaluation must keep pace.
Senior author Associate Professor Ash Hopkins said the promise of AI is real, but so are the risks.
“AI systems have demonstrated that they can reason through clinical problems with similar performance to doctors, notably on the same scenarios used to train clinicians themselves. This presents genuine opportunities to support clinicians in the future,” he said.
“But there is a critical need for deliberate and controlled integration into clinical care.”
The commentary highlights well‑documented dangers associated with poorly evaluated systems, including bias, inequitable care and unintended patient harm.
Past examples include algorithms that worsened racial disparities in health expenditure and consumer‑facing AI tools that under‑triaged emergencies. The authors argue that independent evaluation must be rigorous, transparent and benchmarked against human clinicians to ensure developers are held accountable.
They also warn that deployment is outpacing oversight. In January 2026, OpenAI launched ChatGPT Health, a consumer tool marketed as a personalised health information service.
Although not designed for triage, it attempted triage tasks and under‑triaged more than half of emergencies in an independent evaluation. Without clear task definitions and human comparators, the researchers say it is impossible to determine whether such systems improve or endanger patient care.
Looking ahead, the Flinders researchers argue that enthusiasm for medical AI must be matched by strong governance, clearer evaluation standards and a focus on real‑world patient outcomes rather than benchmark performance.
“We do not allow doctors to practise without supervision and evaluation, and AI should be held to comparable standards,” Mr Cornelisse said.
Professor Hopkins said the goal should be technology that improves care, not tools that simply perform well in studies.
“Patients deserve technology that improves care in the real world, not systems that only look impressive in studies,” he said.
With careful design, strong oversight and rigorous testing, he added, AI could become “a powerful tool to deliver safer, fairer, and more effective care across health systems worldwide.”
Email: rebecca.cox@news.com.au



