New research investigates how large-scale language models perform in a variety of medical situations, including real-life emergency room cases. There, at least one model appears to be more accurate than human doctors.
The study, published this week in the journal Science, is the work of a research team led by doctors and computer scientists from Harvard Medical School and Beth Israel Deaconess Medical Center. The researchers said they conducted various experiments to measure how OpenAI’s models compared to human doctors.
In one experiment, researchers focused on 76 patients who came to Beth Israel’s emergency room and compared the diagnoses provided by two attending internists with the diagnoses generated by OpenAI’s o1 and 4o models. These diagnoses were evaluated by two other primary care physicians, but it was unclear which were human and which were AI-based.
“At each diagnostic touchpoint, O1 performed nominally better than or equal to two primary care physicians and 4O,” the study said, adding that the difference was “particularly pronounced at the first diagnostic touchpoint (early ER triage), when the least information is available about the patient and making the right decision is most urgent.”
In a press release from Harvard Medical School about the study, the researchers emphasized that “no data preprocessing was performed.” The AI model was presented with the same information that was available in the electronic medical record at the time of each diagnosis.
Armed with that information, the o1 model was able to provide “accurate or very close diagnoses” in 67% of triage cases. Meanwhile, one doctor was correct or very close to the diagnosis 55% of the time, and the other doctor was right 50% of the time.
“We tested our AI model against nearly every benchmark, and it outperformed both previous models and physician baselines,” Arjun Manraj, director of the AI Lab at Harvard Medical School and one of the study’s lead authors, said in a press release.
tech crunch event
San Francisco, California
|
October 13-15, 2026
To be clear, this study does not claim that AI is ready to make real life-or-death decisions in emergency rooms. Instead, it said the findings demonstrate “an urgent need for prospective clinical trials to evaluate these technologies in real-world patient care settings.”
The researchers also noted that they only studied how the model behaves when provided with text-based information, and that “existing research suggests that current underlying models are more limited in their inferences to non-text inputs.”
Adam Rodman, a Beth Israel physician and one of the study’s lead authors, warned in the Guardian that there is “currently no formal framework for accountability” for AI diagnostics, and that patients still “want humans to guide them through life-and-death decisions and guide them through difficult treatment decisions.”
In a post about the study, emergency physician Kristen Pantagani said it was an “interesting AI study that led to some very hyped headlines,” especially because it compared AI diagnoses to those of internists rather than ER doctors.
“If you want to compare an AI tool to a doctor’s clinical capabilities, you should start by comparing it to a doctor who actually practices that specialty,” Pantagani said. “I wouldn’t be surprised if an LLM could beat a dermatologist on the neurosurgery board exam, but that’s not particularly helpful to know.”
“My main goal as an ER doctor seeing a patient for the first time is not to guess the final diagnosis. My main goal is to determine whether you have a potentially fatal disease,” she also asserted.
This post and headline have been updated to reflect the fact that the study diagnosis came from the attending physician in internal medicine and to include comments from Kristen Pantagani.
If you buy through links in our articles, we may earn a small commission. This does not affect editorial independence.
