
Benchmarks in medicine: the promise and pitfalls of evaluating AI tools with mismatched yardsticks
Headlines in the recent past have trumpeted that AI 'outperforms doctors' or 'aces medical exams.' The impression that's coming through is these models are smarter, faster, and perhaps even safer. But this hype masks a deeper truth. To put it plainly, the benchmarks used to arrive at these claims are based on exams built for evaluating human memory retention from classroom teachings. They reward fact recall, not clinical judgment.
A calculator problem
A calculator can multiply two six-digit numbers within seconds. Impressive, no doubt. But does this mean calculators are better than, and understand maths more than mathematics experts ? Or better even than an ordinary person who takes a few minutes to do the calculation with a pen and paper?
Language models are celebrated because they can churn out textbook-style answers to MCQs and fill in the blanks for medical facts and questions faster than medical professors. But the practice of medicine is not a quiz. Real doctors deal with ambiguity, emotion, and decision-making under uncertainty. They listen, observe, and adapt.
The irony is that while AI beats doctors in answering questions, it still struggles to generate the very case vignettes that form the basis of those questions. Writing a good clinical scenario from real patients in clinical practice requires understanding human suffering, filtering irrelevant details, and framing the diagnostic dilemma with context. So far, that remains a deeply human ability.
Also Read: Why AI in healthcare needs stringent safety protocols
What existing benchmarks miss
Most widely-used benchmarks—MedQA, PubMedQA, MultiMedQA—pose structured questions with one 'correct' answer or have fill in the blanks questions. They evaluate factual accuracy but overlook human nuance. A patient doesn't say, 'I have been using a faulty chair and sitting in the wrong posture for long hours and have a non-specific backache ever since I bought it. So please choose the best diagnosis and give appropriate treatment.' They just say, 'Doctor, I'm tired. I don't feel like myself.' That is where the real work begins.
Clinical environments are messy. Doctors deal with overlapping illnesses, vague symptoms, incomplete notes, and patients who may be unable—or unwilling—to tell the full story. Communication gaps, emotional distress, and even socio-cultural factors influence how care unfolds. And yet, our evaluation metrics continue to look for precision, clarity, and correctness—things that the real world rarely provides.
Benchmarking vs reality
It can be easy to decide who the best batter in the world is, by only counting runs scored. Similarly, bowlers can be ranked by the number of wickets taken. But answering the question 'Who is the best fielder?' might not be as simple. Measuring fielding is very subjective and evades simple numbers. The number of runs outs assisted or catches taken only tells part of the story. The efforts made at the boundary line to reduce runs or mere intimidation through the presence of the fielders (like Jonty Rhodes or R. Jadeja) preventing runs at covers or points can't be measured easily.
Healthcare is like fielding: it is qualitative, often invisible, deeply contextual, and hard to quantify. Any benchmark that pretends otherwise will mislead more than it illuminates.
This is not a new problem. In 1946, the civil servant Sir Joseph Bhore, when consulted to reform India's healthcare said, 'If it were possible to evaluate the loss, which this country annually suffers through the avoidable waste of valuable human material and the lowering of human efficiency through malnutrition and preventable morbidity, we feel that the result would be so startling that the whole country would be aroused and would not rest until a radical change had been brought about'. This quote reflects a longstanding dilemma—how to measure what truly matters in health systems. Even after 80 years, we have not found perfect evaluation metrics.
What HealthBench does
HealthBench at least acknowledges this disconnect. Developed by OpenAI in collaboration with clinicians, it moves away from traditional multiple-choice formats. It is also the first benchmark to explicitly score responses using 48,562 unique rubric criteriaranging from minus 10 to plus 10, reflecting some aspects of real-world stakes of clinical decision-making. A dangerously wrong answer must be punished more harshly than a mildly useful one. This, finally, mirrors medicine's moral landscape.
Even so, HealthBench has limitations. It evaluates performance across just 5,000 'simulated' clinical cases, of which only 1,000 are classified as 'difficult.' That is a vanishingly small slice of clinical complexity. Though commendably global, its doctor-rater pool includes just 262 physicians from 60 countries in 52 languages, with varying professional experience and cultural backgrounds (three Physicians from India participated, and simulations from 11 Indian languages were generated). HealthBench Hard, a challenging subset of 1,000 cases, revealed that many existing models scored zero—highlighting their inability to handle complex clinical reasoning. Moreover, these cases are still simulations. Thus, the benchmark is an improvement, not a revolution.
Also Read: Artificial Intelligence in healthcare: what lies ahead
Predictive AI's collapse in the real world
This is not just about LLMs. Predictive models have faced similar failures. The sepsis prediction tool, developed by EPIC to flag early signs of sepsis, showed initial promise a few years ago. However, once deployed, it could not meaningfully improve outcomes. Another company that claimed to have developed a detection algorithm for liver transplantation recipients folded quietly after its model showed bias against young patients in Britain. It failed in the real world despite glowing performances on benchmark datasets. Why? Because predicting rare/critical events requires context-aware decision-making. A seemingly unknown determinant may lead to wrong predictions and unnecessary ICU admissions. The cost of error is high—and humans often bear it.
What makes a good benchmark?
A robust medical benchmark should meet four criteria:
Represent reality: Include incomplete records, contradictory symptoms, and noisy environments.
Test communication: Measure how well a model explains its reasoning, not just what answer it gives.
Handle edge cases: Evaluate performance on rare, ethically complex, or emotionally charged scenarios.
Reward safety over certainty: Penalise overconfident wrong answers more than humble uncertainty.
Currently, most benchmarks miss these criteria. And without these elements, we risk trusting technically smart but clinically naïve models.
Red teaming the models
One way forward is red teaming—a method borrowed from cybersecurity, where systems are tested against ambiguous, edge-case, or morally complex scenarios. For example: a patient in mental distress whose symptoms may be somatic; an undocumented illegal immigrant fearful of disclosing travel history; a child with vague neurological symptoms and an anxious parent pushing for a CT scan; a pregnant woman with religious objections to blood transfusion; a terminal cancer patient is unsure whether to pursue aggressive treatment or palliative care; a patient feigning for personal gain.
In these edge cases, models must go beyond knowledge. They must display judgment—or, at the very least, know when they don't know. Red teaming does not replace benchmarks. But it adds a deeper layer, exposing overconfidence, unsafe logic, or lack of cultural sensitivity. These flaws matter more than ticking the right answer box in real-world medicine. Red teaming forces models to reveal what they know and how they think. It uncovers these aspects, which may be hidden in benchmark scores.
Why this matters
The core tension is this: medicine is not just about getting answers right. It is about getting people right. Doctors are trained to deal with doubts, handle exceptions, and recognise cultural patterns not taught in books (doctors also miss a lot). AI, by contrast, is only as good as the data it has seen and the questions it has been trained on. HealthBench, for all its flaws, is a small but vital course correction. It recognises that evaluation needs to change. It introduces a better scoring rubric. It asks harder questions. That makes it better. But we must remain cautious. Healthcare is not like image recognition or language translation. A single incorrect model output can mean a lost life and a ripple effect—misdiagnoses, lawsuits, data breaches, and even health crises. In the age of data poisoning and model hallucination, the stakes are existential.
The road ahead
We must stop asking if AI is better than doctors. That is not the right question. Instead, we should ask: Where is AI safe, useful, and ethical to deploy—and where is it not? Benchmarks, if thoughtfully redesigned, can help answer that. AI in healthcare is not a competition to win. It is a responsibility to share. We must stop treating model performance as a leaderboard sport and start thinking of it as a safety checklist. Until then, AI can assist. It can summarise. It can remind. However, it cannot replace clinical judgment's moral and emotional weight. It certainly cannot sit beside a dying patient and know when to speak and when to stay silent.
(Dr. C. Aravinda is an academic and public health physician. The views expressed are personal. aravindaaiimsjr10@hotmail.com)

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Hindustan Times
29 minutes ago
- Hindustan Times
The High-Schoolers Who Just Beat the World's Smartest AI Models
The smartest AI models ever made just went to the most prestigious competition for young mathematicians and managed to achieve the kind of breakthrough that once seemed miraculous. They still got beat by the world's brightest teenagers. Every year, a few hundred elite high-school students from all over the planet gather at the International Mathematical Olympiad. This year, those brilliant minds were joined by Google DeepMind and other companies in the business of artificial intelligence. They had all come for one of the ultimate tests of reasoning, logic and creativity. The famously grueling IMO exam is held over two days and gives students three increasingly difficult problems a day and more than four hours to solve them. The questions span algebra, geometry, number theory and combinatorics—and you can forget about answering them if you're not a math whiz. You'll give your brain a workout just trying to understand them. Because those problems are both complex and unconventional, the annual math test has become a useful benchmark for measuring AI progress from one year to the next. In this age of rapid development, the leading research labs dreamed of a day their systems would be powerful enough to meet the standard for an IMO gold medal, which became the AI equivalent of a four-minute mile. But nobody knew when they would reach that milestone or if they ever would—until now. This year's International Mathematical Olympiad attracted high-school students from all over the world. The unthinkable occurred earlier this month when an AI model from Google DeepMind earned a gold-medal score at IMO by perfectly solving five of the six problems. In another dramatic twist, OpenAI also claimed gold despite not participating in the official event. The companies described their feats as giant leaps toward the future—even if they're not quite there yet. In fact, the most remarkable part of this memorable event is that 26 students got higher scores on the IMO exam than the AI systems. Among them were four stars of the U.S. team, including Qiao (Tiger) Zhang, a two-time gold medalist from California, and Alexander Wang, who brought his third straight gold back to New Jersey. That makes him one of the most decorated young mathematicians of all time—and he's a high-school senior who can go for another gold at IMO next year. But in a year, he might be dealing with a different equation altogether. 'I think it's really likely that AI is going to be able to get a perfect score next year,' Wang said. 'That would be insane progress,' Zhang said. 'I'm 50-50 on it.' So given those odds, will this be remembered as the last IMO when humans outperformed AI? 'It might well be,' said Thang Luong, the leader of Google DeepMind's team. DeepMind vs. OpenAI Until very recently, what happened in Australia would have sounded about as likely as koalas doing calculus. But the inconceivable began to feel almost inevitable last year, when DeepMind's models built for math solved four problems and racked up 28 points for a silver medal, just one point short of gold. This year, the IMO officially invited a select group of tech companies to their own competition, giving them the same problems as the students and having coordinators grade their solutions with the same rubric. They were eager for the challenge. AI models are trained on unfathomable amounts of information—so if anything has been done before, the chances are they can figure out how to do it again. But they can struggle with problems they have never seen before. As it happens, the IMO process is specifically designed to come up with those original and unconventional problems. In addition to being novel, the problems also have to be interesting and beautiful, said IMO president Gregor Dolinar. If a problem under consideration is similar to 'any other problem published anywhere in the world,' he said, it gets tossed. By the time students take the exam, the list of a few hundred suggested problems has been whittled down to six. Meanwhile, the DeepMind team kept improving the AI system it would bring to IMO, an unreleased version of Google's advanced reasoning model Gemini Deep Think, and it was still making tweaks in the days leading up to the competition. The effort was led by Thang Luong, a senior staff research scientist who narrowly missed getting to IMO in high school with Vietnam's team. He finally made it to IMO last year—with Google. Before he returned this year, DeepMind executives asked about the possibility of gold. He told them to expect bronze or silver again. He adjusted his expectations when DeepMind's model nailed all three problems on the first day. The simplicity, elegance and sheer readability of those solutions astonished mathematicians. The next day, as soon as Luong and his colleagues realized their AI creation had crushed two more proofs, they also realized that would be enough for gold. They celebrated their monumental accomplishment by doing one thing the other medalists couldn't: They cracked open a bottle of whiskey. Key members of Google DeepMind's gold-medal-winning team, including Thang Luong, second from left. To keep the focus on students, the companies at IMO agreed not to release their results until later this month. But as soon as the Olympiad's closing ceremony ended, one company declared that its AI model had struck gold—and it wasn't DeepMind. It was OpenAI. The company wasn't a part of the IMO event, but OpenAI gave its latest experimental reasoning model all six problems and enlisted former medalists to grade the proofs. Like DeepMind's, OpenAI's system flawlessly solved five and scored 35 out of 42 points to meet the gold standard. After the OpenAI victory lap on social media, the embargo was lifted and DeepMind told the world about its own triumph—and that its performance was certified by the IMO. Not long ago, it was hard to imagine AI rivals dueling for glory like this. In 2021, a Ph.D. student named Alexander Wei was part of a study that asked him to predict the state of AI math by July 2025—that is, right now. When he looked at the other forecasts, he thought they were much too optimistic. As it turned out, they weren't nearly optimistic enough. Now he's living proof of just how wrong he was: Wei is the research scientist who led the IMO project for OpenAI. The only thing more impressive than what the AI systems did was how they did it. Google called its result a major advance, though not because DeepMind won gold instead of silver. Last year, the model needed the problems to be translated into a computer programming language for math proofs. This year, it operated entirely in 'natural language' without any human intervention. DeepMind also crushed the exam within the IMO time limit of 4 ½ hours after taking several days of computation just a year ago. You might find all of this completely terrifying—and think of AI as competition. The humans behind the models see them as complementary. 'This could perhaps be a new calculator,' Luong said, 'that powers the next generation of mathematicians.' The problem of Problem 6 Speaking of that next generation, the IMO gold medalists have already been overshadowed by AI. So let's put them back in the spotlight. Team USA at the International Mathematical Olympiad, including Alexander Wang, fourth from right, and Tiger Zhang, with the stuffed red panda on his head. Qiao Zhang is a 17-year-old student in Los Angeles on his way to MIT to study math and computer science. As a young boy, his family moved to the U.S. from China and his parents gave him a choice of two American names. He picked Tiger over Elephant. His career in competitive math began in second grade, when he entered a contest called the Math Kangaroo. It ended this month at the math Olympics next to a hotel in Australia with actual kangaroos. When he sat down at his desk with a pen and lots of scratch paper, Zhang spent the longest amount of time during the exam on Problem 6. It was a problem in the notoriously tricky field of combinatorics, the branch of mathematics that deals with counting, arranging and combining discrete objects, and it was easily the hardest on this year's test. The solution required the ingenuity, creativity and intuition that humans can muster but machines cannot—at least not yet. 'I would actually be a bit scared if the AI models could do stuff on Problem 6,' he said. Problem 6 did stump DeepMind and OpenAI's models, but it wasn't just problematic for AI. Of the 630 student contestants, 569 also received zero points. Only six received the full credit of seven points. Zhang was proud of his partial solution that earned four points—which was four more than almost everyone else. At this year's IMO, 72 contestants went home with gold. But for some, a medal wasn't their only prize. Zhang was among those who left with another keepsake: victory over the AI models. (As if it weren't enough that he can bend numbers to his will, he also has a way with words and wrote this about his IMO experience.) In the end, the six members of the U.S. team piled up five golds and one silver, finishing second overall behind the Chinese after knocking them off the top spot last year. There was once a time when such precocious math students grew up to become professors. (Or presidents—the recently elected president of Romania was a two-time IMO gold medalist with perfect scores.) While many still choose academia, others get recruited by algorithmic trading firms and hedge funds, where their quantitative brains have never been so highly valued. This year, the U.S. team was supported by Jane Street while XTX Markets sponsored the whole event. After all, they will soon be competing with each other—and with the richest tech companies—for their intellectual talents. By then, AI might be destroying mere humans at math. But not if you ask Junehyuk Jung. A former IMO gold medalist himself, Jung is now an associate professor at Brown University and visiting researcher at DeepMind who worked on its gold-medal model. He doesn't believe this was humanity's last stand, though. He thinks problems like Problem 6 will flummox AI for at least another decade. And he walked away from perhaps the most significant math contest in history feeling bullish on all kinds of intelligence. 'There are things AI will do very well,' he said. 'There are still going to be things that humans can do better.' Write to Ben Cohen at The High-Schoolers Who Just Beat the World's Smartest AI Models The High-Schoolers Who Just Beat the World's Smartest AI Models


India Today
2 hours ago
- India Today
Should you double-check your doctor with ChatGPT? Yes, you absolutely should
First, there was Google. Or rather Doctor Google, as it is mockingly called by the men and women in white coats, the ones who come in one hour late to see their patients and those who brush off every little query from patients brusquely and sometimes with unwarranted there is a new foe in town, and it is only now that doctors are beginning to realise it. This is ChatGPT, or Gemini, or something like DeepSeek, the AI systems that are coherent and powerful enough to act like medical guides. Doctors are, obviously, not happy about it. Just the way they enrage patients for trying to discuss with them what the ailing person finds after Googling symptoms, now they are fuming against advice that ChatGPT can dish problem is that no one likes to be double-checked. And Indian doctors, in particular, hate it. They want their word to be the gospel. Bhagwan ka roop or something like that. But frustratingly for them, the capabilities of new AI systems are such that anyone can now re-check their doctor's prescription, or can read diagnostic films and observations, using tools like ChatGPT. The question, however, is: should you do it? Absolutely yes. The benefits outweigh the harms. Let me tell you a story. This is from around 15 years ago. A person whom I know well went to a doctor for an ear infection. This was a much-celebrated doctor, leading the ENT department in a hospital chain which has a name starting with the letter F. The doctor charged the patient a princely sum and poked and probed the ear in question. After a few days of tests and consultations, a surgery — rather complex one — was recommended. It was at this time, when the patient was submitting the consent forms for the surgery that was scheduled for a few days later, that the doctor discovered some new information. He found that the patient was a journalist in a large media group, the name of which starts with the letter new information, although not related to the patient's ear, quickly changed the tune the doctor was whistling. He became coy and cautious. He started having second thoughts about the surgery. So, he recommended a second opinion, writing a reference for another senior doctor, who was the head of the ENT at a hospital chain which has a name starting with the letter A. The doctor at this new hospital carried out his own observations. The ear was probed and poked again, and within minutes he declared, 'No, surgery needed. Absolutely, no surgery needed.'What happened? I have no way of confirming this. But I believe here is what happened. The doctor at hospital F was pushing for an unnecessary and complex surgery, the one where chances of something going wrong were minimal but not zero. However, once he realised that the patient was a journalist, he decided not to risk it and to get out of the situation, relied on the doctor at hospital is a story I know, but I am sure almost everyone in this country will have similar anecdotes. At one time or another, we have all had a feeling that this doctor or that was probably pushing for some procedure, some diagnostic test, or some advice that did not sit well with us. And in many unfortunate cases, people actually underwent some procedure or some treatment that harmed them more than it helped. Medical negligence in India flies under the radar of 'doctor is bhagwan ka roop' and other other countries where medical negligence is something that can have serious repercussions for doctors and hospitals, in India, people in white coats get flexibility in almost everything that they do. A lot of it is due to the reverence that society has for doctors, the savers of life. Some of it is also because, in India, we have far fewer doctors than are needed. This is not to say that doctors in India are incompetent. In general, they are not, largely thanks to the scholastic nature of modern medicine and procedures. Most of them also work crazy long hours, under conditions that are extremely frugal in terms of equipment and highly stressful in terms of this is exactly why we should use ChatGPT to double-check our doctors in India. Because there is a huge supply-demand mismatch, it is safe to say that we have doctors in the country who are not up for the task, whether these are doctors with dodgy degrees or those who have little to no background in modern medicine, and yet they put Dr in front of their name and run clinics where they deal with most complex is precisely because doctors are overworked in India that their patients should use AI to double-check their diagnostic opinions and suggested treatments. Doctors, irrespective of what we feel about them and how we revere them, are humans at the end of the day. They are prone to making the same mistakes that any human would make in a challenging work finally, because many doctors in India — not all, but many — tend to overdo their treatment and diagnostic tests, we should double-check them with AI. Next time, when you get a CT scan, also show it to ChatGPT and then discuss with your doctor if the AI is telling you something different. In the last one year, again and again, research has highlighted that AI is extremely good at diagnosis. Just earlier this month, a new study by a team at Microsoft found that their MAI-DxO — a specially-tuned AI system for medical diagnosis — outperformed human doctors. Compared to 21 doctors who were part of the study and who were correct in only 20 per cent of cases, MAI-DxO was correct in 85 per cent of cases in its none of this is to say that you should replace your doctor with ChatGPT. Absolutely not. Good doctors are indeed precious and their consultation is priceless. They will also be better with subtleties of the human body compared to any AI system. But in the coming months and years, I have a feeling that doctors in India will launch a tirade against AI, similar to how they once fought Dr they will shame and harangue their patients for using ChatGPT for a second opinion. When that happens, we should push back. Indian doctors are not used to questions, they don't like to explain, they don't want to be second-guessed or double-checked. And that is exactly why we should ask them questions, seek explanations and double-check them, if needed, even with the help of ChatGPT.(Javed Anwer is Technology Editor, India Today Group Digital. Latent Space is a weekly column on tech, world, and everything in between. The name comes from the science of AI and to reflect it, Latent Space functions in the same way: by simplifying the world of tech and giving it a context)- Ends(Views expressed in this opinion piece are those of the author)Trending Reel


Time of India
2 hours ago
- Time of India
AI may beat doctors at diagnosis, but trust still wins: Sam Altman
Synopsis OpenAI CEO Sam Altman says AI can now diagnose illnesses better than most doctors, but people still prefer human care for trust and connection. He warned about risks like AI-driven fraud and privacy issues, stressing the need for stronger protections for sensitive conversations users have with tools like ChatGPT.