Medical LLMs Give Different Answers to the Same Question. This Matters for Clinical Practice.
A groundbreaking study published in the Journal of General Internal Medicine has revealed a critical vulnerability in large language models used for medical decision-making: they produce inconsistent recommendations when presented with the same clinical scenario multiple times. This finding has immediate implications for every physician using AI tools at the bedside.
The Study Design
The research team tested six prominent LLMs on four common inpatient management scenarios that deliberately exist in medicine’s “gray zone” - situations where there isn’t a single correct answer, but rather clinical judgment is required:
Models tested:
GPT-4o
GPT-o1
Claude Sonnet 3.7
Grok 3
Gemini 2.0 Flash
OpenEvidence (a domain-specific medical LLM)
Clinical scenarios:
Blood transfusion at borderline hemoglobin
Resumption of anticoagulation after gastrointestinal bleeding
Discharge readiness despite modest creatinine rise
Peri-procedural anticoagulation bridging
Each model was queried five times with identical prompts to assess internal consistency - a crucial test that most LLM research overlooks.
Key Findings
𝗕𝗹𝗼𝗼𝗱 𝘁𝗿𝗮𝗻𝘀𝗳𝘂𝘀𝗶𝗼𝗻 𝗮𝘁 𝗯𝗼𝗿𝗱𝐞𝗿𝗹𝗶𝗻𝗲 𝗵𝗲𝗺𝗼𝗴𝗹𝗼𝗯𝗶𝗻
When faced with a patient at borderline hemoglobin levels:
4 of 6 models recommended transfusion
2 models favored observation
Pro-transfusion responses were delivered with definitive language
Pro-observation responses showed hesitancy
This split reveals a fundamental issue: the same clinical data produced opposite recommendations depending on which LLM a physician happened to consult.
𝗥𝗲𝘀𝘁𝗮𝗿𝘁𝗶𝗻𝗴 𝗮𝗻𝘁𝗶𝗰𝗼𝐚𝗴𝘂𝗹𝗮𝘁𝗶𝗼𝗻
For patients who experienced gastrointestinal bleeding while on anticoagulation:
50/50 split on whether to restart anticoagulation versus waiting
Timing recommendations varied dramatically: from 4 to 14 days
OpenEvidence missed stroke risk entirely - a potentially dangerous omission
The wide variation in timing recommendations is particularly concerning. A decision to restart at 4 days versus 14 days could mean the difference between preventing a stroke and experiencing a recurrent bleed.
𝗔𝗻𝘁𝗶𝗰𝗼𝗮𝗴𝘂𝗹𝗮𝘁𝗶𝗼𝗻 𝗯𝗿𝗶𝗱𝗴𝗶𝗻𝗴
When asked about bridging anticoagulation for high-risk patients:
5 of 6 models recommended against bridging
Only Grok noted that patient characteristics differed from the landmark BRIDGE trial
Gemini uniquely warned about rare complications
This scenario highlights how different models incorporate - or fail to incorporate - nuanced evidence-based medicine principles into their recommendations.
The Critical Problem: Internal Inconsistency
Perhaps most alarming: models changed their own recommendations up to 40% of the time when given identical queries.
This internal inconsistency creates clinically significant flips in management:
“Restart anticoagulation” vs. “Don’t restart”
“Transfuse now” vs. “Observe”
“Bridge with heparin” vs. “No bridging needed”
None of the models explicitly acknowledged clinical uncertainty in their responses, they delivered recommendations with varying degrees of confidence, regardless of the actual evidence base or complexity of the decision.
Why This Happens: Understanding Stochastic LLMs
Large language models are fundamentally probabilistic text generators. They work by:
Processing input through neural networks trained on vast amounts of text
Calculating probabilities for what word should come next
Sampling from a distribution of possible outputs
Generating text that seems coherent and authoritative
This probabilistic nature means that running the same query multiple times can produce different outputs. Temperature settings control randomness but even at low temperatures, variation persists.
In domains with clear right answers (like “What is the capital of France?”), this variation is minimal. But in medicine’s gray zone, where legitimate clinical disagreement exists and evidence is nuanced, LLMs amplify rather than resolve uncertainty.
Different Models, Different Philosophies
The study revealed distinct patterns in how different LLMs approach clinical recommendations:
Grok: Generated lengthy, detailed responses but showed the lowest internal consistency. More verbose doesn’t mean more reliability.
OpenEvidence: Delivered brief, authoritative directives that masked underlying uncertainty. The domain-specific training didn’t eliminate the fundamental problem and may have made it worse by increasing false confidence.
GPT-4o and Claude 3.7 Sonnet: Showed better consistency but still exhibited clinically meaningful variation in recommendations.
What This Means for Medical Practice
The False Confidence Problem
Physicians treating LLMs as deterministic calculators, like using a creatinine clearance formula or a CHADS2 score calculator, risk developing false confidence in recommendations that are actually sampling from a probability distribution.
Traditional clinical decision tools are deterministic: the same inputs always produce the same outputs. LLMs fundamentally aren’t. Yet their confident, articulate responses create an illusion of reliability.
Why This Differs from Standard LLM Research
Most LLM research in medicine uses questions with clear, verifiable answers:
Medical board exam questions
Factual queries about diseases
Interpretation of guidelines with definitive statements
The study authors deliberately tested the gray zone - precisely where clinicians are most likely to turn to AI for help. These are the scenarios where:
Evidence is incomplete or conflicting
Multiple reasonable approaches exist
Clinical judgment is paramount
Experience and pattern recognition matter
This is also where LLM unreliability is most dangerous.
Practical Implications and Recommendations
For Individual Physicians
Query models multiple times Don’t accept the first response as definitive. Run the same query 3-5 times and look for variation. If you get different answers, that’s telling you something important about the underlying uncertainty.
Compare different models Different LLMs are trained on different datasets and use different architectures. Cross-checking recommendations across models can reveal where consensus exists and where it doesn’t.
Maintaining final responsibility LLMs should inform, not determine, clinical decisions. The phrase “AI-assisted” should mean exactly that - assistance, not replacement of clinical judgment.
Human-in-the-loop is mandatory No LLM output should go directly from generation to clinical action without human review and validation. This isn’t about AI capability - it’s about the fundamental nature of probabilistic systems in high-stakes decisions.
For Healthcare Systems
Avoid deterministic trust Policy and workflow designs should not assume LLM outputs are reproducible or consistent. Decision-support systems need human verification steps.
Training and education Clinicians need to understand that LLMs work fundamentally differently from traditional clinical calculators. The same professional who understands how to use a risk calculator needs different mental models for using an LLM.
Documentation standards When LLMs inform clinical decisions, documentation should reflect:
Which model was used
Whether multiple queries were performed
How recommendations were validated
Why the final decision was made
The Bigger Picture: AI in Medicine’s Gray Zone
This study illuminates a crucial truth: AI systems trained on medicine’s documented knowledge still struggle with medicine’s essential skill - navigating uncertainty with wisdom.
The best clinicians don’t just know facts; they know when facts don’t fully apply, when guidelines have limitations, and when patient-specific factors trump population-level evidence. They’re comfortable saying “I don’t know, but here’s how we’ll approach this together.”
Current LLMs, by contrast, confidently generate recommendations without acknowledging when they’re operating in epistemic murk. They don’t say “this is a gray area with reasonable disagreement.” They say “here’s what you should do” - even when “what you should do” might be different if you ask again five minutes later.
Looking Forward
These tools have genuine utility:
Synthesizing large amounts of information quickly
Generating differential diagnoses
Explaining complex concepts to patients
Drafting documentation
Literature searches and evidence summaries
But we need realistic expectations. LLMs are powerful text processors, not clinical reasoning engines. They can augment human judgment but not replace it.
What Needs to Change
Transparency about uncertainty Future medical LLMs should explicitly communicate when recommendations exist in gray zones and when evidence is limited.
Consistency improvements While perfect consistency may be impossible for probabilistic systems, better calibration and uncertainty quantification could help.
Validation frameworks Medical AI needs evaluation standards that test behavior in gray zones, not just accuracy on clear-cut questions.
User interface design Clinical decision support tools using LLMs should display confidence intervals, alternative recommendations, and evidence quality, not just a single, confident-sounding suggestion.
Conclusion
The JGIM study delivers a sobering message: medical LLMs are not yet ready to be trusted as autonomous clinical decision-makers, particularly in the complex, judgment-dependent scenarios where physicians most need help.
The stochastic nature of these systems - their fundamental randomness - creates clinically significant inconsistency that can’t be ignored. A tool that recommends restarting anticoagulation on one query and waiting another week on the next query isn’t malfunctioning; it’s working exactly as probabilistic language models work.
For now, the answer is clear:
Query multiple times
Compare multiple models
Maintain human oversight
Never treat LLM outputs as deterministic
Discussion Questions
What are your experiences with LLMs in clinical practice? Have you noticed inconsistency in recommendations? How do you currently validate AI-generated suggestions before applying them to patient care?
Share your thoughts in the comments below.
thanks for your review → always professionally written and interesting. Are this or other studies compare LLM inconsistency to the rate of misdiagnosis or therapeutic misalignment in regular hospital practice. In medicine, eg. diagnostic drift, iatrogenic harm, the need for second opinions. Is “check again” principle we apply to human judgment should now formally extend to LLMs? and whether their inconsistency simply mirrors, amplifies, or potentially corrects human diagnostic uncertainty? bearing in mind LLMs are very new and still “learning”?