★ VENTURE TAKES

The OpenEvidence Plot Twist

A Nature Medicine study found that general-purpose frontier models outperformed specialised clinical AI tools. The real lesson is not that vertical AI is doomed — it is that specialisation only matters when it makes adoption easier and outcomes better.

1P · JUDY DUONG·JUNE 26, 2026·5 MIN READ

In 2026, a team led by NYU Langone published an independent study in Nature Medicine comparing two specialised clinical AI tools — OpenEvidence and UpToDate Expert AI — with three general-purpose frontier models: GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6.

The result was uncomfortable.

The general-purpose models outperformed the specialised tools across all three evaluations.

On MedQA, a set of 500 USMLE-style medical knowledge questions, Gemini scored 97.4% and GPT scored 94.2%, compared with OpenEvidence at 89.6% and UpToDate at 88.4%.

On HealthBench, GPT scored 88.0, while OpenEvidence scored 62.6 and UpToDate scored 61.3.

Most importantly, on real clinical queries — 100 de-identified questions drawn from live physician use and reviewed blindly by 12 clinicians — the frontier models formed the top performance tier. OpenEvidence and UpToDate formed the lower tier.

The awkward detail is that Google AI Overview, a free search-embedded AI feature, performed similarly to OpenEvidence and UpToDate on the real clinical query benchmark.

That matters because OpenEvidence is not positioned as a casual chatbot. It is positioned as a specialised medical AI product. Its value proposition is not only convenience. It is clinical trust.

Yet in this study, the specialist did not clearly outperform the generalist.

The simple takeaway would be: "ChatGPT beat the medical tool."

But I think the more important takeaway is this:

Vertical AI is not automatically better just because it is vertical.

The wrapper can become the weakness

The wrong conclusion is that specialist medical AI is useless.

The better conclusion is that specialisation only matters when it improves the outcome.

That is what makes the OpenEvidence result interesting. The issue was not mainly that the tool knew nothing. Its weaker responses were associated less with simple factual errors and more with incomplete clinical content, safety-critical omissions, and disorganised answers.

For a clinician, that distinction matters. A medical AI product does not only need access to trusted literature. It needs to turn that information into a clear, complete, usable answer at the point of care. A technically informed answer can still be a poor product if it buries the practical point, omits an important caveat, or makes the clinician work harder to understand what matters.

This is where the irony sits.

Many specialised AI tools are still built on top of the same class of general-purpose foundation models. They add retrieval, citations, safety rules, workflow design, and interface formatting around the model. Those layers are supposed to make the product more trustworthy and useful. But each layer can also introduce friction: a retrieval layer can surface the wrong context, a citation layer can make an incomplete answer look more authoritative, a safety layer can over-constrain the response.

So the study is not simply saying:

The generalist beat the specialist.

It is saying something more uncomfortable:

A frontier model, used more directly, outperformed a specialised product wrapper that was supposed to make it better.

Retrieval is not a moat by itself. Citations are not a moat by themselves. Domain-specific branding is not a moat by itself. They only matter if they improve the answer, the workflow, or the trust layer in a way the general-purpose model cannot easily replicate.

OpenEvidence can still be a strong business

A benchmark loss does not automatically mean a weak company.

OpenEvidence has reportedly grown quickly, attracted major investors, reached significant physician adoption, and built a business model around a high-value healthcare audience. That momentum may not depend entirely on having the best answer quality.

The real moat may sit somewhere else: physician distribution, brand trust, workflow presence, usage data, citation behaviour, healthcare-specific positioning, institutional legitimacy, and pharmaceutical advertising economics. Those are real business advantages.

A company can be commercially strong even if its model is not the best-performing system in a benchmark. The mistake is not believing OpenEvidence can be a good business. The mistake is misunderstanding why it is good.

Different moats carry different risks. A model-quality moat is threatened by better frontier models. A workflow moat is threatened by integration shifts. A distribution moat is threatened by cheaper or more trusted channels. An advertising moat is threatened by regulatory scrutiny and buyer trust. Investors need to know which moat is actually load-bearing.

That is the uncomfortable but useful distinction: OpenEvidence may still be valuable, but not necessarily for the reason people think.

The real risk is perception repricing

OpenEvidence may be perceived as safer and more clinically appropriate because of its branding, citations, and physician-focused interface. That perception has value, especially in healthcare. But perception is not permanently separate from performance — over time, evidence, competing products, procurement reviews, litigation, and regulation can reprice it.

So the better question is not "is the product overrated," but: how long can perception stay ahead of measured performance, and what could force that gap to close?

In healthcare, the gap can persist for a long time. Physicians may not directly bear the cost of lower-quality outputs, a free product creates less price pressure, and citations can create confidence even when an answer is incomplete. So OpenEvidence could keep growing despite the study. But the gap is not risk-free — three forcing functions matter.

First, this study creates a credible external reference point. A peer-reviewed Nature Medicine paper finding lower clarity and safety-critical omissions is different from a casual online critique. It can be cited in procurement reviews, compliance discussions, and risk assessments.

Second, the business model may attract regulatory attention. If a platform influences what physicians read during clinical decision-making and monetises attention through pharmaceutical advertising, regulators may eventually examine the relationship between content, incentives, and prescribing behaviour.

Third, frontier AI companies are moving closer to healthcare workflows. If OpenAI, Anthropic, Google, or another major platform offers a HIPAA-compliant clinical product that is cheaper, clearer, and easier to integrate, specialised tools will need to defend their position more directly.

One paper will not destroy OpenEvidence. But it gives buyers, competitors, and regulators a credible reason to ask harder questions. That is how perception starts to reprice.

My take

I have talked about this many times: the job of a specialised AI product is not to make the product look more specialised. It is to make adoption easier.

A good vertical product should reduce friction. It should take a powerful general-purpose model and make it easier, safer, clearer, and more useful for a specific user in a specific workflow. That is the whole point of specialisation.

If the specialised product adds citations, constraints, terminology, or workflow steps but makes the answer harder to use, then it is not making adoption easier. It is creating noise.

That is why this study matters beyond OpenEvidence. It challenges the easy assumption that vertical AI plus domain-specific retrieval equals defensibility. Sometimes it does. Automatically, it does not.

For founders, "ChatGPT for X" is not enough. The product has to own something deeper than the interface: workflow, proprietary data, evaluation, distribution, compliance, or a user relationship the frontier labs cannot easily replicate.

For investors, the diligence question should be simple: what is actually load-bearing?

If the moat is distribution, say it is distribution. If the moat is workflow, prove the workflow is hard to replace. If the moat is data, show why the data compounds. If the moat is model quality, then it needs to beat the frontier models on real tasks. The danger is underwriting one moat while the company is actually relying on another. Anyway, if the investors’ goal is only upside then it obviously works because it got to the customers with really good traction so I actually don’t know. :p

The OpenEvidence study is not just about whether ChatGPT is better than a medical AI tool. It is about what actually creates defensibility in applied AI.

Specialised AI is not a moat unless it makes adoption easier and outcomes better.

OpenEvidence may still become a very strong company. But if it wins, it may be because of distribution, workflow, trust, data, and monetisation — not because its answer quality is clearly superior to frontier models.

In applied AI, the model is often the most replaceable part of the stack. The moat is what surrounds it.

#OPENEVIDENCE#AI#HEALTHCARE AI#VERTICAL AI#VENTURE CAPITAL#AI MOATS#CHATGPT#PRODUCT STRATEGY

The OpenEvidence Plot Twist

The wrapper can become the weakness

OpenEvidence can still be a strong business

The real risk is perception repricing

My take

MORE IN VENTURE TAKES

Omnigent: Databricks Just Built the Control Tower for AI Agents

Pick Your Stack, Not Just the Model