Vision  /  Sovereign AI

The UK Clinical evaluation layer for medical AI

We’re building the Sovereign UK Clinical AI Evaluation Dataset: a clinician-graded, regulator-aligned evidence base for how medical AI performs against UK clinical practice — independent, vendor-neutral, and designed to sit underneath the regulatory framework the MHRA and CERSI-AI are shaping.

0

Statutorily-regulated UK clinicians

<0%

Of medical AI studies meet rigorous evaluation standards

Aug 0

EU AI Act high-risk enforcement

0

Window — Sovereign AI Fund

01  /  The Window

The frontier of medical AI is being trained right now — on a vanishing window of clinical reasoning.

Whoever sits inside that loop sets the global standard for clinical AI safety, encodes their values into deployed systems, and earns the regulatory pole position when the EU AI Act enforces high-risk obligations on clinical AI in August 2027.

01.A / Problem
01

Evaluation is the bottleneck — not data

Frontier labs already have terabytes of medical text. What they lack is high-fidelity signal on whether their model is actually safe at the bedside. That signal can only come from senior clinicians, working under a methodology rigorous enough that regulators can cite the result.

01.B / Problem
02

Most medical AI benchmarks do not measure what matters

A 2019 systematic review in The Lancet Digital Health (Liu et al.) screened more than 20,000 medical imaging AI studies and found fewer than 1% were sufficiently rigorous to constitute a trustworthy evaluation of the algorithm. The bar is still being set by exam-style multiple choice and single-rater ground truth — neither of which describes how clinicians actually reason.

01.C / Problem
03

The window to build it sovereign is closing

Every week, the world's most regulated medical workforce labels training data that disappears into foreign-owned corpora. The strategic asset is the dataset, not the labour. Once the dataset is embedded in someone else's model weights, you cannot get it back.

Thesis

Frontier medical AI is being trained on the reasoning of UK-regulated clinicians — and Britain has no asset, no benchmark, and no sovereign leverage to show for it. We’re building the layer that reverses the leakage: clinician-graded, vendor-neutral, refreshed against live UK guidance, and governed independently of any vendor — including ourselves. The window to define the standard closes when the EU AI Act sandboxes open in August 2026.

02  /  The Vision

Three things the next generation of medical AI evaluation must do that today’s does not.

This is what we believe a serious evaluation substrate looks like — and what we’re building toward, in partnership with senior UK consultants across every major specialty.

02.A / Pillar 01

Disagreement-aware ground truth

A textbook gives one answer. Senior consultants reasoning about a real patient often do not. The most informative signal in clinical reasoning is the shape of expert disagreement — and today's medical AI is trained to throw it away.

Next-generation medical models should be evaluated against the full distribution of senior-clinician answers, not a single canonical reply. Where reasonable clinicians disagree, the model should know to disagree too — and say so.

Ground truth · Reasoning
Today · single answer

“The correct answer is…”

Disagreement signal collapsed to a point.

Sovereign · answer distribution
disagreement zone
C1
C2
C3
C4
C5
5 senior consultants
Reasonable disagreement preserved
02.B / Pillar 02

Longitudinally calibrated

Clinical guidelines update constantly. A model that was state-of-the-art when NICE NG136 was on v3.2 may be silently unsafe the day v4.0 ships.

Evaluation cannot be a one-shot exercise frozen at a model's launch date. Sovereign AI evaluation tracks live guideline versions, refreshes against them, and surfaces drift — so the score on the box still means something twelve months later.

Calibration · Over time
Model accuracyvs UK guideline version
100%75%50%25%0%
NICE NG136 v4.0guideline updated
Q1Q2Q3Q4Q5Q6Q7
Drift detected
−28ppmissed by one-shot eval
02.C / Pillar 03

Built on tacit specialist knowledge

The reasoning a consultant builds across a career — pattern recognition for red flags, instinct on prescribing safety, judgement on consent and capacity — does not appear in any textbook or guideline. It cannot be scraped.

It must be authored. By the consultants who hold it. Captured at scale, anchored to UK practice, and turned into training signal for the next generation of medical LLMs.

What gets written down · what doesn't
TEXTBOOKwhat’s written down
overlap
CAPTUREDguidelines · protocols
overlap
TACITwhat 20 years builds
ARTICULABLEEXPERIENTIAL

Models train on the overlap. The rest is what a consultant cannot quite write down.

The Manifesto  /  03

We're building the evaluation substrate the next decade of medical AI will need.

Authored by senior UK consultants. Graded by a verified UK clinician network. Refreshed against live UK guidelines. Governed independently of any vendor — including ourselves.

GOVERNANCENHS · MHRA · NICE · GMC
EVALUATIONvendor-neutral panel
Scenario
Corpus
authored by consultants
04.A / Three Layers

A sovereign asset, not a private benchmark.

Three layers, each independent of the layer above it. A lab cannot buy its way onto the public scoreboard. A regulator does not have to take our word that the evaluation is fair.

01

Scenario layer

Real UK clinical decision points, authored by senior consultants and academic clinicians. Reference-linked to the live UK guideline corpus.

02

Evaluation layer

A vendor-neutral panel of frontier APIs, open-weights models, and UK-deployed clinical AI products, graded by the verified UK clinician network.

03

Governance layer

An independent board with NHS, MHRA, NICE, and GMC representation. Versioning, refresh, access tiering, and release sign-off sit here — not inside any vendor.

05  /  Specialty Coverage

Depth-led, not breadth-led.

The first wave covers ten high-leverage specialties — each anchored to a Royal College partner — with enough statistical depth that per-specialty inter-rater reliability, demographic stratification, and model comparison all sit in comfortable territory.

GPGeneral Practice
EMEmergency Medicine
RADRadiology
ONCOncology
PSYMental Health
PEDPaediatrics
O&GObstetrics & Gynae
ANAAnaesthesia
AIMAcute Medicine
RXPrescribing Safety
Now evaluating · General Practice
01

10 specialties

GP, EM, radiology, oncology, psychiatry, paediatrics, O&G, anaesthesia, acute medicine, prescribing safety.

02

10 Royal College partners

RCGP, RCEM, RCR, RCP/RCPath, RCPsych, RCPCH, RCOG, RCoA, RCP, Royal Pharmaceutical Society.

03

Statistical depth, not surface area

Enough per-specialty volume for peer-reviewed inter-rater reliability papers and meaningful model-difference detection.

06  /  Sovereignty
06.A / Where the value flows

The strategic asset is the dataset — not the labour.

Every week, senior UK clinicians label evaluation data for frontier model trainers. Today, the labour is paid and the dataset compounds in foreign-owned model corpora. Once embedded, you cannot get it back.

Sovereign AI evaluation flips the flow. UK clinical reasoning is captured into a UK-owned public asset first, governed independently, and then made available to the world under sovereign terms. The labour is still paid — but the asset stays.

Today
100% leaves
UK
Clinicians
value out →
US
Frontier labs

The dataset itself sits abroad — and compounds there.

Sovereign
Asset retained
UK
Clinicians
authoring
UK
SOVEREIGN
ASSET
UK-owned
licensed
Global AI

UK clinical reasoning becomes a public asset that compounds — captured here, licensed out under sovereign terms.

Stay in the loop
The Promise

Interested?
Stay in the loop.

Join the waitlist and we'll email you when new AI roles matching your expertise become available.

  1. 01Early access to new AI roles
  2. 02Weekly pay rate updates
  3. 03Priority matching when you register
The Form