The UK Clinical evaluation layer for medical AI
We’re building the Sovereign UK Clinical AI Evaluation Dataset: a clinician-graded, regulator-aligned evidence base for how medical AI performs against UK clinical practice — independent, vendor-neutral, and designed to sit underneath the regulatory framework the MHRA and CERSI-AI are shaping.
0
Statutorily-regulated UK clinicians
<0%
Of medical AI studies meet rigorous evaluation standards
Aug 0
EU AI Act high-risk enforcement
0
Window — Sovereign AI Fund
The frontier of medical AI is being trained right now — on a vanishing window of clinical reasoning.
Whoever sits inside that loop sets the global standard for clinical AI safety, encodes their values into deployed systems, and earns the regulatory pole position when the EU AI Act enforces high-risk obligations on clinical AI in August 2027.
Evaluation is the bottleneck — not data
Frontier labs already have terabytes of medical text. What they lack is high-fidelity signal on whether their model is actually safe at the bedside. That signal can only come from senior clinicians, working under a methodology rigorous enough that regulators can cite the result.
Most medical AI benchmarks do not measure what matters
A 2019 systematic review in The Lancet Digital Health (Liu et al.) screened more than 20,000 medical imaging AI studies and found fewer than 1% were sufficiently rigorous to constitute a trustworthy evaluation of the algorithm. The bar is still being set by exam-style multiple choice and single-rater ground truth — neither of which describes how clinicians actually reason.
The window to build it sovereign is closing
Every week, the world's most regulated medical workforce labels training data that disappears into foreign-owned corpora. The strategic asset is the dataset, not the labour. Once the dataset is embedded in someone else's model weights, you cannot get it back.
Frontier medical AI is being trained on the reasoning of UK-regulated clinicians — and Britain has no asset, no benchmark, and no sovereign leverage to show for it. We’re building the layer that reverses the leakage: clinician-graded, vendor-neutral, refreshed against live UK guidance, and governed independently of any vendor — including ourselves. The window to define the standard closes when the EU AI Act sandboxes open in August 2026.
Three things the next generation of medical AI evaluation must do that today’s does not.
This is what we believe a serious evaluation substrate looks like — and what we’re building toward, in partnership with senior UK consultants across every major specialty.
Disagreement-aware ground truth
A textbook gives one answer. Senior consultants reasoning about a real patient often do not. The most informative signal in clinical reasoning is the shape of expert disagreement — and today's medical AI is trained to throw it away.
Next-generation medical models should be evaluated against the full distribution of senior-clinician answers, not a single canonical reply. Where reasonable clinicians disagree, the model should know to disagree too — and say so.
“The correct answer is…”
Disagreement signal collapsed to a point.
consultants
Longitudinally calibrated
Clinical guidelines update constantly. A model that was state-of-the-art when NICE NG136 was on v3.2 may be silently unsafe the day v4.0 ships.
Evaluation cannot be a one-shot exercise frozen at a model's launch date. Sovereign AI evaluation tracks live guideline versions, refreshes against them, and surfaces drift — so the score on the box still means something twelve months later.
Built on tacit specialist knowledge
The reasoning a consultant builds across a career — pattern recognition for red flags, instinct on prescribing safety, judgement on consent and capacity — does not appear in any textbook or guideline. It cannot be scraped.
It must be authored. By the consultants who hold it. Captured at scale, anchored to UK practice, and turned into training signal for the next generation of medical LLMs.
Models train on the overlap. The rest is what a consultant cannot quite write down.
We're building the evaluation substrate the next decade of medical AI will need.
Authored by senior UK consultants. Graded by a verified UK clinician network. Refreshed against live UK guidelines. Governed independently of any vendor — including ourselves.
Corpus
A sovereign asset, not a private benchmark.
Three layers, each independent of the layer above it. A lab cannot buy its way onto the public scoreboard. A regulator does not have to take our word that the evaluation is fair.
Scenario layer
Real UK clinical decision points, authored by senior consultants and academic clinicians. Reference-linked to the live UK guideline corpus.
Evaluation layer
A vendor-neutral panel of frontier APIs, open-weights models, and UK-deployed clinical AI products, graded by the verified UK clinician network.
Governance layer
An independent board with NHS, MHRA, NICE, and GMC representation. Versioning, refresh, access tiering, and release sign-off sit here — not inside any vendor.
Depth-led, not breadth-led.
The first wave covers ten high-leverage specialties — each anchored to a Royal College partner — with enough statistical depth that per-specialty inter-rater reliability, demographic stratification, and model comparison all sit in comfortable territory.
10 specialties
GP, EM, radiology, oncology, psychiatry, paediatrics, O&G, anaesthesia, acute medicine, prescribing safety.
10 Royal College partners
RCGP, RCEM, RCR, RCP/RCPath, RCPsych, RCPCH, RCOG, RCoA, RCP, Royal Pharmaceutical Society.
Statistical depth, not surface area
Enough per-specialty volume for peer-reviewed inter-rater reliability papers and meaningful model-difference detection.
The strategic asset is the dataset — not the labour.
Every week, senior UK clinicians label evaluation data for frontier model trainers. Today, the labour is paid and the dataset compounds in foreign-owned model corpora. Once embedded, you cannot get it back.
Sovereign AI evaluation flips the flow. UK clinical reasoning is captured into a UK-owned public asset first, governed independently, and then made available to the world under sovereign terms. The labour is still paid — but the asset stays.
The dataset itself sits abroad — and compounds there.
UK clinical reasoning becomes a public asset that compounds — captured here, licensed out under sovereign terms.
The dataset itself sits abroad — and compounds there.
UK clinical reasoning becomes a public asset that compounds — captured here, licensed out under sovereign terms.
Interested?
Stay in the loop.
Join the waitlist and we'll email you when new AI roles matching your expertise become available.
- 01Early access to new AI roles
- 02Weekly pay rate updates
- 03Priority matching when you register