Methodology
Evaluation Methodology
How we ensure every clinical AI evaluation is statistically rigorous, reproducible, and trustworthy. From evaluator calibration through to production quality control — every stage is designed to produce reliable results.
0-Phase
Calibration process
CI-Backed
Every metric
0
Quality control layers
Clinician-Led
All evaluations
Calibration
Two-Phase Calibration
Every evaluator goes through a two-phase calibration process before they can assess your AI system. This ensures their judgements are statistically aligned with expert consensus.
Quick Assessment
A focused 10-task assessment using gold-standard items with known expert consensus. This initial screen identifies evaluators whose clinical judgement aligns with the reference standard and flags those who need additional training before proceeding.
Result
Stopped at task 23 — CI narrowed below 5%
Score
91% ± 3.2%
Sequential Stopping
Phase 2 uses sequential stopping rules — evaluators continue until we have statistical confidence in their calibration score, up to 30 tasks maximum. This adaptive approach is more efficient than fixed-length tests while providing the same statistical guarantees.
Scoring
Proper Scoring Rules
We use mathematically proper scoring rules that incentivise honest, well-calibrated evaluations rather than gaming strategies.
Brier Score
Measures the accuracy of probabilistic predictions. A proper scoring rule that rewards evaluators for well-calibrated confidence — penalising both overconfidence and underconfidence equally.
99% confidence on wrong answer → 10× penalty vs Brier
Log Score
Provides sharper discrimination between good and poor calibration than Brier scores. Particularly useful for detecting evaluators who are systematically overconfident in their clinical assessments.
C-02: 86% point estimate but LCB 80% — below threshold
Confidence Intervals
Every calibration score includes statistical confidence intervals. We report lower confidence bounds (LCB) to ensure certification decisions are based on the worst-case plausible performance, not just point estimates.
Agreement
Inter-Annotator Agreement
We measure agreement between evaluators to ensure consistency and identify tasks where clinical judgement genuinely differs.
Observed agreement
75% (6/8 items)
Cohen's κ
0.71 — Substantial
Cohen’s Kappa (κ)
Measures pairwise agreement between two evaluators, correcting for chance agreement. Used for direct comparison tasks and pairwise annotation projects where each item is assessed by exactly two evaluators.
Categories
Fleiss' κ
0.62 — Substantial
Fleiss’ Kappa (κ)
Extends agreement measurement to multiple evaluators. Essential for production evaluation tasks where items are assessed by varying numbers of evaluators and we need to assess the overall reliability of the evaluation process.
Production
Quality Control in Production
Five layers of quality control ensure evaluation reliability is maintained throughout production work — not just during calibration.
Gold-Standard Injection
Known-answer tasks are randomly injected into production queues. Evaluators don't know which tasks are gold items, ensuring they maintain quality on every response.
Drift Detection
Statistical monitoring detects when an evaluator's performance begins to drift from their calibration baseline — catching fatigue, disengagement, or changing standards early.
Attention Checks
Strategically placed verification items confirm evaluators are reading and engaging with content rather than pattern-matching or rushing through tasks.
Skip Budget
Evaluators can skip tasks outside their expertise, but within a managed budget. This prevents gaming while allowing honest acknowledgement of knowledge limits.
Justification Quality
Multi-signal analysis of evaluator justifications assesses depth, clinical reasoning, and consistency — catching thin or formulaic explanations that suggest disengagement.
Readiness
Readiness Scoring
Before matching evaluators with paid work, we compute a readiness score across three categories — ensuring the right evaluator is matched to the right task.
RLHF Evaluation
Pairwise ranking, text correction, and response assessment — the core tasks of clinical RLHF work.
Safety & Red-Teaming
Adversarial testing across failure mode categories — identifying dangerous outputs before they reach patients.
Data & Annotation
Clinical data labelling, document classification, and structured annotation for AI training datasets.
See Our Methodology in Action
Whether you need clinical AI evaluation for your product or want to become a calibrated evaluator — our methodology ensures quality at every stage.