Methodology
Evaluation Methodology
How we ensure every clinical AI evaluation is statistically rigorous, reproducible, and trustworthy. From evaluator calibration through to production quality control — every stage is designed to produce reliable results.
0-Phase
Calibration process
CI-Backed
Every metric
0
Quality control layers
Clinician-Led
All evaluations
Two-phase calibration
Every evaluator goes through a two-phase calibration process before they can assess your AI system. This ensures their judgements are statistically aligned with expert consensus.
Phase 1 — Quick Screen
An 11-task sequential screen on gold-standard items with known expert consensus. Uses anti-luck guardrails and per-item agreement thresholds to confirm an evaluator's clinical judgement is aligned with the reference standard before they commit to the full calibration. Typically takes ~20 minutes.
Result
Stopped at task 23 — CI narrowed below 5%
Score
91% ± 3.2%
Phase 2 — Full Calibration
A complete statistical calibration across all eight RLHF task types — rating, comparison, ranking, rubric, correction, annotation, justification, and red-team — running 60 to 90 minutes. Produces the evaluator's reliability score with Beta-Binomial / Bootstrap confidence intervals per metric, and forms the basis of the Reliability Report.
Proper scoring rules
We use mathematically proper scoring rules that incentivise honest, well-calibrated evaluations rather than gaming strategies.
Brier Score
Measures the accuracy of probabilistic predictions. A proper scoring rule that rewards evaluators for well-calibrated confidence — penalising both overconfidence and underconfidence equally.
99% confidence on wrong answer → 10× penalty vs Brier
Log Score
Provides sharper discrimination between good and poor calibration than Brier scores. Particularly useful for detecting evaluators who are systematically overconfident in their clinical assessments.
C-02: 86% point estimate but LCB 80% — below threshold
Confidence Intervals
Beta-Binomial intervals for proportions, bootstrap intervals for continuous metrics, and lower confidence bounds (LCB) for certification decisions — so every score reflects worst-case plausible performance, not just point estimates.
Inter-annotator agreement
We measure agreement between evaluators to ensure consistency and identify tasks where clinical judgement genuinely differs.
Observed agreement
75% (6/8 items)
Cohen's κ
0.71 — Substantial
Cohen’s Kappa (κ)
Measures pairwise agreement between two evaluators, correcting for chance agreement. Used for direct comparison tasks and pairwise annotation projects where each item is assessed by exactly two evaluators.
Categories
Fleiss' κ
0.62 — Substantial
Fleiss’ Kappa (κ)
Extends agreement measurement to multiple evaluators. Essential for production evaluation tasks where items are assessed by varying numbers of evaluators and we need to assess the overall reliability of the evaluation process.
Quality control in production
Five layers of quality control ensure evaluation reliability is maintained throughout production work — not just during calibration.
Gold-Standard Injection
Known-answer tasks are randomly injected into production queues. Evaluators don't know which tasks are gold items, ensuring they maintain quality on every response.
Drift Detection
Statistical monitoring detects when an evaluator's performance begins to drift from their calibration baseline — catching fatigue, disengagement, or changing standards early.
Attention Checks
Strategically placed verification items confirm evaluators are reading and engaging with content rather than pattern-matching or rushing through tasks.
Skip Budget
Evaluators can skip tasks outside their expertise, but within a managed budget. This prevents gaming while allowing honest acknowledgement of knowledge limits.
Justification Quality
Multi-signal analysis of evaluator justifications assesses depth, clinical reasoning, and consistency — catching thin or formulaic explanations that suggest disengagement.
Every engagement ends in a Reliability Report.
A procurement-grade artifact you can take to clinical safety reviews, regulators, and compliance teams. Issued as both a human-readable PDF and a machine-readable JSON — with the full methodology referenced throughout.
PDF + JSON
Human-readable for safety reviews; machine-readable for ingestion into your QMS or evidence pipeline.
Per-metric CIs
Beta-Binomial intervals for proportions; bootstrap intervals for continuous scores. Lower confidence bounds reported on every metric.
Readiness scores
Overall, safety-critical, and annotation-heavy readiness — each backed by the underlying calibration data.
Audit trail
Evaluator IDs, gold-task agreement, attention-check history, and methodology version recorded against every result.
Readiness scoring
Before matching evaluators with paid work, we compute a readiness score across three categories — ensuring the right evaluator is matched to the right task.
Ranking & Correction
Pairwise ranking, text correction, and response assessment — the core RLHF task families.
Safety & Red-Teaming
Adversarial testing across failure mode categories — identifying dangerous outputs before they reach patients.
Data & Annotation
Clinical data labelling, document classification, and structured annotation for AI training datasets.
Interested?
Stay in the loop.
Join the waitlist and we'll email you when new AI roles matching your expertise become available.
- 01Early access to new AI roles
- 02Weekly pay rate updates
- 03Priority matching when you register