Evaluation Methodology

How we ensure every clinical AI evaluation is statistically rigorous, reproducible, and trustworthy. From evaluator calibration through to production quality control — every stage is designed to produce reliable results.

0-Phase

Calibration process

CI-Backed

Every metric

0

Quality control layers

Clinician-Led

All evaluations

01  /  Calibration

Two-phase calibration

Every evaluator goes through a two-phase calibration process before they can assess your AI system. This ensures their judgements are statistically aligned with expert consensus.

01.A / Phase 1 — Quick Screen

Phase 1 — Quick Screen

An 11-task sequential screen on gold-standard items with known expert consensus. Uses anti-luck guardrails and per-item agreement thresholds to confirm an evaluator's clinical judgement is aligned with the reference standard before they commit to the full calibration. Typically takes ~20 minutes.

01.B / Phase 2 — Full Calibration

Phase 2 — Full Calibration

A complete statistical calibration across all eight RLHF task types — rating, comparison, ranking, rubric, correction, annotation, justification, and red-team — running 60 to 90 minutes. Produces the evaluator's reliability score with Beta-Binomial / Bootstrap confidence intervals per metric, and forms the basis of the Reliability Report.

02  /  Scoring

Proper scoring rules

We use mathematically proper scoring rules that incentivise honest, well-calibrated evaluations rather than gaming strategies.

02.A / Brier Score

Brier Score

Measures the accuracy of probabilistic predictions. A proper scoring rule that rewards evaluators for well-calibrated confidence — penalising both overconfidence and underconfidence equally.

02.B / Log Score

Log Score

Provides sharper discrimination between good and poor calibration than Brier scores. Particularly useful for detecting evaluators who are systematically overconfident in their clinical assessments.

02.C / Confidence Intervals

Confidence Intervals

Beta-Binomial intervals for proportions, bootstrap intervals for continuous metrics, and lower confidence bounds (LCB) for certification decisions — so every score reflects worst-case plausible performance, not just point estimates.

03  /  Agreement

Inter-annotator agreement

We measure agreement between evaluators to ensure consistency and identify tasks where clinical judgement genuinely differs.

03.A / Cohen’s Kappa (κ)

Cohen’s Kappa (κ)

Measures pairwise agreement between two evaluators, correcting for chance agreement. Used for direct comparison tasks and pairwise annotation projects where each item is assessed by exactly two evaluators.

03.B / Fleiss’ Kappa (κ)

Fleiss’ Kappa (κ)

Extends agreement measurement to multiple evaluators. Essential for production evaluation tasks where items are assessed by varying numbers of evaluators and we need to assess the overall reliability of the evaluation process.

04  /  Production

Quality control in production

Five layers of quality control ensure evaluation reliability is maintained throughout production work — not just during calibration.

04.A / Gold-Standard Injection

Gold-Standard Injection

Known-answer tasks are randomly injected into production queues. Evaluators don't know which tasks are gold items, ensuring they maintain quality on every response.

04.B / Drift Detection

Drift Detection

Statistical monitoring detects when an evaluator's performance begins to drift from their calibration baseline — catching fatigue, disengagement, or changing standards early.

04.C / Attention Checks

Attention Checks

Strategically placed verification items confirm evaluators are reading and engaging with content rather than pattern-matching or rushing through tasks.

04.D / Skip Budget

Skip Budget

Evaluators can skip tasks outside their expertise, but within a managed budget. This prevents gaming while allowing honest acknowledgement of knowledge limits.

04.E / Justification Quality

Justification Quality

Multi-signal analysis of evaluator justifications assesses depth, clinical reasoning, and consistency — catching thin or formulaic explanations that suggest disengagement.

The Deliverable  /  05

Every engagement ends in a Reliability Report.

A procurement-grade artifact you can take to clinical safety reviews, regulators, and compliance teams. Issued as both a human-readable PDF and a machine-readable JSON — with the full methodology referenced throughout.

PDF + JSON

Human-readable for safety reviews; machine-readable for ingestion into your QMS or evidence pipeline.

Per-metric CIs

Beta-Binomial intervals for proportions; bootstrap intervals for continuous scores. Lower confidence bounds reported on every metric.

Readiness scores

Overall, safety-critical, and annotation-heavy readiness — each backed by the underlying calibration data.

Audit trail

Evaluator IDs, gold-task agreement, attention-check history, and methodology version recorded against every result.

06  /  Readiness

Readiness scoring

Before matching evaluators with paid work, we compute a readiness score across three categories — ensuring the right evaluator is matched to the right task.

06.A / Ranking & Correction

Ranking & Correction

Pairwise ranking, text correction, and response assessment — the core RLHF task families.

06.B / Safety & Red-Teaming

Safety & Red-Teaming

Adversarial testing across failure mode categories — identifying dangerous outputs before they reach patients.

06.C / Data & Annotation

Data & Annotation

Clinical data labelling, document classification, and structured annotation for AI training datasets.

Stay in the loop
The Promise

Interested?
Stay in the loop.

Join the waitlist and we'll email you when new AI roles matching your expertise become available.

  1. 01Early access to new AI roles
  2. 02Weekly pay rate updates
  3. 03Priority matching when you register
The Form