Methodology

Evaluation Methodology

How we ensure every clinical AI evaluation is statistically rigorous, reproducible, and trustworthy. From evaluator calibration through to production quality control — every stage is designed to produce reliable results.

Talk to Our Team Clinical RLHF Guide

0-Phase

Calibration process

CI-Backed

Every metric

Quality control layers

Clinician-Led

All evaluations

01 / Calibration

Two-phase calibration

Every evaluator goes through a two-phase calibration process before they can assess your AI system. This ensures their judgements are statistically aligned with expert consensus.

01.A / Phase 1 — Quick Screen

Gold-Standard Screen

11 tasks

G-01Drug interaction check

G-02Dosage boundary test

G-03Contraindication ID

G-04Guideline alignment

G-05Scope boundary

G-06Risk severity rating

G-07Triage priority

G-08Lab interpretation

G-09Side effect flagging

G-10Clinical summary

Expert Alignment9/10 — Pass

Phase 1 — Quick Screen

An 11-task sequential screen on gold-standard items with known expert consensus. Uses anti-luck guardrails and per-item agreement thresholds to confirm an evaluator's clinical judgement is aligned with the reference standard before they commit to the full calibration. Typically takes ~20 minutes.

01.B / Phase 2 — Full Calibration

Sequential Analysis

Converged

After 5 tasks72%

±18%

After 10 tasks81%

±12%

After 15 tasks86%

±8%

After 20 tasks89%

±5%

After 23 tasks91%

±3.2%

Result

Stopped at task 23 — CI narrowed below 5%

Score

91% ± 3.2%

Phase 2 — Full Calibration

A complete statistical calibration across all eight RLHF task types — rating, comparison, ranking, rubric, correction, annotation, justification, and red-team — running 60 to 90 minutes. Produces the evaluator's reliability score with Beta-Binomial / Bootstrap confidence intervals per metric, and forms the basis of the Reliability Report.

02 / Scoring

Proper scoring rules

We use mathematically proper scoring rules that incentivise honest, well-calibrated evaluations rather than gaming strategies.

02.A / Brier Score

Brier Score

0 = perfect

E-0190% → 85%

0.12

E-0270% → 72%

0.08

E-0395% → 60%

0.42

E-0480% → 78%

0.09

E-03 flaggedOverconfident

Brier Score

Measures the accuracy of probabilistic predictions. A proper scoring rule that rewards evaluators for well-calibrated confidence — penalising both overconfidence and underconfidence equally.

02.B / Log Score

Log Score

Sharp penalty

Well-calibrated

80%-0.22

Slightly over

92%-0.51

Overconfident

99%-2.30

99% confidence on wrong answer → 10× penalty vs Brier

Log Score

Provides sharper discrimination between good and poor calibration than Brier scores. Particularly useful for detecting evaluators who are systematically overconfident in their clinical assessments.

02.C / Confidence Intervals

95% CI

LCB ≥ 85%

C-0191%(87–95%)

Pass

C-0286%(80–92%)

Fail

C-0393%(90–96%)

Pass

C-02: 86% point estimate but LCB 80% — below threshold

Confidence Intervals

Beta-Binomial intervals for proportions, bootstrap intervals for continuous metrics, and lower confidence bounds (LCB) for certification decisions — so every score reflects worst-case plausible performance, not just point estimates.

03 / Agreement

Inter-annotator agreement

We measure agreement between evaluators to ensure consistency and identify tasks where clinical judgement genuinely differs.

03.A / Cohen’s Kappa (κ)

Pairwise Agreement

2 evaluators

ItemEval AEval BMatch

#1SafeSafe

#2UnsafeUnsafe

#3SafeBorderline

#4UnsafeUnsafe

#5BorderlineBorderline

#6SafeSafe

#7UnsafeSafe

#8SafeSafe

Observed agreement

75% (6/8 items)

Cohen's κ

0.71 — Substantial

0 (chance)1 (perfect)

Cohen’s Kappa (κ)

Measures pairwise agreement between two evaluators, correcting for chance agreement. Used for direct comparison tasks and pairwise annotation projects where each item is assessed by exactly two evaluators.

03.B / Fleiss’ Kappa (κ)

Multi-Rater Agreement

4 evaluators

Item

R1R2R3R4

Agree

AAAB

83%

BBBB

100%

ABAA

83%

CCBC

83%

ABCA

50%

BBBA

83%

Fleiss’ Kappa (κ)

Extends agreement measurement to multiple evaluators. Essential for production evaluation tasks where items are assessed by varying numbers of evaluators and we need to assess the overall reliability of the evaluation process.

04 / Production

Quality control in production

Five layers of quality control ensure evaluation reliability is maintained throughout production work — not just during calibration.

04.A / Gold-Standard Injection

Gold-Standard Injection

Known-answer tasks are randomly injected into production queues. Evaluators don't know which tasks are gold items, ensuring they maintain quality on every response.

04.B / Drift Detection

Drift Detection

Statistical monitoring detects when an evaluator's performance begins to drift from their calibration baseline — catching fatigue, disengagement, or changing standards early.

04.C / Attention Checks

Attention Checks

Strategically placed verification items confirm evaluators are reading and engaging with content rather than pattern-matching or rushing through tasks.

04.D / Skip Budget

Skip Budget

Evaluators can skip tasks outside their expertise, but within a managed budget. This prevents gaming while allowing honest acknowledgement of knowledge limits.

04.E / Justification Quality

Justification Quality

Multi-signal analysis of evaluator justifications assesses depth, clinical reasoning, and consistency — catching thin or formulaic explanations that suggest disengagement.

The Deliverable / 05

Every engagement ends in a Reliability Report.

A procurement-grade artifact you can take to clinical safety reviews, regulators, and compliance teams. Issued as both a human-readable PDF and a machine-readable JSON — with the full methodology referenced throughout.

PDF + JSON

Human-readable for safety reviews; machine-readable for ingestion into your QMS or evidence pipeline.

Per-metric CIs

Beta-Binomial intervals for proportions; bootstrap intervals for continuous scores. Lower confidence bounds reported on every metric.

Readiness scores

Overall, safety-critical, and annotation-heavy readiness — each backed by the underlying calibration data.

Audit trail

Evaluator IDs, gold-task agreement, attention-check history, and methodology version recorded against every result.

06 / Readiness

Readiness scoring

Before matching evaluators with paid work, we compute a readiness score across three categories — ensuring the right evaluator is matched to the right task.

06.A / Ranking & Correction

Readiness Profile87%

Ranking & Correction

Pairwise ranking, text correction, and response assessment — the core RLHF task families.

06.B / Safety & Red-Teaming

Threat Coverage86%

Safety & Red-Teaming

Adversarial testing across failure mode categories — identifying dangerous outputs before they reach patients.

06.C / Data & Annotation

Annotation Skills88%

Data & Annotation

Clinical data labelling, document classification, and structured annotation for AI training datasets.

Stay in the loop

The Promise

Interested?
Stay in the loop.

Join the waitlist and we'll email you when new AI roles matching your expertise become available.

01Early access to new AI roles
02Weekly pay rate updates
03Priority matching when you register

The Form

Evaluation Methodology

Two-phase calibration

Phase 1 — Quick Screen

Phase 2 — Full Calibration

Proper scoring rules

Brier Score

Log Score

Confidence Intervals

Inter-annotator agreement

Cohen’s Kappa (κ)

Fleiss’ Kappa (κ)

Quality control in production

Gold-Standard Injection

Drift Detection

Attention Checks

Skip Budget

Justification Quality

Every engagement ends in a Reliability Report.

Readiness scoring

Ranking & Correction

Safety & Red-Teaming

Data & Annotation

Interested?Stay in the loop.

Interested?
Stay in the loop.