entertheloop

Evaluation Methodology

How we ensure every clinical AI evaluation is statistically rigorous, reproducible, and trustworthy. From evaluator calibration through to production quality control — every stage is designed to produce reliable results.

0-Phase

Calibration process

CI-Backed

Every metric

0

Quality control layers

Clinician-Led

All evaluations

Two-Phase Calibration

Every evaluator goes through a two-phase calibration process before they can assess your AI system. This ensures their judgements are statistically aligned with expert consensus.

1

Quick Assessment

A focused 10-task assessment using gold-standard items with known expert consensus. This initial screen identifies evaluators whose clinical judgement aligns with the reference standard and flags those who need additional training before proceeding.

2

Sequential Stopping

Phase 2 uses sequential stopping rules — evaluators continue until we have statistical confidence in their calibration score, up to 30 tasks maximum. This adaptive approach is more efficient than fixed-length tests while providing the same statistical guarantees.

Proper Scoring Rules

We use mathematically proper scoring rules that incentivise honest, well-calibrated evaluations rather than gaming strategies.

Brier Score

Measures the accuracy of probabilistic predictions. A proper scoring rule that rewards evaluators for well-calibrated confidence — penalising both overconfidence and underconfidence equally.

Log Score

Provides sharper discrimination between good and poor calibration than Brier scores. Particularly useful for detecting evaluators who are systematically overconfident in their clinical assessments.

Confidence Intervals

Every calibration score includes statistical confidence intervals. We report lower confidence bounds (LCB) to ensure certification decisions are based on the worst-case plausible performance, not just point estimates.

Inter-Annotator Agreement

We measure agreement between evaluators to ensure consistency and identify tasks where clinical judgement genuinely differs.

Cohen’s Kappa (κ)

Measures pairwise agreement between two evaluators, correcting for chance agreement. Used for direct comparison tasks and pairwise annotation projects where each item is assessed by exactly two evaluators.

Fleiss’ Kappa (κ)

Extends agreement measurement to multiple evaluators. Essential for production evaluation tasks where items are assessed by varying numbers of evaluators and we need to assess the overall reliability of the evaluation process.

Quality Control in Production

Five layers of quality control ensure evaluation reliability is maintained throughout production work — not just during calibration.

Gold-Standard Injection

Known-answer tasks are randomly injected into production queues. Evaluators don't know which tasks are gold items, ensuring they maintain quality on every response.

Drift Detection

Statistical monitoring detects when an evaluator's performance begins to drift from their calibration baseline — catching fatigue, disengagement, or changing standards early.

Attention Checks

Strategically placed verification items confirm evaluators are reading and engaging with content rather than pattern-matching or rushing through tasks.

Skip Budget

Evaluators can skip tasks outside their expertise, but within a managed budget. This prevents gaming while allowing honest acknowledgement of knowledge limits.

Justification Quality

Multi-signal analysis of evaluator justifications assesses depth, clinical reasoning, and consistency — catching thin or formulaic explanations that suggest disengagement.

Readiness Scoring

Before matching evaluators with paid work, we compute a readiness score across three categories — ensuring the right evaluator is matched to the right task.

RLHF Evaluation

Pairwise ranking, text correction, and response assessment — the core tasks of clinical RLHF work.

Safety & Red-Teaming

Adversarial testing across failure mode categories — identifying dangerous outputs before they reach patients.

Data & Annotation

Clinical data labelling, document classification, and structured annotation for AI training datasets.

See Our Methodology in Action

Whether you need clinical AI evaluation for your product or want to become a calibrated evaluator — our methodology ensures quality at every stage.