Phase 2 · Full Calibration

The Full Calibration.

A complete statistical calibration across all eight RLHF task types. It produces your reliability score with confidence intervals on every metric — the number AI companies see when they match you to paid work.

Start training — free The methodology

At a glance

Duration: 60–90 min
Task types: 8
Output: Reliability Report
Price: Free

Why it matters

The score on the record.

Where Phase 1 is a private diagnostic, Phase 2 is the procurement-grade measurement. It produces your reliability score with statistical confidence intervals and the buyer-facing Reliability Report — the evidence that gets you matched to paid evaluation work.

Coverage

All eight task types.

Real RLHF work spans more than rating answers. Phase 2 exercises the full range, so your score reflects how you actually perform across the work buyers need.

Rating

Score a single AI response against a quality scale.

Comparison

Choose the better of two responses, and say why.

Ranking

Order several responses from best to worst.

Rubric

Assess against structured, dimensional criteria.

Correction

Fix an unsafe or incorrect response.

Annotation

Label spans and structure in clinical text.

Justification

Write the clinical reasoning behind a judgement.

Red-team

Probe for failure modes across the safety taxonomy.

The statistics

Measured, not asserted.

Every metric is reported with uncertainty, not as a bare number — so a score means what it says.

Per-metric confidence intervals

Beta-Binomial intervals for proportions, bootstrap intervals for continuous metrics — uncertainty quantified on every score.

Lower confidence bounds

Certification reads the lower bound, not the point estimate, so your score reflects worst-case plausible performance.

Per-category coverage

Performance broken down across task types and the safety taxonomy, recorded in the report.

The deliverable

A Reliability Report on record.

You finish Phase 2 with a procurement-grade Reliability Report — PDF for people, JSON for systems — that you can share with employers and that backs every match we make.

How we score it →

Get calibrated.

Work through Foundation and the Quick Screen, then complete the Full Calibration to put a reliability score on your profile — all free.

entertheloopClinicians powering AI alignment, training & safety.

Verified against

GMCNMCGPhCHCPC

entertheloop

Clinicians powering AI alignment, training & safety.