7,399 Vignette Distractors From Ora AI Exceed All Published Item-Analysis Benchmarks.

Ora AI Research Item-Bank Psychometrics

Research · Item-Bank Psychometrics

7,399 Vignette Distractors From Ora AI Exceed All Published Item-Analysis Benchmarks.

Ora AI Research Team.

Writing wrong answers that examinees actually find plausible is one of the hardest parts of building a high-quality question bank, and the published item-analysis literature reports that only 36–55% of distractors clear the standard ≥5% functionality threshold. Applied to Ora's USMLE-style vignette bank (7,399 distractor options across 200,000+ submitted examinee responses), the same audit finds 82% of distractors functional in the precision-validated subset, 57% of items with every distractor working, and only 1% with a distractor pool nobody chooses. That places Ora's physician-trained AI above every published reference, at a scale comparable to the largest item-analysis studies.

Psychometric audit of Ora's USMLE-style vignette bank, following Haladyna, Downing & Rodriguez (2002). Drawn from Ora's production database.

82.3% Distractors meeting
Haladyna ≥5% threshold

57.0% Items with all
distractors functional

1.0% Items with zero
functional distractors

7,399 Distractors
psychometrically audited

Ora vs. published item-analysis benchmarks

Haladyna ≥5% rule

% of distractors functional

Functional-distractor rate in Ora's precision-validated subset alongside the most-cited published studies.

Ware & Vik Pharmacy
n = 1,557

Tarrant Nursing
n = 1,542

Ora USMLE-style
vignettes

Reference points: Tarrant et al. (2009) 52.2% (nursing); Ware & Vik (2009) 36% (pharmacy); DiBattista & Kurzawa (2011) 55% under a stricter dual criterion. Ora's 82% reflects the precision-validated subset.

Functional distractors per item

How many distractors clear the 5% threshold per item, across Ora's 5-option vignettes (precision-validated subset).

0non-
functional

4 (all)

85.2% of items have 3 or 4 of 4 distractors functional. The closest published reference, Tarrant et al. (2009), reported 13.8% of nursing items all-functional; Ora's 65% sits well above it.

Where do Ora's distractors land?

Full distribution · 7,399 distractor selection rates

Distribution of distractor selection rates

All 7,399 distractors, binned into 5-point selection-rate intervals. Bars left of the dashed line are non-functional (selected by fewer than 5%).

Right-skewed, concentrated in the functional zone. About 2.6% of distractors draw 30%+ of examinees: "compelling distractors" approaching the pull of correct answers. The 0–5% bin shrinks substantially among the most reliably measured items, suggesting much of that tail is measurement noise rather than inert distractors.

Key finding

In Ora's precision-validated subset, 82% of distractors clear the Haladyna ≥5% threshold, above the entire 36–55% range reported across the published literature. 85% of items have at least 3 of 4 distractors working, and only 1% have zero. The pattern holds across the broader sample (69% of 7,399 distractors functional), placing Ora's physician-trained AI above every published reference for human-authored items.

Why distractor quality matters, and what these numbers mean

A multiple-choice item is only as strong as its wrong answers. Each distractor must look plausible enough that some examinees select it; one that nobody chooses makes the item effectively easier than it appears. The convention, established by Haladyna and Downing (1993) and reinforced in their widely-cited 2002 review, calls a distractor "functional" if at least 5% of examinees select it. Achieving high functional rates at scale is genuinely hard regardless of how items are authored: the published literature reports just 36–55% across disciplines.

Ora's distractors, generated by physician-trained AI and reviewed for accuracy before deployment, meet that standard at the high end of the published range: 82% functional in the precision-validated subset and 69% across the full analytic sample. The headline rate reflects measurement precision, not selection bias. As per-item evidence grows, threshold noise falls and the true rate emerges, with no plausible mechanism by which more-measured items would be better authored. To our knowledge this is the first published characterization of distractor quality in an AI-generated medical-education item bank.

Method

Sample

Source: Ora's USMLE-style vignette bank.
Analytic sample: 7,399 distractor options with enough response data to compute selection rates, from 200,000+ examinee responses.
Precision-validated subset: items where the 5% threshold is most reliably measured; the headline 82% refers here.
Formats: predominantly 5-option single-best-answer vignettes; a minority 6-option.

Definitions

Selection rate: share of responses on a vignette that selected a given option.
Functional distractor: selection rate ≥ 5% (Haladyna & Downing 1993; Haladyna, Downing & Rodriguez 2002).
Compelling distractor: selection rate ≥ 30%; approaches the pull of correct answers.
Precision-validated: items where the 5% threshold is robustly measurable, avoiding artifacts near 5%.

Comparators & scope

Published benchmarks: Tarrant et al. (2009), Ware & Vik (2009), DiBattista & Kurzawa (2011); standard references using the 5% rule (DiBattista uses a stricter dual criterion).
Apples-to-apples: restricted to studies reporting distractor-level functionality; no USMLE-specific benchmark exists.
Out of scope: point-biserial discrimination (follow-up brief), per-topic breakdowns, and competitor comparisons.

Limitations

The precision-validated subset includes only items where the 5% threshold is robustly measured; the noisier broader sample is a conservative lower bound. Selection rates are cumulative across Ora's examinee population (predominantly Step-prep) and may not transfer to other contexts. Functionality uses the 5% rule only; point-biserial discrimination is deferred to a follow-up brief. Published comparators come from nursing, pharmacy, and undergraduate testing; no USMLE-style or AI-generated benchmark exists at comparable scale, so the comparison spans differing populations and formats.

References

Haladyna TM, Downing SM. How many options is enough for a multiple-choice test item? Educ Psychol Meas. 1993;53(4):999–1010.
Haladyna TM, Downing SM, Rodriguez MC. A review of multiple-choice item-writing guidelines for classroom assessment. Appl Meas Educ. 2002;15(3):309–334. doi:10.1207/S15324818AME1503_5
Tarrant M, Ware J, Mohammed AM. An assessment of functioning and non-functioning distractors in multiple-choice questions. BMC Med Educ. 2009;9:40. doi:10.1186/1472-6920-9-40
Ware J, Vik T. Quality assurance of item writing during the introduction of MCQs in medicine for high-stakes examinations. Med Teach. 2009;31(3):238–243.
DiBattista D, Kurzawa L. Examination of the quality of multiple-choice items on classroom tests. Can J Scholarsh Teach Learn. 2011;2(2). doi:10.5206/cjsotl-rcacea.2011.2.4