Ora AI Outperforms UWorld in the First Multi-Institutional RCT of a USMLE Question Bank

Ora AI Research

Research

Ora AI Outperforms UWorld in the First Multi-Institutional RCT of a USMLE Question Bank

Phelps R, et al. Toward AI-Powered Precision Medical Education: A Multi-Institutional Randomized Controlled Trial of an Adaptive Question Bank for the USMLE. Preprint, pending peer review. June 2026.

In a preregistered, blinded-analysis, multi-institutional RCT of 155 medical students, at-risk students improved 2.4× more with Ora than UWorld over the 14-day intervention (+4.4 vs +1.8 questions, p < .001). At 10-month real-exam follow-up, 100% of Ora-assigned Step 1 respondents passed on first attempt (vs 91% UWorld, n.s.), and students with predominant Ora exposure scored 11.0 points higher on first-attempt Step 2 CK (p = .027).

Preregistered: AsPredicted #246348. Statistical analysis conducted blinded to arm assignment by the Stanford Department of Statistics.

100% Step 1 first-attempt
pass rate (vs 91%, n.s.)

+11.0 Points on Step 2 CK
vs. UWorld

+2.6 Q Advantage for at-risk
students (p < .001)

155 / 51 RCT enrolled /
10-mo follow-up

Long-term outcomes: actual USMLE performance

10 months post-trial · n = 51 follow-up

USMLE Step 1

First-attempt pass rate (10-mo follow-up; n = 24)

National

UWorld

Ora

100% (13/13) Ora-assigned vs 91% (10/11) UWorld-assigned passed on first attempt. Fisher's exact p = .46 (n.s. given small n). National avg first-time MD pass rate ~91% (USMLE.org).

USMLE Step 2 CK

First-attempt mean score (10-mo follow-up; n = 23)

National

UWorld

Ora

Ora +11.0 pts vs UWorld (263.4 vs 252.4). Approximately 0.7 SD on the Step 2 CK scoring scale. National 2024-25 mean: 250 (USMLE Score Interpretation Guidelines).

Short-term outcome: 14-day primary analysis

Preregistered primary outcome · n = 121 per-protocol

14-Day Improvement: All Students and At-Risk Students

Pretest → posttest gain on the 60-item NBME-derived assessment over the 14-day intervention. UWorld's gain is flat across populations; Ora's gain rises with student need.

All studentsn = 60

At-risk studentsn = 33

All studentsn = 61

At-risk studentsn = 38

UWorld

Ora

UWorld's gain is essentially flat across populations (+1.9 → +1.8); Ora's gain rises with student need (+3.2 → +4.4), the adaptive-targeting signal. Adjusted Ora − UWorld effect among at-risk students from per-protocol ANCOVA: +2.61 questions, 95% CI 1.09 to 4.13, p < .001. At-risk = pretest < 36/60 (the approximate NBME first-attempt passing threshold); at-risk students are a subset of the all-students population.

Study design

Design

Two-arm parallel-group RCT comparing Ora AI vs UWorld.
1:1 randomization, computer-generated, stratified by exam level (Step 1 / Step 2 CK).
Preregistered on AsPredicted (#246348).
Blinded analysis: arm-coded (A/B) dataset analyzed by Stanford Department of Statistics co-authors, blinded to arm assignment.
IRB exempt under 45 CFR 46.104(d)(1)-(2); all participants provided electronic consent.

Participants & protocol

155 US medical students enrolled (Ora: 77; UWorld: 78); 121 per-protocol; 51 long-term respondents at 10 months.
14-day intervention period during summer 2025.
Engagement criteria: ≥10 study days and ≥400 questions; verified via platform logs (Ora) and self-report (UWorld).
Baseline well-matched across pretest score, exam cohort, assessment order, and URiM proportion (all p > .38).

Outcomes & analysis

Short-term outcome: 60-item NBME Free 120 posttest (counterbalanced; Block A↔B).
Long-term outcome: actual first-attempt USMLE Step 1 pass and Step 2 CK score, self-reported at 10 months.
Primary analysis: ANCOVA adjusted for pretest, exam cohort, and assessment order; arm × pretest interaction tested per protocol.
Conducted in R 4.5.3 (emmeans, car, effectsize, mediation).

Limitations

The 14-day intervention represents a fraction of a typical board prep cycle. Long-term follow-up sample (n = 51) was modest and self-reported on an anonymous survey. UWorld-arm engagement was self-reported while Ora-arm engagement was platform-logged, an asymmetry that may affect engagement-related interpretations. The study was not adequately powered for subgroup analyses (URiM students showed a directional but non-significant larger benefit). Findings warrant confirmatory replication in a larger trial. Preprint pending peer review.