Ora AI Outperforms UWorld in the First Multi-Institutional RCT of a USMLE Question Bank
Research

Ora AI Outperforms UWorld in the First Multi-Institutional RCT of a USMLE Question Bank

Phelps R, et al. Toward AI-Powered Precision Medical Education: A Multi-Institutional Randomized Controlled Trial of an Adaptive Question Bank for the USMLE. Preprint, pending peer review. June 2026.

In a preregistered, blinded-analysis, multi-institutional RCT of 155 medical students, at-risk students improved 2.4× more with Ora than UWorld over the 14-day intervention (+4.4 vs +1.8 questions, p < .001). At 10-month real-exam follow-up, 100% of Ora-assigned Step 1 respondents passed on first attempt (vs 91% UWorld, n.s.), and students with predominant Ora exposure scored 11.0 points higher on first-attempt Step 2 CK (p = .027).

Preregistered: AsPredicted #246348. Statistical analysis conducted blinded to arm assignment by the Stanford Department of Statistics.
100% Step 1 first-attempt
pass rate (vs 91%, n.s.)
+11.0 Points on Step 2 CK
vs. UWorld
+2.6 Q Advantage for at-risk
students (p < .001)
155 / 51 RCT enrolled /
10-mo follow-up
Long-term outcomes: actual USMLE performance
10 months post-trial · n = 51 follow-up
USMLE Step 1
First-attempt pass rate (10-mo follow-up; n = 24)
National
UWorld
Ora

100% (13/13) Ora-assigned vs 91% (10/11) UWorld-assigned passed on first attempt. Fisher's exact p = .46 (n.s. given small n). National avg first-time MD pass rate ~91% (USMLE.org).

USMLE Step 2 CK
First-attempt mean score (10-mo follow-up; n = 23)
National
UWorld
Ora

Ora +11.0 pts vs UWorld (263.4 vs 252.4). Approximately 0.7 SD on the Step 2 CK scoring scale. National 2024-25 mean: 250 (USMLE Score Interpretation Guidelines).

Short-term outcome: 14-day primary analysis
Preregistered primary outcome · n = 121 per-protocol
14-Day Improvement: All Students and At-Risk Students
Pretest → posttest gain on the 60-item NBME-derived assessment over the 14-day intervention. UWorld's gain is flat across populations; Ora's gain rises with student need.
All studentsn = 60
At-risk studentsn = 33
All studentsn = 61
At-risk studentsn = 38
UWorld
Ora

UWorld's gain is essentially flat across populations (+1.9 → +1.8); Ora's gain rises with student need (+3.2 → +4.4), the adaptive-targeting signal. Adjusted Ora − UWorld effect among at-risk students from per-protocol ANCOVA: +2.61 questions, 95% CI 1.09 to 4.13, p < .001. At-risk = pretest < 36/60 (the approximate NBME first-attempt passing threshold); at-risk students are a subset of the all-students population.

Study design

Design
  • Two-arm parallel-group RCT comparing Ora AI vs UWorld.
  • 1:1 randomization, computer-generated, stratified by exam level (Step 1 / Step 2 CK).
  • Preregistered on AsPredicted (#246348).
  • Blinded analysis: arm-coded (A/B) dataset analyzed by Stanford Department of Statistics co-authors, blinded to arm assignment.
  • IRB exempt under 45 CFR 46.104(d)(1)-(2); all participants provided electronic consent.
Participants & protocol
  • 155 US medical students enrolled (Ora: 77; UWorld: 78); 121 per-protocol; 51 long-term respondents at 10 months.
  • 14-day intervention period during summer 2025.
  • Engagement criteria: ≥10 study days and ≥400 questions; verified via platform logs (Ora) and self-report (UWorld).
  • Baseline well-matched across pretest score, exam cohort, assessment order, and URiM proportion (all p > .38).
Outcomes & analysis
  • Short-term outcome: 60-item NBME Free 120 posttest (counterbalanced; Block A↔B).
  • Long-term outcome: actual first-attempt USMLE Step 1 pass and Step 2 CK score, self-reported at 10 months.
  • Primary analysis: ANCOVA adjusted for pretest, exam cohort, and assessment order; arm × pretest interaction tested per protocol.
  • Conducted in R 4.5.3 (emmeans, car, effectsize, mediation).
Limitations

The 14-day intervention represents a fraction of a typical board prep cycle. Long-term follow-up sample (n = 51) was modest and self-reported on an anonymous survey. UWorld-arm engagement was self-reported while Ora-arm engagement was platform-logged, an asymmetry that may affect engagement-related interpretations. The study was not adequately powered for subgroup analyses (URiM students showed a directional but non-significant larger benefit). Findings warrant confirmatory replication in a larger trial. Preprint pending peer review.