Students Who Fail a Vignette on Ora AI Recover the Concept 80% of the Time Within Three Spaced Variants.
Ora AI Research Spaced Repetition
Research · Spaced Repetition

Students Who Fail a Vignette on Ora AI Recover the Concept 80% of the Time Within Three Spaced Variants.

Ora AI Research Team. Internal empirical analysis.

Ora operates the world's first spaced-repetition clinical-vignette question bank. On the Ebbinghaus curve, failure is supposed to be a setback. On Ora it's a setup. Across 25,184 cleaned student-vignette failure chains, 79.9% of learners who initially failed a vignette eventually got the concept right when the scheduler re-served them a different variant of it within their first 3 encounters: 64.1% recovered on the next variant alone, another 15.8% on the third. Only 10.3% failed all three variants; a further 9.8% are still censored inside the analytic window. The within-user paired test (each student as their own control, n = 193 users) gives a +3.10 percentage-point lift over the same student's typical first-attempt accuracy (95% CI ±1.40, p < 0.001), identical to the between-subject estimate, so the recovery effect isn't a user-ability artifact. Variant order within every vignette is randomized at authoring time, ruling out content-difficulty confounding.

Substrate: 17,316 three-variant vignettes · 105,075 baseline first encounters · 25,184 cleaned single-failure recovery chains. Drawn from Ora's production database. Companion to F-2 (flashcard FSRS validation).
79.9% Failed concepts
recovered within 3 variants
64.1% First-recovery rate
at E2 (next variant)
+3.1 pp Within-user causal lift
vs own baseline (p < 0.001)
25,184 Cleaned recovery chains
(of 40,601 E1 failures)
What happens after a student fails a vignette on Ora
N = 25,184 cleaned E1-failure chains · 17,316 three-variant vignettes
Outcome breakdown of every cleaned student-vignette failure chain, i.e., every case where a student answered an Ora vignette incorrectly on their first encounter and the scheduler subsequently served them at least one different variant of the same concept. The dark + light accent segments together (80%) are the recovered group.
64.1% Right on next variant (E2) n = 16,138
15.8% Failed E2, right on E3 n = 3,985
9.8% Failed E2, E3 not yet served (censored) n = 2,474
10.3% Failed all three variants n = 2,593

Failing on Ora isn't a setback. It's the setup. 4 out of 5 cleaned failure chains end in mastery within the 3-variant content. The 9.8% censored slice is a strict lower bound: those students failed E2 but were still inside our analytic window when their next encounter hadn't yet arrived; their eventual recovery will only push the 80% higher. Cleaned dataset: vignettes with exactly 3 author-randomized variants; same-variant retakes removed; “right then served again” pairs (n = 329, scheduler bugs) excluded.

Within-user paired test: the causal anchor

A between-subjects comparison can in principle be inflated by user-ability differences (do better students happen to land in the recovery cohort?). To rule this out we computed, for each of the 193-user analytic subset meeting the within-user threshold, their own first-encounter baseline across all eligible vignettes vs their own recovery rate on vignettes they initially failed. Mean own baseline: 57.9%. Mean own recovery at the next variant: 61.0%. Within-user lift: +3.10 percentage points (95% CI ±1.40, p < 0.001). Identical to the between-subjects estimate. The same student, on concepts they failed once, performs measurably better on the next variant than they typically do on a fresh first encounter. The effect is the scheduler, not the cohort.

The spacing window: recovery beats baseline out to about 7 days
N = 25,359 cleaned single-failure recovery pairs, bucketed by Δt between E1 and E2
Recovery accuracy on the second variant, conditional on failing the first, broken out by the time gap between the two encounters. Dashed reference line is the 61.0% peer first-time baseline (i.e., the average accuracy of everyone on a fresh first encounter). Inside the scheduler's intended re-exposure window (under 7 days), recovery beats baseline by 3 to 19 points. Beyond two weeks, forgetting dominates and recovery falls below baseline, which is exactly why a scheduler is needed.
<12 hn = 177
12–24 hn = 1,356
1–3 dn = 8,745
3–7 dn = 10,458
7–14 dn = 2,971
14–30 dn = 1,092
30–60 dn = 461
60+ dn = 99

Same cleaned cohort as the headline chart. The crossover at ~7 days isn't a flaw; it's the empirical signal that defines the optimal review window for clinical-vignette content. Bars above baseline (under 7 d) are the scheduler doing its job; bars below baseline (over 14 d) are concepts the scheduler hasn't yet looped back to, and provide the dose-response evidence that when it loops them back matters as much as that it does. The 60+ d bucket has a small n (99) and the curve flattens with cohort attrition rather than continuing to decay smoothly.

Why this matters: the spaced-rep QBank thesis, validated

Every other major USMLE-prep QBank (UWorld, AMBOSS, Kaplan, NBME self-assessments) treats vignettes as a library: students click through, get a result, and the system forgets. Ora is the first QBank that treats each vignette concept as a memory trace to be maintained on a spacing schedule, with multiple author-randomized variants per concept so the second and third exposures aren't trivial recognition retests. This brief is the first empirical evidence that the thesis works: not just “students who fail recover to mediocre,” but 4 out of 5 failed concepts end in mastery within 3 spaced variants, and the next-variant recovery rate (64.1%) is itself above the peer first-time baseline (61.0%). The textbook spacing-effect literature (Cepeda 20081, Karpicke & Roediger 20082, Bjork's “desirable difficulties”3) is overwhelmingly built on atomic facts and flashcards. We are reporting it for the first time on multi-step clinical-reasoning items at scale.

How we measured it

Cohort + filtering
  • Restricted to the 17,316 vignettes with exactly 3 non-suspended variants. Vignettes with 1 or 2 variants excluded because they can't generate the full failure-recovery chain.
  • Variant order randomized at authoring. Within a vignette, the three variants are content-equivalent draws, eliminating any systematic difficulty bias between “variant 1” and “variant 2.”
  • Same-variant retakes removed (user manually re-doing the same question, not spaced re-exposure).
  • “Right then served again” pairs excluded (n = 329): the scheduler should never re-serve a passed concept; those pairs are operationally bugs.
Estimators
  • Headline 79.9%: of the 25,184 cleaned E1-wrong chains with at least one further encounter, the share where any subsequent variant (E2 or E3) was answered correctly.
  • Between-subjects E2 recovery (64.1%): compared against the global E1 baseline (61.0%, n = 105,075). The +3.1 pp gap is the cohort-level lift.
  • Within-subjects (paired): each of 193 users (those with ≥10 first encounters AND ≥10 recovery events) compared to themselves; mean within-user difference and 95% CI reported.
  • Dose-response: the E2 recovery rate further bucketed by the time gap between E1 and E2, showing the spacing window inside which recovery exceeds baseline.
Honest scope
  • Recognition-format multiple-choice items; not numerically comparable to free-recall flashcard or Ebbinghaus values.
  • Within-vignette concept, not cross-topic transfer. A separate brief addresses whether re-exposure to one concept lifts performance on adjacent ones.
  • Drawn from Ora's production database; the analytic window censors the largest spacing-gap buckets and the “E3 not yet served” slice of the headline chart.
Limitations

The headline 79.9% recovery, the 64.1% vs 61.0% between-subjects lift, and the within-user paired +3.10 pp estimate are derived from the same cleaned cohort and corroborate each other but are not independent samples. The 9.8% censored slice of the headline chart represents student-vignette pairs where the learner failed E2 and was still inside our analytic window when the third variant hadn't yet been served; we treat these as “not recovered” in the 80% number, which makes 80% a lower bound. The within-user analysis requires ≥10 first encounters AND ≥10 recovery events per user, restricting that estimator to engaged users; the between-subjects estimator does not require this and shows the same effect. Variant content is conceptually equivalent by design (same learning objective, different clinical vignette) and variant order was randomized at authoring time, but human authoring introduces small idiosyncratic variation that no scheduler-controlled experiment can fully eliminate; a future controlled randomization at serve time (rather than at authoring) would let us measure this residual. The recovery effect documented here is recovery to peer-baseline-or-better on the next encounter; the long-horizon durability of that recovery is the subject of the flashcard-substrate companion (F-2).

References

  1. Cepeda NJ, Vul E, Rohrer D, Wixted JT, Pashler H. Spacing effects in learning: a temporal ridgeline of optimal retention. Psychological Science. 2008;19(11):1095-1102. doi:10.1111/j.1467-9280.2008.02209.x.
  2. Karpicke JD, Roediger HL III. The critical importance of retrieval for learning. Science. 2008;319(5865):966-968. doi:10.1126/science.1152408.
  3. Bjork RA. Memory and metamemory considerations in the training of human beings. In: Metcalfe J, Shimamura A, eds. Metacognition: Knowing about Knowing. MIT Press; 1994:185-205. (The “desirable difficulties” framework.)
  4. Roediger HL III, Butler AC. The critical role of retrieval practice in long-term retention. Trends in Cognitive Sciences. 2011;15(1):20-27. doi:10.1016/j.tics.2010.09.003.
  5. Ebbinghaus H. Über das Gedächtnis. Leipzig: Duncker & Humblot; 1885. English translation: Memory: A Contribution to Experimental Psychology (Ruger & Bussenius, 1913). archive.org.
  6. Open Spaced Repetition Project. FSRS: Free Spaced Repetition Scheduler. github.com/open-spaced-repetition.