Ora's AI Tutor Answers Medical Student Questions More Accurately than Frontier Models GPT 5.5 (OpenAI), Claude Opus 4.8 (Anthropic), and Gemini 3.1 Pro (Google).
Ora AI Research Team. Blinded multi-model evaluation of production chat traffic.
We pulled 200 real, anonymized student questions (about the specific vignette, flashcard, article, or video each student had open) together with the Ora replies those students actually received, and re-ran the same questions against three ungrounded frontier models (GPT 5.5, Claude Opus 4.8, Gemini 3.1 Pro). A blinded three-model panel then scored all 800 responses on five dimensions. Ora scored 4.5/5 on contextual relevance versus 3.5 for the ungrounded models and 4.6 vs 4.2 on pedagogical fit; factual accuracy was near-ceiling and indistinguishable across all four; none of the 800 responses had a confirmed safety issue. Grounding's benefit is in answering the student's actual item, not raw correctness.
Ora vs ungrounded LLMs (/5)
Ora vs ungrounded (/5)
across all 800 responses
(200 × 4 × 3 × 5)
What grounding changes
The procurement question for medical-education AI is "is this safe in our students' hands?" The consumer question is "why pay for Ora's AI when ChatGPT is free?" Both resolve at one point: does grounding a response in a curated, physician-built corpus1 beat an unaugmented frontier model on the same student query? The honest answer is nuanced.
Frontier models are already strong on direct medical knowledge. Our three ungrounded arms scored a mean 4.8/5 on factual accuracy, indistinguishable from Ora and consistent with the published ceiling for these models.2,4 No system produced a single confirmed safety issue across 800 responses, after a blinded adjudicator applied an expert-physician rubric plus guideline search to every flagged case. For the institutional buyer, the reassuring headline: on this corpus, grounded and ungrounded answers alike were clinically safe.
Grounding's measurable edge is elsewhere. Asked about a specific item, the ungrounded models gave correct but generic answers that talked past it; Ora's reply engaged the actual vignette stem, flashcard, article, or video. All three graders independently scored Ora about a full point higher on contextual relevance (4.5 vs 3.5) and higher on pedagogical fit (4.6 vs 4.2), with strong agreement (quadratic-weighted kappa 0.80). The gap held across all four content types and was largest for vignettes and videos, the case or clip an ungrounded model cannot see.
On real student queries, ungrounded frontier LLMs are factually strong and, here, clinically safe. What they cannot do is see the specific item a student is studying. Grounding raises contextual relevance from 3.5 to 4.5 and pedagogical fit from 4.2 to 4.6, across every content type. Ora's grounded chat earns its keep answering the student's actual question, not by out-scoring a frontier model on facts it already knows.
Method
- Source. 200 first-turn student questions, sampled deterministically from 8,631 content-grounded conversations, 50 each across the four content types.
- Anonymization. Names, schools, dates, URLs, and first-person clinical referents stripped; all 200 manually reviewed before any text left Ora.
- Ora arm. The verbatim production reply each student received, grounded in the linked item.
- Three frontier models (GPT 5.5, Claude Opus 4.8, Gemini 3.1 Pro) answered each query with no system prompt and no grounded content: the unaugmented use a student gets from a chatbot.
- Access. Via Cursor-bundled models, not vendor APIs; behavior may differ slightly. Slugs snapshotted at evaluation time.
- Layer 1. Three models, blinded to source, scored all 800 responses on five 1 to 5 dimensions plus issue flags.
- Layer 2. Flagged factual, safety, and citation cases went to a separate blinded adjudicator (Claude Opus 4.8, literature search) for a binary ruling; relevance and fit are raw consensus scores.
- Pre-registered rubric, locked before scoring.
This compares Ora's full production stack against unaugmented API-style calls, not against an alternative grounding method. Citation quality was uniformly weak across all four systems (1.9/5) with near-zero grader agreement, so it is not a differentiator. Safety was assessed by rubric, not student outcomes; "no confirmed issue" is not "no possible harm."3 The eval set is Ora users' queries and may not represent all medical-student AI use, and comparator behavior may change as frontier models advance.
References
- Zakka C, Shad R, Chaurasia A, et al. Almanac: Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI. 2024;1(2). doi:10.1056/AIoa2300068
- Goodman RS, Patrinely JR, Stone CA Jr, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open. 2023;6(10):e2336483. doi:10.1001/jamanetworkopen.2023.36483
- Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med. 2023;388(13):1233-1239. doi:10.1056/NEJMsr2214184
- Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025. doi:10.1038/s41591-024-03423-7