How studies are rated
Ratings use a simplified four-tier scheme (High, Moderate, Low, Very Low) informed by the GRADE working group. The rating reflects how confidently the study's findings can be applied - not the quality of the study authors or their work. A Very Low rating does not mean the study is bad; it often means the study is a pilot or case series, which is exactly what you want early in a field.
Current distribution
What each rating means
High certainty
Very confident that the true effect lies close to the estimate. Expected where multiple high-quality randomized controlled trials converge, across different sites and research groups, with minimal risk of bias, inconsistency, or indirectness. A single study rarely reaches High on its own - the exception is a large, multi-site, low-risk-of-bias trial; otherwise High reflects a body of evidence, not one paper.
Moderate certainty
Reasonably confident in the estimated effect; the true effect is likely close but could plausibly differ. Typical for well-designed single RCTs with adequate samples, and for systematic reviews of heterogeneous primary studies.
Low certainty
Limited confidence. The true effect may be substantially different from the estimate. Common for small-sample RCTs, quasi-experimental designs, and qualitative studies - all of which contribute real knowledge, but cannot alone support firm conclusions.
Very low certainty
Very little confidence in any estimate of effect. Case studies, small pilots, and narrative or conceptual papers sit here. These studies are still valuable - they establish feasibility, surface questions, and ground later controlled work - but they are not evidence of effect.
Where High ratings come from
Of the 111 studies currently in the hub, 9 (8%) are rated High. High certainty almost always reflects a converging body of evidence rather than a single paper, so most of these are systematic reviews and meta-analyses that pool findings across many studies. A single primary study reaches High only as a rare exception - a large, multi-site, low-risk-of-bias trial; otherwise the most a standalone study earns here is Moderate.
That is why the spread sits where it does: 8% High, 36% Moderate, 35% Low, and 21% Very Low. VR for speech, voice, and communication work is still a young research field, so most of the evidence is made up of small, early, single-site studies - exactly what you would expect, and exactly what later controlled trials and reviews are built on. An honest distribution like this is more useful, and more trustworthy, than inflated ratings that would be easier on the eye.
What it takes to reach High
For a specific clinical question, a High rating needs a converging body of high-quality evidence - typically:
- A Cochrane review or equivalent systematic review of multiple high-quality RCTs in an in-scope area (VR for stuttering, voice work, aphasia, swallowing, social communication) with consistent effects across studies.
- Meta-analyses that synthesize several pre-registered RCTs with adequate sample sizes, minimal heterogeneity, and findings that converge in direction and magnitude.
- Multiple large multi-site RCTs (n > 200 per arm) that replicate a specific effect of VR-based practice on a communication outcome that matters to clients.
For most specific clinical claims in VR speech therapy, this depth of evidence is still building. One question converging steadily is whether virtual audiences produce communicative responses comparable to real ones, as new work replicates earlier findings - a formal systematic review of that question would be welcome.
How ratings are decided
Each study's rating is assigned editorially by withVR, drawing on the paper's design (RCT / quasi-experimental / case / review), sample size, population, and the paper's own stated limitations. A short rationale accompanies each rating, visible on the study page by expanding "How this was rated."
Ratings reflect editorial judgment, not a formal GRADE assessment process of the kind used in Cochrane reviews. The scheme is simplified deliberately: four tiers are enough to signal how confidently clinicians and researchers should apply a finding, without implying a precision the editorial process does not have.
Corrections and suggestions welcome
If you believe a study has been rated incorrectly, or a study has been missed that warrants inclusion, send a note to hello@withvr.app. The scheme is intended to be transparent and correctable.
Further reading
- GRADE working group - the international collaboration that developed the methodology this scheme is informed by.
- GRADE: an emerging consensus on rating quality of evidence (Guyatt et al., BMJ 2008) - the foundational paper.
- Evidence Hub glossary - definitions of related terms (levels of evidence, risk of bias, PEDro, PICO).
Know of research that should be here? If a peer-reviewed study on VR in speech, voice, hearing, or communication work is not listed, send the reference to hello@withvr.app.