A paper lands in your inbox. Someone on your team says “look at this VR study, it sounds useful.” You want to know what to make of it before your next session or your next commissioning meeting. Where do you even start?
This is a short guide to reading a VR speech therapy study with a critical eye. Not a research methods course. Not a statistics primer. Just a practical set of questions a speech-language professional can hold in mind to tell the difference between a study that supports a clinical decision and a study that is interesting but not ready to change what you do.
Start with who, not what
Before anything else, read the Participants section. Who was in this study?
- How many participants? Five is a pilot. Fifteen is a small study. Fifty is starting to be a study whose findings generalize. Not an absolute rule, but a useful rough guide.
- What population? Non-clinical university students? Adults who stutter recruited from a clinic? Children with language differences? The population shapes what the findings can tell you.
- Were participants paid, recruited, or volunteers? How were they selected?
If the population in the study is very different from the people you see in clinic, the findings do not necessarily transfer. This is not a criticism of the study. It is a reminder that no single study answers every question, and evidence needs to be matched to the population you care about.
Understand what they actually compared
The next section worth reading is Design. What did the researchers compare?
- Within-subjects: every participant did every condition. Good for controlling individual differences. Can be exhausting for participants.
- Between-subjects: different participants did different conditions. Needs larger samples. Random assignment is important.
- Pre-post: participants measured before and after an intervention. Useful but vulnerable to practice effects, expectation effects, and regression to the mean unless there is a control.
- Randomized controlled trial: participants randomly assigned to intervention or control. Strongest design for causal claims, but rarer in early-stage work.
Ask yourself: if the intervention had no effect at all, is there any other reason the outcomes might have changed across conditions? If the answer is “yes, many reasons,” then the design is weak for a causal claim. A good study design rules out most alternatives.
Look at what they measured
The Outcome measures section tells you what the researchers decided counted as evidence. This matters because different measures tell different stories.
- Self-report (questionnaires, SUDS ratings, confidence ratings) captures the participant’s experience. High ecological meaning, but sensitive to expectations and demand characteristics.
- Observed behavior (conversational turns, speaking time) is closer to objective but still requires interpretation and often relies on human raters.
- Physiological (heart rate, skin conductance) is harder to fake but does not always map neatly onto felt experience.
- Acoustic (fundamental frequency, intensity, variability) measures voice signal properties directly, independent of self-report.
The most convincing VR validation studies combine measures. If anxiety goes up on SUDS and heart rate and voice measures shift consistently, that is stronger evidence than any single measure alone. Watch for studies that report only one type of measure - they tell a partial story.
Check whether the effect is actually large
A finding can be statistically significant and practically meaningless. This is a hard lesson. It happens because statistical significance depends on sample size: a tiny difference will be statistically significant if the sample is large enough.
What you want is an effect size. Common ones in this literature:
- Cohen’s d: roughly, 0.2 is small, 0.5 is medium, 0.8 is large. Tiny d values (< 0.1) mean the effect is barely there even if “significant.”
- Correlation r: 0.1 small, 0.3 medium, 0.5 large. Values above 0.7 are striking.
- Partial eta squared (η²ₚ): 0.01 small, 0.06 medium, 0.14 large.
If a paper reports only p-values without effect sizes, that is a weakness. If it reports effect sizes, check them. A large p-value with a small effect size can still be clinically uninteresting even if the statistics are legitimate.
Read the limitations section (seriously)
Authors know their own studies’ limitations better than you do. Read what they say. A good limitations section will tell you:
- What the sample size limits
- What the population limits (who the findings may not apply to)
- What the design cannot rule out
- What the follow-up period does or does not tell you about long-term effects
If a paper’s limitations section is a single throwaway paragraph, treat the findings cautiously. If the authors have thought carefully about what their study can and cannot tell us, give the paper more weight.
Distinguish feasibility from effect
A lot of early VR research is about feasibility rather than effect. A feasibility study asks: “can this be done at all? Will participants tolerate it? Does the equipment work as intended?” These are legitimate research questions, and the findings can be informative - but they are not evidence that the intervention works.
A feasibility study with five participants showing anxiety decreased across a week tells you that a week of practice is feasible. It does not tell you that VR caused the change. Other things could have - practice effects, expectation, the researcher’s attention, regression to the mean.
When you see a small-sample pre-post VR study with favorable results, ask: “is this a pilot telling me the idea is worth a bigger study, or is this being presented as evidence of effect?” The first is useful. The second would be overclaiming.
Ask about generalization honestly
Most VR studies measure responses inside the virtual environment. Fewer measure whether gains transfer to real-world situations. And yet what clients usually want is change in real life, not in a virtual room.
Questions to hold:
- Did the study measure anything outside the VR setting?
- Were there follow-up measurements after the VR sessions ended?
- Did participants report changes in their everyday speaking experiences?
If none of these are present, the study cannot tell you much about real-world transfer. That is not a flaw - it is a limitation of scope. But it matters when you are deciding what a study supports.
Check who funded the study
The Funding and Conflicts of Interest declarations are worth reading. Independent funding from research councils, universities, or government bodies is different from industry funding or a study conducted by a company on its own product.
Neither kind of funding automatically invalidates a study. But knowing who paid for it and who has a financial stake in its results helps you weigh the findings. A study on virtual audiences funded by a research council carries different weight than a study on a specific VR product conducted by that product’s company.
A short checklist
If a VR speech therapy study comes across your desk, these six questions will get you most of the way:
The 6-question checklist
Reading a VR speech therapy study with a critical eye
- Who was studied? Sample size and population. 5 = pilot, 15 = small, 50+ = generalizable.
- What was the design? Within / between / pre-post / RCT. What alternatives does it rule out?
- What was measured? Self-report, behavior, physiology, acoustics. Multiple measures = stronger.
- How big is the effect? Cohen's d, r, or eta-squared - not just the p-value.
- What did authors flag? Read the limitations section seriously. A thin one is itself a signal.
- Did it test transfer? Was anything measured outside the VR setting? Real-world transfer is the clinical question.
Print or save this card. None of these questions requires a statistics background - they ask what the paper itself usually answers in plain language.
None of this requires a statistics background. It requires slowing down and asking the questions authors usually answer in plain language somewhere in the paper.
Further reading
- Evidence Hub - peer-reviewed research on VR in speech therapy, with plain-language summaries
- How studies are rated - the certainty scheme used across the Evidence Hub
- Evidence Hub glossary - definitions of research terms used in these studies
- Further reading - books and communities that shape current practice
- Technology Checklist for SLPs - broader framework for evaluating new technology
