Three-participant feasibility case study of an Arabic-language VR public-speaking system with an automated stuttering-event detector
How this was rated
Case study with three participants in a single experimental session. The study makes a feasibility/proof-of-concept claim about Arabic-language VR + automated speech analysis, not a clinical-effect claim. The speech-analyzer's threshold for prolongation detection was computed from a corpus of three fluent FEMALE Saudi speakers (only), which may not generalize across genders or dialects. No control condition; no comparison with clinician-rated stuttering severity; no longitudinal follow-up. The paper has no explicit funding disclosure or COI declaration.
Ratings use a simplified four-tier scheme (High, Moderate, Low, Very Low) informed by the GRADE working group. Learn more about how studies are rated.
A three-participant feasibility case study (two female, one male, ages 30-34) of an Arabic-language VR public-speaking system on a Samsung Gear VR + S6 phone, paired with an automated stuttering-event detector. Each participant completed one session reading from a virtual podium facing a virtual audience. Setup time 2-3 minutes; the automated detector correlated R=0.95 with manual clinician counts on the same audio.
A 3-participant single-session feasibility case study of an Arabic-language VR public-speaking environment with an automated speech-analyzer module that detects prolongations, blockages, and repetitions via Google Cloud Speech-to-Text API. Useful as proof-of-concept for VR in an under-served language context (Arabic) and for the integration of automated speech analysis with VR; the sample (n=3, single session, single environment) cannot establish clinical effect. The mild-stuttering participant showing the highest detected stuttering rate raises questions about the speech-analyzer's calibration with respect to clinician-rated severity that the authors flag for future study.
Key findings
- Three participants (two female, one male; ages 30-34, M=32 SD=1.6) each completed ONE single session, NOT multiple sessions; the system supports three audience-size levels (5, 8, 11 avatars) but the experiment used a single configuration per participant
- Strong positive correlation (R=0.95) between session length and the number of automatically detected stuttering events
- Participants reported anxiety and presence comparable to real-world public speaking; they also reported a 'mild uncanny valley effect' with the avatar characters
- Setup and preparation took 2-3 minutes per participant; session length ranged 1:40-2:25 minutes (participants exceeded the mean fluent recitation duration of 44.7±2.4 seconds by ~1:15 min)
- Counterintuitive finding flagged by the authors: the participant rated as MILD stuttering severity by the supervising SLP exhibited the HIGHEST detected stuttering-event rate (20.8%) while the SEVERE participant showed the LOWEST (4.8%); the moderate participant showed 8.6%. The authors note this 'suggests that VR may suit only individuals with higher stuttering severity. Additional data are required to validate this theory'
- Speech analyzer detected three disfluency types: prolongations (word duration exceeding a threshold derived from three fluent female speakers averaging 74 Arabic words read aloud in 44.7±2.4 seconds), blockages (when the speech API returns null for an utterance, interpreted as non-speech vocal sounds), and repetitions (when the API transcribes a word more times than expected)
- Hardware/software: Samsung Gear VR headset on a Samsung S6 phone (Oculus-compatible Android VR glasses); Blender 3D modeling tool for scene characters; Mixamo + Unity 3D for animation and placement; Google Cloud Speech-to-Text Python client library with synchronous recognition (selected for its accuracy with under-resourced languages and Arabic-dialect support); Audacity for recording capture; Sony ICD-AX412F digital recorder with lavalier microphone
Background
Assessing speech fluency typically requires a clinician to manually count and classify each moment of stuttering during a conversation or reading task. This process is time-consuming, subjective, and can vary between observers. For people who stutter, the awareness of being closely monitored may also change how they speak. A second challenge is access: most stuttering-VR research has been conducted with English-speaking populations, with very limited equivalent work in Arabic. Al-Nafjan, Alghamdi, and Almudhi - working across three Saudi universities (Imam Muhammad bin Saud, King Saud, and King Khalid) - set out to address both challenges by developing an Arabic-language VR public-speaking environment with an integrated automated speech-analyzer.
What the researchers did
The team built a two-component system: (1) a VR component that places the participant at a virtual podium facing a virtual audience, supporting three audience-size configurations (5, 8, and 11 avatars at levels 1, 2, and 3 respectively), built in Blender for character modeling, Mixamo for animation, and Unity 3D for scene assembly, and rendered on a Samsung Gear VR headset (Oculus-compatible) running on a Samsung S6 Android phone; and (2) a speech-analyzer component that records the participant’s reading via an Olympus WS-500M digital recorder with lavalier microphone, segments the audio using Audacity by thresholding signal energy and spectral centroid, and transcribes each segment using the Google Cloud Speech-to-Text Python client library with synchronous recognition. The speech-analyzer flags three disfluency types:
- Prolongation: when a participant’s word duration exceeds a per-word threshold computed by averaging the same word’s duration across three fluent female reference speakers (74 Arabic words read in 44.7±2.4 seconds).
- Blockage: when the speech API returns a null transcription for an utterance, interpreted as a non-speech vocal sound produced during a stuttering block.
- Repetition: when the API transcribes a word more times than expected from the reference script.
The Stuttering Screening (SS) score is the sum of these three counts.
Participants. Three Arabic-speaking adults who stutter were recruited from the supervising SLP’s (co-author Almudhi) clinical practice. Demographics: two female, one male; ages 30, 32, and 34 (mean 32, SD 1.6). Stuttering severity was rated by the SLP: P1 moderate (age 32), P2 mild (age 34), P3 severe (age 30). All were healthy with normal eyesight and no prior VR experience.
Procedure. The experiment was a single session in an isolated room under the supervisor’s oversight. Participants put on a lavalier microphone digital Sony IC-Recorder (ICD-AX412F) and the Samsung Gear VR headset, adjusted their position until the text on the virtual podium was readable, and read the 74-word Arabic script aloud while facing the virtual audience. Setup/preparation took 2-3 minutes per participant; the actual reading session lasted 1:40-2:25 minutes. After the recording, the audio was segmented, transcribed, and analyzed; participants were then interviewed for subjective feedback.
What they found
Acceptability and presence (qualitative). Participants positively rated their VR experiences across aesthetic design, character design, and immersion. They reported acceptable resemblance between the VR scene and a real conference room, a “mild uncanny valley effect” with the avatar characters (a noted limitation of the character design), and similar emotional reactions (fear, anxiety) to those experienced in real-world public speaking activities. Subjectively, the supervising SLP observed no significant difference in participants’ speech prosody when using VR vs outside VR.
Speech-analyzer performance. A strong positive correlation was found between session length and automatically detected stuttering events (R=0.95). The authors interpret this as evidence of “acceptable performance of the speech analyzer in detecting stuttering events, especially prolongation instances.”
Counterintuitive severity-vs-detection result. Table 2 of the paper shows the participant-by-participant detected stuttering event percentages: P1 (moderate, 32y) 8.6%, P2 (mild, 34y) 20.8%, P3 (severe, 30y) 4.8%. That is, the participant rated as MILD by the clinician showed the HIGHEST detected stuttering rate, while the SEVERE participant showed the LOWEST. The authors flag this directly: “An interesting observation is that the participant with a mild stuttering severity exhibited a higher percentage of stuttering events. This observation suggests that VR may suit only individuals with higher stuttering severity. Additional data are required to validate this theory.” A reader might equally interpret this as a calibration/validity question about the automated detector vs the clinician rating, but the authors interpret it as a population-suitability question.
Setup feasibility. The 2-3 minute setup time per participant is offered as evidence that the system is feasible for clinical use.
Why this matters
This is among the very few VR-stuttering studies conducted in Arabic, addressing a significant under-representation in the field. It is also one of the relatively few studies that explicitly integrates an off-the-shelf cloud speech-recognition API with a VR environment to automatically detect stuttering events. The integration concept - reducing the manual-counting burden during stuttering assessment - is a real clinical need; whether the implementation works robustly is what this small case study can hint at (R=0.95 correlation with session length) but cannot establish (n=3, no comparison with clinician event-counts).
The severity-vs-detection observation is the most clinically interesting finding. With only 3 participants it is hypothesis-generating, not conclusive. It could reflect: (a) genuine population variation in how stuttering manifests during VR-based reading; (b) calibration issues with the prolongation threshold (derived from three fluent female speakers, applied across mixed-gender participants and varying severities); (c) test-retest variability that a single session cannot quantify; (d) statistical noise from n=3. Subsequent work would need to disentangle these.
For Therapy withVR: this study did not use, test, or evaluate Therapy withVR. The system was custom research software built by the authors. The Al-Nafjan paper is included in the Evidence Hub because it adds to the broader immersive-VR-for-stuttering evidence base and represents a rare Arabic-language contribution, not because it relates to Therapy withVR.
Limitations
The paper acknowledges some of these directly; others are inherent to the design:
- Sample size n=3, single session, single audience configuration per participant. The system supports three audience-size levels (5/8/11 avatars) but the experiment did not vary audience size within or between participants; the ‘graded hierarchy’ aspect of the system was not tested.
- No comparison condition. No non-VR baseline, no comparison with manual clinician event-counts, no test-retest.
- No longitudinal follow-up. Single session only.
- Speech-analyzer threshold derived from three fluent FEMALE speakers. Applied across mixed-gender participants; may not generalize across genders, dialects, or speech tempos.
- Counterintuitive severity-vs-detection finding (mild participant: highest detected rate; severe: lowest) raises the question of whether the automated detector tracks clinician judgment of severity; the authors note “additional data are required to validate this theory.”
- Mild uncanny valley effect reported by participants in the qualitative debrief - a flag for the avatar design.
- No explicit funding disclosure or COI declaration in the paper.
- VR hardware is the original Samsung Gear VR (2015-era mobile VR). Modern Quest-class hardware offers materially better visual fidelity and tracking.
Implications for practice
For Arabic-speaking clinicians considering technology-assisted stuttering assessment: this paper provides feasibility evidence that an off-the-shelf cloud speech-recognition API (Google Cloud Speech-to-Text) can be combined with a VR public-speaking environment to detect prolongations, blockages, and repetitions in Arabic-language stuttering assessment. The unexpected finding that the participant with the lowest clinician-rated severity showed the highest automated-detection rate is a caution against using such systems for severity rating without further calibration. Clinicians should treat the study as proof-of-concept for the technical pipeline (Arabic-language VR + automated speech analysis), not as evidence that VR reduces stuttering or that automated detection matches clinician judgment.
Where this connects to Therapy withVR
The study above is independent research and does not endorse any product. The notes below are commentary from withVR on how the themes in this research relate to features of Therapy withVR. The research findings are not claims about Therapy withVR.
Speech analysis integration (editorial parallel only)
The Al-Nafjan study integrated an off-the-shelf automated speech recognizer (Google Cloud Speech-to-Text) with the VR environment to detect prolongations, blockages, and repetitions in Arabic. The conceptual goal - reducing the burden of manual stuttering-event counting during sessions - is one that Therapy withVR's session logging can support in a different way (within its own design). Editorial parallel only; the studied system is custom research software, not Therapy withVR.
Adjustable audience size (editorial parallel only)
The Al-Nafjan VR system supports three audience-size configurations (5, 8, 11 avatars). The experiment used a single configuration per participant, but the system's hierarchy concept aligns with Therapy withVR's clinician-adjustable audience controls within its own design. Editorial parallel only.
Cite this study
If you reference this study in your work, the canonical citation formats are:
@article{alnafjan2021,
author = {Al-Nafjan, A. and Alghamdi, N. and Almudhi, A.},
title = {Virtual Reality Technology and Speech Analysis for People Who Stutter},
journal = {EMITTER International Journal of Engineering Technology},
year = {2021},
doi = {10.24003/emitter.v9i2.649},
url = {https://withvr.app/evidence/studies/al-nafjan-2021}
}TY - JOUR
AU - Al-Nafjan, A.
AU - Alghamdi, N.
AU - Almudhi, A.
TI - Virtual Reality Technology and Speech Analysis for People Who Stutter
JO - EMITTER International Journal of Engineering Technology
PY - 2021
DO - 10.24003/emitter.v9i2.649
UR - https://withvr.app/evidence/studies/al-nafjan-2021
ER - Know of research that should be in this hub? If a relevant peer-reviewed study is not listed here, send the reference to hello@withvr.app. The hub is kept up to date as the literature grows.
Funding & independence
The paper does NOT disclose any external funding source - there is no 'Funding' section in the paper. The Acknowledgments thank three unnamed project team members (Asmaa Albasha, Maryam Alghalban, Ola Semsemiah) 'for their hard work and dedication' along with the participating subjects. No COI declaration is included in the paper. Author affiliations: Abeer Al-Nafjan (Department of Computer Sciences, College of Computer and Information Sciences, Imam Muhammad bin Saud Islamic University, Riyadh, Saudi Arabia); Najwa Alghamdi (Department of Information Technology, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia); Abdulaziz Almudhi (Department of Medical Rehabilitation Sciences, College of Applied Medical Sciences AND Speech Language Pathology Unit, King Khalid University, Abha, Saudi Arabia). The VR system was custom-developed by the authors using Blender, Unity 3D, and Mixamo, running on a Samsung Gear VR headset (Oculus-compatible) with a Samsung S6 phone; this is NOT Therapy withVR. The speech-analyzer used the Google Cloud Speech-to-Text Python client library. No withVR BV involvement in funding, study design, or authorship. Summary prepared independently by withVR using the published paper.