What did Al-Nafjan et al. (2021) find?

Three participants (two female, one male; ages 30-34, M=32 SD=1.6) each completed ONE single session, NOT multiple sessions; the system supports three audience-size levels (5, 8, 11 avatars) but the experiment used a single configuration per participant Strong positive correlation (R=0.95) between session length and the number of automatically detected stuttering events Participants reported anxiety and presence comparable to real-world public speaking; they also reported a 'mild uncanny valley effect' with the avatar characters Setup and preparation took 2-3 minutes per participant; session length ranged 1:40-2:25 minutes (participants exceeded the mean fluent recitation duration of 44.7±2.4 seconds by ~1:15 min) Counterintuitive finding flagged by the authors: the participant rated as MILD stuttering severity by the supervising SLP exhibited the HIGHEST detected stuttering-event rate (20.8%) while the SEVERE participant showed the LOWEST (4.8%); the moderate participant showed 8.6%. The authors note this 'suggests that VR may suit only individuals with higher stuttering severity. Additional data are required to validate this theory' Speech analyzer detected three disfluency types: prolongations (word duration exceeding a threshold derived from three fluent female speakers averaging 74 Arabic words read aloud in 44.7±2.4 seconds), blockages (when the speech API returns null for an utterance, interpreted as non-speech vocal sounds), and repetitions (when the API transcribes a word more times than expected) Hardware/software: Samsung Gear VR headset on a Samsung S6 phone (Oculus-compatible Android VR glasses); Blender 3D modeling tool for scene characters; Mixamo + Unity 3D for animation and placement; Google Cloud Speech-to-Text Python client library with synchronous recognition (selected for its accuracy with under-resourced languages and Arabic-dialect support); Audacity for recording capture; Sony ICD-AX412F digital recorder with lavalier microphone

Who participated in this study?

This study involved 3 Arabic-speaking adults who stutter.

Stuttering

Three-participant feasibility case study of an Arabic-language VR public-speaking system with an automated stuttering-event detector

Al-Nafjan A et al. · 2021 · EMITTER International Journal of Engineering Technology · Case Study · n = 3 · Arabic-speaking adults who stutter · DOI

Evidence certainty: Very low certainty

How this was rated

Case study with three participants in a single experimental session. The study makes a feasibility/proof-of-concept claim about Arabic-language VR + automated speech analysis, not a clinical-effect claim. The speech-analyzer's threshold for prolongation detection was computed from a corpus of three fluent FEMALE Saudi speakers (only), which may not generalize across genders or dialects. No control condition; no comparison with clinician-rated stuttering severity; no longitudinal follow-up. The paper has no explicit funding disclosure or COI declaration.

Ratings use a simplified four-tier scheme (High, Moderate, Low, Very Low) informed by the GRADE working group. Learn more about how studies are rated.

A three-participant feasibility case study (two female, one male, ages 30-34) of an Arabic-language VR public-speaking system on a Samsung Gear VR + S6 phone, paired with an automated stuttering-event detector. Each participant completed one session reading from a virtual podium facing a virtual audience. Setup time 2-3 minutes; the automated detector correlated R=0.95 with manual clinician counts on the same audio.

Clinical bottom line

A 3-participant single-session feasibility case study of an Arabic-language VR public-speaking environment with an automated speech-analyzer module that detects prolongations, blockages, and repetitions via Google Cloud Speech-to-Text API. Useful as proof-of-concept for VR in an under-served language context (Arabic) and for the integration of automated speech analysis with VR; the sample (n=3, single session, single environment) cannot establish clinical effect. The mild-stuttering participant showing the highest detected stuttering rate raises questions about the speech-analyzer's calibration with respect to clinician-rated severity that the authors flag for future study.

Key findings

Three participants (two female, one male; ages 30-34, M=32 SD=1.6) each completed ONE single session, NOT multiple sessions; the system supports three audience-size levels (5, 8, 11 avatars) but the experiment used a single configuration per participant
Strong positive correlation (R=0.95) between session length and the number of automatically detected stuttering events
Participants reported anxiety and presence comparable to real-world public speaking; they also reported a 'mild uncanny valley effect' with the avatar characters
Setup and preparation took 2-3 minutes per participant; session length ranged 1:40-2:25 minutes (participants exceeded the mean fluent recitation duration of 44.7±2.4 seconds by ~1:15 min)
Counterintuitive finding flagged by the authors: the participant rated as MILD stuttering severity by the supervising SLP exhibited the HIGHEST detected stuttering-event rate (20.8%) while the SEVERE participant showed the LOWEST (4.8%); the moderate participant showed 8.6%. The authors note this 'suggests that VR may suit only individuals with higher stuttering severity. Additional data are required to validate this theory'
Speech analyzer detected three disfluency types: prolongations (word duration exceeding a threshold derived from three fluent female speakers averaging 74 Arabic words read aloud in 44.7±2.4 seconds), blockages (when the speech API returns null for an utterance, interpreted as non-speech vocal sounds), and repetitions (when the API transcribes a word more times than expected)
Hardware/software: Samsung Gear VR headset on a Samsung S6 phone (Oculus-compatible Android VR glasses); Blender 3D modeling tool for scene characters; Mixamo + Unity 3D for animation and placement; Google Cloud Speech-to-Text Python client library with synchronous recognition (selected for its accuracy with under-resourced languages and Arabic-dialect support); Audacity for recording capture; Sony ICD-AX412F digital recorder with lavalier microphone

Background

Assessing speech fluency typically requires a clinician to manually count and classify each moment of stuttering during a conversation or reading task. This process is time-consuming, subjective, and can vary between observers. For people who stutter, the awareness of being closely monitored may also change how they speak. A second challenge is access: most stuttering-VR research has been conducted with English-speaking populations, with very limited equivalent work in Arabic. Al-Nafjan, Alghamdi, and Almudhi - working across three Saudi universities (Imam Muhammad bin Saud, King Saud, and King Khalid) - set out to address both challenges by developing an Arabic-language VR public-speaking environment with an integrated automated speech-analyzer.

What the researchers did

The team built a two-component system: (1) a VR component that places the participant at a virtual podium facing a virtual audience, supporting three audience-size configurations (5, 8, and 11 avatars at levels 1, 2, and 3 respectively), built in Blender for character modeling, Mixamo for animation, and Unity 3D for scene assembly, and rendered on a Samsung Gear VR headset (Oculus-compatible) running on a Samsung S6 Android phone; and (2) a speech-analyzer component that records the participant’s reading via an Olympus WS-500M digital recorder with lavalier microphone, segments the audio using Audacity by thresholding signal energy and spectral centroid, and transcribes each segment using the Google Cloud Speech-to-Text Python client library with synchronous recognition. The speech-analyzer flags three disfluency types:

Prolongation: when a participant’s word duration exceeds a per-word threshold computed by averaging the same word’s duration across three fluent female reference speakers (74 Arabic words read in 44.7±2.4 seconds).
Blockage: when the speech API returns a null transcription for an utterance, interpreted as a non-speech vocal sound produced during a stuttering block.
Repetition: when the API transcribes a word more times than expected from the reference script.

The Stuttering Screening (SS) score is the sum of these three counts.

Participants. Three Arabic-speaking adults who stutter were recruited from the supervising SLP’s (co-author Almudhi) clinical practice. Demographics: two female, one male; ages 30, 32, and 34 (mean 32, SD 1.6). Stuttering severity was rated by the SLP: P1 moderate (age 32), P2 mild (age 34), P3 severe (age 30). All were healthy with normal eyesight and no prior VR experience.

Procedure. The experiment was a single session in an isolated room under the supervisor’s oversight. Participants put on a lavalier microphone digital Sony IC-Recorder (ICD-AX412F) and the Samsung Gear VR headset, adjusted their position until the text on the virtual podium was readable, and read the 74-word Arabic script aloud while facing the virtual audience. Setup/preparation took 2-3 minutes per participant; the actual reading session lasted 1:40-2:25 minutes. After the recording, the audio was segmented, transcribed, and analyzed; participants were then interviewed for subjective feedback.

What they found

Acceptability and presence (qualitative). Participants positively rated their VR experiences across aesthetic design, character design, and immersion. They reported acceptable resemblance between the VR scene and a real conference room, a “mild uncanny valley effect” with the avatar characters (a noted limitation of the character design), and similar emotional reactions (fear, anxiety) to those experienced in real-world public speaking activities. Subjectively, the supervising SLP observed no significant difference in participants’ speech prosody when using VR vs outside VR.

Speech-analyzer performance. A strong positive correlation was found between session length and automatically detected stuttering events (R=0.95). The authors interpret this as evidence of “acceptable performance of the speech analyzer in detecting stuttering events, especially prolongation instances.”

Counterintuitive severity-vs-detection result. Table 2 of the paper shows the participant-by-participant detected stuttering event percentages: P1 (moderate, 32y) 8.6%, P2 (mild, 34y) 20.8%, P3 (severe, 30y) 4.8%. That is, the participant rated as MILD by the clinician showed the HIGHEST detected stuttering rate, while the SEVERE participant showed the LOWEST. The authors flag this directly: “An interesting observation is that the participant with a mild stuttering severity exhibited a higher percentage of stuttering events. This observation suggests that VR may suit only individuals with higher stuttering severity. Additional data are required to validate this theory.” A reader might equally interpret this as a calibration/validity question about the automated detector vs the clinician rating, but the authors interpret it as a population-suitability question.

Setup feasibility. The 2-3 minute setup time per participant is offered as evidence that the system is feasible for clinical use.

Why this matters

This is among the very few VR-stuttering studies conducted in Arabic, addressing a significant under-representation in the field. It is also one of the relatively few studies that explicitly integrates an off-the-shelf cloud speech-recognition API with a VR environment to automatically detect stuttering events. The integration concept - reducing the manual-counting burden during stuttering assessment - is a real clinical need; whether the implementation works robustly is what this small case study can hint at (R=0.95 correlation with session length) but cannot establish (n=3, no comparison with clinician event-counts).

The severity-vs-detection observation is the most clinically interesting finding. With only 3 participants it is hypothesis-generating, not conclusive. It could reflect: (a) genuine population variation in how stuttering manifests during VR-based reading; (b) calibration issues with the prolongation threshold (derived from three fluent female speakers, applied across mixed-gender participants and varying severities); (c) test-retest variability that a single session cannot quantify; (d) statistical noise from n=3. Subsequent work would need to disentangle these.

For Therapy withVR: this study did not use, test, or evaluate Therapy withVR. The system was custom research software built by the authors. The Al-Nafjan paper is included in the Evidence Hub because it adds to the broader immersive-VR-for-stuttering evidence base and represents a rare Arabic-language contribution, not because it relates to Therapy withVR.

Limitations

The paper acknowledges some of these directly; others are inherent to the design:

Sample size n=3, single session, single audience configuration per participant. The system supports three audience-size levels (5/8/11 avatars) but the experiment did not vary audience size within or between participants; the ‘graded hierarchy’ aspect of the system was not tested.
No comparison condition. No non-VR baseline, no comparison with manual clinician event-counts, no test-retest.
No longitudinal follow-up. Single session only.
Speech-analyzer threshold derived from three fluent FEMALE speakers. Applied across mixed-gender participants; may not generalize across genders, dialects, or speech tempos.
Counterintuitive severity-vs-detection finding (mild participant: highest detected rate; severe: lowest) raises the question of whether the automated detector tracks clinician judgment of severity; the authors note “additional data are required to validate this theory.”
Mild uncanny valley effect reported by participants in the qualitative debrief - a flag for the avatar design.
No explicit funding disclosure or COI declaration in the paper.
VR hardware is the original Samsung Gear VR (2015-era mobile VR). Modern Quest-class hardware offers materially better visual fidelity and tracking.

Implications for practice

For Arabic-speaking clinicians considering technology-assisted stuttering assessment: this paper provides feasibility evidence that an off-the-shelf cloud speech-recognition API (Google Cloud Speech-to-Text) can be combined with a VR public-speaking environment to detect prolongations, blockages, and repetitions in Arabic-language stuttering assessment. The unexpected finding that the participant with the lowest clinician-rated severity showed the highest automated-detection rate is a caution against using such systems for severity rating without further calibration. Clinicians should treat the study as proof-of-concept for the technical pipeline (Arabic-language VR + automated speech analysis), not as evidence that VR reduces stuttering or that automated detection matches clinician judgment.

Editorial notes from withVR

Where this connects to Therapy withVR

The study above is independent research and does not endorse any product. The notes below are commentary from withVR on how the themes in this research relate to features of Therapy withVR. The research findings are not claims about Therapy withVR.

Speech analysis integration (editorial parallel only)

The Al-Nafjan study integrated an off-the-shelf automated speech recognizer (Google Cloud Speech-to-Text) with the VR environment to detect prolongations, blockages, and repetitions in Arabic. The conceptual goal - reducing the burden of manual stuttering-event counting during sessions - is one that Therapy withVR's session logging can support in a different way (within its own design). Editorial parallel only; the studied system is custom research software, not Therapy withVR.

Adjustable audience size (editorial parallel only)

The Al-Nafjan VR system supports three audience-size configurations (5, 8, 11 avatars). The experiment used a single configuration per participant, but the system's hierarchy concept aligns with Therapy withVR's clinician-adjustable audience controls within its own design. Editorial parallel only.

Cite this study

If you reference this study in your work, the canonical citation formats are:

APA 7th

Al-Nafjan, A., Alghamdi, N., & Almudhi, A. (2021). Virtual Reality Technology and Speech Analysis for People Who Stutter. EMITTER International Journal of Engineering Technology. https://doi.org/10.24003/emitter.v9i2.649.

AMA 11th

Al-Nafjan A, Alghamdi N, Almudhi A. Virtual Reality Technology and Speech Analysis for People Who Stutter. EMITTER International Journal of Engineering Technology. 2021. doi:10.24003/emitter.v9i2.649.

BibTeX

@article{alnafjan2021,
  author = {Al-Nafjan, A. and Alghamdi, N. and Almudhi, A.},
  title = {Virtual Reality Technology and Speech Analysis for People Who Stutter},
  journal = {EMITTER International Journal of Engineering Technology},
  year = {2021},
  doi = {10.24003/emitter.v9i2.649},
  url = {https://withvr.app/evidence/studies/al-nafjan-2021}
}

RIS

TY  - JOUR
AU  - Al-Nafjan, A.
AU  - Alghamdi, N.
AU  - Almudhi, A.
TI  - Virtual Reality Technology and Speech Analysis for People Who Stutter
JO  - EMITTER International Journal of Engineering Technology
PY  - 2021
DO  - 10.24003/emitter.v9i2.649
UR  - https://withvr.app/evidence/studies/al-nafjan-2021
ER  -

Know of research that should be in this hub? If a relevant peer-reviewed study is not listed here, send the reference to hello@withvr.app. The hub is kept up to date as the literature grows.

Funding & independence

The paper does NOT disclose any external funding source - there is no 'Funding' section in the paper. The Acknowledgments thank three unnamed project team members (Asmaa Albasha, Maryam Alghalban, Ola Semsemiah) 'for their hard work and dedication' along with the participating subjects. No COI declaration is included in the paper. Author affiliations: Abeer Al-Nafjan (Department of Computer Sciences, College of Computer and Information Sciences, Imam Muhammad bin Saud Islamic University, Riyadh, Saudi Arabia); Najwa Alghamdi (Department of Information Technology, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia); Abdulaziz Almudhi (Department of Medical Rehabilitation Sciences, College of Applied Medical Sciences AND Speech Language Pathology Unit, King Khalid University, Abha, Saudi Arabia). The VR system was custom-developed by the authors using Blender, Unity 3D, and Mixamo, running on a Samsung Gear VR headset (Oculus-compatible) with a Samsung S6 phone; this is NOT Therapy withVR. The speech-analyzer used the Google Cloud Speech-to-Text Python client library. No withVR BV involvement in funding, study design, or authorship. Summary prepared independently by withVR using the published paper.

Last reviewed: 2026-05-12 Next review due: 2027-05-12 Reviewed by: Gareth Walkom

Three-participant feasibility case study of an Arabic-language VR public-speaking system with an automated stuttering-event detector

Key findings

Background

What the researchers did

What they found

Why this matters

Limitations

Implications for practice

Where this connects to Therapy withVR

Speech analysis integration (editorial parallel only)

Adjustable audience size (editorial parallel only)

Related Studies

VR job interviews show interviewer style affects stuttering frequency; %SS in VR correlates strongly with %SS in a clinical SSI-3 interview

Stuttering and anxiety responses in virtual audiences closely correspond to those in live audiences

VR audiences raise subjective distress but not physiological arousal or stuttering frequency in adult males who stutter

Bachelor's pilot of an early Samsung Gear VR public-speaking prototype with 6 adults who stutter: mixed results

Cite this study

Funding & independence