Pilot of Immersive VoiceSpace VR (N=17, vocally healthy plus people with dysphonia) - participants scaled loudness and pitch across graded virtual restaurant conditions

Daşdöğen Ü · 2026 · Journal of Voice · Experimental · n = 17 · Seventeen adults recruited at Mount Sinai... · DOI
Evidence certainty: Low certainty
How this was rated

Peer-reviewed in Journal of Voice (Elsevier), IRB-approved (Mount Sinai STUDY-25-01418), linear mixed-effects analysis with random intercept for subject and Kenward-Roger degrees of freedom - a defensible analytic frame for a pilot. Strengths: includes a clinical population (dysphonia) rather than vocally healthy adults only; per-participant baseline-relative dB thresholds remove absolute-SPL confounding; behavioral pattern was consistent across both groups for SPL. Limitations that hold certainty low: small total N (17) with only 7 in the atypical group; single session and single context (a lightly populated virtual restaurant); no control group or comparator condition; baseline collected outside the headset, which confounds VR-exposure with task-demand effects; restaurant ambient audio was deliberately muted, limiting ecological realism and external validity; primary feasibility instrument was author-developed and not yet validated; sole-author study with no inter-rater reliability work reported; significant conflict of interest - the author invented IVS and holds a US patent application on the technology (sole listed inventor). The work establishes feasibility and signal, not efficacy. Replication in larger multisite samples with control comparators is needed before clinical-decision use.

Ratings use a simplified four-tier scheme (High, Moderate, Low, Very Low) informed by the GRADE working group. Learn more about how studies are rated.

A within-subjects pilot of Immersive VoiceSpace (IVS), a custom VR voice-training platform developed by the sole author. Seventeen adults (10 vocally healthy speakers and 7 people with dysphonia) completed a menu-ordering task in a virtual restaurant under four conditions - a baseline plus three graded IVS levels manipulating avatar distance, voice-activation thresholds, and walkaway timeouts. Sound pressure level and mean speaking f0 increased significantly across IVS levels in both groups; pitch flexibility was more constrained in the dysphonia group. Feasibility ratings were good overall (4.0/5), with comfort and safety excellent (4.5/5) and no cybersickness reported.

Clinical bottom line

First published feasibility and proof-of-concept evidence for Immersive VoiceSpace (IVS), a custom voice-responsive VR platform invented and patented by the sole author at Mount Sinai. In a single-session, within-subjects pilot with 17 adults (10 vocally healthy plus 7 people with dysphonia, including 2 trans women in gender-affirming voice care), graded virtual restaurant conditions produced systematic, progressive increases in sound pressure level (SPL) and mean speaking f0. Both groups followed the same SPL pattern; the dysphonia group showed flatter pitch scaling as task demands rose. Participants rated comfort and safety as excellent; no cybersickness, no adverse events. The study is limited by small sample (N=17, atypical n=7), single context (restaurant), single session, sole author with significant conflict of interest as inventor and patent holder, and a deliberately muted audio scene that constrains ecological validity. The findings support feasibility and preliminary construct validity for voice-responsive VR as a contextualized practice tool, but do not yet establish therapeutic efficacy or generalization to real-world voice use - both of which require follow-on multisession studies in clinical populations with control comparators.

Key findings

  • 17 adults (10 vocally healthy, 7 with dysphonia: presbyphonia, vocal fold polyp, vocal fold paresis, muscle tension dysphonia, and 2 trans women in gender-affirming voice care) completed a single-session within-subjects protocol
  • Equipment: Oculus Quest 3 head-mounted display running the IVS application; AKG C520 condenser microphone at 7 cm from the mouth (calibrated to 30 cm reference); recordings via Computerized Speech Lab (CSL) at 44.1 kHz / 16 bit
  • Four conditions in randomized order: Baseline (research-team listener at ~2 m in the clinical room) plus three IVS levels in a virtual restaurant - Normal (waiter at 5 m, +3 dB above each participant's own baseline, 5 s timeout), Effortful (10 m, +5 dB, 10 s), Calling (15 m, +10 dB, 20 s). Restaurant ambient audio was muted to isolate visual-spatial effects
  • SPL main effect of IVS Level was significant: F(3, 48) = 33.94, p < 0.001. Relative to Baseline, SPL increased by 3.83 dB (Normal), 7.41 dB (Effortful), and 9.04 dB (Calling), all p < 0.001
  • Mean speaking f0 main effect of IVS Level was significant: F(3, 45) = 17.63, p < 0.001. Stepwise increases from Baseline of approximately 36 Hz (Normal, p = 0.008), 66.6 Hz (Effortful, p < 0.001), and 103.9 Hz (Calling, p < 0.001)
  • Group main effects: people with dysphonia produced lower SPL overall (estimate -6.88 dB, p = 0.001) and lower mean f0 overall (p = 0.002) than vocally healthy speakers
  • Significant IVS Level x Group interaction for mean f0 only: F(3, 45) = 3.94, p = 0.014. Pitch scaling diverged in the more demanding conditions - the gap between groups was non-significant at Baseline (p = 0.102), approached significance at Normal (p = 0.055), and was significant at Effortful (p = 0.003) and Calling (p < 0.001). SPL interaction was non-significant and dropped from the final model - both groups raised loudness in parallel
  • Feasibility (1-5 Likert): Usability & Interaction 3.9 (moderate-good), Immersion & Realism 3.4 (moderate, lowest domain), Engagement & Perceived Benefit 4.0 (good), Comfort & Safety 4.5 (excellent). Overall 4.0 (good)
  • No adverse events. No cybersickness reported. No technical interruptions across the protocol. Average ~2 minutes to reconfigure difficulty parameters between trials. Full session including instructions and questionnaires lasted ~20 minutes per participant
  • Open-text feedback flagged the limited avatar responsiveness as a key constraint - participants asked for verbal responses, facial expressions, and conversational gestures to deepen interaction realism

Background

Voice change is a motor-learning problem, not just a knowledge problem. Behavioral voice therapy is effective for many voice conditions, but gains in the clinic often fail to carry over to everyday communication. The motor-learning literature is clear about why: durable change depends on practicing under conditions that resemble the target context, not just performing the behavior in a structured session. The Specificity of Learning Principle, Transfer-Appropriate Processing, and Encoding Specificity all converge on the same point - when the sensory and contextual demands of practice match the demands of real use, transfer is stronger.

Real-world voice use happens under layered demands: communicative intent, listener distance, social-emotional pressure, room size, background acoustics, and visual-spatial cues that signal how much voice is needed before a person even speaks. Conventional clinic rooms intentionally minimize these variables, which serves initial acquisition but underrepresents the very cues that learning theory says generalization depends on.

Immersive virtual reality offers a controlled way to put those cues back in. Daşdöğen’s 2023 multisensory study (in this Hub) established that visual and audiovisual VR cues drive measurable vocal adaptations in vocally healthy adults, beyond what acoustic simulation alone produces. The 2026 trained-singers study (also in this Hub) extended that to compare expert and untrained speakers. The present study takes the next step: does the same effect hold up in a clinical voice population, and is a custom voice-responsive VR platform feasible to use in that population.

What the researchers did

A within-subjects pilot at Mount Sinai with 17 adults: 10 vocally healthy speakers recruited from otolaryngology clinic and hospital staff, and 7 people with dysphonia recruited during routine voice-assessment visits (diagnoses included presbyphonia, vocal fold polyp, vocal fold paresis, muscle tension dysphonia, and gender-affirming voice care).

The intervention was Immersive VoiceSpace (IVS) - a custom VR platform developed by the sole author. IVS rendered a lightly populated virtual restaurant on an Oculus Quest 3 headset. A waiter non-player character served as the listener target. The waiter responded in real time to the participant’s voice: if voice intensity met a preset threshold, the waiter approached and stayed in a listening posture; if it dropped below threshold for longer than a set timeout, the waiter walked away.

Three parameters were graded across conditions:

The speech task across all four conditions was the same: “Order a drink, an appetizer, an entrée, and a dessert.” The Baseline condition was performed with a research-team member acting as the listener in the clinical room at ~2 m. The three IVS conditions were performed in the virtual restaurant in randomized order.

To isolate visual-spatial effects, the restaurant’s ambient audio (background conversation and utensils, which IVS can play) was muted across all experimental conditions. Acoustic recording was made through an AKG C520 head-mounted condenser microphone at 7 cm from the mouth, calibrated to a 30 cm reference, captured at 44.1 kHz / 16 bit via Computerized Speech Lab (CSL).

Outcomes: sound pressure level (SPL, dB) and mean speaking fundamental frequency (mean f0, Hz), each extracted from CSL and analyzed in separate linear mixed-effects models with a random intercept for subject. Fixed effects were Group (Typical, Atypical) and Task Condition (Baseline, Normal, Effortful, Calling). The Group × Task Condition interaction was retained for mean f0 (significant) and dropped from the final SPL model (non-significant). Fixed effects were evaluated with Type III sums of squares and Kenward-Roger approximated degrees of freedom; pairwise contrasts used estimated marginal means with Tukey correction.

A 5-point Likert questionnaire (author-developed, not yet validated) captured four domains after the session: Usability and Interaction, Immersion and Realism, Engagement and Perceived Benefit, Comfort and Safety. Domain scores were averaged; an overall feasibility index was the mean of the four domains. Open-text feedback was reviewed descriptively.

What they found

Sound pressure level. A significant main effect of IVS Level: F(3, 48) = 33.94, p < 0.001. Relative to Baseline, SPL increased by 3.83 dB at Normal, 7.41 dB at Effortful, and 9.04 dB at Calling (all p < 0.001). Normal-to-Effortful and Normal-to-Calling pairwise contrasts were significant; the 1.63 dB step from Effortful to Calling was not (p = 0.450), suggesting a ceiling-like pattern at the highest demand level. The Group main effect was also significant: people with dysphonia produced about 6.88 dB lower SPL on average than vocally healthy speakers. The Group × Level interaction was non-significant and was therefore dropped from the final SPL model - both groups raised loudness in parallel as task demands escalated.

Mean speaking f0. A significant main effect of IVS Level: F(3, 45) = 17.63, p < 0.001. Stepwise increases relative to Baseline (intercept ≈ 201.8 Hz for the typical group) of about 36 Hz at Normal (p = 0.008), 66.6 Hz at Effortful (p < 0.001), and 103.9 Hz at Calling (p < 0.001). The Group main effect was significant, but the Level × Group interaction was also significant: F(3, 45) = 3.94, p = 0.014. Decomposing the interaction: at Baseline the groups did not differ in mean f0 (p = 0.102); at Normal the difference approached but did not reach significance (p = 0.055); at Effortful (p = 0.003) and Calling (p < 0.001) the gap was significant and grew with task demand. The dysphonia group raised pitch with task demands, but to a smaller extent than the vocally healthy group.

Feasibility. Domain scores (out of 5): Usability and Interaction 3.9 (moderate-good), Immersion and Realism 3.4 (moderate, the lowest domain), Engagement and Perceived Benefit 4.0 (good), Comfort and Safety 4.5 (excellent). Overall feasibility index 4.0 (good). No adverse events, no cybersickness, no technical interruptions across the protocol. Average parameter-reconfiguration time between trials was about 2 minutes. Total session length was about 20 minutes per participant.

Qualitative feedback. Participants described the experience as “fun,” “like a video game,” and “a realistic way to practice voice use.” They highlighted the live, responsive behavior of the waiter as the most engaging element. The most consistent piece of negative feedback was the limited interactional behavior of the waiter - participants wanted verbal responses, facial expressions, and gestures during listening turns to make the interaction feel more natural.

Why this matters

For the Evidence Hub, three things are important about this paper:

For Therapy withVR specifically: this work tested IVS, not Therapy withVR. The broader principle it supports (graded visual-spatial demands elicit functional vocal adaptation) is consistent with the rationale clinicians already use when choosing scenes in Therapy withVR for voice work. Direct equivalence of the avatar-threshold trigger mechanism between platforms has not been studied.

Limitations

The paper is explicit about what this trial does and does not establish:

How this fits with the wider Evidence Hub

This study is part of a growing thread of immersive-VR voice work centered on Mount Sinai / Daşdöğen and adjacent voice labs:

The broader landscape: voice-VR is moving from “does the simulation feel real enough to change behavior” (mostly answered: yes) to “does practicing in the simulation transfer to real-world voice use” (largely unanswered, pending multisession longitudinal work). This study sits at the boundary - feasibility and immediate behavioral signal are established for a custom voice-responsive platform; transfer is the next test.

Note on the Immersive VoiceSpace platform. IVS is distinct from Therapy withVR. It is a single-scene, voice-threshold-responsive system invented and patented by the study author. The Mount Sinai institutional report (May 2026, “Hypophonia”) describes ongoing work extending IVS to people with Parkinson’s disease hypophonia, with planned modules for vocal feminization and additional contexts. The IP status of IVS could not be independently verified at the time of this review (see funding/COI field).

Implications for practice

For voice clinicians using or evaluating immersive VR for voice work: this study extends previous lab-based VR voice findings (Daşdöğen 2023, Daşdöğen 2026 trained-singers paper) by showing that the same realism-and-validity effects hold up in a clinical population (people with dysphonia), not just vocally healthy adults. Both groups raised loudness in line with graded distance and threshold cues; pitch scaling was more constrained for people with dysphonia, consistent with reduced phonatory flexibility documented in the wider voice literature. Practically: contextualized practice in virtual environments can elicit functional vocal output without explicit clinician cueing, which speaks to the generalization-and-transfer problem that has long limited carryover from clinic to daily communication. This study tested Immersive VoiceSpace specifically, not Therapy withVR - clinicians using Therapy withVR can take from this work the same broader principle (graded visual-spatial demands elicit vocal scaling) but should not assume direct equivalence of the avatar-threshold trigger mechanism without separate validation. The findings are consistent with the social model of communication: barriers to functional voice use sit in the contexts where voice is needed, and rehearsing in those contexts (rather than in stripped-down clinic rooms) is what the evidence supports.

Implications for research

Replication and extension are needed in: (a) larger samples with sufficient power for subgroup analysis by voice diagnosis; (b) multisession protocols that test learning, retention, and real-world generalization (the central claim of the IVS theoretical frame is transfer-appropriate processing, which requires longitudinal data to test); (c) controlled comparator conditions, including imagery-based control tasks to isolate the unique contribution of immersive visual-spatial cues from VR-exposure novelty effects; (d) Parkinson's hypophonia, which is the lead clinical application of IVS per Mount Sinai institutional reporting; (e) gender-affirming voice care, where IVS feminization modules are reportedly in development; (f) the avatar interaction limitation flagged by participants - whether richer verbal/nonverbal avatar responses (potentially AI-driven) materially improve outcomes. Independent replication outside the inventing institution would substantially strengthen the evidence base.

Cite this study

If you reference this study in your work, the canonical citation formats are:

APA 7th
Daşdöğen Ü (2026). Immersive VoiceSpace: Development and Pilot Testing of a Virtual Reality System for Contextualized Vocal Training. Journal of Voice. https://doi.org/10.1016/j.jvoice.2026.04.047.
AMA 11th
Daşdöğen Ü. Immersive VoiceSpace: Development and Pilot Testing of a Virtual Reality System for Contextualized Vocal Training. Journal of Voice. 2026. doi:10.1016/j.jvoice.2026.04.047.
BibTeX
@article{daden2026,
  author = {Daşdöğen Ü},
  title = {Immersive VoiceSpace: Development and Pilot Testing of a Virtual Reality System for Contextualized Vocal Training},
  journal = {Journal of Voice},
  year = {2026},
  doi = {10.1016/j.jvoice.2026.04.047},
  url = {https://withvr.app/evidence/studies/dasdogen-2026-ivs}
}
RIS
TY  - JOUR
AU  - Daşdöğen Ü
TI  - Immersive VoiceSpace: Development and Pilot Testing of a Virtual Reality System for Contextualized Vocal Training
JO  - Journal of Voice
PY  - 2026
DO  - 10.1016/j.jvoice.2026.04.047
UR  - https://withvr.app/evidence/studies/dasdogen-2026-ivs
ER  - 

Know of research that should be in this hub? If a relevant peer-reviewed study is not listed here, send the reference to hello@withvr.app. The hub is kept up to date as the literature grows.

Funding & independence

Sole-author study by Ümit Daşdöğen (Research Director, Speech and Language Pathology, The Grabscheid Voice and Swallowing Center; Assistant Professor of Otolaryngology, Icahn School of Medicine at Mount Sinai). No external funders, grants, or sponsors named in the manuscript. IRB approval: Mount Sinai STUDY-25-01418. Significant conflict of interest: the author invented the Immersive VoiceSpace (IVS) platform and is identified in the published manuscript as the holder of a US patent application on the technology (USPTO Application No. 63/987 per the manuscript - this appears to be a truncated provisional-application number; the full number was not given in the published paper, and was not independently locatable via USPTO Patent Public Search or Google Patents at the time of this review, consistent with provisional-application confidentiality). The Immersive VoiceSpace® mark appears with the federal-registration symbol in Mount Sinai institutional reporting; a USPTO TESS trademark search returned no matching live registration at the time of review. These IP claims are reported as the author's own representations and could not be independently verified. These overlapping roles (investigator, author, inventor, IP holder, questionnaire designer) are common in early-stage academic platform development and are flagged here for transparency; readers should weigh the feasibility and acceptability outcomes specifically with this context in mind. Therapy withVR (withVR BV, Belgium) had no role in funding, design, conduct, analysis, or reporting of this study; this Evidence Hub entry was prepared independently from the published peer-reviewed paper and the publicly available Mount Sinai institutional report. Daşdöğen has separately published a 2026 Journal of Voice paper using the Rooms module of Therapy withVR (see dasdogen-2026 in this Hub), and uses Therapy withVR in other research work.

Last reviewed: 2026-05-23 Next review due: 2027-05-23 Reviewed by: Gareth Walkom