Gaze-contingent auditory displays for improved spatial attention in virtual reality

From National Research Council Canada

DOI	Resolve DOI: https://doi.org/10.1145/3067822
Author	Search for: Vinnikov, Margarita¹; Search for: Allison, Robert S.; Search for: Fernandes, Suzette
Affiliation	National Research Council of Canada. Aerospace
Format	Text, Article
Abstract	Virtual reality simulations of group social interactions are important for many applications, including the virtual treatment of social phobias, crowd and group simulation, collaborative virtual environments (VEs), and entertainment. In such scenarios, when compared to the real world, audio cues are often impoverished. As a result, users cannot rely on subtle spatial audio-visual cues that guide attention and enable effective social interactions in real-world situations. We explored whether gaze-contingent audio enhancement techniques driven by inferring audio-visual attention in virtual displays could be used to enable effective communication in cluttered audio VEs. In all of our experiments, we hypothesized that visual attention could be used as a tool to modulate the quality and intensity of sounds from multiple sources to efficiently and naturally select spatial sound sources. For this purpose, we built a gaze-contingent display (GCD) that allowed tracking of a user’s gaze in real-time and modifying the volume of the speakers’ voices contingent on the current region of overt attention. We compared six different techniques for sound modulation with a base condition providing no attentional modulation of sound. The techniques were compared in terms of source recognition and preference in a set of user studies. Overall, we observed that users liked the ability to control the sounds with their eyes. They felt that a rapid change in attenuation with attention but not the elimination of competing sounds (partial rather than absolute selection) was most natural. In conclusion, audio GCDs offer potential for simulating rich, natural social, and other interactions in VEs. They should be considered for improving both performance and fidelity in applications related to social behaviour scenarios or when the user needs to work with multiple audio sources of information.
Publication date	2017-07-22
Publisher	Association for Computing Machinery
In	ACM Transactions on Computer-Human Interaction 24, no. 3, 19.
Language	English
Peer reviewed	Yes
NPARC number	23002401
Export citation	Export as RIS
Report a correction	Report a correction (opens in a new tab)
Record identifier	332b2654-5599-4c64-bbe3-f379dc631e24
Record created	2017-10-27
Record modified	2020-03-16

Date modified:: 2025-05-09