IEEE International Workshop on Haptic Audio Visual Environments and their Applications (HAVE'2004), October 2-3, 2004, Ottawa, Ontario, Canada
Videoconferencing systems in use today typically rely on either fixed or pan/tilt/zoom cameras for image acquisition, and close-talking microphones for good quality audio capture. These sensors are unsuitable for scenarios involving multiple users seated at a meeting table, or non-stationary users. In these situations, the focus of attention should change from one talker to the next, and if possible track moving users. This paper describes a multi-modal perception system using both video and audio signals for such a videoconferencing system. An omnidirectional video camera and an audio beamforming array are combined into a device placed in the center of a meeting table. The video and audio is processed to determine the direction of who is talking, a virtual perspective view and directional audio beam is then created. Computer vision algorithms are used to find people by motion and by face and marker detection. The audio beamformer merges the signals from a circular array of microphones to provide audio power measurements in different directions simultaneously. The video and audio cues are combined to make a decision as to the location of the talker. The system has been integrated with OpenH.323 and serves as a node using Microsoft NetMeeting.
Proceedings of the IEEE International Workshop on Haptic Audio Visual Environments and their Applications (HAVE'2004).