Spatial audio and spatial audio-visual learning

<p>As humans, we extensively depend on multimodal signals to perceive, interact with, and analyze our surrounding 3D spatial environment, so as to accomplish various complex tasks. Amongst all our multimodal senses, sound and vision are the two most ubiquitous signals in real world scenarios....

Full description

Bibliographic Details
Main Author: He, Y
Other Authors: Markham, A
Format: Thesis
Language:English
Published: 2024
Description
Summary:<p>As humans, we extensively depend on multimodal signals to perceive, interact with, and analyze our surrounding 3D spatial environment, so as to accomplish various complex tasks. Amongst all our multimodal senses, sound and vision are the two most ubiquitous signals in real world scenarios. Equipping machines with such spatial audio-visual multimodal inferring and learning capability, or human-level audio-visual intelligence, is a vital yet challenging task. Although the widely accepted importance for spatial audio-visual intelligence, current research community’s focus has been heavily biased towards vision, with far less attention paid to spatial audio. Research in spatial audio-visual learning typically takes spatial audio as an auxiliary input to assist vision-centred tasks, ignoring the intricate and multifaceted interactions between audio and vision.</p> <p>These observations and limitations discussed above motivate my two main DPhil research focus: spatial audio learning and spatial audio-visual learning. While the first one serves as the preparatory exploration aims to fill up the gap between the well-explored vision learning and the underexplored spatial audio learning, the second one raises new challenges in modality issue and exhibits strong applicability in real-scenarios. In summary, my DPhil research is driven by the following three main questions:</p> <p>1. In the modern deep neural network era, how can classic signal processing based sound waveform feature extraction methods benefit from the powerful expressiveness of deep neural network?</p> <p>2. Is the widely-adopted process of spectrogram extraction which treats sound as ordinary 2D image an optimal representation? If not, can we design novel neural networks specifically tailored for sound?</p> <p>3. How can we design robust audio-visual multimodal learning framework in embodied settings where sound and vision are weakly associated?</p></br> <p>Exploring these questions requires treating sound as being equally important as vision, yet fundamentally distinct. To accomplish these aims, <strong>First</strong>, we first systematically explore the fundamental knowledge behind classical acoustic and visual signal processing, and further study the modality difference between acoustic signals and visual signals. <strong>Second</strong>, based on the previous exploration, we further investigate how these fundamentals behind acoustic and visual signals can guide deep neural network design for sound raw waveform learning. <strong>Third</strong>, we show how sound can be efficiently and effectively processed together with vision to provide an initial attempt towards spatial audio-visual intelligence under sound-vision weak-correlation condition.</p>