While rendering and animation of photorealistic 3D human body models have matured and reached an impressive quality over the past years, modeling the spatial audio associated with such full body models has been largely ignored so far. In this work, we present a framework that allows for high-quality spatial audio generation, capable of rendering the full 3D soundfield generated by a human body, including speech, footsteps, hand-body interactions, and others. Given a basic audio-visual representation of the body in form of 3D body pose and audio from a head-mounted microphone, we demonstrate that we can render the full acoustic scene at any point in 3D space efficiently and accurately. To enable near-field and realtime rendering of sound, we borrow the idea of volumetric primitives from graphical neural rendering and transfer them into the acoustic domain. Our acoustic primitives result in an order of magnitude smaller soundfield representations and overcome deficiencies in near-field rendering compared to previous approaches.
We translate the task of modeling 3D soundfields for the visual body to learning a set of small acoustic primitives \( \{\mathcal{S}_i\}_{i=1}^{K} \), with \( N=2 \) for soundfield coefficients and number of acoustic primitives as \( K \). We decompose the entire learning process into two sub-steps:
1) First, we design a neural network \( \mathcal{F} \) that consumes audio and pose data as input, and output the soundfield representation, along with primitive weights to reweight the importance of different acoustic primitives and offsets to adjust the sound source locations
2) With the learned acoustic primitives \( \{\mathcal{S}_i\}_{i=1}^{K} \), we leverage a differentiable rendering function, denoted as \( \mathcal{R} \) (illustrated in the paper Eq. (4)), to generate the audio waveform received at the target position
Sound field visualizations for four different types of sounds are shown, with the main sound field in the center and individual primitive contributions around it. The method correctly assigns acoustic energy to relevant primitives, e.g., speech is primarily from the head with minimal contribution from the shoulders, and the directivity matches the head's orientation. In each visualization, the left and right primitives, from bottom to top, are labeled foot, hip, hand, and shoulder, with the head in the middle.
Speech
@inproceedings{huang2024modeling,
author = {Huang, Chao and
Markovic, Dejan and
Xu, Chenliang and
Richard, Alexander},
title = {Modeling and Driving Human Body Soundfields through Acoustic Primitives},
booktitle = {European Conference on Computer Vision},
year = {2024},
}