DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models

Chao Huang1, Susan Liang1, Yapeng Tian2, Anurag Kumar3, Chenliang Xu1

1University of Rochester, 2University of Texas at Dallas, 3Meta Reality Labs Research

Abstract

We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner.

While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated magnitudes starting from Gaussian noises, conditioned on both the audio mixture and the visual footage. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the domain-specific MUSIC dataset and the open-domain AVE dataset, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task

Iterative Synthesis

Here, we animate the syntheis process by showing how the separated magnitude spectrogram changes from time step T=100 to T=0. Specifically, given the visual frames from two videos V(1) and V(2) and the corresponding audio mixture, our model synthsizes the separated spectrograms A(1) and A(2) iteratively and finally outputs clean separated spectrograms. Use the slider here to linearly control the iterative syntheis.

Inference start noisy spectrogram.

Conditions

Drag the button and see the synthsis process...
Inference end separaetd spectrogram.

Output


Example results on MUISC dataset

Settings: We take audio samples of different categories in the MUSIC dataset and mix the two sources. Then we use their corresponding frames to separate the sources respectively. Comparisons between Ground Truth, DAVIS, and iQuery are provided. DAVIS clearly achieves better separation quality. Note that for each example, the two mixtures are the same.


Example1: "Accordion" + "Violin"

Mixture Frame Ground Truth Prediction (DAVIS) Prediction (iQuery)
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png


Example2: "Cello" + "Trumpet"

Mixture Frame Ground Truth Prediction (DAVIS) Prediction (iQuery)
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png

Example3: "Clarinet" + "Tuba"

Mixture Frame Ground Truth Prediction (DAVIS) Prediction (iQuery)
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png

Example4: "erhu" + "Saxophone"

Mixture Frame Ground Truth Prediction (DAVIS) Prediction (iQuery)
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png


Example results on AVE dataset

Settings: We take audio samples of different categories in the AVE dataset and mix the two sources. DAVIS clearly achieves better separation quality across diverse categories. Note that for each example, the two mixtures are the same.


Example1: "Clock" + "Woman speaking"

The background noise in the "Clock" video is also dropped by the model as no related visual information presents in the frame.

Mixture Frame Ground Truth Prediction (DAVIS) Prediction (iQuery)
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png

Example2: "Rats" + "Motorcycle"

The background speech in the "Rats" video is also dropped by the model as no "human" presents in the frame.

Mixture Frame Ground Truth Prediction (DAVIS) Prediction (iQuery)
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png

Example3: "Race car" + "Church bell"

Mixture Frame Ground Truth Prediction (DAVIS) Prediction (iQuery)
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png

Example4: "Helicopter" + "Bark"

Mixture Frame Ground Truth Prediction (DAVIS) Prediction (iQuery)
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png
mix.jpg frame1.jpg gtamp.png predmap.png predmap.png