We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner.
While existing discriminative methods that perform mask regression have made remarkable progress in this field, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated magnitudes starting from Gaussian noises, conditioned on both the audio mixture and the visual footage. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the domain-specific MUSIC dataset and the open-domain AVE dataset, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task
Here, we animate the syntheis process by showing how the separated magnitude spectrogram changes from time step T=100 to T=0. Specifically, given the visual frames from two videos V(1) and V(2) and the corresponding audio mixture, our model synthsizes the separated spectrograms A(1) and A(2) iteratively and finally outputs clean separated spectrograms. Use the slider here to linearly control the iterative syntheis.
Conditions
Output
Settings: We take audio samples of different categories in the MUSIC dataset and mix the two sources. Then we use their corresponding frames to separate the sources respectively. Comparisons between Ground Truth, DAVIS, and iQuery are provided. DAVIS clearly achieves better separation quality. Note that for each example, the two mixtures are the same.
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Settings: We take audio samples of different categories in the AVE dataset and mix the two sources. DAVIS clearly achieves better separation quality across diverse categories. Note that for each example, the two mixtures are the same.
The background noise in the "Clock" video is also dropped by the model as no related visual information presents in the frame.
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
The background speech in the "Rats" video is also dropped by the model as no "human" presents in the frame.
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|