We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner.
Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse sound categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task. We will make our source code and pre-trained models publicly available.
Here, we animate the syntheis process by showing how the separated magnitude spectrogram changes with 100-steps sampling. Specifically, given the visual frames from two videos V(1) and V(2) and the corresponding audio mixture, our model synthsizes the separated spectrograms A(1) and A(2) iteratively and finally outputs clean separated spectrograms. Use the slider here to linearly control the iterative syntheis.
Conditions
Output
We show DAVIS's separation results across diverse categories.
Mixture | Frame | Ground Truth | Prediction |
---|---|---|---|
Mixture | Frame | Ground Truth | Prediction |
---|---|---|---|
Mixture | Frame | Ground Truth | Prediction |
---|---|---|---|
Mixture | Frame | Ground Truth | Prediction |
---|---|---|---|
Settings: We take audio samples of different categories in the MUSIC dataset and mix the two sources. Then we use their corresponding frames to separate the sources respectively. Comparisons between Ground Truth, DAVIS, and iQuery are provided. DAVIS clearly achieves better separation quality. Note that for each example, the two mixtures are the same.
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Settings: We take audio samples of different categories in the AVE dataset and mix the two sources. DAVIS clearly achieves better separation quality across diverse categories. Note that for each example, the two mixtures are the same.
The background noise in the "Clock" video is also dropped by the model as no related visual information presents in the frame.
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
The background speech in the "Rats" video is also dropped by the model as no "human" presents in the frame.
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
Mixture | Frame | Ground Truth | Prediction (DAVIS) | Prediction (iQuery) |
---|---|---|---|---|
We show an application: Zero-Shot Text-Guided Source Separation, which utilizes our trained audio-visual separation model to enable text query at inference. During training, we use CLIP-image encoder to extract the visual embeddings as conditions. Thanks to the strong capabilities of the CLIP model, which aligns corresponding image and text features into closely matched embeddings, we can use the text features generated by the CLIP-text encoder from a text query in a zero-shot setting. Combining the parsing capability from Large Language Models, our model can further allow for user-instructed sound separation, i.e., user can use our audio-visual model to separate sounds by providing their insturctions in the form of text.
During testing, text prompts are used as conditions to the DAVIS model, which is trained with images as conditions.
Mixture: Helicopter + Train Horn | Prompt: "A photo of Helicopter." | Prompt: "A photo of Train Horn." |
Mixture: Shofar + Motorcycle | Prompt: "A photo of Shofar." | Prompt: "A photo of Motorcycle." |
Mixture: Church bell + Female speech | Prompt: "A photo of Church bell." | Prompt: "A photo of Female speech." |