High-Quality Visually-Guided Sound Separation from Diverse Categories

We propose DAVIS, a Diffusion model-based Audio-VIusal Separation framework that solves the audio-visual sound source separation task through a generative manner.

Existing methods typically frame sound separation as a mask-based regression problem, achieving significant progress. However, they face limitations in capturing the complex data distribution required for high-quality separation of sounds from diverse categories. In contrast, DAVIS leverages a generative diffusion model and a Separation U-Net to synthesize separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information. With its generative objective, DAVIS is better suited to achieving the goal of high-quality sound separation across diverse sound categories. We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets, and results show that DAVIS outperforms other methods in separation quality, demonstrating the advantages of our framework for tackling the audio-visual source separation task. We will make our source code and pre-trained models publicly available.

Here, we animate the syntheis process by showing how the separated magnitude spectrogram changes with 100-steps sampling. Specifically, given the visual frames from two videos V(1) and V(2) and the corresponding audio mixture, our model synthsizes the separated spectrograms A(1) and A(2) iteratively and finally outputs clean separated spectrograms. Use the slider here to linearly control the iterative syntheis.

Mixture	Frame	Ground Truth	Prediction

Mixture	Frame	Ground Truth	Prediction

Mixture	Frame	Ground Truth	Prediction

Mixture	Frame	Ground Truth	Prediction

Mixture	Frame	Ground Truth	Prediction (DAVIS)	Prediction (iQuery)

Mixture	Frame	Ground Truth	Prediction (DAVIS)	Prediction (iQuery)

Mixture	Frame	Ground Truth	Prediction (DAVIS)	Prediction (iQuery)

Mixture	Frame	Ground Truth	Prediction (DAVIS)	Prediction (iQuery)

Mixture	Frame	Ground Truth	Prediction (DAVIS)	Prediction (iQuery)

Mixture	Frame	Ground Truth	Prediction (DAVIS)	Prediction (iQuery)

Mixture	Frame	Ground Truth	Prediction (DAVIS)	Prediction (iQuery)

Mixture	Frame	Ground Truth	Prediction (DAVIS)	Prediction (iQuery)

We show an application: Zero-Shot Text-Guided Source Separation, which utilizes our trained audio-visual separation model to enable text query at inference. During training, we use CLIP-image encoder to extract the visual embeddings as conditions. Thanks to the strong capabilities of the CLIP model, which aligns corresponding image and text features into closely matched embeddings, we can use the text features generated by the CLIP-text encoder from a text query in a zero-shot setting. Combining the parsing capability from Large Language Models, our model can further allow for user-instructed sound separation, i.e., user can use our audio-visual model to separate sounds by providing their insturctions in the form of text.

High-Quality Visually-Guided Sound Separation from Diverse Categories

Abstract

🌟 Iterative Synthesis

♫ Audio-Visual Separation from Diverse Categories

Example1: "Baltimore oriole calling" + "gibbon howling"

Example2: "cow lowing" + "snake hissing"

Example3: "fire truck siren" + "helicopter"

Example4: "driving motorcycle" + "train whistling"

♫ Example results on MUISC dataset

Example1: "Accordion" + "Violin"

Example2: "Cello" + "Trumpet"

Example3: "Clarinet" + "Tuba"

Example4: "erhu" + "Saxophone"

♬ Example results on AVE dataset

Example1: "Clock" + "Woman speaking"

Example2: "Rats" + "Motorcycle"

Example3: "Race car" + "Church bell"

Example4: "Helicopter" + "Bark"

☞ Application: Zero-Shot Text-Guided Separation

Mixture: Helicopter + Train Horn	Prompt: "A photo of Helicopter."	Prompt: "A photo of Train Horn."

Mixture: Shofar + Motorcycle	Prompt: "A photo of Shofar."	Prompt: "A photo of Motorcycle."

Mixture: Church bell + Female speech	Prompt: "A photo of Church bell."	Prompt: "A photo of Female speech."

High-Quality Visually-Guided Sound Separation from Diverse Categories

Abstract

🌟 Iterative Synthesis

♫ Audio-Visual Separation from Diverse Categories

Example1: "Baltimore oriole calling" + "gibbon howling"

👉 Click to fold/unfold more examples.

Example2: "cow lowing" + "snake hissing"

Example3: "fire truck siren" + "helicopter"

Example4: "driving motorcycle" + "train whistling"

♫ Example results on MUISC dataset

Example1: "Accordion" + "Violin"

👉 Click to fold/unfold more examples.

Example2: "Cello" + "Trumpet"

Example3: "Clarinet" + "Tuba"

Example4: "erhu" + "Saxophone"

♬ Example results on AVE dataset

Example1: "Clock" + "Woman speaking"

👉 Click to fold/unfold more examples.

Example2: "Rats" + "Motorcycle"

Example3: "Race car" + "Church bell"

Example4: "Helicopter" + "Bark"

☞ Application: Zero-Shot Text-Guided Separation