ZeroSep: Separate Anything in Audio with Zero Training

Overview

ZeroSep is a novel approach for audio source separation that requires zero training. Users can specify any sound source they want to separate using natural language prompts. Our method leverages pretrained audio diffusion models to achieve separation without requiring specialized training data or fine-tuning.

Demo Video

This demo video showcases our user-friendly interface for real-world audio source separation. Users can isolate any sound source through natural language prompts, and ZeroSep performs zero-shot separation without any training requirements. All code and demo implementations are available in our GitHub repository.

ZeroSep interface demonstration: separating audio sources using natural language prompts

Comparison with State-of-the-Art Methods

Below are comparison results on the MUSIC and AVE datasets. We compare ZeroSep against training-based methods (LASS-Net, AudioSep, FlowSep) and training-free methods (AudioEdit and ZeroSep (Ours)). The natural language prompt used for source specification is shown at the beginning of each row.

Results

Prompt

Mixture

Ground Truth

LASS-Net
(Training-based)

AudioSep
(Training-based)

FlowSep
(Training-based)

AudioEdit
(Training-free)

ZeroSep
(Ours, Training-free)

"tuba"