ZeroSep: Separate Anything in Audio with Zero Training

Chao Huang1 Yuesheng Ma2 Junxuan Huang3 Susan Liang1 Yunlong Tang1 Jing Bi1 Wenqiang Liu3 Nima Mesgarani2 Chenliang Xu1
1 University of Rochester 2 Columbia University 3 Tencent America

Overview

ZeroSep is a novel approach for audio source separation that requires zero training. Users can specify any sound source they want to separate using natural language prompts. Our method leverages pretrained audio diffusion models to achieve separation without requiring specialized training data or fine-tuning.

Demo Video

This demo video showcases our user-friendly interface for real-world audio source separation. Users can isolate any sound source through natural language prompts, and ZeroSep performs zero-shot separation without any training requirements. All code and demo implementations are available in our GitHub repository.

ZeroSep interface demonstration: separating audio sources using natural language prompts

Comparison with State-of-the-Art Methods

Below are comparison results on the MUSIC and AVE datasets. We compare ZeroSep against training-based methods (LASS-Net, AudioSep, FlowSep) and training-free methods (AudioEdit and ZeroSep (Ours)). The natural language prompt used for source specification is shown at the beginning of each row.

Results
Prompt
Mixture
Ground Truth
LASS-Net
(Training-based)
AudioSep
(Training-based)
FlowSep
(Training-based)
AudioEdit
(Training-free)
ZeroSep
(Ours, Training-free)
"tuba"
"cello"
"accordion"
"flute"
"xylophone"
"chainsaw"
"frying food"
"Banjo"
"Speech"
"Speech"
"Church Bell"