I am a fifth-year PhD candidate in the Department of Computer Science at the University of Rochester, advised by Prof. Chenliang Xu. Previously, I spent one wonderful year as a research assistant at the Chinese University of Hong Kong, working with Prof. Chi-Wing Fu on 3D vision. I received my B.Eng. from ESE Department, Nanjing University in 2019. In my undergrad, I worked with Prof. Zhan Ma on image compression.
I am broadly interested in developing machine learning models to understand how human perceive the surrounding scenes from multi-modal inputs and utilize the perception for action. Specifically, I am working on multimodal video understanding and generation.
Research opportunities: I am open to collaborating on research projects. Shoot me an email if you are insterested.
✉️ I'm currently seeking full-time opportunities. Please feel free to reach out if you have any openings!
Email  / 
CV  / 
Google Scholar
|
|
|
⭐ ZeroSep: Separate Anything in Audio with Zero Training
Chao Huang, Yuesheng Ma, Junxuan Huang, Susan Liang, Yunlong Tang, Jing Bi, Wenqiang Liu, Nima Mesgarani, Chenliang Xu
arxiv, 2025
Paper /
Project Page /
Code
No fine-tuning, no task-specific data, just latent inversion + text-conditioned denoising to isolate any sound you describe.
|
|
⭐ FreSca: Scaling in Frequency Space Enhances Diffusion Models
Chao Huang, Susan Liang, Yunlong Tang, Li Ma, Yapeng Tian, Chenliang Xu
CVPR GMCV, 2025
Paper /
Project Page /
Code
Where and why you should care about frequency space in diffusion models.
|
|
🔥 Learning to Highlight Audio by Watching Movies
Chao Huang, Ruohan Gao, J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh
CVPR, 2025
Paper /
Project Page /
Code /
Dataset
We learn from movies to transform audio to deliver appropriate highlighting effects guided by the accompanying video.
|
|
Video Understanding with Large Language Models: A Survey
Yunlong Tang*, ... , Chao Huang, ... , Ping Luo, Jiebo Luo, Chenliang Xu
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025
Paper /
Project Page
A survey on the recent Large Language Models for video understanding.
|
|
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?
Yunlong Tang*, Junjia Guo*, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, and Pooyan Fazli, Chenliang Xu
CVPR, 2025
Paper /
Project Page /
Code
We introduce VidComposition, a benchmark designed to assess MLLMs' understanding of video compositions
|
|
Scaling Concept with Text-Guided Diffusion Models
Chao Huang, Susan Liang, Yunlong Tang, Yapeng Tian, Anurag Kumar, Chenliang Xu
arXiv preprint, 2024
Paper /
Project Page /
Code
We use pretrained text-guided diffusion models
to scale up/down concepts in image/audio.
|
|
DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models
Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
ACCV, 2024 🏆 Best Paper Award, Honorable Mention
Paper /
Project Page / Code
A new take on the audio-visual separation problem with the recent generative diffusion models.
|
|
Language-Guided Joint Audio-Visual Editing Via One-Shot Adaptation
Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
ACCV, 2024
Paper /
Project Page /
Dataset
We achieve joint audio-visual editing under language guidance.
|
|
Modeling and Driving Human Body Soundfields through Acoustic Primitives
Chao Huang, Dejan Markovic, Chenliang Xu, Alexander Richard
ECCV, 2024
Paper /
Project Page
Thinking of the equivalent of 3D Gaussian Splatting and volumetric primitives for the human body soundfield? Here, we introduce Acoustic Primitives.
|
|
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
NeurIPS, 2023
Paper /
Project Page /
Code
We propose a novel method of synthesizing real-world audio-visual scenes at novel positions and directions.
|
|
Egocentric Audio-Visual Object
Localization
Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
CVPR, 2023
Paper /
Code
We explore the problem of sound source visual localization in egocentric videos, propose a new localization method and establish a benchmark for evaluation.
|
|
Non-Local Part-Aware Point Cloud Denoising
Chao Huang*, Ruihui Li*, Xianzhi Li, Chi-Wing Fu
arXiv preprint, 2020
A non-local attention based method for point cloud denoising in both synthetic and real scenes.
|
|
Extreme Image Compression via Multiscale Autoencoders With Generative Adversarial Optimization
Chao Huang, Haojie Liu, Tong Chen, Qiu Shen, Zhan Ma
IEEE Visual Communications and Image Processing (VCIP), 2019   (Oral Presentation)
An image compression system under extreme condition, e.g., < 0.05 bits per pixel (bpp).
|
 |
University of Rochester, NY, USA
Ph.D. in Computer Science
Jan. 2021 - Present
Advisor: Chenliang Xu
|
 |
Nanjing University, Nanjing, China
B.Eng in Electronic Science and Engineering
Sept. 2015 - Jun. 2019
|
 |
Meta Reality Labs Research, Meta, Cambridge, UK
Research Scientist Intern
May. 2024 - Aug. 2024
Mentor: Sanjeel Parekh
, Ruohan Gao, Anurag Kumar
|
 |
Codec Avatars Lab, Meta, Pittsburgh
Research Scientist Intern
May. 2023 - Nov. 2023
Mentor: Dejan Markovic
, Alexander Richard
|
 |
The Chinese University of Hong Kong, Shatin, Hong Kong
Research Assistant
Jul. 2019 - Dec. 2020
Advisor: Chi-Wing Fu
|
|