| 
          
            | 
                I am a fifth-year PhD candidate in the Department of Computer Science at the University of Rochester, advised by Prof. Chenliang Xu. Previously, I spent one wonderful year as a research assistant at the Chinese University of Hong Kong, working with Prof. Chi-Wing Fu on 3D vision. I received my B.Eng. from ESE Department, Nanjing University in 2019. In my undergrad, I worked with Prof. Zhan Ma on image compression.
                
 I am working on multimodal learning and generation. Recently, I am particularly interested in how to leverage the power of large language models (LLMs) to enhance multimodal understanding and generation.
 Research opportunities: I am open to collaborating on research projects. Shoot me an email if you are insterested. 
              
                 
                  βοΈ I'm currently seeking full-time opportunities. Please feel free to reach out if you have any openings! 
                Email  / 
                CV  / 
                
                Google Scholar 
                
                
               |   |  
          
            |   | DRIFT: Directional Reasoning Injection for Fine-Tuning MLLMs Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu
 arXiv preprint, 2025
 Paper /
              Project Page / 
              Code
 DRIFT transfers reasoning from DeepSeek-R1 into QwenVL via gradient-space guidance, improving multimodal reasoning without destabilizing alignment or expensive RL. |  
            |   | XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models Xingrui Wang, Jiang Liu, Chao Huang, Xiaodong Yu, Ze Wang, Ximeng Sun, Jialian Wu, Alan Yuille, Emad Barsoum, Zicheng Liu
 arXiv preprint, 2025
 Paper / 
              Project Page /
			  Code /
			  Data
 A benchmark for evaluating cross-modal capabilities and consistency in omni-language models. |  
            |   | High-Quality Sound Separation Across Diverse Categories via
                  Visually-Guided Generative Modeling Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
 IJCV, 2025
 Paper / 
              Project Page / Code
 How generative models can improve sound separation across diverse categories with visually-guided training. |  
            |   | ZeroSep: Separate Anything in Audio with Zero Training Chao Huang, Yuesheng Ma, Junxuan Huang, Susan Liang, Yunlong Tang, Jing Bi, Wenqiang Liu, Nima Mesgarani, Chenliang Xu
 NeurIPS, 2025
 Paper / 
              Project Page / 
              Code
 No fine-tuning, no task-specific data, just latent inversion + text-conditioned denoising to isolate any sound you describe. |  
            |   | Harnessing the Computation Redundancy in ViTs to Boost Adversarial Transferability Jiani Liu*, Zhiyuan Wang*, Zeliang Zhang*, Chao Huang, Susan Liang, Yunlong Tang, Chenliang Xu.
 NeurIPS, 2025
 Paper
 We propose a bag of tricks to boost the adversarial transferability of ViT-based attacks. |  
            |   | MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness Yunlong Tang, Pinxin Liu, Mingqian Feng, Zhangyun Tan, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, and Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu
 NeurIPS, 2025, Datasets and Benchmarks
 Paper / 
              Project Page / 
              Code
 Introducing MMPerspective, a comprehensive benchmark for MLLMs on perspective understanding. |  
            |   | π-AVAS: Can Physics-Integrated Audio-Visual Modeling Boost Neural Acoustic Synthesis? Susan Liang,  Chao Huang, Yunlong Tang, Zeliang Zhang, Chenliang Xu
 ICCV, 2025
 
 π-AVAS is a two-stage framework that combines physics-based vision-guided audio simulation for generalization with flow-matching audio refinement for realism. |  
            |   | FreSca: Scaling in Frequency Space Enhances Diffusion Models Chao Huang, Susan Liang, Yunlong Tang, Li Ma, Yapeng Tian, Chenliang Xu
 CVPR GMCV, 2025
 Paper / 
              Project Page / 
              Code
 Where and why you should care about frequency space in diffusion models. |  
            |   | Learning to Highlight Audio by Watching Movies Chao Huang, Ruohan Gao,  J. M. F. Tsang, Jan Kurcius, Cagdas Bilen, Chenliang Xu, Anurag Kumar, Sanjeel Parekh
 CVPR, 2025
 Paper / 
              Project Page / 
              Code / 
              Dataset
 We learn from movies to transform audio to deliver appropriate highlighting effects guided by the accompanying video. |  
            |   | Video Understanding with Large Language Models: A Survey Yunlong Tang*, ... , Chao Huang, ... , Ping Luo, Jiebo Luo, Chenliang Xu
 IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2025
 Paper / 
              Project Page
 A survey on the recent Large Language Models for video understanding. |  
            |   | VidComposition: Can MLLMs Analyze Compositions in Compiled Videos? Yunlong Tang*, Junjia Guo*, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, and Pooyan Fazli, Chenliang Xu
 CVPR, 2025
 Paper / 
              Project Page / 
              Code
 We introduce VidComposition, a benchmark designed to assess MLLMs' understanding of video compositions |  
            |   | Scaling Concept with Text-Guided Diffusion Models Chao Huang, Susan Liang, Yunlong Tang, Yapeng Tian, Anurag Kumar, Chenliang Xu
 arXiv preprint, 2024
 Paper / 
              Project Page / 
              Code
 We use pretrained text-guided diffusion models
                to scale up/down concepts in image/audio. |  
            |   | DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models Chao Huang, Susan Liang, Yapeng Tian, Anurag Kumar, Chenliang Xu
 ACCV, 2024  π Best Paper Award, Honorable Mention
 Paper / 
              Project Page / Code
 A new take on the audio-visual separation problem with the recent generative diffusion models. |  
            |   | Language-Guided Joint Audio-Visual Editing Via One-Shot Adaptation Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
 ACCV, 2024
 Paper / 
              Project Page /
              Dataset
 We achieve joint audio-visual editing under language guidance. |  
            |   | Modeling and Driving Human Body Soundfields through Acoustic Primitives Chao Huang, Dejan Markovic, Chenliang Xu, Alexander Richard
 ECCV, 2024
 Paper / 
              Project Page
 Thinking of the equivalent of 3D Gaussian Splatting and volumetric primitives for the human body soundfield? Here, we introduce Acoustic Primitives. |  
            |   | AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
 NeurIPS, 2023
 Paper / 
              Project Page /
              Code
 We propose a novel method of synthesizing real-world audio-visual scenes at novel positions and directions. |  
            |   | Egocentric Audio-Visual Object
                  Localization Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu
 CVPR, 2023
 Paper / 
              Code
 We explore the problem of sound source visual localization in egocentric videos, propose a new localization method and establish a benchmark for evaluation. |  
            |   | Non-Local Part-Aware Point Cloud Denoising Chao Huang*, Ruihui Li*, Xianzhi Li, Chi-Wing Fu
 arXiv preprint, 2020
 A non-local attention based method for point cloud denoising in both synthetic and real scenes. |  
            |   | Extreme Image Compression via Multiscale Autoencoders With Generative Adversarial Optimization Chao Huang, Haojie Liu, Tong Chen, Qiu Shen, Zhan Ma
 IEEE Visual Communications and Image Processing (VCIP), 2019   (Oral Presentation)
 An image compression system under extreme condition, e.g., < 0.05 bits per pixel (bpp).  |  
					
          
            |  | University of Rochester, NY, USA Ph.D. in Computer Science
 Jan. 2021 - Present
 Advisor: Chenliang Xu
 |  
            |  | Nanjing University, Nanjing, China B.Eng in Electronic Science and Engineering
 Sept. 2015 - Jun. 2019
 |  
          
            |  | AMD Research, Remote Research Scientist Intern
 May. 2025 - Aug. 2025
 Mentor: Jiang Liu, Zicheng Liu
 |  
            |  | Meta Reality Labs Research, Meta, Cambridge, UK Research Scientist Intern
 May. 2024 - Aug. 2024
 Mentor: Sanjeel Parekh
              , Ruohan Gao, Anurag Kumar
 |  
            |  | Codec Avatars Lab, Meta, Pittsburgh Research Scientist Intern
 May. 2023 - Nov. 2023
 Mentor: Dejan Markovic
              , Alexander Richard
 |  
            |  | The Chinese University of Hong Kong, Shatin, Hong Kong Research Assistant
 Jul. 2019 - Dec. 2020
 Advisor: Chi-Wing Fu
 |  
          
            | 2025 | ICCV 2025 Doctoral Consortium β Selected Participant |  
            | 2024 | ACCV 2024 β Best Paper Award, Honorable Mention for βDAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Modelsβ |  
          
          
            | Workshop Organization |  |  
            | Conference Reviewing | 
                CVPR (2023β2025)ICCV (2025)AAAI (2023β2025)ACM MM (2023β2025)NeurIPS (2025) |  
            | Journal Reviewing | 
                IEEE Transactions on Multimedia (TMM)IEEE Transactions on Image Processing (TIP)SIGGRAPH (ACM) |  |