Exploring visually-guided acoustic highlighting to transform audio experiences
VisAH is a novel approach that transforms audio to deliver appropriate highlighting effects guided by the accompanying video. This gallery showcases examples of VisAH in action, comparing our results with other methods and demonstrating applications.
Examples from our Muddy Mixed Dataset, showcasing: input poorly mixed video, LCE highlighting results, VisAH model outputs, and the original movie clips for comparison.
In this example, the speech is not highlighted properly in the input, and our model resolves this issue.
In this video, our model highlights the sound effect properly.
In this video, our model highlights the speech properly.
Our VisAH model can refine video-to-audio generation by rebalancing audio sources in alignment with the video, resulting in improved audio-visual coherence.
Note: Videos are sourced from the MovieGen website. All videos are adjusted to the same loudness level.
The generated videos are from OpenAI Sora, and the corresponding audios are generated by Seeing-and-Hearing.
Our VisAH model can also be applied to real-world videos, where audio is often recorded with suboptimal quality and may require rebalancing.
The videos are sourced from the AudioCaps dataset.