We introduce and define a novel task—Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.
Generating speech that matches the acoustic properties of different visual scenes.
Environment: Sea
Prompt: [Please add your text prompt here]
Environment: Rain
Prompt: It's raining so hard. I feel like I can't go home.
Environment: Animals
Prompt: I finally entered the embrace of nature.
In each example, we provide the visual scene, text prompt, and audio from both the baseline model (VoiceLDM) and our proposed model (VividVoice).
Visual Scene | Text Prompt | VoiceLDM (Baseline) | VividVoice (Ours) |
---|---|---|---|
![]() |
"I messed things up, it's all my fault. I'm really sorry." |
A woman speaks on the rough coast |
|
"I can't believe my eyes. It's so beautiful." |
A woman speaks on the rough coast |
||
![]() |
"I can't believe my eyes. It's so beautiful." |
A man speaks on the rough coast |
|
"Try this, it's a new flavor I just attempted to make." |
A man speaks on the rough coast |
Visual Scene | Text Prompt | VoiceLDM (Baseline) | VividVoice (Ours) |
---|---|---|---|
![]() |
"I can't believe my eyes. It's so beautiful." |
A woman talking in the forest |
|
"The noise of the city is far away, all I hear is the wind." |
A woman talking in the forest |
||
![]() |
"The noise of the city is far away, all I hear is the wind." |
An old man talking in the forest |
Visual Scene | Text Prompt | VoiceLDM (Baseline) | VividVoice (Ours) |
---|---|---|---|
![]() |
"I'm a little tired today, and I can finally get off work." |
A man talking on a noisy street |
Visual Scene | Text Prompt | VoiceLDM (Baseline) | VividVoice (Ours) |
---|---|---|---|
![]() |
"The wind whistles through the crevices, emitting a low mourn." |
An old man speaking hoarsely in the valley |
The visual environment remains a sea coast, while the character's visual identity changes, resulting in a different voice timbre. The speech content is fixed.
Character 1
Prompt: The breath of the sea is one of the most fascinating sounds in the world.
Character 2
Prompt: The breath of the sea is one of the most fascinating sounds in the world.
The character's visual identity is fixed, while the visual environment changes, resulting in a different background acoustic scene. The speech content is fixed.
Environment 1: Forest
Prompt: The weather is terrible. I've been in a bad mood all day. How can I forget it?
Environment 2: Rain
Prompt: The weather is terrible. I've been in a bad mood all day. How can I forget it?