🔊 VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis

Abstract

We introduce and define a novel task—Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.

💡 Model Architecture

✨ VividVoice Showcase: Visual Control

Generating speech that matches the acoustic properties of different visual scenes.

Environment: Sea

A person looking at the sea

Prompt: [Please add your text prompt here]

Environment: Rain

A rainy scene

Prompt: It's raining so hard. I feel like I can't go home.

Environment: Animals

A grassland scene with animals

Prompt: I finally entered the embrace of nature.

🎧 Audio Demos: Comparison with Baselines

In each example, we provide the visual scene, text prompt, and audio from both the baseline model (VoiceLDM) and our proposed model (VividVoice).

Environment: Sea

Visual Scene Text Prompt VoiceLDM (Baseline) VividVoice (Ours)
A rough coast "I messed things up, it's all my fault. I'm really sorry."

A woman speaks on the rough coast

"I can't believe my eyes. It's so beautiful."

A woman speaks on the rough coast

A rough coast with a man "I can't believe my eyes. It's so beautiful."

A man speaks on the rough coast

"Try this, it's a new flavor I just attempted to make."

A man speaks on the rough coast

Environment: Forest

Visual Scene Text Prompt VoiceLDM (Baseline) VividVoice (Ours)
A woman in a forest "I can't believe my eyes. It's so beautiful."

A woman talking in the forest

"The noise of the city is far away, all I hear is the wind."

A woman talking in the forest

An old man in a forest "The noise of the city is far away, all I hear is the wind."

An old man talking in the forest

Environment: Street

Visual Scene Text Prompt VoiceLDM (Baseline) VividVoice (Ours)
A man on a noisy street "I'm a little tired today, and I can finally get off work."

A man talking on a noisy street

Environment: Mountain

Visual Scene Text Prompt VoiceLDM (Baseline) VividVoice (Ours)
An old man speaking hoarsely in the valley "The wind whistles through the crevices, emitting a low mourn."

An old man speaking hoarsely in the valley

🔬 Evaluation of Decoupling Ability

Fixed Environment, Varying Character (FE-VC)

The visual environment remains a sea coast, while the character's visual identity changes, resulting in a different voice timbre. The speech content is fixed.

Character 1

Character 1 at the sea

Prompt: The breath of the sea is one of the most fascinating sounds in the world.

Character 2

Character 2 at the sea

Prompt: The breath of the sea is one of the most fascinating sounds in the world.

Fixed Character, Varying Environment (FC-VE)

The character's visual identity is fixed, while the visual environment changes, resulting in a different background acoustic scene. The speech content is fixed.

Environment 1: Forest

Character in a forest

Prompt: The weather is terrible. I've been in a bad mood all day. How can I forget it?

Environment 2: Rain

Character in the rain

Prompt: The weather is terrible. I've been in a bad mood all day. How can I forget it?