VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis

We introduce and define a novel task—Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.

Visual Scene	Text Prompt	VoiceLDM (Baseline)	VividVoice (Ours)
	"I messed things up, it's all my fault. I'm really sorry."	A woman speaks on the rough coast
	"I can't believe my eyes. It's so beautiful."	A woman speaks on the rough coast
	"I can't believe my eyes. It's so beautiful."	A man speaks on the rough coast
	"Try this, it's a new flavor I just attempted to make."	A man speaks on the rough coast

Visual Scene	Text Prompt	VoiceLDM (Baseline)	VividVoice (Ours)
	"I can't believe my eyes. It's so beautiful."	A woman talking in the forest
	"The noise of the city is far away, all I hear is the wind."	A woman talking in the forest
	"The noise of the city is far away, all I hear is the wind."	An old man talking in the forest

🔊 VividVoice: A Unified Framework for Scene-Aware Visually-Driven Speech Synthesis

Abstract

💡 Model Architecture

✨ VividVoice Showcase: Visual Control

🎧 Audio Demos: Comparison with Baselines

Environment: Sea

Environment: Forest

Environment: Street

Environment: Mountain

🔬 Evaluation of Decoupling Ability

Fixed Environment, Varying Character (FE-VC)

Fixed Character, Varying Environment (FC-VE)