VTalker: Text-Driven Synthesis of Talking Head with Vision Diffusion Transformer
Text-driven Talking Head Generation (THG) marks a significant advancement in the video production industry by enabling the creation of realistic talking head videos with minimal data input. While previous research has explored few-shot talking head synthesis, these methods often fall short in terms of lip-sync consistency and expression diversity, both of which are essential for practical applications. In this article, we present a novel multi-modal framework, VTalker, for synthesizing talking heads with specific vocal tones and speaking styles. First, a Text-to-Speech (T2S) model is developed for generating speech with a given voice tone from text. Second, we design a novel multi-modal fusion module to effectively integrate speech and text features. Third, we propose the V-DiT (Video Diffusion Transformer) as the backbone to frame the generation of talking heads as a temporally iterative denoising task. To further enhance performance, the appearance and temporal conditions are incorporated into the backbone as tokens, along with patches, to improve image fidelity and ensure smooth spatio-temporal motion. This innovative design enables our model to generate high-fidelity, text-synchronized talking head videos that generalize smoothly across various identities. Extensive experiments demonstrate the effectiveness of our approach in generating high-quality text-driven talking head videos.
Added 2026-04-21