
A New Approach to Joint Audio-Video Generation
Generating synchronized audio and video from a single input remains one of the hardest problems in multimodal AI. Most current methods either align the two modalities after each is generated independently (dual-tower with posterior alignment) or dump text, audio, and video into a single shared latent space. Both approaches have known weaknesses: the former loses fine-grained co-evolution between sound and image, while the latter conflates high-level semantic conditioning with low-level synchronization tasks.
In a paper published on May 28, 2026, researchers from Baidu's ERNIE team propose a third path. Their framework, called NAVA (Native Audio-Visual Alignment), first establishes a dedicated interaction space where audio and video features learn to correspond, then uses an external textual context to guide the joint denoising process. The result is a 6.3 billion-parameter model that produces temporally precise audio-visual pairs with optional control over speech timbre.
NAVA's architecture is named Align-then-Fuse MMDiT (Multimodal Diffusion Transformer). It transitions from modality-aware alignment to modality-shared joint denoising. According to the team's project page, the model is built on a context-conditioned native alignment principle: audio-visual correspondence is learned in its own subspace before the diffusion process begins, rather than as an afterthought.
Why Existing Architectures Fall Short
The paper identifies two dominant design patterns in open-source joint-generation models. Dual-tower approaches keep separate strands for audio and video, aligning them only after individual generation. This posterior alignment cannot enforce tight synchronization at fine timescales because the modalities never inform each other during creation. On the other hand, unified tri-modal designs that mix text, audio, and video tokens into one space force the model to simultaneously handle semantic understanding (what the scene means) and low-level synchronization (when the sound should occur). The team argues that overloading the same representation with both tasks degrades overall quality.

NAVA instead uses a two-phase pipeline. In the first phase, a lightweight alignment module learns cross-modal attention between audio and video features without any text conditioning. This builds a native correspondence that captures physical relationships—for example, that a hand clap must coincide with a palm impact. In the second phase, the aligned features are passed into a shared MMDiT backbone alongside text embeddings. The text provides high-level semantic guidance (e.g., “a person speaking with a calm tone in a rainy street”), but it does not interfere with the low-level temporal coupling already established.
Controllable Speech Timbre via In-Context Conditioning
Beyond synchronization, NAVA introduces a feature called Timbre-in-Context Conditioning. Many video-generation use cases—such as automated dubbing, virtual avatars, or film post-production—require that the generated voice match a specific reference speaker’s timbre. Previous methods either treated timbre as a fixed global attribute or ignored it altogether.
NAVA associates a short reference audio clip with corresponding speech spans in the generated output. The model learns to transfer timbral characteristics from the reference to the target speech while preserving natural prosody and lip-sync. In the paper, the team reports experiments on Verse-Bench and Seed-TTS, along with a user study, showing that NAVA achieves stronger reference-timbre controllability compared to baseline open-source models, without sacrificing video quality or synchronization accuracy.
The researchers note that the Timbre-in-Context module operates entirely within the alignment phase, meaning it does not increase the number of diffusion steps or the model's inference budget. This design choice keeps the total parameter count at a relatively lean 6.3B for a multimodal generative model.
Performance Benchmarks and Open-Source Release

NAVA was evaluated on two established benchmarks. On Verse-Bench, which tests video quality and audio-visual sync across diverse scenes, the framework achieved superior video quality scores and precise temporal alignment. On Seed-TTS, a popular dataset for text-to-speech evaluation, NAVA produced competitive audio quality while maintaining synchronization metrics that outperformed dual-tower baselines by a wide margin.
A user study involving 50 participants assessed subjective quality: participants rated NAVA-generated clips higher for both naturalness and lip-sync accuracy compared to outputs from prior open-source systems. The model runs on a single A100 GPU for inference, and the authors have released the code and project page on GitHub under the ERNIE Research organization. The repository had already accumulated 64 stars at the time of publication.
One limitation acknowledged by the team is that NAVA currently focuses on single-speaker scenarios and does not handle overlapping voices or complex background sounds. Extending the framework to multi-source audio-video generation remains future work.
Implications for Practical Deployment and the Broader Field
Joint audio-video generation has long been considered a “moonshot” capability for video production, gaming, and real-time avatar systems. The fact that NAVA achieves state-of-the-art results with only 6.3B parameters—significantly smaller than many large-scale multimodal models—makes it immediately more viable for real-world applications where latency and compute cost matter. The decoupling of alignment from semantic conditioning also provides a cleaner architectural template that other research groups can build upon.
For developers working on video dubbing, automated content creation, or digital humans, NAVA offers a concrete open-source option that addresses two pain points: temporal synchronization without artifacts, and speaker timbre control without external fine-tuning. The Baidu team has signaled that future releases may include multi-speaker support and longer context windows, which would further broaden its use cases.
As the generative AI community moves beyond static text and image generation, frameworks like NAVA that tackle the harder problem of synchronized multimodal content will likely define the next wave of production tools. Baidu’s focus on native alignment rather than post-hoc correction sets a new precedent for how researchers think about audio-video co-generation.
Comments