Subject: ByteDance LatentSync — architecture, environment, inference requirements, and local optimizations
Context: Local setup on Linux with NVIDIA GPU, Gradio UI, and Hugging Face checkpoints
Date: April 2026
1. Executive summary
LatentSync is an end-to-end audio-conditioned lip synchronization system. It maps a driving audio signal to mouth motion in a reference video using a latent diffusion stack (Stable Diffusion–style VAE + U-Net) with Whisper-derived audio embeddings and optional DeepCache acceleration. This report documents how the open-source implementation behaves in practice: hardware limits, checkpoint variants (v1.5 vs v1.6), input/output contracts, memory optimizations applied in this workspace, and operational limits (duration, language, still images).
2. System architecture
2.1 High-level data flow
- Video ingest: Frames are read from disk (FFmpeg may normalize to 25 FPS).
- Audio ingest: A separate waveform is loaded at 16 kHz, mono.
-
Whisper encoder: Audio is converted to a sequence of embedding vectors (e.g. 384-dim with
tiny.pt, 768-dim withsmall.ptdepending on U-Netcross_attention_dim). - Face processing: Each frame is passed through face detection and affine alignment; the face crop is normalized to the model spatial resolution (256 or 512).
- Masking: A fixed mouth mask defines the editable region; reference and masked pixels are VAE-encoded to latents.
- Diffusion: A 3D U-Net denoises latents conditioned on per-frame audio embeddings (and optional classifier-free guidance).
- Decode & composite: Latents are VAE-decoded; the mouth region is composited back onto the aligned face and warped into the original frame geometry.
- Mux: Video and audio are combined with FFmpeg to produce the final MP4.
2.2 Major components
| Component | Role |
|---|---|
Whisper (checkpoints/whisper/*.pt) |
Audio → sequence of features aligned to video timing |
VAE (stabilityai/sd-vae-ft-mse) |
Pixel ↔ latent space |
| UNet3DConditionModel | Spatiotemporal denoising with cross-attention to audio |
| DDIM scheduler | Noise schedule (configs/ scheduler assets) |
| InsightFace / MediaPipe stack | Face detection and landmarks for alignment |
| Gradio | Web UI for video + audio upload |
3. Model variants and checkpoints
3.1 LatentSync 1.5 (256 × 256)
- Typical VRAM: On the order of ~8 GB for inference (project README).
-
U-Net config: e.g.
configs/unet/stage2.yamlwithresolution: 256. -
Weights:
latentsync_unet_1.5.ptfrom Hugging FaceByteDance/LatentSync-1.5(recommended filename when co-installing with v1.6). -
Whisper:
cross_attention_dim: 384→checkpoints/whisper/tiny.pt.
3.2 LatentSync 1.6 (512 × 512)
- Typical VRAM: On the order of ~18 GB for inference (project README).
-
U-Net config: e.g.
configs/unet/stage2_512.yamlwithresolution: 512. -
Weights:
latentsync_unet.ptfromByteDance/LatentSync-1.6. - Rationale (per changelog): Trained at higher resolution to reduce mouth blurriness relative to v1.5.
3.3 Checkpoint / resolution pairing
Per project documentation, checkpoint and resolution in the U-Net config must correspond (same architecture; training resolution differs). Mixing a v1.6 checkpoint with a 256 config (or the reverse) without the matching weights is not supported for correct results.
4. Software environment
4.1 Runtime
-
Python: 3.10.x (Conda example in upstream
setup_env.sh; this deployment usedvenv). -
PyTorch: 2.5.1 with CUDA 12.1 wheels (see
requirements.txt). -
Key libraries:
diffusers,transformers,opencv-python,decord,kornia,gradio,onnxruntime-gpu(InsightFace path),ffmpeg(system +ffmpeg-python).
4.2 Checkpoint acquisition
Large artifacts should be downloaded with huggingface-cli, preferably with HF_HUB_ENABLE_HF_TRANSFER=1 and optional hf_transfer / hf_xet for reliable Xet-backed blobs.
5. Input and output specifications
5.1 Video (MP4)
-
Decoding: OpenCV or Decord; optional FFmpeg pass forces 25 FPS (
read_video(..., change_fps=True)). - Codec: H.264 MP4 is typical; anything FFmpeg can decode is generally acceptable.
- Content: One primary face visible enough for landmark-based alignment. Failures surface as “Face not detected” or similar.
- Resolution: Arbitrary; pipeline crops and warps the face to 256 or 512.
5.2 Audio
- Gradio / CLI: Driving signal is typically a separate file (e.g. WAV). Internal resampling to 16 kHz mono.
- Duration vs video: If audio is longer than video, the implementation can loop the video (including reversed segments) to match chunk count; trimming behavior should be validated per use case.
5.3 Still images
- Not a first-class input. Practical approach: convert image + audio to a synthetic MP4 (same frame repeated at 25 FPS for the audio length), then run normal inference.
5.4 Output
-
Video: H.264 via imageio / FFmpeg mux with AAC audio (see
lipsync_pipelineandutil.write_videopatterns).
6. Performance and scalability
6.1 Temporal chunking
Inference iterates over chunks of num_frames (commonly 16 in bundled configs). Total runtime grows linearly with the number of chunks (and with num_inference_steps).
6.2 Memory (RAM vs VRAM)
-
VRAM: Dominated by U-Net forward, VAE encode/decode, and classifier-free guidance (duplicated batch when
guidance_scale > 1). - System RAM: Stores all per-frame face crops for the clip; long clips increase host memory roughly linearly with frame count.
6.3 Language / audio content
- No explicit language switch in inference code; features come from Whisper, which is multilingual in principle.
- Lip quality is dataset-dependent; README highlights English and improved Chinese behavior in v1.5. Other languages are best-effort.
7. Local memory optimizations (this workspace)
The following changes reduce peak VRAM and improve long-clip stability:
| Measure | Description |
|---|---|
| VAE slicing |
enable_vae_slicing() on the pipeline |
| VAE tiling |
vae.enable_tiling() for spatial tiling in the VAE |
| Micro-batched VAE encode/decode | Chunked encode of masked/reference pixels and chunked decode of latents; auto-tuned by GPU size or overridden via env |
| Per-chunk latent init | Initial noise allocated per temporal chunk instead of for the entire sequence |
| CPU staging of decoded frames | Decoded face tensors moved to CPU float32 between chunks to cap GPU growth |
| DeepCache policy | Disabled on GPUs under ~18 GiB unless forced via environment (DeepCache trades speed vs extra memory) |
| Optional aggressive cleanup |
LATENTSYNC_EMPTY_CACHE_EACH_CHUNK for gc / torch.cuda.empty_cache() between chunks |
Device fix in AlignRestore.restore_img |
Ensures face tensors are moved to the restorer’s CUDA device when inputs were staged on CPU |
7.1 Environment variables (memory tuning)
-
LATENTSYNC_VAE_ENCODE_CHUNK— max frames per VAE encode pass (integer). -
LATENTSYNC_VAE_DECODE_CHUNK— max frames per VAE decode pass. -
LATENTSYNC_DEEPCACHE/LATENTSYNC_NO_DEEPCACHE— force DeepCache on/off. -
LATENTSYNC_EMPTY_CACHE_EACH_CHUNK— aggressive CUDA cache flush between chunks. -
LATENTSYNC_PROFILE—1.5/1.6style profile selection in Gradio (path resolution).
7.2 Operational tuning
-
guidance_scale = 1.0: Disables classifier-free guidance → ~2× narrower U-Net batch axis → major VRAM savings at some quality cost. -
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True: PyTorch suggestion to mitigate fragmentation after OOM diagnostics.
8. Risks and limitations
- Hardware floor: v1.6 @ 512 is impractical on ~8 GB GPUs without unacceptable compromise; v1.5 @ 256 is the appropriate tier.
- Face dependency: Occlusion, profile views, or small faces may fail alignment.
- Long videos: Feasible but slow; host RAM for face tensors can become the bottleneck before VRAM.
- Legal / ethical: Lip-sync can be misused; deploy only with consent and compliant policies.
9. References
- Repository: github.com/bytedance/LatentSync
- Paper: Li et al., LatentSync: Taming Audio-Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision, arXiv:2412.09262
- Weights: ByteDance/LatentSync-1.5, ByteDance/LatentSync-1.6
- Upstream components: AnimateDiff, MuseTalk, Whisper, Stable Diffusion VAE, SyncNet lineage (per project acknowledgements)