Omni-Conditions Training

Omni-Conditions Training Framework

  • Stage 0: Initial T2V (Text-to-Video) Pre-training — focuses on aligning text and image for base-level generation.

  • Stage 1–3: Progressive fine-tuning that integrates additional modalities such as audio, pose, and noise.

  • As training progresses, the system is exposed to increasingly strong motion-related patterns to improve realism in lip sync, body posture, and emotional delivery.

  • Training ratios vary per modality, emphasizing more common interactions like text-image, while still optimizing less frequent but expressive modalities like pose and noise.

This progressive structure ensures that each generated video includes not just visual fidelity, but dynamic, context-aware motion and human-like responsiveness.

Last updated