OmniHuman-1 Model

The model architecture itself is designed to harmonize various inputs through layered transformer blocks:

  • Inputs: Text, Image, Noise, Audio, and Pose

  • Transformer Stack: Multimodal encoders feed into a transformer sequence model composed of multiple stacked attention blocks.

  • Heatmap & Frame-Level Features: Intermediate layers compute facial heatmaps and audio alignment features to guide visual synthesis.

  • Prediction Module: Generates frame-by-frame video output, ensuring coherence and emotional consistency.

This deep learning pipeline empowers OmniHuman1 to generate expressive, high-quality content with minimal input and maximum creative output.

Last updated