NVIDIA Cosmos: The Makings of a World Foundation Model

World foundation models are neural networks that simulate real-world environments and predict accurate outcomes based on text, image, or video input.
NVIDIA Cosmos: The Makings of a World Foundation Model

NVIDIA announced Cosmos on the CES stage, a foundation model trained on 20M hours of video data that learned physical dynamics from real-world observations. The model architecture supports both diffusion and autoregressive approaches, with the key innovation being its ability to predict and simulate physical interactions without explicit physics rules.

The implementation details are available in their 75-page technical paper:

https://d1qx31qr3h6wln.cloudfront.net/publications/NVIDIA%20Cosmos_4.pdf

What makes this technically interesting is the scale of video processing required - the training pipeline handled 20 million hours of video, extracting 100 million distinct clips through a multi-stage filtering process. The model differs from previous video-based approaches by operating directly in the wavelet space rather than pixel space, allowing for more efficient compression while preserving temporal dynamics.

💡
Wavelet space is a mathematical representation where signals (in this case, video) are decomposed into different frequency components at different scales, similar to how a prism breaks light into its component colors.

The technical approach centers on learning physics implicitly through observation rather than explicit simulation, making it particularly relevant for robotics and autonomous systems where traditional physics engines struggle with real-world complexity.

A breakdown of how they've trained this model:

Video Curation

The training dataset consists of 100M video clips extracted from 20M hours of raw footage through a five-stage data processing pipeline:

  1. Shot Detection: Videos are segmented using TransNetV2, which improves on traditional methods by handling complex transitions and motion blur. TransNetV2 uses a CNN-based architecture with temporal convolutions to detect shot boundaries with higher precision than frame-difference approaches.
  2. Quality Filtering: Clips are filtered using a multi-stage scoring system:
    1. Motion content scoring using optical flow magnitude. They used Farneback optical flow for this.
    2. Visual quality assessment (blur detection, compression artifacts)
    3. Text overlay detection with OCR to eliminate tutorial/screencast content
    4. Dynamic range thresholding to remove static scenes
  3. Semantic Annotation: A Visual Language Model generates descriptive captions for each clip. This differs from standard image captioning by incorporating temporal context and physical interactions. They found VILA to be most effective.
  4. Deduplication: Near-duplicate detection uses perceptual hashing and temporal feature matching. The pipeline computes frame-level fingerprints and applies sequence alignment to identify similar clips, even when they differ in resolution or encoding. This included K-Means Clustering using the RapidsAI CUVS framework
  5. Storage Optimization: Clips are organized into webdatasets (TAR archives) optimized for streaming performance. The sharding strategy groups clips by technical characteristics:
    • Resolution buckets (e.g., 720p, 1080p)
    • Aspect ratio clustering
    • Temporal length binning

The resulting dataset achieves a 5:1 compression ratio from raw footage to curated clips while maintaining high information density per sample.

Tokenizer

The core innovation in Cosmos's architecture is its wavelet-based video tokenization system, which implements both continuous and discrete token representations in a unified framework.

  • Continuous and Discrete Tokenizers: Cosmos offers both continuous (vector-based) and discrete (integer-based) tokenizers to accommodate different types of models. The continuous tokens are suitable for diffusion models, while discrete tokens are used for autoregressive models.
  • High Compression Rates: The tokenizers are designed to achieve high compression rates while preserving the visual information in the video to improve the efficiency of training.
  • Causal and Temporally Agnostic: The tokenizer is designed to process video frames in a temporal order, processing current and past frames independently of future frames and are also temporally length-agnostic allowing them to process video of variable lengths.
Input Video Wavelet Transform Continuous Tokens Discrete Tokens Diffusion Models Autoregressive Models • High Compression • Causal Architecture • Temporally Agnostic

Diffusion and Autoregressive WFMs

The Cosmos platform leverages two distinct approaches for building its World Foundation Models:

  • Diffusion WFMs: These models are trained to generate videos by iteratively denoising a Gaussian noise input. The model learns to reverse the process of adding noise to video, effectively creating realistic and physically plausible scenes. These models use continuous tokens.
  • Autoregressive WFMs: These models predict future frames by generating video tokens sequentially, similar to how Large Language Models (LLMs) generate text. They use discrete tokens. These will accept, at max 8 second video clips and return embeddings. Contrary to Google's Vertex which has a max of 120 seconds.

Both approaches have their strengths and weaknesses. Diffusion models currently produce higher visual quality and can be fine-tuned with a variety of control signals. Autoregressive models, while still evolving, can potentially leverage advancements from the LLM community.

Cosmos World Foundation Model Approaches Diffusion WFM Autoregressive WFM Gaussian Noise Denoising Steps Continuous Tokens Generated Video Input Frames Sequential Tokens Discrete Tokens Predicted Frames Start: Random Iterative Refinement Vector-based High Visual Quality Max 8s Clips LLM-like Generation Integer-based Fast Inference

Guardrails

Given the potential risks, the Cosmos platform includes a robust guardrail system.

  • Pre-Guard: This system blocks harmful inputs using a keyword blocklist and a content safety model (Aegis-AI-Content-Safety-LlamaGuard-LLM-Defensive-1.0).
  • Post-Guard: This filters the generated video content for safety, including a face blur filter.
  • Red Teaming: A dedicated team actively probes the system with adversarial examples to uncover weaknesses and refine the guardrails.

Physical AI

The pre-trained models and tokenizers are being released under an open-source license, with implementation available through the NeMo framework. Core use cases demonstrating practical applications:

  • Policy Evaluation: Assessing the quality of an AI policy by allowing it to interact with the WFM in a simulated environment, which is faster and less risky than real-world testing.
  • Policy Initialization: Using a WFM to pre-initialize a policy model, which mitigates the challenges of data scarcity.
  • Planning and Control: Predicting future states based on different action sequences for better decision making by the Physical AI.
  • Synthetic Data Generation: Creating synthetic data for training and fine-tuning, particularly useful for bridging the gap between simulation and real-world implementation (Sim2Real).
  • Robotics, Autonomous Driving, and More: The models can be fine-tuned for specific applications like robotic manipulation, camera control and autonomous driving.
About the author
Ethan Steininger

Ethan Steininger

Probably outside.

Multimodal Makers | Mixpeek

Learn best practices, reference architectures and follow example tutorials to build multimodal AI applications

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.