đ˘ Quick Hits (TL;DR)
AI learns through action, not observation.
Traditional AI watches and classifies. Pelican-VLâs DPPO training lets robots practice and self-correct, SIMA 2 maintains goals across gaming sessions, and Holo2 navigates interfaces conceptually. All three learn by doing.
Vision AI groups by meaning, not appearance.
DeepMindâs odd-one-out method stops AI from grouping bananas with yellow cars. OmniVinci fuses vision, audio, and language in one space with 6x less training data. Both understand what things are, not just how they look.
Single images generate complete 3D worlds.
Fei-Fei Liâs World Labs Marble creates walkable environments from one photo. Depth Anything 3 extracts depth from any 2D image. PAN simulates nested physical interactions. Every image can now train spatial AI.
đ§ Research Highlights
UniVA: Universal Video Agent
UniVA works like LEGO for video AI, you plug in whatever tools you need. The demo shows it tracking objects, editing footage, and understanding complex scenes all in one system.
Phys2Real: Sim-to-Real Transfer
This method trains robots in simulation then transfers that knowledge to the real world by accounting for real-world messiness. The robot learns what it doesnât know and adapts accordingly.
Links: Project Page | Paper | Twitter
Pelican-VL 1.0: The Embodied Intelligence Brain
Beijingâs Pelican-VL converts what robots see into 3D movement commands directly. Their DPPO training method works like human practice, make mistakes, reflect, improve.
Links: Project Page | Paper | GitHub | Hugging Face
OmniVinci: Omni-Modal Understanding LLM
NVIDIAâs OmniVinci processes vision, audio, and language in one unified space. It beats Qwen2.5-Omni by 19% while using 6x less training data.

Links: Project Page | Paper | Model
Teaching AI to See the World More Like We Do
DeepMind used an âodd-one-outâ test to show how differently AI sees things compared to humans. Their three-step alignment method fixes this, making AI group concepts the way you naturally would.
RF-DETR: First real-time segmentation model to beat YOLO models.
Links: Paper | GitHub | Hugging Face
Meta Omnilingual ASR: Speech recognition for 1,600+ languages in one model. Links: Blog Post | GitHub | Twitter
The Value of Personalized Recommendations: Netflix data shows how recommendation algorithms actually work. Links: Paper
đ ď¸ Tools, Models and Techniques
SIMA 2
Googleâs SIMA 2 plays games with you, learns through trial and error, and actually reasons about what to do. Talk to it through text, voice, or images, it understands high-level goals and figures out how to achieve them.
Why it matters: Your next gaming buddy will be an AI that actually understands the game.
Links: Blog Post | Twitter
Depth Anything 3 (DA3)
DA3 generates depth maps from regular images with unprecedented accuracy. The demo shows it working on everything from selfies to satellite imagery.
Why it matters: Every 2D image can now become 3D data for your applications.Links: Project Page | Paper | GitHub | Hugging Face | Twitter
Marble
World Labsâ Marble creates persistent 3D worlds from a single image, video, or text prompt. Upload a photo of your living room, get a walkable 3D space.
Why it matters: 3D content creation just became as simple as taking a photo.
Links: Website | Blog Post | Twitter
Holo2
H-Companyâs Holo2 beats all computer-use benchmarks across web, desktop, and mobile. Drop it into your existing Holo setup, it works immediately on Ubuntu, Android, or Chrome.
.png)
Links: Blog Post | GitHub | Hugging Face
Music Flamingo
NVIDIAâs Music Flamingo understands full songs, not just clips. It analyzes music structure, identifies instruments, and reasons about compositions.
Why it matters: AI finally understands music the way musicians do.
Links: Project Page | Paper | Hugging Face | Demo
PAN: General world model simulates physical, agentic, and nested worlds. Links: Demo | Twitter
ERNIE-4.5-VL-28B-A3B-Thinking: Baiduâs natively omni-modal foundation model. Links: Hugging Face | Demo | Twitter
Llama-Embed-Nemotron-8B: NVIDIAâs universal text embedding for 100+ languages. Links: Paper | Hugging Face
DeepEyesV2: Multimodal agent that writes and runs code while searching the web. Links: Project Page | Paper | Hugging Face
Maya1: Create any voice from text. Links: Demo
đ Trends & Predictions
The Perception-to-Action Gap Closes
This week shows three distinct approaches to the same problem: how do you get AI to actually do things, not just understand them?
Pelican-VL tackles this for robotics with its DPPO training method, the model practices tasks, fails, analyzes what went wrong, then adjusts. Think of it like teaching a robot to play piano: it doesnât just memorize finger positions, it learns the relationship between what it sees and how to move. The Beijing team tested this on real humanoid robots doing manipulation tasks, and the results show genuine spatial reasoning emerging from visual input alone.
SIMA 2 solves this in virtual environments. Googleâs agent doesnât just execute commands, it maintains persistent goals across gaming sessions, reasons about cause and effect, and learns new skills without being explicitly programmed. When you tell it âbuild a house,â it figures out it needs to gather materials first, find a good location, and plan the structure. This kind of multi-step reasoning with environmental feedback is new.
Holo2 brings this to computer interfaces. Itâs not using predefined clicking patterns or UI maps. The model understands interface elements conceptually, it knows what a button does, not just where it is. H-Companyâs benchmarks show it handling complex workflows across different operating systems without specific training for each one.
What connects these three? Theyâre all moving beyond the traditional pipeline of âperceive â classify â decide â actâ toward integrated systems where perception and action inform each other continuously. The models learn by doing, not just by observing. This feedback loop between action and understanding is what makes these systems actually useful in unpredictable real-world scenarios.
The technical breakthrough here is handling uncertainty through interaction. Instead of needing perfect understanding before acting, these systems act to improve their understanding. Thatâs fundamentally different from how weâve built AI systems until now.
Community + Shoutouts
dLLM
Zhanhui Zhou turned BERT into a chatbot using diffusion. Yes, you read that right, BERT can now chat.
Links: GitHub | Report | Hugging Face
Next Scene LoRA
OdinLovis built a LoRA that adds camera movement to image generation. Type âNext Sceneâ and watch your static image become a cinematic sequence.
Links: Hugging Face
Thatâs a wrap for Multimodal Monday #33! From robots that understand space with Pelican-VL, to AI that sees concepts like humans via DeepMind, to instant 3D worlds through Marble, this week redefined what multimodal means. Add Metaâs 1,600-language ASR and NVIDIAâs Music Flamingo understanding full songs, and youâre looking at AI systems that perceive, reason, and act across every modality.