🧠 Multimodal Monday #1 - State of the Stack

Researchers introducing new methods to replace embeddings with discrete IDs for faster cross-modal search
🧠 Multimodal Monday #1 - State of the Stack

This week in multimodal AI:

  • Researchers introducing new methods to replace embeddings with discrete IDs for faster cross-modal search
  • New video retrieval systems showing improvement by integrating audio with visual cues
  • Tutorials to build Multimodal RAG applications
  • Multimodal use cases expanding across healthcare and e-commerce
  • Cool demos showcasing visual reasoning and multimodal agents

🧠 Research Highlights

Refer to caption
# Multimodal Retrieval with BGE-VL
# Demonstrates how to use BGE-VL for image + text queries

import torch
from transformers import AutoModel
from PIL import Image

# Load the model - choose from base or large versions
model = AutoModel.from_pretrained("BAAI/BGE-VL-base", trust_remote_code=True)
model.set_processor("BAAI/BGE-VL-base")  # Initialize the processor
model.eval()  # Set to evaluation mode

# Example: Combined image + text query
# The power of multimodal retrieval is combining both modalities
with torch.no_grad():
    # Encode a query using both image and text instruction
    query_embedding = model.encode(
        images="./product_image.jpg", 
        text="Find this in blue color with leather material"
    )
    
    # Encode candidate images from your database
    candidate_embeddings = model.encode(
        images=["./candidate1.jpg", "./candidate2.jpg", "./candidate3.jpg"]
    )
    
    # Calculate similarity scores
    similarity_scores = query_embedding @ candidate_embeddings.T
    
    # Get the most similar item
    best_match_idx = torch.argmax(similarity_scores).item()
    print(f"Best match: candidate{best_match_idx+1}.jpg with score {similarity_scores[0][best_match_idx]:.4f}")
Refer to caption

🛠️ Tools & Techniques

  • RAP (Retrieval-Augmented Personalization) – A new open-source library (with a CVPR’25 paper) that lets you inject personalized knowledge into a multimodal LLM via retri (GitHub - Hoar012/RAP-MLLM: [CVPR 2025] RAP: Retrieval-Augmented Personalization). How you might use it: You can have a vision-language model “remember” custom concepts (e.g. who your family members are in photos) by feeding it a private image-text database, enabling personalized Q&A or content generation without retraining.
RAP-MLLM
Multimodal rag flow

🏗️ Real-World Applications

  • “All-in-One” models: There’s a clear movement toward unified models that handle many modalities simultaneously. The debut of Qwen2.5-Omni (vision + audio + text in one) and other efforts (e.g. Google’s Gemini vision-language upgrades) suggest that future AI stacks will favor integrated multimodal understanding over siloed models – all while keeping model size manageable (7B parameters in Qwen’s case, optimized for edge depl (Alibaba Cloud Releases Qwen2.5-Omni-7B An End-to-end Multimodal AI Model - Alibaba Cloud Community).
  • Retrieval-augmented multimodality: Many updates this week highlight retrieval as a crucial component of multimodal systems. From using external knowledge to ground image answers, to personalization via retrieved user data (RAP), to generative retrieval replacing indexes – combining search with multimodal models is becoming a standard strategy to boost accuracy and controllability.
  • Efficiency and scalability: A growing theme is making multimodal models and searches more efficient. Approaches like GENIUS avoid costly nearest-neighbor search by generating IDs on t (GENIUS: A Generative Framework for Universal Multimodal Search)3-L68】, and new architectures (e.g. Qwen’s blockwise encoders and Thinker-Talker design) enable streaming inputs/outputs without giant compute ov ([2503.20215] Qwen2.5-Omni Technical Report). We foresee a push toward resource-friendly multimodal AI that can run in real-time and at scale.

🧩 Community + Shoutouts

  • QVQ-Max visual reasoning demo: The Alibaba AI team unveiled QVQ-Max, a visual reasoning module for their Qwen chat—allowing users to upload an image or video and then see the model’s step-by-step “thinking” process when answering a question (Multimodal: AI News Week Ending 03/28/2025 - Ethan B. Holland). This peek under the hood of a multimodal LLM got the community excited about new ways to interpret and trust AI’s visual answers.
  • Together Chat (multimodal agent): Startup Together released a free web demo that combines several open-source models to handle diverse tasks. Dubbed Together Chat, it can perform web search, code writing, image generation, and even image analysis in one in (Multimodal: AI News Week Ending 03/28/2025 - Ethan B. Holland). It showcases the power of connecting multimodal tools (like a language model + vision model) – all accessible to users for free, demonstrating the community’s push toward open, all-in-one AI assistants.
About the author
Ethan Steininger

Ethan Steininger

Probably outside.

Multimodal Makers | Mixpeek

Learn best practices, reference architectures and follow example tutorials to build multimodal AI applications

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.