Mixpeek Engineering Blog

The 3072-Dimension Problem

A 3072-dimensional embedding encodes everything about a video and distinguishes nothing. Decomposing content into named, measurable features, then placing them in a queryable hierarchy, is how multimodal search actually works at scale.

ArchitectureMultimodal

Why Vector Search Alone Can't Find What's in Your Videos

Text-only RAG pipelines miss 80%% of what is in your content. A video contains faces, dialogue, on-screen text, background music, and brand logos. No single embedding captures all of that. The solution is multi-stage retrieval.

Multimodal AIVector SearchRetrieval

Multimodal Taxonomies: How to Classify Video, Images, Audio, and Text with One Category System

Traditional taxonomies classify one content type at a time. Multimodal taxonomies unify classification across every format using embedding similarity the missing layer between raw AI features and structured, searchable metadata.

Multimodal AITaxonomiesContent Classification

Object Storage Comparison 2026: 21 Providers, Real Pricing, and the Gotchas Nobody Tells You

We compared 21 S3-compatible object storage providers across pricing, egress, features, and fine print. AWS S3 costs 15x more than the cheapest alternative for the same workload. Here's everything we found.

Object StorageCloud InfrastructureS3

Building a Kalshi Trading Bot with Semantic Search and LLM Extraction

How we built an autonomous Kalshi trading bot using the Kalshi API and Mixpeek's video transcription, semantic search, and LLM data extraction no external tools required.

EngineeringUse CasesLLM

The Multimodal Data Warehouse: Why Unstructured Data Needs Its Own Snowflake

We are drowning in unstructured data — video, audio, images, documents, IoT — but our infrastructure still assumes everything is a row or a vector. The multimodal data warehouse is the missing layer: object decomposition, tiered storage, and multi-stage retrieval pipelines for the AI era.

Multimodal AIData WarehouseArchitecture

ColQwen2 + MUVERA: Multimodal Late Interaction Retrieval That Actually Scales

We benchmarked every viable approach to multimodal document retrieval on financial tables (ViDoRe/TabFQuAD) and found a combination that hasn&

EngineeringResearchMultimodal

We Built a Pre-Publication IP Clearance Pipeline. Here's What We Learned.

Every major IP enforcement tool finds violations after they're live. We built one that catches them before publication. Here's the architecture, the models, and what we learned.

EngineeringIP SafetyComputer Vision

I Benchmarked 5 Video Embedding Models So You Don't Have To

We tested Gemini, Twelve Labs Marengo, X-CLIP, SigLIP 2, and InternVideo2 on text-to-video retrieval with graded relevance. The results surprised us.

BenchmarkVideo EmbeddingsMulti-Modal AI

Gemini Embedding 2 is Live: embed multiple files into one vector

Google's Gemini Embedding 2 embeds images, PDFs, and text together in a single API call. Here's how we integrated it into Mixpeek's feature extractor pipeline, the production numbers, and where multi-file embedding beats single-chunk approaches.

EngineeringEmbeddingsGemini

Query Preprocessing: Semantic Search With Large Files

How we built query preprocessing into Mixpeek's feature_search stage — decompose a 500MB video into chunks, embed in parallel, fuse results. Zero API surface change for callers.

EngineeringVector SearchRetrieval

AI Video Analysis for Sports: Build Automated Highlight Reels, Archive Search, and Performance Analytics

Sports broadcasters cut 4-8 hour editing sessions to 15 minutes using AI video analysis. Learn how to build automated highlight detection, archive search, and performance analytics pipelines for any sport.

VideoVideo AISports

How Mixpeek runs distributed multimodal ML on Ray: architecture, patterns, and production lessons

We run 20+ ML models in parallel across video, image, and document pipelines. Here's the Ray architecture behind it -- custom resource isolation, flexible actor pools, distributed Qdrant writes, and the lessons we learned the hard way.

EngineeringInfrastructureMachine Learning

Political Ad Disclaimers: How ZIP+4 Targeting Creates Jurisdiction Conflicts

6,000+ ZIP codes straddle congressional district lines. At ZIP+4 precision, federal, state, and local disclaimer requirements can all apply simultaneously. Here's how multimodal AI solves what static rules engines can't.

AdvertisingAdTechPolitical Advertising

Multimodal Monday #45: Birds, Whales, and the End of Latency

Your Weekly Multimodal AI Roundup (Feb 9 - Feb 16)

Multimodal Monday

IAB Contextual Classifier: Taxonomies for Videos and Images

Classify text, images, and video into 700+ IAB Content Taxonomy categories using multimodal AI. Learn how it works under the hood and how to extend it for your contextual targeting needs.

IndustryTutorials

Semantic Crons: Replace LLM Polling with Vector-Based Alerts

Instead of polling with an LLM on a cron schedule, Retriever Alerts evaluate semantic conditions at ingestion time. Vector math instead of inference calls. Event-driven instead of scheduled. Three API calls to set up.

InfrastructureSearch

Why SNF Documentation Is a Multimodal AI Problem

Nurses spend 40% of their time on documentation. MDS coordinators abstract charts for 3-4 hours per assessment. PDPM revenue goes uncaptured. The root cause is that clinical documentation is inherently multimodal — and most tools only handle text.

HealthcareResearchhidden

Semantic Chunking Strategies for Multimodal RAG

How semantic chunking improves RAG quality by splitting content at natural boundaries rather than fixed token counts. Covers text, documents, video, and audio.

ComparisonsRAGTutorials

What is Agentic Retrieval? The Next Evolution of RAG

How agentic retrieval goes beyond traditional RAG by letting AI agents dynamically plan and execute multi-step search strategies with tool calling.

ComparisonsResearchRAG

Video Intelligence: From Raw Footage to Searchable Data

How AI-powered video intelligence extracts structured, searchable information from raw footage — covering scene detection, transcription, face recognition, and temporal indexing.

ComparisonsVideoResearch

Keyword Search vs Semantic Search vs Hybrid Search: A Developer's Guide

A clear comparison of keyword, semantic, and hybrid search with practical guidance on when to use each approach in production systems.

ComparisonsSearch

How to Build a Multimodal Search Engine in 2025

A practical guide to building search that works across text, images, video, and audio using shared embedding spaces and retrieval pipelines.

ComparisonsTutorialsResearch

Multimodal Monday #44: Agents Ship Code, Robots Skateboard

Feb 2 - 9: GPT-5.3-Codex and Claude Opus 4.6 handle full software lifecycles from debug to deploy, MiniCPM-o 4.5 beats GPT-4o on vision tasks at 9B parameters running on-device, HUSKY skateboards using physics-based control, and TinyLoRA fine-tunes models with a single parameter.

Multimodal Monday