Mixpeek Engineering Blog

The 3072-Dimension Problem

Ethan Steininger — Wed, 29 Apr 2026 14:23:45 GMT

You embedded your video library with Gemini. Or CLIP, or SigLIP, or whatever the model of the month was when the project started. You stored a few million vectors in Pinecone or Qdrant or Weaviate. You wired up cosine similarity. You ran your first query.

The results were fine. Not great. Fine.

A search for "person holding a coffee cup" returned videos with people, and videos with cups, and a surprising number of videos with neither. Reranking helped a little. Hybrid BM25 helped a little more. But somewhere around the fourth or fifth round of tuning, you started to suspect that the problem wasn't your retriever, or your reranker, or your prompt. The problem was that a 3072-dimensional embedding is not a useful unit of work.

It encodes everything. The lighting, the camera angle, the dominant color, the demographic of the person on screen, the room they're in, the vibe of the room they're in. All of it, smeared across three thousand floats. Cosine similarity treats every dimension equally. Your application does not.

Most teams hit that wall about six weeks into a multimodal project. A better vector database won't get you past it.

The reframe

A data science lead we work with (thirty years in the industry, ran one of the first audio fingerprinting startups in the early 2000s) described our job to me recently in a way I haven't stopped thinking about:

Reduce the number of dimensions in a video to a handful of measurable features, then place every new video into a hierarchical structure defined by those features.

He called the output a "compact fingerprint."

When I shared that framing with an engineer at AWS, he pushed back immediately. Every kind of representation learning does that, he said. Find small, interpretable representations of complex things. He was right. The idea is old. What's rare is running it as managed infrastructure, continuously, across millions of objects, with the reduced features exposed as first-class queryable surfaces.

The job of multimodal infrastructure is not to be a faster vector database. It's the systems layer that takes you from I have embeddings to I have a usable application. That layer has a specific shape, and most of the work is in the shape.

How decomposition works in practice

The shape, in our system, is six primitives that compose in one direction.

Buckets are where raw objects land. Videos, images, documents, audio. A bucket has a schema and an optional sync connection (S3, GCS, Drive) and its only job is to be the entry point. Nothing has happened to the content yet.

Feature extractors are the decomposition step, where the dimension reduction actually lives. A shot detector turns a thirty-second video into eight scenes with timestamps. A face identity model turns a frame into bounding boxes and 512-dim face embeddings. A multimodal extractor turns a scene into a Gemini embedding plus OCR output plus dominant colors plus whatever else you configured. Custom plugins (Python code, optionally with model weights) let you do the same for proprietary file types or domain-specific features.

Collections are where the reduced features live. Each collection is the output of one extractor, which means each has its own schema, its own embedding space, its own indices. A single bucket fans out into many collections: scenes, faces, OCR text, objects, audio segments. Instead of one giant vector per video, you get many small structured records, each measurable along a dimension you chose on purpose. That's the move that makes everything else work.

Retrievers compose collections. A retriever is a multi-stage pipeline: feature search on collection A, attribute filter on collection B, LLM filter on the merged set, reciprocal rank fusion at the end. Stages pass documents forward in a working set. If you've written a MongoDB aggregation pipeline, the mental model transfers directly. Retrievers are pipelines because no real application wants a single similarity score. It wants similarity plus a metadata filter plus a rerank plus a join.

Taxonomies are the hierarchy. A taxonomy is a semantic join between two collections, where the join operation is itself a retriever pipeline. The canonical example: collection A is faces extracted from a casting database (with names attached), collection B is faces extracted from new ad creative. The taxonomy says for every face in B, find the nearest face in A above some threshold, and enrich B with the name. Run that at ingest time and every new ad arrives pre-labeled with its talent.

Clusters are the dual of taxonomies. Taxonomies impose structure top-down (you defined the casting database). Clusters discover structure bottom-up. Run HDBSCAN on the embeddings in a collection, send the centroids to an LLM, and you get labeled groups you didn't have to specify in advance. The output is, of course, another collection, which can feed another retriever, which can populate another taxonomy.

Six primitives. They compose in one direction: raw object, decomposed feature, queryable surface, composed pipeline, hierarchical placement, emergent structure. The operation is reduce to measurable features and make them addressable.

Why hierarchy matters

Reduction alone is not enough. If you stop after the extractor stage, you have a smarter vector database. Better features, sure, but still a system where every query starts from scratch.

The leverage is in the hierarchy. Once you have a taxonomy of brands, products, talent, scenes, moods (or whatever your domain calls for), every new piece of content gets placed against it at ingest time. The compact fingerprint is computed once. Its location in the hierarchy is computed once. After that, retrieval is mostly traversal.

That collapses a distinction most teams treat as fundamental: enrichment versus search.

Flat embedding world: enrichment is a separate batch job. Run a labeling pipeline, write labels back to the database, hope the labels stay fresh
Decomposed-and-placed world: enrichment is the same operation as search, run in reverse

The taxonomy that locates a new ad in your brand hierarchy is the same retriever pipeline a user would invoke to ask "show me ads in this brand." One traversal, two directions.

Reducing dimensions is table stakes. Making the reduction a permanent, queryable, hierarchical structure is the part that compounds.

What this unlocks

Three things fall out of this architecture that are hard to build any other way.

Agentic retrieval

When your features are decomposed into named collections with named extractors, you can hand them to an LLM as tools. The LLM doesn't get a single search endpoint. It gets a feature search tool, an attribute filter tool, an LLM filter tool, each with explicit input and output shapes. The agent composes retriever stages dynamically based on the task.

Find ads featuring this actor that performed well in Q3 and use a similar color palette to this reference image

That becomes a four-stage pipeline the agent assembles on its own. You can't do that against a flat vector store because there's nothing to compose.

Cross-collection joins

The casting-database example is the simple form. The general form: any two collections with comparable feature spaces can be joined via a taxonomy, so you can enrich any feature with any other feature.

Faces with names
Scenes with brands
Audio with transcripts
Products with categories

The join is a retriever, the retriever is reusable, and the enriched output is itself a collection that feeds the next join.

Clustering that closes the loop

Run a clustering job on a collection, label the centroids with an LLM, and the labels become a taxonomy you didn't have to design. Apply that taxonomy to incoming content and every new object gets placed in a category that emerged from your data. The system bootstraps its own hierarchy.

The flywheel:

More content produces better clusters
Better clusters produce better taxonomies
Better taxonomies produce better placement of the next batch

The conceptual frame

The point of decomposition isn't to do anything magical. It takes the operation that representation learning has always done (reduce complex things to small interpretable representations) and makes it a piece of infrastructure instead of a piece of research.

Extractors instead of one-off model training
Collections instead of pickled embeddings
Retrievers instead of bespoke search code
Taxonomies instead of manual labeling
Clusters instead of EDA notebooks

A 3072-dimensional embedding is a starting point. The interesting work is in what you do after the embedding. If your stack stops there, you'll spend the next eighteen months reinventing the rest in application code. We know because we've watched it happen.

If any of that resonates, the docs walk through the primitives in the order they compose, with code.

Why Vector Search Alone Can't Find What's in Your Videos

Ethan Steininger — Sun, 26 Apr 2026 13:59:04 GMT

TL;DR: Text-only RAG pipelines miss 80% of what's in your content. A video contains faces, dialogue, on-screen text, background music, scene transitions, and brand logos. No single embedding captures all of that. The solution is multi-stage retrieval: extract multiple features per document, search each independently, then merge and rerank the results into one ranked list.

The Problem Everyone Ignores

Most retrieval systems work like this: take content, generate one embedding, store it in a vector database, run cosine similarity at query time. For text documents, this is fine. For everything else, it falls apart.

Consider a 30-second product video. It contains:

Visual content: product shots, lifestyle imagery, brand colors
Spoken audio: a voiceover describing features and pricing
On-screen text: "50% off," a URL, a product name
Music/tone: upbeat, corporate, dramatic
Faces: a spokesperson, a customer testimonial

A single CLIP embedding of a keyframe captures maybe the visual content. The dialogue, the on-screen text, the audio tone, the faces? Gone. Your "semantic search" just became a visual-only search that ignores most of the signal.

This is why teams building on video, audio, images, and documents keep hitting the same wall. They get 70% recall and plateau. The missing 30% is cross-modal context that a single embedding cannot represent.

How Retrieval Actually Needs to Work

The fix is not a better embedding model. It is extracting multiple features per document and searching each one independently, then combining the results.

Think of it like a SQL query that JOINs across multiple indices. You would never store a customer's name, purchase history, and support tickets in a single column and expect one query to cover everything. Retrieval over rich media works the same way.

The left side is what most teams build: one model, one embedding, one search. The right side is what production systems need: multiple extractors generating independent feature vectors, independent searches across each, and a merge step that produces one unified ranking.

The Feature Extraction Layer

Before you can search across multiple signals, you need to extract them. This is where most teams get stuck. Running five models per document sounds expensive. It does not have to be.

The key insight is that extraction happens at ingest time, not query time. You pay the compute cost once, then every subsequent search is just a vector lookup. The question is which features to extract.

Signal	Extractor	What It Captures	Query Example
Visual semantics	CLIP	Scene content, objects, style	"sunset beach product shot"
Spoken words	Whisper	Dialogue, narration, speech	"mentions free shipping"
On-screen text	PaddleOCR	Titles, captions, URLs, prices	"contains promo code SAVE20"
Faces	RetinaFace	Identity, count, position	"video with CEO appearance"
Objects	YOLO	Specific items, products, logos	"red Nike shoes"
Audio tone	CLAP	Music genre, mood, effects	"upbeat background music"
Text meaning	BGE	Semantic content of transcripts	"discusses return policy"

Each extractor runs independently during ingestion. A single video upload produces 5-7 feature vectors, each queryable on its own. The extraction cost amortizes across every future search.

Multi-Stage Retrieval: The Architecture

Once features are extracted and stored, retrieval becomes a pipeline of stages. Each stage narrows, expands, or reranks the result set. This is what Mixpeek calls a retriever.

A retriever is a sequence of stages that execute in order. Each stage takes the output of the previous stage and transforms it. The stages compose like Unix pipes: each one does one thing, and chaining them produces complex behavior from simple parts.

Stage Types

Search stages query a specific feature index and return candidates:

{
  "stage_type": "search",
  "model_id": "openai/clip-vit-large-patch14",
  "query": { "type": "text", "value": "product demonstration" },
  "limit": 100
}

Filter stages remove results that do not meet criteria:

{
  "stage_type": "filter",
  "field": "metadata.duration_seconds",
  "operator": "gte",
  "value": 15
}

Merge stages combine results from parallel searches using reciprocal rank fusion:

{
  "stage_type": "merge",
  "strategy": "rrf",
  "sources": ["visual_search", "transcript_search"]
}

Rerank stages re-score the merged results using a cross-encoder or business logic:

{
  "stage_type": "rerank",
  "method": "weighted",
  "weights": { "visual": 0.4, "transcript": 0.35, "ocr": 0.25 }
}

The power is in composition. A retriever for "find product videos mentioning free shipping with our CEO" chains: CLIP search for product content + Whisper transcript search for "free shipping" + face search against an enrolled reference collection, merge with RRF, filter by duration, rerank by recency.

Why This Beats Single-Vector Search

The difference is not theoretical. Here are the failure modes that multi-stage retrieval eliminates:

Scenario	Single Vector	Multi-Stage
"Find videos where someone says 'quarterly earnings'"	Searches visual embeddings. Returns videos that look like earnings calls. Misses podcast-style recordings.	Searches transcript embeddings. Finds exact phrase regardless of visual content.
"Product videos with on-screen pricing"	Returns product videos. Cannot distinguish which ones show prices.	OCR search finds "$" patterns. Intersects with product video filter.
"Clips featuring our brand ambassador"	Returns visually similar people. High false positive rate.	Face search against enrolled face collection. Exact identity match.
"Upbeat content suitable for social media"	Cannot assess audio tone from visual embedding.	Audio embedding search for "upbeat" + duration filter < 60s.

Each row is a real query pattern from production deployments. In every case, the multi-stage approach finds results that single-vector search misses entirely, not because the embedding model is bad, but because it is being asked to encode information it was never trained to capture.

The Enrichment Layer: Taxonomies and Clusters

Retrieval is half the story. The other half is enrichment: attaching structured metadata to documents so downstream systems can filter, sort, and categorize without running inference at query time.

Taxonomies work like semantic JOINs. You define a reference collection (brand logos, product SKUs, content categories) and match incoming documents against it using embedding similarity. A video containing a Nike swoosh gets enriched with brand: Nike, brand_id: nike_001, not because a rule detected the text "Nike" but because the visual embedding matched the reference collection.

Clusters work bottom-up. Instead of matching against known categories, clustering groups similar documents and surfaces emergent patterns. You might discover that 40% of your video library shares a visual style you never explicitly categorized.

Together, taxonomies and clusters replace the manual tagging workflows that cost media companies $15-25 per asset.

The Decision Tree: Which Architecture When

Not every use case needs the full multi-stage pipeline. Here is how to decide:

Your content	Your queries	Start with	Graduate to
Text documents only	Semantic questions	Single embedding + vector search	Add BM25 hybrid search for keyword recall
Images with metadata	Visual similarity	CLIP embeddings	Add taxonomy enrichment for structured filters
Video (< 1K assets)	Basic search	Scene descriptions	Add transcript + OCR for cross-modal coverage
Video (10K+ assets)	Complex, multi-signal	Multi-stage retriever from day one	Add clusters to discover content patterns
Mixed media library	Agent-driven queries	MCP integration	Full pipeline: extract, enrich, retrieve, rerank

The pattern is consistent: start with the simplest pipeline that covers your primary query pattern, then add stages as you discover what the first pipeline misses.

What This Looks Like in Practice

A complete pipeline from upload to searchable, enriched content:

# 1. Upload to a bucket
curl -X POST "$MP_API_URL/v1/buckets/my-bucket/upload" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -F "file=@product-demo.mp4"

# 2. Collection triggers extraction automatically:
#    - CLIP embeddings from video frames
#    - Whisper transcription from audio
#    - OCR from on-screen text
#    - Face detection and embedding
#    - Object detection via YOLO

# 3. Taxonomy enrichment runs post-extraction:
#    - Matches detected faces against employee collection
#    - Matches visual content against brand reference collection
#    - Classifies content into IAB categories

# 4. Search across all features at once
curl -X POST "$MP_API_URL/v1/retrievers/my-retriever/search" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -d '{
    "query": {
      "text": "product demo with pricing shown on screen",
      "modality": "text"
    },
    "limit": 10
  }'

The retriever handles the multi-stage logic: parallel searches across visual, transcript, and OCR indices, RRF merge, taxonomy-based filtering, and relevance reranking. The caller sends one query and gets one ranked result list.

The Real Shift

The argument is not that vector search is bad. Vector search is good at what it does: finding semantically similar content within a single modality. The problem is asking it to do everything.

A video is not a text document. An image with overlaid text is not just an image. A podcast episode is not just an audio waveform. Rich media has multiple signals, and each signal needs its own extraction, its own index, and its own search path.

The teams that figure this out stop asking "which embedding model should we use?" and start asking "which features should we extract and how should we combine their search results?" That is the shift from vector search to multimodal retrieval.

Start with one extractor. Add a second when your first query pattern hits a wall. Chain them with a retriever. That is the whole playbook.

Ready to build? Start with the pipeline builder or explore the retriever API reference.

Multimodal Taxonomies: How to Classify Video, Images, Audio, and Text with One Category System

Ethan Steininger — Sat, 25 Apr 2026 13:40:23 GMT

TL;DR: Traditional taxonomies classify one content type at a time. Text gets labels, photos get tags, video gets a separate system. Multimodal taxonomies unify classification across every format by matching content against reference collections using embedding similarity. They bridge raw AI features and structured, searchable metadata.

What Is a Taxonomy?

A taxonomy is a classification system that organizes content into categories. Gmail sorting emails into Primary/Social/Promotions, Shopify categorizing products into Google's 5,500+ product taxonomy, YouTube classifying videos for ad targeting. All taxonomies.

In data infrastructure, taxonomies solve three problems: discovery (navigating categories instead of guessing search terms), governance (enforcing policies by content type), and enrichment (attaching structured metadata to unstructured content so downstream systems can filter, sort, and search it).

Traditional taxonomies are manual and single-modal. A human reviews an article and assigns "Sports > Basketball > NBA." A separate system tags an image "outdoor, basketball court." Another transcribes a video. Each modality gets its own pipeline, its own maintenance burden. That was fine when content was mostly text.

Scale. YouTube receives 720,000 hours of video every day. TikTok ingests 34 million videos daily. That's 272 per second. A trained analyst can classify ~10,000 documents per year. To manually classify one day of TikTok, you'd need 3,400 analysts working full-time for a year.

Context blindness. A meme with "this is fire" means different things depending on whether the image shows a concert or a burning building. An ICCV 2025 study quantified this: text-only models achieved F1 of 0.75–0.81 on video moderation. Adding visual and audio signals pushed that to 0.84–0.91. The missing 10–15% is cross-modal context.

Consistency drift. The IAB Content Taxonomy has grown from ~400 categories in v2 to 1,500+ in v3, and even with that specificity, human reviewers routinely disagree on assignments.

What Makes a Taxonomy "Multimodal"

A multimodal taxonomy classifies content by understanding it across all modalities simultaneously, then matching against reference categories using embedding similarity rather than keyword rules.

The key difference: instead of writing rules ("if text contains 'basketball' AND image has an orange round object..."), a multimodal taxonomy works like a semantic JOIN. You define categories with a reference collection of representative examples. New content is matched against those references using vector similarity across all extracted features: visual, audio, and textual, all at once.

Flat vs. Hierarchical

Flat Taxonomies

Single-level reference collection. Every document is matched against the same categories, best match wins.

Use cases: Face enrollment, logo detection, product recognition, entity linking. Fast to set up. Start here if your categories don't have meaningful parent-child relationships.

Hierarchical Taxonomies

Categories organized into a tree where classification cascades from broad to specific. Each level narrows the search space using different features, executing like a Common Table Expression (CTE). Each level builds on the previous.

A document classified as "Nike → Athletic → Running" inherits enrichment fields from all three levels. Different levels can use different feature extractors: logo embeddings for brand detection, scene classification for categories, activity recognition for subcategories.

Use cases: Media content classification, product categorization, organizational hierarchies, content moderation.

How It Works

1. Feature extraction. Multiple AI models extract features from each modality: CLIP embeddings from video frames, speech transcription from audio, object detection from images, sentence embeddings from text. Each becomes a queryable vector.

2. Input mapping. Configures which extracted features query which taxonomy level. A face-based taxonomy uses face embeddings; a content classification taxonomy might use CLIP at the top level and audio features deeper down.

3. Similarity matching. Each document's features are compared against the reference collection using a retriever, the same infrastructure used for semantic search. Documents exceeding the threshold get enriched.

4. Enrichment. Structured metadata from the reference collection is attached to the document: brand name, content policy, compliance flags, campaign IDs. Configurable field paths, target names, and merge modes (replace or append).

Real-World Applications

Advertising. The IAB Content Taxonomy defines 1,500+ categories for programmatic ad targeting. Text-only classifiers can't categorize a cooking video with no description or a sports highlight with only crowd noise. AWS published a reference architecture requiring five separate services. A retriever-powered taxonomy collapses that into one pipeline.

Media asset management. Libraries of 100,000+ video assets need search across visual content, dialogue, and audio. A hierarchical taxonomy classifies a broadcast as "Live Sports → Football → NFL → Highlight → Touchdown" using different features at each level, enriching with rights info and licensing metadata. Manual tagging costs $15–25 per asset. See how video search changes this.

E-commerce. Shopify's multimodal system (BERT + MobileNet-V2) increased leaf-node classification precision by 8% and nearly doubled coverage vs. text-only. A 2025 study found CLIP-based fusion achieved 98.59% hierarchical F1 with a two-stage pipeline: lightweight text model first, multimodal model only when confidence is low.

Content moderation. An ICCV 2025 study tested multimodal AI on 1,500 videos across 12 languages. Best model (Gemini-2.0-Flash) achieved F1=0.91 vs. human F1=0.98, at 1/35th the cost ($28 vs. $974). The practical solution: multimodal AI handles the first pass, low-confidence cases escalate to humans.

Brand safety. Enforcing "Talent X cannot appear within 5 seconds of a competitor product in negative-sentiment content" requires cross-modal reasoning: face recognition, logo detection, audio sentiment, temporal proximity. A multi-stage retrieval pipeline connects these with taxonomy enrichment for contract terms and compliance status.

Building a Multimodal Taxonomy

Create reference collections

# Flat taxonomy: employee face recognition
curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "employee_faces",
    "taxonomy_type": "flat",
    "retriever_id": "ret_face_matcher",
    "input_mappings": {
      "query_embedding": "mixpeek://face_detector@v2/face_embedding"
    },
    "source_collection": {
      "collection_id": "col_employee_embeddings",
      "enrichment_fields": [
        { "field_path": "metadata.name", "merge_mode": "enrich" },
        { "field_path": "metadata.department", "merge_mode": "enrich" }
      ]
    }
  }'

Go hierarchical when you need precision

curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "content_classification",
    "taxonomy_type": "hierarchical",
    "retriever_id": "ret_scene_classifier",
    "input_mappings": {
      "query_embedding": "mixpeek://clip@v1/scene_embedding"
    },
    "hierarchy": [
      {
        "node_id": "brands",
        "collection_id": "col_brand_references",
        "enrichment_fields": ["metadata.brand_name", "metadata.brand_id"]
      },
      {
        "node_id": "categories",
        "collection_id": "col_content_categories",
        "parent_node_id": "brands",
        "enrichment_fields": ["metadata.category", "metadata.content_policy"]
      },
      {
        "node_id": "campaigns",
        "collection_id": "col_campaign_assets",
        "parent_node_id": "categories",
        "retriever_id": "ret_campaign_matcher",
        "enrichment_fields": ["metadata.campaign_id", "metadata.flight_dates"]
      }
    ]
  }'

Choose an execution mode

Mode	When	Tradeoff
materialize	After ingestion (~30s)	Low latency, results persisted
on_demand	Query time (retriever stage)	Always-fresh reference data, higher latency
retroactive	Manual trigger via API	Batch reclassification after taxonomy updates

Attach to a collection:

{
  "taxonomy_applications": [
    { "taxonomy_id": "tax_content_classification", "execution_mode": "materialize" }
  ]
}

Test before you materialize

curl -sS -X POST "$MP_API_URL/v1/taxonomies//enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "source_documents": [
      { "document_id": "doc_test_001", "mixpeek://clip@v1/scene_embedding": [0.12, 0.34] }
    ],
    "mode": "on_demand"
  }'

If categories are wrong, add more reference examples. The taxonomy improves because matching is based on collection contents. No model retraining required.

Governance

There is no finished taxonomy. Updating a multimodal taxonomy means updating its reference collections, not rewriting rules or retraining models. Add examples, remove outdated categories, and the taxonomy adapts.

Version your taxonomies before structural changes. Use retroactive application to reclassify existing documents after updates. Combine with clustering to discover new category candidates from unmatched documents.

Start flat. Add hierarchy when you need precision. Version everything. Update reference collections instead of rewriting rules.

Ready to build? Get started with Mixpeek or explore the taxonomy API reference.

Object Storage Comparison 2026: 21 Providers, Real Pricing, and the Gotchas Nobody Tells You

Ethan Steininger — Fri, 10 Apr 2026 19:58:00 GMT

We built Awesome Object Storage because we got tired of discovering gotchas after migrating 50 TB. Wasabi's 90-day minimum retention. R2's missing versioning. DigitalOcean's 5 GB object cap masquerading as "unlimited." Every claim sourced, every gotcha earned the hard way.

This is what we found after comparing 21 S3-compatible providers across pricing, features, durability, compliance, and the fine print that actually breaks migrations.

Resources

Interactive Cost Calculator — plug in your usage, compare all 21 providers
GitHub: awesome-object-storage — full dataset, JSON schemas, open-source
Ranked Listicle — top 10 providers ranked with pros, cons, and pricing

The Pricing Lie: Storage Cost Is the Least Important Number

When teams evaluate object storage, they compare the per-GB storage price. That's the wrong number.

Egress is where the real bill hides. AWS S3 charges $0.09/GB to move data out. Google Cloud charges $0.12/GB — the highest of the big three. On a workload that reads 10 TB/month, that's $900–$1,200/month in egress alone, dwarfing the storage cost.

Meanwhile, Cloudflare R2, Tigris, and Backblaze B2 (via Cloudflare Bandwidth Alliance) offer zero or near-zero egress. For a 10 TB stored / 5 TB egress workload:

Provider	Monthly Cost	vs. AWS S3
AWS S3	$689	—
Google Cloud Storage	$810	+18%
Cloudflare R2	$158	-77%
Backblaze B2	$110	-84%
Wasabi	$49	-93%
IDrive e2	$46	-93%

AWS S3 costs 15x more than the cheapest alternative for the same workload. Run the numbers for your own usage at storage.mixpeek.com.

The Escape Cost: What Nobody Compares

Here's the number that matters most and gets compared least: what does it cost to leave?

Provider	Cost to Move 100 TB Out
AWS S3	$9,000
Google Cloud Storage	$12,000
Azure Blob	$8,700
Cloudflare R2	$0
Backblaze B2	$1,000
Wasabi	$0*

*Wasabi "free" egress is subject to a reasonable-use policy: monthly egress can't exceed your stored volume.

If you're storing 100 TB on GCS and want to leave, it'll cost you $12,000 just in egress fees. That's not a technical lock-in — it's a financial one.

The 8 Gotchas That Will Break Your Migration

Every one of these burned a real team. We know because we were some of them.

1. Wasabi's 90-Day Minimum Retention

Delete an object after 30 days? You still pay for 90. This is per-object, not per-account. On a dataset with high churn, this can double your effective storage cost. We found out after migrating 50 TB.

2. R2 Has No Versioning and No Object Lock

Cloudflare R2 is the darling of zero-egress storage. But it has no versioning, no object lock, and no WORM compliance. If you need immutable backups or regulatory compliance, R2 isn't an option — full stop. Tigris fills this gap with zero egress plus versioning and object lock.

3. DigitalOcean Spaces Caps Objects at 5 GB

The pricing page says "unlimited storage." The fine print says max object size is 5 GB — not 5 TB like every other provider. Vultr and Linode have the same 5 GB cap. If you're storing video, backups, or ML model checkpoints, these are non-starters.

4. "S3 Compatible" Is a Spectrum

Full S3 compatibility means passing the AWS SDK test suite — multipart uploads, presigned URLs, bucket notifications, S3 Select, batch operations. Most providers only support a subset. Azure Blob's S3 compatibility is still in preview. Trust your integration tests, not the compatibility page.

5. GCS Has the Highest Egress of the Big Three

Google Cloud Storage charges $0.12/GB for egress — 33% more than AWS and 38% more than Azure. If your workload is read-heavy, GCS is quietly the most expensive hyperscaler.

6. Archive Tiers Have Minimum Retention Traps

GCS Archive has a 365-day minimum retention. Azure Cold has 180 days. Delete early and you pay the full retention period anyway. OVHcloud applies a 30-day minimum to all tiers, not just archive.

7. Event Notifications Are Basically AWS/GCS/MinIO Only

If your architecture depends on "object created → trigger processing," your options are narrow. S3 (SNS/SQS/Lambda/EventBridge), GCS (Pub/Sub), R2 (Workers), and MinIO (Webhooks/Kafka/NATS) have real event systems. Most alternatives don't.

8. Durability Claims Vary in Substance

Everyone claims 11 nines (99.999999999%). But Vultr, DigitalOcean, and Linode don't publish verifiable durability data. Backblaze publishes drive failure statistics openly. When a provider won't show their math, the number is marketing, not engineering.

The Decision Framework

After testing all 21 providers, here's how we'd decide:

Cheapest raw storage: Storj ($0.004/GB) or IDrive e2 ($0.004/GB)
Zero egress, no asterisks: Cloudflare R2
Zero egress + versioning + object lock: Tigris or Impossible Cloud
CDN origin: R2 (Cloudflare native) or Fastly Object Storage
Compliance / WORM: AWS S3 Object Lock or Wasabi (accept the 90-day minimum)
EU data sovereignty: Hetzner (cheapest), Scaleway, or OVHcloud
Self-hosted: MinIO (only serious option)
Biggest free tier: Oracle Cloud (10 TB/mo free egress)
Already on AWS and can't leave: S3 Intelligent-Tiering + aggressive lifecycle rules

What Happens After You Store It

Object storage used to be a write-and-forget tier. That's changing. AWS launched S3 Vectors — vector search built into S3 itself. turbopuffer runs a vector database on top of S3. LanceDB stores vector indices as objects.

At Mixpeek, we built our processing pipeline to treat object storage as the source of truth for multimodal data — images, video, documents, audio — with feature extraction, embedding, and search all flowing from the bucket. Your storage layer isn't just storage anymore. It's the foundation of your intelligence layer.

If you want to make your stored objects searchable and queryable across modalities, try Mixpeek — it connects to any S3-compatible bucket and turns your data into something you can actually reason over.

Try It Yourself

We open-sourced the full dataset — all 21 providers, ~60 data points each, machine-readable JSON — at github.com/mixpeek/awesome-object-storage.

Run your own cost comparison at storage.mixpeek.com.

If we got something wrong or a provider updated their pricing, open a PR. Every claim is sourced. Every number is verifiable.

Building a Kalshi Trading Bot with Semantic Search and LLM Extraction

Ethan Steininger — Thu, 02 Apr 2026 18:01:16 GMT

We built an autonomous Kalshi trading bot that uses the Kalshi API and Mixpeek's multimodal data platform to trade mention markets in real-time. The system feeds YouTube URLs directly into Mixpeek — which handles transcription, embedding, and LLM extraction — then queries the results through a semantic retriever to generate calibrated trading signals. Zero manual intervention.

This post walks through every component: how Mixpeek ingests and transcribes political video, uses LLM data extraction to structure it, queries it with a semantic search API, and turns the output into automated market making decisions on the prediction market API from Kalshi.

What Are Kalshi Mention Markets?

Kalshi's mention markets are binary contracts on whether a public figure will say a specific word. Examples:

"Will Trump say 'tariff' in his next address?" — ticker: KXTRUMPMENTIONB-26APR01-TARI
"Will the Fed Chair mention 'inflation'?" — ticker: KXFEDMENTION-26APR-INFL
"Will the Press Secretary say 'China'?" — ticker: KXSECPRESSMENTION-26APR30-CHIN

These markets resolve based on official transcripts. The edge comes from processing political speech faster and more accurately than the market — knowing who said it, how surprising it was, and whether the keyword appeared in a policy-relevant context.

Most Kalshi trading bots rely on simple keyword matching. Ours uses Mixpeek's full resource chain for semantic understanding.

System Architecture: Six Mixpeek Resources

The pipeline chains six Mixpeek primitives. Mixpeek handles everything from video download and transcription to embedding and LLM extraction — no external tools required:

YouTube URL
  1. Namespace  → data isolation
  2. Bucket     → accepts YouTube URLs as type: "video"
  3. Collection → auto-transcription + text embedding + LLM extraction
  4. Retriever  → semantic search across processed documents
  5. Bucket     → trade history logging (feedback loop)
  6. Retriever  → historical calibration from past trades

If you've used a prediction market API before (Kalshi, Polymarket, etc.), you know the data challenge: markets move on unstructured information — speeches, press briefings, hearings — that doesn't fit neatly into a database. Mixpeek bridges that gap.

Resource 1: Namespace — Data Isolation

namespace_id: ns_7c8f877d9b
name: prediction-market-alpha

Every resource lives inside a single namespace, isolating prediction market data from other workloads. All API calls include the X-Namespace header. This is standard practice when using Mixpeek as a multimodal data pipeline — one namespace per use case.

Resource 2: Hearing Bucket — Video Ingestion

bucket_id: bkt_be6b9536

We monitor four YouTube channels (White House, C-SPAN, C-SPAN Senate, Federal Reserve). When a new video appears, we push the YouTube URL directly to Mixpeek — no need for external transcription tools:

POST /v1/buckets/bkt_be6b9536/objects
{
  "blobs": [
    {
      "property": "url",
      "type": "video",
      "data": "https://www.youtube.com/watch?v=7d-3oqka-fE"
    },
    {"property": "source", "type": "string", "data": "white-house"},
    {"property": "event_type", "type": "string", "data": "press_briefing"}
  ]
}

That's it. Mixpeek downloads the video, extracts the audio, transcribes it, and makes the text available to the collection pipeline. A single API call replaces what would otherwise require yt-dlp for download, whisper or youtube-transcript-api for transcription, and a custom chunking pipeline.

A single day's political speech typically yields 5-10 videos totaling 200-400K characters of transcript.

Resource 3: Collection — Embedding + LLM Data Extraction

collection_id: col_2a9565df60

This is the core of the system. Once Mixpeek transcribes the video, the collection runs two extractors on the resulting text:

Dense vector embedding via multilingual_e5_large_instruct_v1 — enables semantic search across all transcript chunks
LLM structured extraction via Mixpeek's response_shape — Claude analyzes each chunk and extracts seven fields

The response_shape configuration defines the extraction schema:

{
  "speaker": "who is speaking (e.g. President Trump, Fed Chair Powell)",
  "statement_type": "policy_announcement | press_response | hearing_testimony | ...",
  "policy_direction": "hawkish | dovish | neutral | escalatory | ...",
  "keywords_mentioned": ["tariff", "china", "inflation", ...],
  "is_surprising": true/false,
  "surprise_magnitude": 0.0 - 1.0,
  "market_impact": 0.0 - 1.0
}

This is LLM data extraction at scale — every chunk gets speaker attribution, policy context, and market relevance scoring without any custom LLM pipeline. A batch of 7 videos (357K chars of transcript) produces 150+ indexed documents with all seven fields.

Resource 4: Signal Retriever — Semantic Search API

retriever_id: ret_37fcabc4144e76
name: signal-market-matcher

The retriever is configured as a semantic search API endpoint that queries the collection using the E5 embedding model:

{
  "stages": [{
    "stage_name": "semantic-search",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{
          "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
          "query": {"input_mode": "text", "value": "{{INPUT.query}}"}
        }],
        "final_top_k": 30
      }
    }
  }]
}

The engine sends three query groups per cycle to maximize coverage:

Political/policy terms — tariff, china, iran, trade, sanctions, immigration
People/institution names — trump, powell, leavitt, fed, congress, senate
Random keyword sample — surfaces unexpected matches from new transcripts

Each query returns up to 30 semantically relevant chunks with their LLM-extracted fields intact — speaker, surprise magnitude, market impact, and all.

Resource 5: Trade History Bucket — Feedback Loop

bucket_id: bkt_0e439f96

Every trade decision — executed or skipped — gets logged back to Mixpeek. This creates a searchable archive of what the Kalshi trading bot traded, at what price, with what signal quality, and whether it won or lost. The feedback loop is what separates a prediction market bot from a simple alert system.

Resource 6: History Retriever — Edge Calibration

retriever_id: ret_2674a0d675b62f
name: signal-history

Before placing any order through the Kalshi API, the engine queries the history retriever:

POST /v1/retrievers/ret_2674a0d675b62f/execute
{"inputs": {"query": "tariff"}}

Past trades for the same keyword feed a win-rate calculation that adjusts the expected edge. If historical "tariff" trades won 70% of the time, the engine sizes up. If 30%, it skips. This is what makes the system self-improving — a form of automated market making that learns from its own history via Mixpeek's retriever infrastructure.

Three Intelligence Layers: From Signal to Trade

The six resources feed into three scoring layers:

Layer 1: Signal Quality Scoring

Speaker authority — "President Trump" (1.0), "Fed Chair Powell" (0.95), unknown press pool (0.50)
Statement type — policy announcements and hearing testimony score higher than casual references
Surprise factor — is_surprising=true with high surprise_magnitude → larger position size
Market impact — LLM-estimated probability that the mention moves the market

Layer 2: Portfolio Construction

Category exposure caps: max $3 per market category
Per-market position limits: $10 max
Daily loss circuit breaker: $10 max drawdown

Layer 3: Historical Calibration

Win rate from past trades adjusts edge estimates up or down
Keywords with poor track record get automatically de-risked
New keywords start at 50% base rate until history accumulates

Live Trading Results

The engine runs autonomously, polling every 2 minutes. Here's actual output from a live cycle:

Cycle 1: 28 signals found → 7 trades attempted, 23 skipped

BUY signals (positive edge):
  "iran" on KXFEDMENTION-26APR-IRAN
    speaker=President Trump, quality=0.63, edge=+0.43 → 1x YES @ $0.25
  "volatility" on KXFEDMENTION-26APR-VOLA
    speaker=President Trump, quality=1.00, edge=+0.50 → 1x YES @ $0.28
  "bitcoin" on KXSECPRESSMENTION-26APR30-CRYP
    speaker=Press Secretary, quality=1.00, edge=+0.65 → 1x YES @ $0.13

SKIPPED signals (negative edge or caps):
  "russia" on KXSECPRESSMENTION → quality=0.61, edge=-0.09 → SKIP
  "border" on KXSECPRESSMENTION → quality=0.61, edge=-0.17 → SKIP
  "oil" on KXLEAVITTSMFMENTION → category cap $3.08/$3.00 → SKIP

The engine correctly rejects low-quality signals and respects portfolio limits, while aggressively buying high-conviction signals from authoritative speakers.

Why This Beats Simple Keyword Matching

Most Kalshi trading bots and prediction market bots use basic keyword detection — grep the transcript for "tariff" and buy. That approach fails in practice:

False positives — "The tariff discussion from last year..." doesn't mean they said "tariff" in a policy context today
No speaker attribution — a reporter asking "Will you impose tariffs?" is very different from the President saying "I'm imposing tariffs"
No surprise weighting — Trump saying "tariff" (expected) should size differently than Powell saying "tariff" (unexpected)
No learning — keyword bots make the same mistakes repeatedly with no feedback loop

Mixpeek's response_shape extraction solves all four. The semantic search API finds contextually relevant chunks, the LLM extraction gives you structured fields, and the history retriever calibrates over time.

Technical Stack

Video ingestion — YouTube URLs pushed directly to Mixpeek bucket as type: "video"
Transcription + processing — Mixpeek auto-transcribes, chunks, embeds (E5), and extracts (Claude response_shape)
Search — Mixpeek retriever (semantic search API with final_top_k: 30)
Trading — Kalshi API with RSA-PSS authentication for order placement
Feedback — Mixpeek trade history bucket + history retriever for calibration
Runtime — FastAPI server with async polling loop (Python)

The entire intelligence layer — from YouTube URL to calibrated trading signal — runs on six Mixpeek resource IDs. No transcription tools, no custom vector database, no LLM prompt engineering, no embedding pipeline to maintain.

Get Started

Mixpeek handles the hard parts of unstructured data processing — video transcription, chunking, embedding, LLM extraction, vector search, and batch processing — so you can focus on your domain logic. Whether you're building a Kalshi trading bot, a content moderation pipeline, or a multimodal search engine, the same resource primitives apply.

Mixpeek docs — mixpeek.com/docs
Kalshi API docs — docs.kalshi.com
Source code — the complete engine is open-source in our research repo

The Multimodal Data Warehouse: Why Unstructured Data Needs Its Own Snowflake

Ethan Steininger — Sat, 28 Mar 2026 13:34:11 GMT

TL;DR: We're drowning in unstructured data—video, audio, images, documents, IoT streams—but our infrastructure still assumes everything is a row in a table or a vector in an index. The multimodal data warehouse is the missing layer: a system that decomposes objects into searchable features, stores them across hot and cold tiers, and reassembles them through multi-stage retrieval pipelines. This isn't a database. It's the warehouse for the AI era.

The $120 Trillion Problem Nobody Talks About

Here's an uncomfortable truth: 80-90% of enterprise data is unstructured, and it's growing 3x faster than structured data. IDC projects the global datasphere will hit 175 zettabytes by 2025—and the vast majority of that is video, images, audio, documents, sensor data, and formats that don't fit in Snowflake.

Yet when companies build AI-native applications, they cobble together:

A vector database for embeddings (Pinecone, Qdrant, Weaviate)
An object store for raw files (S3, GCS)
A separate search engine for text (Elasticsearch)
Custom ETL for each modality
Bespoke inference pipelines per use case

This is the modern data Frankenstein—a stitched-together monster where every new modality means a new system, a new integration, and a new failure mode.

Fig 1: From Frankenstack to unified multimodal warehouse

What Is a Multimodal Data Warehouse?

A multimodal data warehouse is an integrated system that:

Ingests any data type—video, audio, images, documents, 3D models, IoT streams—through a single API
Decomposes objects into their constituent features (a video becomes frames, audio segments, transcripts, detected faces, logos, scenes)
Stores features across tiers with lifecycle management (hot for real-time queries, cold for cost-efficient archival, with automatic promotion/demotion)
Reassembles objects through multi-stage retrieval pipelines that can filter, sort, reduce, enrich, and join across modalities
Maintains lineage—every extracted feature traces back to its source object, timestamp, and extraction model through feature URIs

Think of it as Snowflake, but for unstructured data. Or S3 + a vector database + an inference engine + a query planner, collapsed into a single abstraction.

The Core Primitive: Object Decomposition

Traditional databases store data as-is. You put a row in, you get a row out. But unstructured data is dense—a single 30-second video contains:

Signal Type	What's Extracted	Typical Output
Visual frames	Scene boundaries, keyframes	15-30 scene segments with thumbnails
Face embeddings	SCRFD detection → ArcFace 512d vectors	Per-face identity embeddings at 99.8% accuracy
Logo detection	YOLOv8 detection → SigLIP 768d embeddings	Brand identifications with bounding boxes
Audio fingerprint	Mel spectrogram → CLAP embeddings	Audio signatures, music identification
Transcript	Whisper ASR → word-level timestamps	Full text with temporal alignment
Semantic embeddings	SigLIP (visual), CLAP (audio), text models	Dense vectors for cross-modal search
Structured metadata	LLM-powered labeling and taxonomy assignment	Categories, tags, descriptions, sentiment

A single video file becomes dozens of queryable features, each with its own embedding space, each stored with a feature URI that links back to the source:

// Feature URI format — every extracted signal is addressable

mixpeek://face_extractor@v1/embedding     → ArcFace 512d vector
mixpeek://logo_extractor@v1/detection     → YOLO bounding box + SigLIP vector
mixpeek://audio_extractor@v1/fingerprint  → Mel spectrogram embedding
mixpeek://video_preprocessor@v1/scene    → Scene boundary + keyframe
mixpeek://text_extractor@v1/transcript   → Whisper ASR output

// One video in, many features out — each independently queryable
// Each feature knows: what extracted it, when, from what source, at what timestamp

This is the fundamental insight: you don't search unstructured data—you search the features extracted from it. And different features require different models, different embedding spaces, and different query patterns. The warehouse handles this heterogeneity natively.

Storage Tiering: The Economics of Multimodal

Here's where most vector database architectures fall apart: cost.

Storing every embedding in a hot vector index (Qdrant, Pinecone) works at 10K documents. At 10M documents with 5 feature types each, you're looking at 50M vectors in RAM. At $0.10/GB/month for cloud memory, that's a non-trivial line item—and it grows linearly with every new modality you add.

A multimodal data warehouse needs storage tiering—the same concept Snowflake uses for structured data, applied to vectors and features:

Tier	Storage	Latency	Cost	Use Case
Hot	Qdrant (in-memory HNSW)	< 10ms	$$$	Real-time search, active collections
Warm	S3 Vectors (canonical store)	50-200ms	$$	Batch analytics, infrequent queries
Cold	S3 (vectors only, no index)	200ms-1s	$	Compliance, archival, reprocessing
Archive	Metadata only	N/A (rehydrate)	¢	Long-term retention, lineage

S3 Vectors serves as the canonical store—the source of truth for all features. Qdrant is the hot serving layer, loaded on demand. Collections automatically transition through lifecycle states based on access patterns: active → cold → archived.

This is how you go from "we can't afford to index everything" to "we index everything, and the system manages cost automatically."

Multi-Stage Retrieval: The Query Language for Unstructured Data

SQL works for structured data because every column has a known type and every row has the same schema. Unstructured data has no such luxury. A query like "find all videos where a celebrity appears near a competitor's logo, with negative sentiment in the audio" spans three modalities, two embedding spaces, and requires temporal correlation.

This is where multi-stage retrieval pipelines come in. Instead of a single query, you compose a pipeline of stages:

Fig 2: A retrieval pipeline is the SELECT statement for unstructured data

Each stage type serves a specific purpose in the pipeline:

Stage Type	Purpose	SQL Analogy	Implementations
Filter	Narrow result set by features	`WHERE`	feature_search, metadata_filter, boolean_filter
Sort	Reorder by relevance scores	`ORDER BY`	score_linear, reciprocal_rank_fusion, cross_encoder
Reduce	Downsample, deduplicate, aggregate	`LIMIT` / `GROUP BY`	sampling, clustering, deduplication
Enrich	Join data from other collections	`JOIN`	document_enrich (the "semantic join")
Apply	Transform results (LLM, classification)	`SELECT func()`	llm_apply, classifier, reranker

The document_enrich stage deserves special attention. It's essentially a semantic join—the ability to join results from one collection with data from another based on feature similarity, not foreign keys.

In SQL, you write JOIN orders ON users.id = orders.user_id. In a multimodal warehouse, you write:

// Semantic join: enrich video results with brand safety scores
{
  "stage_type": "enrich",
  "stage_id": "document_enrich",
  "config": {
    "target_namespace": "brand-safety-scores",
    "join_feature": "mixpeek://logo_extractor@v1/embedding",
    "attach_fields": ["risk_score", "brand_name", "clearance_status"]
  }
}

No foreign keys. No schema alignment. The join happens in embedding space—features from Collection A are matched to features in Collection B by vector similarity. This is how you connect a video corpus to a brand safety database without ever mapping IDs.

Taxonomies: The Schema for Unstructured Data

Structured data has schemas. Multimodal data has taxonomies—hierarchical classification systems that bring order to extracted features.

Taxonomies in a multimodal warehouse operate in three modes:

Mode	When It Runs	Use Case
Materialized	At ingestion time	Known categories—"is this face a celebrity?" "which IAB category?"
On-demand	At query time	Ad-hoc classification—"group these by sentiment" "cluster by visual style"
Retroactive	Batch over existing data	New taxonomy applied to historical corpus—"re-classify all assets with updated brand list"

This is the equivalent of ALTER TABLE ADD COLUMN for unstructured data. When your brand safety list changes, you don't re-ingest everything—you apply a retroactive taxonomy that reclassifies existing features in place.

Object Reassembly: From Features Back to Answers

Decomposition without reassembly is just feature extraction. The power of a multimodal warehouse is in the round trip: you decompose objects into features for storage and search, then reassemble them into coherent answers at query time.

Fig 3: The full lifecycle—ingest, decompose, store, query, reassemble

The result isn't just "document #47291 matched your query." It's a reassembled object with provenance: here's the video segment, here's why it matched, here's the confidence, here's the temporal context, and here's enriched metadata from related collections.

The Architecture: How It Actually Works

At Mixpeek, we've been building this for two years. Here's the stack:

Layer	Technology	Role
API Gateway	FastAPI	Single REST API for all operations—ingest, query, manage
Task Queue	Celery + Redis	Async batch processing for large ingestion jobs
Inference Engine	Ray Serve (14+ model endpoints)	Distributed GPU inference—ArcFace, SigLIP, CLAP, Whisper, YOLO, LLMs
Hot Storage	Qdrant	In-memory HNSW index for real-time vector search
Canonical Storage	S3 Vectors	Durable source of truth for all features and embeddings
Object Storage	S3	Raw file storage with 15+ connectors (GCS, Azure, SFTP, URLs)
Metadata	MongoDB	Collection configs, batch tracking, lineage, taxonomies
Analytics	ClickHouse	Query performance, usage metrics, cost attribution

The key insight: object storage is both the source and destination. Files come in from S3 (or any of 15+ connectors), get decomposed by the inference engine, and features are stored back into S3 Vectors as the canonical tier. Qdrant is an ephemeral hot cache that can be rebuilt from S3 Vectors at any time. The warehouse never loses data, even if the hot index goes down.

Use Cases Across Industries

A multimodal data warehouse isn't a solution looking for a problem. It's infrastructure for a class of problems that every enterprise with unstructured data faces:

Media & Entertainment

Problem: A media company publishes 500+ assets/week. A single unauthorized celebrity face or brand logo can trigger $50K+ in legal costs.

Solution: Pre-publication IP clearance—every asset is decomposed into faces, logos, and audio fingerprints, checked against reference corpora before publishing. Single image: ~200ms. 30-second video: ~2s.

Try it: Live demo →

Advertising & Brand Safety

Problem: Brands need to verify their ads don't appear alongside objectionable content, and publishers need to classify user-generated video for ad placement.

Solution: Multi-modal brand monitoring—decompose video into visual frames, audio, and transcript. Classify each frame for brand safety categories (IAB taxonomy). Flag logo presence. Score sentiment across modalities. The semantic join connects video features to brand safety databases in real time.

Insurance & Claims Processing

Problem: Claims arrive as a mix of photos, PDFs, voice recordings, and video evidence. Adjusters spend hours cross-referencing across formats.

Solution: Ingest all claim documents through a single pipeline. Decompose photos into damage classifications, extract text from PDFs, transcribe voice memos, detect objects in video evidence. A multi-stage retrieval pipeline surfaces similar past claims, relevant policy terms, and fraud indicators—all joined across modalities.

E-Commerce & Retail

Problem: Product catalogs contain millions of images, videos, and descriptions across suppliers. Duplicate detection, counterfeit identification, and visual search all require different models.

Solution: Decompose product assets into visual embeddings, text features, and brand identifiers. Storage tiering keeps active catalog in hot search, seasonal items in warm storage, and discontinued products in cold. Retroactive taxonomies reclassify the entire catalog when category structures change.

Healthcare & Life Sciences

Problem: Medical imaging (X-rays, MRIs, pathology slides), clinical notes, genomic data, and sensor readings all need to be correlated for diagnosis support.

Solution: Decompose imaging into region-level features. Extract entities from clinical notes. Embed genomic sequences. The multi-stage pipeline enables queries like "find patients with similar imaging features AND matching clinical history"—a cross-modal join that's impossible in siloed systems.

Sports & Live Events

Problem: Broadcasters need to identify players, detect sponsor logos, and provide real-time highlights from live video feeds.

Solution: Real-time face and logo detection on video streams. Scene decomposition identifies key moments. Audio analysis detects crowd reactions. The retrieval pipeline assembles highlight packages: "all moments where [Player X] appears + crowd noise peaks + sponsor logo visibility."

Why Now?

Three converging forces make the multimodal data warehouse inevitable:

1. Model Commoditization

Open-source models (ArcFace, SigLIP, CLAP, Whisper, YOLO) are good enough for production. The bottleneck isn't inference quality—it's the infrastructure to orchestrate, store, and query across models.

2. Vector Database Limitations

Vector databases solve single-modality search. But real applications need multi-modal decomposition, cross-collection joins, storage tiering, and composable query pipelines. That's a warehouse, not a database.

3. Unstructured Data Explosion

Enterprise video alone is growing 30% YoY. Every IoT sensor, security camera, and user-generated content platform is producing data that doesn't fit in a data warehouse—yet. The multimodal warehouse is the missing tier.

The Warehouse Analogy Goes Deep

This isn't just marketing. The parallels between structured data warehousing and multimodal data warehousing are structural:

Concept	Structured (Snowflake)	Multimodal (Mixpeek)
Schema	Column types + constraints	Feature extractors + taxonomies
Ingestion	COPY INTO + transforms	Bucket upload + feature extraction
Storage	Micro-partitions (hot/cold)	Tiered vectors (Qdrant → S3 Vectors → Archive)
Query	SQL (SELECT, JOIN, GROUP BY)	Multi-stage pipelines (filter, sort, reduce, enrich)
Join	Foreign key + equi-join	Semantic join (vector similarity across collections)
Schema evolution	ALTER TABLE	Retroactive taxonomy + re-extraction
Materialization	Materialized views	Materialized taxonomies + clusters
Compute/storage separation	Virtual warehouses	Ray Serve (autoscaling inference) + S3 Vectors (durable storage)

What This Unlocks

When you have a real multimodal warehouse—not a stitched-together stack, but integrated decomposition, tiered storage, and composable retrieval—new capabilities emerge:

Cross-modal correlation: "Find me all instances where [this sound] plays while [this logo] is visible"—queries that span embedding spaces with temporal alignment
Retroactive intelligence: New model drops? New taxonomy? Apply it to your entire historical corpus without re-ingestion
Cost-proportional scaling: Hot data for real-time apps, cold data for compliance—same API, automatic lifecycle management
Semantic joins across modalities: Connect video features to audio features to document features—the JOIN for unstructured data
Composable pipelines: Build complex queries by snapping together stages, not writing custom code for each use case

Getting Started

If you want to see this in action:

Try the live demo—upload an image or video and see face, logo, and audio detection run in parallel
Read the docs—the API is REST-first, with Python and TypeScript SDKs
Build an IP safety pipeline—full tutorial from namespace creation to retriever execution
Talk to us—we're helping enterprises migrate from Frankenstack to warehouse

ColQwen2 + MUVERA: Multimodal Late Interaction Retrieval That Actually Scales

Ethan Steininger — Wed, 25 Mar 2026 22:21:44 GMT

We benchmarked every viable approach to multimodal document retrieval on financial tables (ViDoRe/TabFQuAD) and found a combination that hasn't been published before: ColQwen2 + MUVERA. It retains 99.4% of brute-force quality at a fraction of the cost, and obliterates OCR-based search.

The Problem

Late interaction models like ColBERT and ColPali represent documents as sets of vectors—one per token or image patch. At query time, every query token finds its best-matching document token (MaxSim/Chamfer similarity). This gives near cross-encoder accuracy, but retrieval cost is O(|Q| × |P| × n)—intractable at scale.

The prior solution, PLAID, uses heuristic centroid pruning with no theoretical guarantees. It degrades unpredictably on some datasets.

MUVERA: The Fix

MUVERA (Google Research, NeurIPS 2024) converts any multi-vector set into a single fixed-dimensional encoding (FDE) whose inner product provably approximates Chamfer similarity. This means you can use standard ANN engines (HNSW, DiskANN) for first-pass retrieval, then re-rank a small candidate set with true MaxSim.

The key insight is asymmetric encoding: documents get centroids with empty-cluster filling (preserves information), queries get sums with no filling (preserves distribution). Random hyperplane partitioning + dimensionality reduction, repeated R times and concatenated. The math gives you an ε-approximation guarantee—the first such result for multi-vector retrieval.

The Benchmark

We ran ColPali-v1.2 and ColQwen2-v1.0 against BM25 (OCR + Tesseract) on the ViDoRe TabFQuAD dataset—70 financial table images, 280 queries. This is the hard case: charts, multi-column tables, footnotes, mixed text+visual content where OCR systematically fails.

MUVERA config: k_sim=5, d_proj=16, r_reps=20 → 10,240-dimensional FDE per document.

Method	R@1	R@5	NDCG@10	MRR	Latency
BM25 (OCR text)	0.425	0.650	0.570	0.531	0.4ms
ColPali-v1.2 brute-force	0.825	0.929	0.890	0.872	26.4ms
ColQwen2-v1.0 brute-force	0.839	0.932	0.896	0.883	42.8ms
MUVERA FDE only	0.693	0.854	0.791	0.759	0.2ms
MUVERA + rerank (ColQwen2)	0.836	0.925	0.891	0.877	30.2ms

What the Numbers Mean

BM25 is not competitive. OCR + keyword search reaches 63.6% of the visual model's quality on financial tables. If your pipeline is “extract text then BM25,” you're leaving 36% of retrieval quality on the floor for any document with visual structure.

MUVERA + rerank = 99.4% quality retention. The FDE narrows 70 documents to 50 candidates in 0.2ms, then Chamfer re-ranking recovers essentially all of the brute-force accuracy. At 1M documents, brute-force becomes seconds; MUVERA stays at milliseconds.

MUVERA FDE-only at 179× speedup. For applications where you can tolerate ~12% quality loss, pure FDE search gives sub-millisecond retrieval. This is the operating point for real-time serving at scale.

ColQwen2 > ColPali by +0.7% NDCG@10 on this dataset, with a larger gap (~6%) on the full ViDoRe average. ColQwen2 is Apache 2.0 licensed and 2B parameters—smaller than ColPali's 3B.

The Two-Tier Architecture

The production pattern that falls out of this:

Offline: Embed documents with ColQwen2 → ~620 patch vectors per page (128-dim each). Generate MUVERA FDE (10,240-dim single vector). Index FDEs in any standard ANN engine.
Tier 1—candidate generation: Query → ColQwen2 query embedding → MUVERA query FDE → ANN search → top-K candidates. Cost: O(log n). Latency: <1ms.
Tier 2—precision re-ranking: Load candidate multi-vectors from storage → true Chamfer/MaxSim scoring → final ranked list. Cost: O(K × |patches|). Latency: ~30ms for K=50.

FDEs go into your vector index as ordinary single vectors. Multi-vectors stay in object storage (S3/parquet) and only get loaded for the re-rank stage. No new infrastructure—just a smarter encoding layer.

What This Means for Multimodal Search

The combination of a strong vision-language model (ColQwen2) with a theoretically-grounded retrieval engine (MUVERA) makes multi-vector search practical at scale for the first time. Prior approaches either sacrificed quality (single-vector), sacrificed speed (brute-force), or sacrificed guarantees (PLAID).

The verticals where this matters most: financial document search (tables, charts, filings), medical imaging (radiology reports with embedded scans), legal discovery (scanned contracts with annotations), and any domain where OCR is the current bottleneck.

We Built a Pre-Publication IP Clearance Pipeline. Here's What We Learned.

Ethan Steininger — Wed, 25 Mar 2026 13:40:11 GMT

Every major IP enforcement tool finds violations after they're live. We built one that catches them before publication. Here's the architecture.

The Problem: Content Velocity vs. Clearance Bottleneck

A mid-size media company publishes 200-400 creative assets per week. Each one needs to be checked for unauthorized faces, trademarked logos, and copyrighted audio before it ships. A single missed celebrity likeness in an ad campaign can trigger a seven-figure lawsuit. A logo that's "close enough" to a registered trademark gets a cease-and-desist within hours.

The current options are bad:

Manual review doesn't scale. A trained compliance analyst can review maybe 50 assets/day with any rigor. That's a week's backlog by Tuesday.
Post-publication enforcement (Pixsy, Red Points, VISUA, Copyseeker) finds violations after they're already live. You pay for takedowns, not prevention. The damage — legal exposure, brand risk, platform penalties — is already done.
Perceptual hashing alone catches exact and near-exact copies, but misses stylized logos, different angles of the same face, or AI-generated content that's "inspired by" but not pixel-identical to protected IP.

We wanted a pipeline that clears content before publication, runs in under a second per image and a few seconds per video, and catches the hard cases that hashing misses.

Architecture Overview

The system is built on three primitives from the Mixpeek API: Buckets (storage + ingestion triggers), Collections (processing pipelines with feature extractors), and Retrievers (multi-stage search).

The high-level flow:

Content Asset (image/video/audio)
    |
    v
Bucket Upload (triggers collection pipeline)
    |
    v
Collection Pipeline (parallel extractors)
    |--- Face Detection → Face Embedding (ArcFace 512d)
    |--- Scene Splitting → Object Detection (YOLO) → Logo Embedding (SigLIP 768d)
    |--- Audio Extraction → Spectrogram Fingerprinting
    |
    v
Vector Storage (Qdrant — one namespace, three vector spaces)
    |
    v
Retriever (multi-stage search across all three corpora)
    |
    v
Clearance Result: { faces: [...], logos: [...], audio: [...] }

Three detection layers run in parallel within a single collection pipeline. Each layer has its own feature extractor, its own embedding model, and its own reference corpus. A single retriever execution searches all three and returns a unified result.

The key insight: the same pipeline that processes your content for clearance also builds your reference corpus. Celebrity headshots, trademarked logos, and copyrighted audio tracks are all ingested through the same bucket-collection flow. The only difference is metadata tagging — reference items get a corpus_type: "reference" field, content to be checked gets corpus_type: "submission".

Query Pre-Processing: Why It Matters More Than Model Quality

This is the section most teams skip, and it's the one that matters most. The quality of what you feed into your embedding model determines your recall far more than which embedding model you pick.

Scene Splitting for Video

Naive approach: sample every Nth frame and run detection on each. This is expensive and produces massive redundancy — a 30-second talking-head clip generates 900 frames at 30fps, most of which are nearly identical.

Better approach: split by scene boundaries first. Mixpeek's scene_splitting extractor uses PySceneDetect's content-aware detection to identify hard cuts and gradual transitions. A typical 30-second ad breaks into 3-8 scenes. Run detection on representative frames from each scene, not every frame.

# Collection config — scene splitting feeds into face detection
{
    "collection_name": "ip_clearance_pipeline",
    "feature_extractors": [
        {
            "feature_extractor_name": "scene_splitting",
            "version": "v1",
            "parameters": {
                "threshold": 27.0,
                "min_scene_len": 15
            }
        },
        {
            "feature_extractor_name": "face_identity",
            "version": "v1",
            "parameters": {
                "quality_threshold": 0.4,
                "min_face_size": 40,
                "detection_threshold": 0.5
            }
        },
        {
            "feature_extractor_name": "object_detection",
            "version": "v1",
            "parameters": {
                "model": "yolov8x-worldv2",
                "confidence_threshold": 0.25,
                "classes": ["logo", "brand", "trademark", "sign", "label"]
            }
        }
    ]
}

This preprocessing step cuts compute by 10-50x on video inputs while actually improving detection quality — scene-representative frames are more likely to show faces and logos in clear, unblurred positions than arbitrary frame samples.

Face Cropping Before Embedding

You don't embed the full frame. You detect faces first (SCRFD — Sample and Computation Redistribution for Face Detection), crop each face region, align it to a normalized 112x112 template using 5 facial landmarks, and then generate the identity embedding.

This is obvious in hindsight, but I've seen teams embed full frames and wonder why their face search has 40% recall. The face is 3% of the pixel area in a wide shot. The embedding is dominated by the background.

Object Proposals for Logo Isolation

Same principle for logos. YOLO generates bounding box proposals for logo-like regions. Each region is cropped and embedded independently with SigLIP. A single frame might yield zero logo proposals (clean background) or five (product shelf shot). Each proposal becomes a separate search query against the logo reference corpus.

The alternative — embedding the full frame and hoping the model attends to the logo — works surprisingly well for prominent logos (center frame, large area) and fails badly for small, peripheral, or partially occluded marks. The crop-then-embed approach handles both cases.

The Pipelines in Detail

Face Detection → Recognition

Pipeline stages:

Scene splitting (video only): PySceneDetect content-aware detection → representative frames
Face detection: SCRFD-2.5G scans each frame. Outputs bounding boxes, confidence scores, and 5 facial landmarks per detected face.
Alignment: Landmarks are used to warp each face to a canonical 112x112 frontal pose. This normalization is what makes the system robust to head tilt, partial profile views, and camera angle variation.
Embedding: ArcFace (ResNet-100, trained on MS1MV3) generates a 512-dimensional identity embedding. Cosine similarity in this space corresponds directly to identity — same person across lighting, age, expression, and moderate pose changes.
ANN search: Each face embedding is searched against the reference corpus in Qdrant. Threshold: cosine similarity >= 0.28 (conservative; FAR ~1e-4 on LFW benchmark).

Why ArcFace and not CLIP/SigLIP for faces? CLIP embeds semantic similarity. Two different red-haired women in similar settings score high. ArcFace is trained with angular margin loss specifically for identity discrimination — it is not interchangeable with general-purpose image embedders for biometric matching.

# Retriever config — face search stage
{
    "stage_id": "feature_search",
    "stage_type": "search",
    "parameters": {
        "searches": [
            {
                "feature_uri": "mixpeek://face_identity@v1/arcface_embedding",
                "query": {
                    "input_mode": "content",
                    "value": "{{submission_face_crop}}"
                },
                "filters": {
                    "corpus_type": "reference"
                },
                "top_k": 10,
                "score_threshold": 0.28
            }
        ]
    }
}

Logo Detection → Recognition

Pipeline stages:

Scene splitting (video only): same as above
Object detection: YOLOv8x-WorldV2 generates bounding box proposals for logo-like objects. We use the open-vocabulary variant so it generalizes beyond a fixed class set — you can prompt it with arbitrary class names at inference time.
Region cropping: Each detected region is cropped with 10% padding
Logo embedding: SigLIP (ViT-B/16, 768-dimensional) embeds each cropped region. SigLIP over CLIP because its sigmoid pairwise loss produces better-calibrated similarity scores for retrieval tasks.
Dual matching: Each crop is matched against the reference corpus via both (a) embedding cosine similarity and (b) perceptual hash distance. Either signal above threshold triggers a match.

The dual matching is important. Perceptual hashing catches trivial copies (exact logo, maybe resized or JPEG'd) cheaply. The embedding catches stylized variants, partial logos, color inversions, and the deformed versions that generative AI tends to produce. Running both in parallel with an OR-gate means high recall without relying solely on either approach.

# Why pHash + embedding, not just embedding?
#
# pHash: O(1) lookup, exact/near-exact matches, zero false negatives on
#         trivial copies. Catches ~60% of real-world violations.
# Embedding: Handles stylization, partial occlusion, AI-generated variants.
#            Catches the remaining 40% that hashing misses.
#
# Running both costs almost nothing extra — the hash is computed during
# ingestion and stored as a payload field. At query time, it's a
# Qdrant payload filter, not a separate search.

Audio Fingerprinting

Pipeline stages:

Audio extraction: FFmpeg strips the audio track from video assets
Spectrogram generation: Short-time Fourier transform → mel spectrogram
Fingerprint embedding: The spectrogram is embedded into a dense vector representation for similarity search
ANN search: Search against a reference corpus of copyrighted audio tracks, jingles, and licensed music

Audio is the least mature of the three pipelines and the one where perceptual fingerprinting (Chromaprint/AcoustID-style) still outperforms learned embeddings for exact match detection. The embedding approach shines for covers, remixes, and tempo-shifted versions.

Models and Tuning

Why SigLIP Over CLIP

We evaluated both extensively for the logo matching use case. SigLIP (768-d, ViT-B/16) wins on three dimensions that matter for retrieval:

Calibrated scores: SigLIP's sigmoid loss produces similarity scores that are directly interpretable as match confidence. CLIP's softmax-normalized scores are relative within a batch, which makes threshold-setting fragile.
Cropped region performance: On our internal eval set of 500 logo crops against 3,000 reference brands, SigLIP at threshold 0.75 achieves 94% recall / 97% precision vs CLIP's 89% recall / 93% precision at its optimal threshold.
Zero-shot generalization: SigLIP handles brand logos it's never seen in training better than CLIP, likely due to the per-pair sigmoid loss not pushing negatives to the same scale as positives.

Face Recognition: ArcFace Tradeoffs

ArcFace ResNet-100 is the default. It's 512 dimensions, runs at ~3ms per face on GPU, and achieves 99.83% accuracy on LFW. The tradeoffs:

Pose sensitivity: Accuracy degrades beyond ~60-degree profile angles. The alignment step mitigates this for moderate poses, but a face visible only in full profile may not match.
Aging: Embeddings shift over 10+ year spans. A reference photo from 2010 may not match the same person in 2026 at conservative thresholds. Mitigation: include multiple reference images spanning different time periods.
Low resolution: Faces below ~40px wide after detection don't produce reliable embeddings. We set min_face_size: 40 as a hard floor.

Custom YOLO Models

The object detection stage supports custom model deployment. If your reference corpus is highly specialized — say, you need to detect pharmaceutical packaging marks or specific regulatory symbols — you can train a custom YOLO model and upload it as a ZIP file to the platform. The inference service loads it on-demand.

For generic logo detection, YOLOv8x-WorldV2's open-vocabulary capability is sufficient. You specify the classes you care about at query time:

# Open-vocabulary object detection — no retraining needed
{
    "feature_extractor_name": "object_detection",
    "parameters": {
        "model": "yolov8x-worldv2",
        "classes": ["Nike swoosh", "McDonald's arches", "Apple logo", "brand logo"]
    }
}

The Reranker

ANN search returns top-K candidates fast but approximate. For the final ranking, we run a cross-encoder reranker on the top results. The cross-encoder sees both the query and candidate simultaneously (not independently embedded), which allows it to capture fine-grained differences that dual-encoder models miss.

This is especially valuable for logos where the top-10 ANN results might include 3 genuine matches and 7 visually similar but legally distinct marks. The cross-encoder precision on this disambiguation step is what separates "useful tool" from "alert fatigue generator."

The Dataset Challenge

Building the Reference Corpus

This is where we spent most of our time and where most teams underinvest.

Faces: How many reference images per identity do you need? Our empirical finding: 5-10 quality reference images per person covers enough pose/lighting variation to achieve >95% recall at FAR 1e-4. With only 1 reference image, recall drops to ~70%. Below 3, it's unreliable.

We built our initial corpus from FaceScrub (~530 identities, ~2,900 images after URL death) supplemented with Wikipedia Commons portraits. The URL attrition on FaceScrub is brutal — it's a 2014 dataset and ~85% of the original URLs are dead. For production, you need a maintained reference database, not a research dataset.

# Corpus quality matters more than quantity
# These are our empirical numbers on the FaceScrub + Wikipedia corpus:
#
# References/identity | Recall@FAR=1e-4 | Notes
# ------------------- | ---------------- | -----
# 1                   | ~70%             | Single frontal portrait
# 3                   | ~88%             | Frontal + 2 varied poses
# 5                   | ~94%             | Diverse lighting/angle
# 10                  | ~97%             | Diminishing returns here
# 20+                 | ~98%             | Not worth the curation cost

Logos: We use LogoDet-3K (158,654 images, 3,000 brands, MIT license) as the base. The critical preprocessing step: LogoDet-3K uses numeric company IDs, not brand names. You must resolve the ID-to-brand mapping before ingestion, or your metadata says "Food/12345" instead of "McDonald's." We burned half a day on this.

For logos, variation coverage matters: you need the logo on white, on dark backgrounds, in color, in grayscale, at multiple scales, and ideally in real-world context (storefront, product packaging, screen captures). A single clean vector logo file is insufficient as a reference.

The Cold Start Problem

When you first deploy, your reference corpus is whatever you curated. It doesn't cover edge cases — unusual lighting conditions, rare logo variants, faces that are only partially visible. The system's recall on day one is measurably worse than on day 30.

The fix is interaction feedback, which feeds directly into the continuous learning loop described in the next section.

Continuous Learning: The Interaction Loop

This is the part that actually matters long-term, and it's the part most blog posts about ML pipelines skip.

Every user interaction with the system generates a signal:

Click on a result → implicit positive signal
Skip a result → weak negative signal
Long view (>3s on a result) → implicit positive signal
Explicit feedback → "correct match" / "false positive" / "missed match"
Threshold override → analyst manually approves/rejects at a specific confidence level

These signals are captured by Mixpeek's interaction tracking and stored in ClickHouse for analytics. The analytics endpoints expose confidence distributions and signal patterns per retriever:

# Analyzing signal patterns for a retriever
import requests

# Get confidence distribution — where are matches landing?
response = requests.get(
    "https://api.mixpeek.com/v1/analytics/retrievers/{retriever_id}/confidence",
    headers={"Authorization": f"Bearer {api_key}"}
)

# Returns histogram of match confidence scores
# If there's a bimodal distribution with a gap at 0.35,
# that's your natural threshold — not the default 0.28.

# Get signal breakdown — what's being confirmed vs. rejected?
signals = requests.get(
    "https://api.mixpeek.com/v1/analytics/retrievers/{retriever_id}/signals",
    headers={"Authorization": f"Bearer {api_key}"}
)

Over time, these signals inform three adjustments:

Threshold tuning: If analysts are consistently rejecting matches above 0.28 but below 0.35, raise the threshold. If they're marking missed matches in the 0.22-0.28 range, consider lowering it for specific corpora.
Fusion weight adjustment: The retriever combines face, logo, and audio signals with configurable weights. If logo matches have a higher false-positive rate than face matches in your domain, down-weight them.
Reference corpus expansion: When a new face or logo is confirmed as a genuine match but wasn't in the reference corpus, add it. This is how the system improves its recall over time without model retraining.

The system doesn't auto-adjust thresholds (that would be terrifying for a compliance tool). It surfaces the data; a human makes the call. But having the data surface automatically, instead of discovering your false-positive rate through legal complaints, is the difference between a proactive tool and an expensive audit.

Performance Numbers

Measured on our production deployment with a corpus of ~3,000 brand logos and ~530 face identities:

Input Type	Latency (p50)	Latency (p95)	Notes
Single image	220ms	480ms	Face + logo detection + search
Video (30s, ~5 scenes)	2.1s	3.8s	Includes scene splitting
Video (60s, ~12 scenes)	4.3s	7.2s	Scales linearly with scenes
Audio-only check	180ms	350ms	30s clip fingerprint
Batch (1000 images)	~4 min	~7 min	Parallelized across workers

False positive rates at the default thresholds:

Detection Layer	Threshold	FPR	Recall
Face (ArcFace)	cosine >= 0.28	~0.01%	~94%
Logo (SigLIP)	cosine >= 0.75	~3%	~94%
Logo (pHash)	hamming <= 8	~0.1%	~60%
Logo (combined)	either above	~3.1%	~97%

The logo FPR is higher because visually similar but legally distinct marks are common (think: any swoosh-like shape near the Nike threshold). The reranker brings this down to ~0.5% in practice, but we report the pre-reranker numbers because that's what you'll see before adding the cross-encoder stage.

What We'd Do Differently

Having built this and deployed it, here's what we'd change if starting over:

1. Start with perceptual hashing, add embeddings second

pHash is trivial to implement, runs in microseconds, and catches ~60% of real-world logo violations (exact copies, resizes, JPEG re-compressions, minor crops). If you're building an MVP, ship pHash first and add the embedding pipeline when you need to catch stylized variants. We built the embedding pipeline first because it was more interesting, which was a mistake.

2. Invest in reference corpus quality before model selection

We spent weeks evaluating ArcFace vs. AdaFace vs. CosFace when the actual bottleneck was that 40% of our FaceScrub reference URLs were dead. 10 good reference images per identity with a mediocre model beats 1 reference image with SOTA. This isn't a theoretical observation — our recall jumped 12 percentage points from corpus cleanup alone, without touching the model.

3. Build the feedback loop from day one

We added interaction tracking as an afterthought after the first deployment. This meant three weeks of production data with no signal capture — three weeks of analyst corrections that were lost. The feedback loop is what makes the system improve over time. Bolt it in from the start, even if the analytics dashboard comes later.

4. Don't underestimate the logo disambiguation problem

There are thousands of logos that are vaguely circular with a swoosh-like element. At embedding similarity 0.7, the Nike swoosh matches dozens of unrelated marks. The cross-encoder reranker exists because of this problem. If your use case involves logos at all, budget for a reranking stage — you will need it.

5. Video is a multiplier, not a different problem

The per-frame detection is identical to image detection. The only video-specific work is scene splitting and deduplication (the same face appearing across multiple frames of the same scene should not generate multiple alerts). We initially over-engineered the video pipeline before realizing it's just "image pipeline + scene splitting + dedup."

Try It

The demo is live at copyright.mixpeek.com. Upload an image or video and see the face/logo/audio detection results in real time.

The API is documented at mixpeek.com/docs. If you're building a pre-publication clearance pipeline, the key endpoints are:

Buckets for ingesting both reference corpora and content submissions
Collections with face_identity, object_detection, and audio extractors for processing
Retrievers with multi-stage search for running clearance checks

If you're building something similar and hit a wall, the tutorial walks through the full pipeline setup from corpus ingestion to retriever configuration.

We also open-sourced the demo frontend — it's a React app that calls the Mixpeek API. Clone it, point it at your namespace, and you have a working IP clearance UI.

I Benchmarked 5 Video Embedding Models So You Don't Have To

Ethan Steininger — Thu, 12 Mar 2026 21:40:15 GMT

At Mixpeek, we process video embeddings at scale for our customers' retrieval pipelines. When Google dropped Gemini Embedding 2 — claiming to unify text, image, video, and audio in one embedding space — we needed to know: does it actually work for video retrieval? And how does it compare to purpose-built alternatives?

So we built a benchmark. Not a synthetic one with cherry-picked examples — a proper IR evaluation with graded relevance — dataset, code, and results all open-sourced on GitHub — following the same methodology as BEIR and MTEB. Twenty CC0 videos, sixty queries, six models, reproducible results.

Here's what we found.

The Results

Model	Dims	NDCG@5	NDCG@10	R@1	R@5	MRR	Latency	Type
Gemini Embedding 2	3072	0.697	0.769	0.200	0.717	0.896	2,458ms	API
Marengo 2.7*	1024	0.721	0.760	0.250	0.743	1.000	18,148ms	API
Mixedbread Wholembed v3**	ColBERT	0.644	0.757	0.216	0.649	0.932	500ms	API (Stores)
X-CLIP Base	512	0.327	0.470	0.067	0.367	0.520	192ms	Local
SigLIP 2 SO400M	1152	0.202	0.325	0.075	0.237	0.466	636ms	Local
InternVideo2 6B	768	0.186	0.302	0.046	0.237	0.405	24,817ms	Local

* Marengo results based on 34/60 queries due to Twelve Labs free-tier rate limits. ** Mixedbread results based on 37/60 queries due to rate limiting. Uses Stores API (ColBERT-style late interaction).

What These Metrics Mean (and Why You Should Care)

If you're building any kind of search or retrieval system over video, these numbers tell you how often your users will find what they're looking for. Let me break them down:

NDCG@K (Normalized Discounted Cumulative Gain) is the primary metric. It measures ranking quality with graded relevance — not just "did you find it?" but "did you rank the best match highest?" An NDCG@10 of 0.769 means Gemini gets the ranking mostly right across the top 10 results. An NDCG@10 of 0.302 (InternVideo2) means the ranking is barely better than random.

MRR (Mean Reciprocal Rank) answers: "Where does the first correct result appear?" Marengo's perfect 1.000 means the right video was always ranked #1. Every single query. Gemini's 0.896 means the right answer is usually in the top 2. X-CLIP's 0.520 means you're typically scrolling to position 2-3 before finding something relevant.

Recall@K tells you what fraction of relevant results appear in the top K. At R@5, Marengo retrieves 74% of relevant videos in the top 5 results. InternVideo2 only gets 24%. If you're building a search UI that shows 5 results per page, that's the difference between useful and useless.

Latency is wall-clock time to embed one video. X-CLIP does it in 192ms locally. Marengo takes 18 seconds through their API. That's a 94x difference. For batch processing, this might not matter. For real-time applications, it's a dealbreaker.

The Surprising Takeaways

1. Three API models in a tight race

The top 3 — Gemini, Marengo, and Mixedbread — cluster tightly at NDCG@10 0.757–0.769. Marengo's perfect MRR is genuinely impressive (it never puts the wrong video first), while Mixedbread's ColBERT-style late interaction achieves 0.932 MRR with a very different architecture. But Gemini is close on all metrics and handles text, images, audio, and documents too. For most teams, Gemini's versatility at 7x faster latency probably wins. Mixedbread's Stores approach is interesting — you upload videos once and search via API. No embedding vectors to manage, no vector DB needed.

2. Model size doesn't predict quality

InternVideo2 has 6 billion parameters. X-CLIP has ~150 million. X-CLIP beats InternVideo2 on every single metric by a wide margin (0.470 vs 0.302 NDCG@10). The reason: InternVideo2's Stage2 checkpoint is optimized for multimodal pretraining, not zero-shot retrieval. Architecture and training objective matter more than parameter count.

3. Frame averaging is a dead end

SigLIP 2 is a fantastic image encoder. But sampling 8 frames and averaging their embeddings gives you 0.325 NDCG@10 — barely above InternVideo2's pretrained checkpoint. Video is not a bag of frames. Temporal structure — what happens between frames — carries critical information for retrieval. X-CLIP's cross-frame attention proves this: same number of frames, 1.4x better results.

4. The API vs. self-hosted gap is 2x

All three API models (Gemini, Marengo, Mixedbread) score 0.75+ NDCG@10. The best open-source model (X-CLIP) scores 0.470. That's a 2x quality gap. If you need high-quality video retrieval today, you're paying for an API. The open-source video embedding space is still immature.

What This Means For Your Architecture

If you're building a search product:

Use Gemini Embedding 2. It has the best balance of quality, latency, and cost (free tier covers 1K videos/day). The 3072-dim vectors are large, but you get Matryoshka support — truncate to 768 dims with minimal quality loss. Marengo is slightly better on retrieval but 7x slower and costs $0.033/min.

If latency matters more than quality:

X-CLIP runs locally in 192ms on consumer hardware. At 0.470 NDCG@10, it's good enough for recommendation systems, deduplication, or coarse-grained search where you can refine results downstream.

If you're evaluating at scale:

Don't use InternVideo2 for retrieval. Despite the hype, the Stage2 checkpoint isn't designed for zero-shot embedding similarity. If you need an open-source model with >1B params, wait for a contrastive-tuned variant or fine-tune it yourself.

If you're at Mixpeek:

This benchmark directly informs our pipeline. We're integrating Gemini Embedding 2 as a first-class embedding option alongside our existing extractors. The quality-to-latency ratio is unmatched, and the multimodal unification means our customers can search across video, images, and documents with a single model.

Methodology

We want this to be reproducible. Here's exactly what we did:

Dataset: 20 CC0 videos from Pexels across 5 categories (sports, cooking, nature, urban, technology). All normalized to 640x360, 24fps, 10s max, H.264.
Queries: 60 text queries with graded relevance (0/1/2), three per video: exact match, partial match, and hard negative (semantically adjacent but wrong domain).
Metrics: Standard IR evaluation following BEIR/MTEB conventions — NDCG@K with graded relevance as the primary metric.
Embedding: All vectors L2-normalized. Retrieval by cosine similarity. Random seed fixed at 42.
Frame-based models: 8 frames uniformly sampled (SigLIP, X-CLIP) or 4 frames (InternVideo2). API models process the full video.

Hard Negative Performance

We specifically designed queries to confuse models — e.g., "A technician carefully assembling small electronic parts by hand" for a cooking video (both involve precise hand movements). This tests whether models understand semantics or just match visual patterns.

Model	NDCG@1	NDCG@5	MRR
Marengo 2.7*	1.000	0.846	1.000
Mixedbread Wholembed v3**	1.000	0.802	1.000
Gemini Embedding 2	0.800	0.790	0.900
SigLIP 2 SO400M	0.400	0.236	0.556
X-CLIP Base	0.200	0.322	0.467
InternVideo2 6B	0.200	0.220	0.423

Marengo and Mixedbread were never fooled — both achieve perfect NDCG@1 and MRR on hard negatives. Gemini was fooled once. The open-source models were confused frequently. This is arguably the most important test for production retrieval — false positives in search results destroy user trust.

Cost Comparison

Model	Pricing	Est. Cost / 1K Videos	Vector Storage
Gemini Embedding 2	Free tier (1K/day)	~$0	12,288 B/vec
Marengo 2.7	$0.033/min	~$5-15	4,096 B/vec
Mixedbread Wholembed v3	Free tier, then per-token	~$0	N/A (server-side)
X-CLIP Base	Self-hosted	GPU cost only	2,048 B/vec
SigLIP 2	Self-hosted	GPU cost only	4,608 B/vec
InternVideo2 6B	Self-hosted	GPU cost only	3,072 B/vec

Gemini at free tier is almost unfair. For most teams processing <1K videos/day, the cost is literally zero. Marengo's per-minute pricing adds up fast for large video libraries but might be worth it if retrieval quality is your north star metric.

Reproduce It Yourself

Everything is open source — code, dataset (all 20 videos), and results — in a single repo:

github.com/mixpeek/video-embedding-benchmark

What's in the repo:

All 20 CC0 videos included directly (~13MB) — no separate download step
data/queries.json with 60 graded-relevance queries
Pre-computed results for all 6 models in results/
Full benchmark + adapter code for every model
Clone and run python report.py to verify our numbers — no API keys needed

git clone https://github.com/mixpeek/video-embedding-benchmark.git
cd video-embedding-benchmark

# Videos are already in the repo — no download needed
ls data/videos/

# Run individual models
python benchmark.py --model gemini      # needs GEMINI_API_KEY
python benchmark.py --model xclip       # runs locally
python benchmark.py --model siglip      # runs locally
python benchmark.py --model mixedbread  # needs MIXEDBREAD_API_KEY

# Generate comparison report from pre-computed results
python report.py

We'll update this post as we complete the remaining Marengo queries and add Amazon Nova Multimodal to the benchmark.

What's Next

This benchmark covers retrieval quality, but that's only one dimension. We're planning to extend it with:

Longer videos — 30s, 60s, 5min clips to test how models degrade with length
Domain-specific evaluation — medical, security, retail video datasets
Cross-modal retrieval — image-to-video, video-to-video search
Matryoshka dimension scaling — how much quality do you lose at 256d vs 3072d?

If you're working on video search or embeddings and want to collaborate on the benchmark, reach out. We're at mixpeek.com or @mixpeek on GitHub.

Ethan Steininger is the founder of Mixpeek, a multimodal processing platform for video, image, text, and audio understanding at scale.

Gemini Embedding 2 is Live: embed multiple files into one vector

Ethan Steininger — Thu, 12 Mar 2026 16:40:08 GMT

Google shipped Gemini Embedding 2 (gemini-embedding-exp-03-07, 3072-d) last week (announcement). The headline number is the dimensionality. The part that actually matters is buried in the announcement: it does multi-modal embedding natively — images, PDFs, audio, and text in a single API call, producing one vector that represents all of them together.

That's genuinely new. CLIP and SigLIP give you one modality per call and leave you to figure out late fusion. Vertex Multimodal gives you image+text but not PDF, not audio. Gemini Embedding 2 takes whatever you throw at it and returns a single 3072-d float array. No fusion logic, no alignment head you have to train yourself.

We integrated it into Mixpeek this week. Here's how it works, what it's good for, and the actual production numbers.

Quick context: what feature extractors and retrievers are

If you haven't used Mixpeek before:

A feature extractor is a processing pipeline that runs when objects land in a bucket. You point it at blob properties on your objects (image URLs, text fields, PDF attachments) and it writes embeddings into a vector index attached to the namespace. You can have multiple extractors per namespace — one for text, one for images, one for multi-modal — each writing its own named vector. Docs: docs.mixpeek.com/processing/feature-extractors.

A retriever is a query pipeline. You define what inputs it takes, which feature indices to search, and how to fuse and rank results. At query time it embeds your input using the same model that was used at ingest, runs ANN search, and returns ranked documents. The embedding step at query time is what we call the realtime path — it runs inline during the request, not in a batch job. Docs: docs.mixpeek.com/retrieval/retrievers.

The problem with one chunk per embedding

Most embedding workflows treat objects as a single blob: extract text, embed it, done. If the object also has an image you embed the image separately and late-fuse at query time, or you throw away the image entirely.

This is fine for simple retrieval but breaks in a few important ways:

Product catalogs. A product is a hero image + a spec sheet PDF + a description. If you embed these separately, a query for "lightweight carbon fiber trail shoe" matches on text but has no idea the product also has a sole pattern image that's strongly correlated with "trail." The separate embeddings can't cross-reference each other.
Documents with figures. Research papers, technical reports, slide decks. The figure on page 4 is context for the paragraph below it. Embedding the figure and the paragraph independently loses that relationship. You'd need to chunk and cross-reference manually.
Brand and compliance monitoring. You want to know if a video frame + its surrounding caption jointly violate a guideline. A text-only check misses visual context; an image-only check misses the caption spin.
Anything with metadata that changes the semantics. A photo of a person means different things with the caption "CEO of Acme Corp" vs "wanted for fraud." Text and image together carry meaning that neither carries alone.

The standard workaround is to embed everything separately and hope your late fusion weights are right. With Gemini Embedding 2 you can skip that entirely: pass all of it in one call, get one vector, store one point in Qdrant per object. At query time, pass your query image + your query text in one call. The model figures out the cross-modal alignment.

How it works in Mixpeek

The extractor is called gemini_multifile_extractor. You configure it with an input_mappings block that lists which blob properties to embed together. All listed properties are collected per object, fetched (URLs are downloaded, presigned if S3), and sent to Gemini in a single embed_content call.

{
  "feature_extractor_name": "gemini_multifile_extractor",
  "version": "v1",
  "input_mappings": {
    "files": ["hero_image", "spec_sheet", "description"]
  },
  "params": {
    "output_dimensionality": 3072,
    "task_type": "RETRIEVAL_DOCUMENT"
  }
}

The files key is an array of blob property names on your objects. At ingest time, Mixpeek's Ray Data pipeline calls get_content_list() to resolve each property — downloading binaries, passing text strings as-is — then builds a Part[] array and fires one Gemini API call per object. The result is a single 3072-d vector stored as a named vector in Qdrant.

Key things the extractor writes to the document payload:

source_blob_count — how many blobs were embedded
source_blob_properties — which properties contributed
gemini_multifile_extractor_v1_embedding — the 3072-d vector

Full end-to-end setup guide: docs.mixpeek.com/processing/extractors/gemini-multifile

Bucket schema with multi-blob objects

curl -X POST https://api.mixpeek.com/v1/buckets \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "bucket_name": "products",
    "schema": {
      "product_id":   { "type": "string" },
      "product_name": { "type": "string" },
      "description":  { "type": "text" },
      "hero_image":   { "type": "image" },
      "spec_sheet":   { "type": "document" }
    }
  }'

Uploading an object with multiple blobs

curl -X POST https://api.mixpeek.com/v1/buckets/$BUCKET_ID/objects \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "blobs": [
      { "blob_property": "hero_image",  "url": "https://cdn.example.com/shoe.jpg" },
      { "blob_property": "spec_sheet",  "url": "s3://products/SKU-42/spec.pdf" },
      { "blob_property": "description", "text": "Lightweight carbon-fiber trail shoe" }
    ],
    "metadata": { "product_id": "SKU-42" }
  }'

One object. Three blobs. One embedding. That's the whole model.

Production numbers

We ran this against a live namespace on GKE (ns_8606a82b84) with objects containing 2 blobs each (image URL + text). Here's what we measured:

Path	Blobs/object	Gemini API calls	Vector dims	Embed latency	Result score
Batch ingest	2	1 per object	3072	~800ms (Ray actor)	—
Text query	—	1 per request	3072	1,414ms	0.573
Multi-content query (image + text)	—	1 per request	3072	2,898ms	0.573

A few things to note:

One API call per object regardless of blob count. With 3 blobs you still pay one API call, not three. The cost scales with total token count of the content, not number of parts.
Multi-content query latency doubles vs text-only. Makes sense — you're downloading an image over the network before the Gemini call. Factor that into your SLA budget.
Score is identical (0.573) for text-only and multi-content queries against these test documents. The test objects were both 2-blob objects. In production on real multi-modal data, multi-content queries should score higher on relevant results because you're matching on both modalities simultaneously.

Retrieval: text queries and multi-content queries

The retriever has two query modes for this extractor:

Text query (most common)

Pass a text string. The retriever embeds it via Gemini at request time and searches the multi-file vector index. This works because the multi-file vectors are trained to align across modalities — your text query "trail running shoe" has nonzero similarity to vectors built from images + PDFs + descriptions of trail running shoes.

{
  "stages": [{
    "stage_type": "filter",
    "config": {
      "stage_id": "feature_search",
      "parameters": {
        "searches": [{
          "feature_uri": "mixpeek://gemini_multifile_extractor@v1/gemini-embedding-exp-03-07",
          "query": {
            "input_mode": "text",
            "text": "{{INPUT.query}}"
          },
          "top_k": 10
        }]
      }
    }
  }]
}

# Execute
curl -X POST https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -d '{"inputs": {"query": "trail running shoe carbon fiber"}, "settings": {"limit": 5}}'

Multi-content query (match how you indexed)

This is the interesting one. You pass multiple inputs — a query image URL plus a text description — and they're embedded together in one Gemini call, producing a query vector that was generated the same way as your indexed vectors. The query and the index are in the same space, built the same way.

{
  "query": {
    "input_mode": "multi_content",
    "values": ["{{INPUT.image_url}}", "{{INPUT.description}}"]
  }
}

curl -X POST https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: $NAMESPACE_ID" \
  -d '{
    "inputs": {
      "image_url": "https://example.com/query-shoe.jpg",
      "description": "trail running shoe carbon fiber"
    }
  }'

values accepts any mix of HTTP URLs, S3 URIs, and plain text strings. S3 URIs are presigned server-side before the Gemini call. You never need to handle fetching or encoding yourself.

Embedding migration without pain: how namespaces handle it

One thing that doesn't get talked about enough when new embedding models ship: what happens to your existing index?

Gemini Embedding 2 vectors are not compatible with your old SigLIP or E5 vectors. You can't query them in the same index — the spaces don't align. The naïve path is "re-embed everything, rebuild your Qdrant collection, update all your retriever configs." That's a multi-hour job on a large corpus and you have a period where search is degraded.

Mixpeek namespaces are designed around this. A namespace is a single Qdrant collection, but it supports multiple named vectors per point — one per feature extractor. When you add a new extractor to a namespace and trigger reprocessing, Mixpeek writes a new named vector field on each point without touching the existing ones. Your old retriever configs keep working against the old named vectors while the new ones are being populated.

Once you've validated the new extractor's quality, you update your retriever to point at the new feature URI and you're done. The old named vectors continue to exist on the points — they don't need to be cleaned up immediately — and you can roll back by changing one retriever config field.

This is the right model for production embedding pipelines. New model ships → add extractor → let it populate in parallel → cut over retriever → validate → done. No downtime, no search quality regression window, no re-indexing panic.

Where multi-file embedding actually wins

There are a few patterns where embedding multiple files together is clearly better than embedding each separately and fusing:

1. Products with image + spec sheet + description

E-commerce is the obvious one. A query for "waterproof boots for wide feet size 13" should return boots that match on all three dimensions: the waterproofing is in the description, the width might be in the spec sheet table, and the boot style is in the hero image. Single-modality embeddings can match on any one of these but can't coherently match on all three simultaneously. Multi-file gives you one embedding that captures the conjunction.

Measured uplift: In internal experiments on product search, recall@10 for queries combining visual + specification attributes improved 18-23% over text-only embeddings, and 12-15% over late fusion of separate image and text embeddings.

2. Research papers and technical documents

Arxiv papers, technical reports, anything where figures are integral to the argument. If you embed figure 3 and the paragraph that discusses figure 3 separately, a query about the methodology in figure 3 will miss the connection unless you've built explicit cross-reference logic. Embed them together and the model handles the alignment.

3. Video frames + captions + metadata

Video indexing at the segment level: a frame image + the ASR transcript of that segment + the scene metadata. Standard video search embeds the transcript and uses the image as a filter. Multi-file embedding makes the image part of the semantic search space, not just a filter value.

4. Brand and compliance monitoring

You want to know: does this social post (image + caption) jointly violate a brand safety guideline? Text-only checks miss visual context. Image-only checks miss caption framing. Embedding both together into a single space lets you do semantic search against a corpus of flagged examples that also have both image and text — you're matching like with like.

5. Medical and legal documents with embedded charts

Pathology reports with embedded microscopy images. Legal briefs with embedded exhibit photos. The image isn't decoration — it's part of the argument. Multi-file embedding captures that the image and the surrounding text are jointly meaningful.

Custom plugins

The gemini_multifile_extractor is a builtin. If you need to customize the preprocessing — resizing images, extracting specific pages from PDFs, doing OCR before embedding, applying access controls — you can deploy a custom plugin (enterprise feature, requires dedicated infrastructure).

Custom plugins are zip archives you upload to Mixpeek. They must have a realtime.py that implements BaseInferenceService.infer() — this is called at query time, inline in the retriever request, to generate the query embedding. The plugin can call Gemini internally or any other embedding service.

# realtime.py — minimal custom Gemini plugin
from engine.core.base import BaseInferenceService
from google import genai
from google.genai import types
import os

class MyGeminiPlugin(BaseInferenceService):
    async def infer(self, inputs: dict) -> dict:
        client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
        parts = self._build_parts(inputs.get("files", []))
        response = client.models.embed_content(
            model="models/gemini-embedding-2-preview",
            contents=parts,
            config=types.EmbedContentConfig(task_type="RETRIEVAL_QUERY"),
        )
        return {"embedding": list(response.embeddings[0].values)}

The routing is automatic: if your vector index's inference_service_id starts with custom_plugin_, the retriever sends query-time embedding requests to your Ray Serve deployment. If it starts with google/, it calls the Gemini API directly without going through Ray Serve at all.

Implementation notes and gotchas

A few things that bit us during development:

Model name matters. gemini-embedding-exp-03-07 is available on Google AI API (ai.google.dev) as models/gemini-embedding-2-preview. The Vertex AI endpoint requires api_version=v1beta1 — the GA endpoint 404s. We use GEMINI_API_KEY if present (Google AI API) and fall back to Vertex only if it's not set.

inference_name normalization. Mixpeek normalizes inference_service_id strings via service_id_to_deployment_name(): slashes become double underscores, hyphens become underscores. So google/gemini-embedding-exp-03-07 becomes google__gemini_embedding_exp_03_07 in the database. If you're debugging retriever routing, this is why a startswith("google/gemini-embedding") check silently fails — you need to check the normalized form too.

Array input_mappings serialization. Ray Data/Arrow serializes Python lists as numpy arrays when passing through map_batches. Any code touching array-valued input_mappings columns needs a if hasattr(items, "tolist"): items = items.tolist() guard before iterating. This was a subtle batch processing bug that caused jobs to fail on the first attempt after pipeline startup.

Task type for retrieval. Use RETRIEVAL_DOCUMENT at ingest and RETRIEVAL_QUERY at query time for best recall. If you're doing symmetric similarity (find products similar to this product), use SEMANTIC_SIMILARITY at both stages. Mismatching these degrades recall measurably — in our tests, using RETRIEVAL_DOCUMENT at query time reduced recall@10 by ~8% vs RETRIEVAL_QUERY.

Dimensionality reduction. Gemini Embedding 2 supports output dimensionality from 256 to 3072. Lower dimensions reduce storage and ANN search latency. Our testing showed recall@10 dropping ~3% at 768-d vs 3072-d, and ~9% at 256-d. For most production workloads 768-d is a reasonable tradeoff — it halves the Qdrant memory footprint.

Full end-to-end walkthrough

The complete guide with working curl commands covering bucket setup → multi-blob upload → collection config → batch processing → retriever creation → both query modes is at:

docs.mixpeek.com/processing/extractors/gemini-multifile

Related docs:

What's next

A few things on the backlog:

Batched query-time embedding. Right now the retriever embeds one query per request. For re-ranking pipelines where you need embeddings for N candidates, batching the Gemini calls would reduce latency significantly.
Selective blob embedding. Currently you list all blob properties in input_mappings.files and all of them are embedded together every time. A predicate system — "only include spec_sheet if the object has a category == 'electronics'" — would let you get more precise about what goes into each object's vector.
Streaming updates. Objects that get updated blobs (e.g., a product whose spec sheet is revised) should trigger incremental re-embedding of just that blob's contribution, not a full reprocessing of the object. This requires some delta-tracking in the manifest that isn't there yet.

If you're building on top of Mixpeek and have a use case where multi-file embedding is relevant, talk to us. The implementation is new and we're actively shaping the API surface based on what people actually need.

Query Preprocessing: Semantic Search With Large Files

Ethan Steininger — Mon, 09 Mar 2026 18:11:09 GMT

The problem: your query is bigger than your embeddings

Most vector search systems assume queries are small. A sentence. An image. A short audio clip. The entire retrieval literature is built around this assumption: you embed a query into a single vector, search against an index of many vectors, return ranked results.

This works until a user hands you a 500MB video and says "find me everything in my library that looks like this."

We started seeing this pattern from multiple customers in Q4 2025. A media company wanted to search their archive using a raw broadcast clip. A legal team wanted to submit a full contract PDF as a query against a corpus of prior agreements. An IP safety product (which we also built) needed to scan uploaded videos for trademark violations by searching frame-by-frame against a brand index.

The naive solutions all have obvious problems:

Reject large inputs — forces the client to pre-split, which breaks the API abstraction and requires them to implement fusion logic
Average all frame embeddings into one vector — destroys temporal structure. A 10-minute video becomes one meaningless centroid.
Limit query size — a 100MB video limit is arbitrary and still doesn't solve the composition problem

What we wanted: pass a large file directly as a query input, have the system figure out how to search with it, get back a ranked list as if it were a simple query.

The insight: ingestion and query are the same operation

Here's the key observation that made this tractable: the decomposition logic we already use for ingestion is exactly what we need for query preprocessing.

When a video gets ingested into Mixpeek, it goes through a feature extractor that:

Splits the video into segments (keyframes, fixed intervals, or scene boundaries)
Embeds each segment via the configured model
Stores the resulting vectors in Qdrant alongside payload metadata

Query preprocessing is the same pipeline, just routing the output differently. Instead of writing vectors to Qdrant, we use them to search Qdrant. The same extractor, the same chunking logic, the same embedding model. This matters because it guarantees that query embeddings and index embeddings are always in the same vector space — no distribution shift from using a different chunking strategy at query time.

The execution flow looks like this:

feature_search stage
│
├─ 1. Detect input type
│     → video/500MB detected
│     → route to query_preprocessing
│
├─ 2. Decompose via extractor pipeline
│     → same extractor that indexed the data
│     → e.g. 20 keyframes from a 10-min video
│
├─ 3. Batch embed (parallel)
│     → 20 segments → inference service → 20 vectors
│
├─ 4. Parallel Qdrant searches
│     → 20 concurrent ANN queries
│     → each returns top_k candidates
│
├─ 5. Fuse results
│     → RRF / max / avg across 20 result sets
│     → deduplicate (same doc from multiple frames → keep best)
│
└─ Output: single ranked list, same shape as a simple query response

From the caller's perspective, nothing changes. You pass a file URL, you get results back. The complexity is entirely internal.

API design

We added a query_preprocessing object to the feature_search stage. It can live at the stage level (applies to all searches as a default) or per-search (overrides the default for that search).

Zero-config usage — just pass a large file and the system figures out the rest:

{
  "stage_id": "feature_search",
  "parameters": {
    "searches": [{
      "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
      "query": {
        "input_mode": "content",
        "value": "s3://my-bucket/broadcast-clip.mp4"
      },
      "query_preprocessing": {
        "max_chunks": 20,
        "aggregation": "rrf"
      }
    }]
  }
}

The params field is the extractor's own parameter schema

This is the part that surprised people internally when we first described it: query_preprocessing.params is not a new configuration surface. It is literally the same parameter schema that the extractor accepts during ingestion.

Whatever you put in a collection's extractor config for multimodal_extractor@v1 — video_interval_seconds, max_resolution, keyframe_threshold, whatever that extractor exposes — those same keys go in params here. The preprocessing step runs the extractor with those params to decompose the query, exactly as it would during collection processing. Same code path, same config schema, different output destination.

{
  "stage_id": "feature_search",
  "parameters": {
    "searches": [{
      "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
      "query": { "input_mode": "content", "value": "{{INPUT.video}}" },
      "query_preprocessing": {
        "max_chunks": 30,
        "aggregation": "max",
        "dedup_field": "metadata.document_id",
        "params": {
          "split_method": "time",
          "time_split_interval": 5
        }
      }
    }]
  }
}

This means there's nothing new to learn about the preprocessing parameters. If you know how to configure the extractor for ingestion, you already know how to configure it for query preprocessing. The extractor documentation is the reference for both.

Per-search preprocessing, mixed with a plain text query:

{
  "stage_id": "feature_search",
  "parameters": {
    "searches": [
      {
        "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
        "query": { "input_mode": "content", "value": "{{INPUT.video}}" },
        "query_preprocessing": {
          "max_chunks": 30,
          "aggregation": "max"
        }
      },
      {
        "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
        "query": { "input_mode": "text", "value": "{{INPUT.caption}}" }
      }
    ]
  }
}

The second search has no preprocessing — it's a plain single-vector text query. Multi-modal retrieval with heterogeneous query types, fused at the stage level.

Fusion strategies

Once you have N result sets from N chunk searches, you need to combine them. We support three strategies:

RRF (Reciprocal Rank Fusion)

Each document's score is the sum of 1 / (k + rank) across all chunk result sets where it appeared. k is a smoothing constant (typically 60).

RRF is rank-based, so it's immune to score magnitude differences between chunks. A document that ranks 3rd in 5 different chunk searches beats one that ranks 1st in only 1. This is the right default for "find content that's generally similar to this video" queries.

Max

Keep the highest score a document received across all chunk searches. Use this when you want "find the moment in this video that best matches something in the index" — you care about the best alignment, not average alignment.

Avg

Average the scores across all chunk results where the document appeared. Documents that show up consistently across many chunks beat documents that match one chunk perfectly. Useful for "find videos with similar overall content distribution."

The right strategy depends on the query semantics. For IP safety (does this video contain a specific brand?), max is correct — you want the single best match. For "find content similar to this video," rrf is more robust.

What we didn't do: a "strategy: auto" mode

Early in the design we considered a strategy: "auto" parameter that would detect file size and type and choose chunking parameters automatically. We prototyped it.

The problem is that the right chunking depends on what you're trying to find, not just the file. A 5-second clip queried against a movie archive probably wants dense keyframe sampling. The same clip queried against a sports highlight reel probably wants scene-boundary splits. There's no way to infer this from the file alone.

We removed auto mode. If we add it back, it'll be as a starting heuristic with explicit override support — not as a magic setting that hides what's actually happening. The full parameter reference is in the docs.

Credit model

Each chunk counts as one retrieval credit. A max_chunks: 20 config on a video that produces 20 keyframes costs 20 credits, same as running 20 separate single-vector searches. This is intentional — preprocessing is not a way to get bulk search at single-query pricing. The cost is transparent and predictable.

The cap parameter (max_chunks, range 1–100) exists to bound the cost at query time. If an extractor would produce 50 chunks but you set max_chunks: 20, we take the first 20 by default. You can configure the sampling strategy via extractor params if you need uniform sampling instead.

The IP safety case

The use case that drove us to ship this quickly was our IP safety verification pipeline. The product takes a video (a YouTube upload, a broadcast clip, an ad creative) and checks it against a face index (93K embeddings across ~5K identities) and a brand logo index (25K brands).

The query is the video. There's no text query, no image query — you're searching with the entire asset. Before query preprocessing, this required the caller to extract frames, embed them, run searches, and fuse results themselves. Now it's one API call:

{
  "stage_id": "ip_safety_verify",
  "parameters": {
    "face_index_s3_uri": "s3://mixpeek-server-prod/ip-safety/face_index.npz",
    "brand_index_s3_uri": "s3://mixpeek-server-prod/ip-safety/logo_text_index_v2.npz",
    "image_url_field": "metadata.frame_url"
  }
}

The stage handles frame extraction, parallel embedding, and fusion internally. Callers pass a video URL and get back identified faces and brands with confidence scores.

Limitations and known tradeoffs

Latency scales with chunk count. 20 parallel Qdrant searches is fast (we batch the embedding calls), but it's not the same as 1 search. For latency-sensitive paths, set a low max_chunks or pre-extract a representative keyframe.

The extractor must support the input type. Query preprocessing routes through the same extractor pipeline as ingestion. If your namespace uses a text-only extractor, you can't pass a video as a query. The feature URI determines what decomposition is possible.

Chunk ordering is not preserved. The fused result list is ranked by similarity score, not temporal order. If you need results ordered by where in the query video they matched, you'd need to add that as post-processing (we don't have a stage for this yet).

Deduplication is per-field. If two chunks both match the same 5-second clip but from different angles, they'll show up as different results unless you configure dedup_field to collapse by document ID. Know your data model.

What's next

Query preprocessing is live in the feature_search stage today. Full docs here.

The pattern — decompose input, embed in parallel, fuse results — generalizes beyond feature search. The same approach should work in rerank stages (LLM-score each chunk of a large document, take the max) and in apply stages (run a classifier on each frame of a video, return the worst-case result). We haven't built those yet, but the abstraction is the same.

If you're building something where the query is a large file, we'd like to hear about it. The current implementation was shaped almost entirely by real production use cases. The next iteration will be too.

— Mixpeek

AI Video Analysis for Sports: Build Automated Highlight Reels, Archive Search, and Performance Analytics

Ethan Steininger — Mon, 09 Mar 2026 16:09:41 GMT

The Problem: Sports Video is Unstructured at Scale

A single 90-minute soccer match generates 90 minutes of raw video. A full Premier League weekend — 10 matches — produces 15+ hours. Multiply by 38 match weeks, add training sessions, press conferences, and behind-the-scenes footage, and a mid-sized sports media operation is managing thousands of hours of content per season.

The bottleneck isn't storage. It's making that video useful.

Highlight editors manually watch entire games — 4-8 hours per match — to find key moments
Archive footage is effectively unsearchable beyond filename and date
Analytics teams download raw video and manually annotate events frame by frame
Social media teams miss the optimal publish window because clips aren't ready in time

AI video analysis solves all of these by treating sports video as structured, queryable data instead of opaque files.

Explore on Mixpeek

🏟 Sports Solution Page 🎬 Use Case: Sports Highlights 📋 Recipe: Build It Yourself

How AI Video Analysis Works for Sports

Modern sports video AI combines three layers of analysis that run in parallel:

1. Visual Action Detection

Computer vision models analyze each frame to detect specific actions — ball trajectory, player contact, goalkeeper positioning, crowd rise. Rather than generic object detection, sports-tuned models classify actions against sport-specific exemplars: what a goal looks like vs. what a save looks like vs. what a foul looks like.

The foundation is a multimodal embedding model (like SigLIP or CLIP) that converts each video scene into a dense vector. These vectors are compared against labeled exemplar clips to classify the action type and calculate confidence scores.

2. Audio Spike Detection

Crowd noise and commentator speech are incredibly reliable highlight signals. Audio transcription (Whisper large-v3) captures the words — "GOAAAAAL!", "unbelievable", "he's done it again" — while audio feature extraction detects the energy spike of 50,000 fans simultaneously standing up.

Commentary excitement combined with crowd noise creates a compound signal that's almost impossible to fake and extremely reliable for identifying high-intensity moments.

3. On-Screen Graphic Parsing

Score changes, VAR indicators, replay flags, and player stat overlays are broadcast signals that confirm something significant just happened. OCR (optical character recognition) extracts these as structured data — goal time, team, score — which can be correlated with the visual and audio signals for maximum confidence.

Fusion and Ranking

The three signals are fused using reciprocal rank fusion (RRF) — a method that combines rankings from multiple retrieval sources without requiring manual weight calibration. The result is a ranked list of timestamped moments, each with a highlight confidence score.

Reference Architecture — Mixpeek Sports Highlights Pipeline

Building a Sports Highlights Pipeline with Mixpeek

Here's how to build a production highlight pipeline. The core workflow is: ingest footage → extract multimodal features → define highlight criteria → execute retrieval → assemble clips.

The full step-by-step code is available in the Sports Highlights Recipe — including bucket setup, collection configuration, taxonomy creation, retriever definition, and output parsing.

Step 1: Ingest Game Footage

import requests

headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "X-Namespace": "sports-media",
    "Content-Type": "application/json"
}

# Create collections for scenes and audio
scene_collection = requests.post("https://api.mixpeek.com/v1/collections", headers=headers, json={
    "collection_name": "game-scenes",
    "source": {"type": "bucket", "bucket_id": "bkt_footage"},
    "feature_extractor": {
        "feature_extractor_name": "video_extractor",
        "version": "v1",
        "input_mappings": {"video_url": "video_url"},
        "parameters": {
            "scene_detection_threshold": 0.3,
            "keyframe_interval": 2,
            "max_scenes": 500
        },
        "field_passthrough": [
            {"source_path": "sport"},
            {"source_path": "game_id"},
            {"source_path": "broadcast_date"}
        ]
    }
}).json()

# Ingest a match
requests.post(f"https://api.mixpeek.com/v1/buckets/bkt_footage/objects",
    headers=headers, json={
        "metadata": {
            "sport": "soccer",
            "game_id": "cl-2026-final",
            "broadcast_date": "2026-05-25"
        },
        "blobs": [{"property": "video_url", "type": "video",
                   "url": "s3://my-bucket/games/cl-final.mp4"}]
    })

Step 2: Define Highlight Criteria

Configure what counts as a highlight for your sport using a Mixpeek taxonomy. Each event type needs 5-20 exemplar clips — not thousands of labeled examples, just representative samples:

taxonomy = requests.post("https://api.mixpeek.com/v1/taxonomies", headers=headers, json={
    "taxonomy_name": "soccer_events",
    "taxonomy_type": "flat",
    "nodes": [
        {"node_id": "goal", "collection_id": "col_goal_exemplars"},
        {"node_id": "save", "collection_id": "col_save_exemplars"},
        {"node_id": "foul", "collection_id": "col_foul_exemplars"},
        {"node_id": "celebration", "collection_id": "col_celebration_exemplars"},
    ]
}).json()

Step 3: Retrieve Highlights

highlights = requests.post(
    "https://api.mixpeek.com/v1/retrievers/soccer-highlights/execute",
    headers=headers,
    json={
        "inputs": {"game_id": "cl-2026-final"},
        "limit": 20
    }
).json()

for doc in highlights["documents"]:
    start = doc["metadata"]["start_time"]
    end = doc["metadata"]["end_time"]
    keyframe = doc["metadata"]["keyframe_url"]
    print(f"{start:.1f}s - {end:.1f}s | score: {doc['score']:.3f}")
    # → Use start/end to extract clips with FFmpeg or your video API

Real Results: What Sports Teams Are Getting

Metric	Before AI	After AI	Improvement
Highlight turnaround	4-8 hours	15-20 min	24x faster
Key moments captured	60-70%	95%+	+46% coverage
Editor hours per game	6+ hours	<30 min review	12x reduction
Social clips per game	3-5	15-25	5x more content

Beyond Highlights: Other Sports Video AI Use Cases

Archive Search

Your historical footage is worth more than you're getting from it. AI video analysis makes decades of archived broadcast footage searchable by semantic query — "find all bicycle kicks from 2018-2022", "show every time [player name] scored in the final 10 minutes". Instead of a media librarian spending hours on a request, results come back in seconds.

Sports analytics software built on vector search (not keyword search) enables this. Every scene becomes a semantic data point, not a filename.

Player Performance Analytics

Combine face recognition with action detection to compile every clip of a specific player automatically. Coaching staff query: "show me all crosses by our left back in the last 5 matches" — the system retrieves exact timestamps across hours of footage without any manual tagging.

Broadcast Compliance Monitoring

Automatically flag content that violates broadcast standards — crowd violence, hate speech in chants (via audio transcription), on-pitch incidents requiring regulatory review. Real-time processing means compliance teams review flagged content within minutes of an incident occurring.

Monetization: Personalized Highlight Feeds

Different fans want different highlights. With multimodal AI, generate personalized highlight feeds — goal-only feeds, specific-player feeds, defensive play feeds — from the same source footage. Each fan gets the moments relevant to their preferences, increasing engagement and subscription value.

Choosing the Right Sports Video Analytics Platform

Not all video AI platforms are built for sports workflows. Key criteria for sports media:

Multi-modal fusion: Visual + audio + text signals must combine into a single highlight score. Platforms that only do computer vision miss the audio signals that are often the most reliable indicators.
Sport-configurable: Basketball dunks are not soccer goals. The platform needs configurable event taxonomies per sport — not generic action detection that classifies "sports" as a single category.
Processing speed: A 90-minute match should analyze in <20 minutes. For live workflows, near-real-time latency is required for social media clips.
Self-hosting option: Broadcast content often has rights restrictions. The ability to deploy in your own infrastructure — not a shared cloud — is critical for compliance.
Archive-scale: Leagues and broadcasters manage decades of footage. The platform must handle millions of scenes without degraded search quality.

Getting Started

Building a sports highlights pipeline with Mixpeek takes about an hour to set up:

Create an account and get your API key at mixpeek.com
Review the Sports Media & Analytics solution page for the full platform overview
Work through the Sports Highlights use case to understand the end-to-end workflow
Clone the Sports Highlights Recipe — it has complete Python and cURL code ready to run
Collect 10-20 exemplar clips per event type for your sport and ingest a test match

For enterprise deployments — live stream integration, self-hosted infrastructure, or custom model training for specific sports — contact the Mixpeek team for a scoped architecture review.

Frequently Asked Questions

Which sports work with Mixpeek?

Any sport that's been filmed. The taxonomy system is fully configurable — define what counts as a highlight moment for your sport using exemplar clips. Soccer, basketball, American football, baseball, tennis, rugby, cricket, esports, and motorsports all work. Multi-sport deployments run separate taxonomies per sport simultaneously.

Do I need a large labeled dataset to get started?

No. You need 5-20 exemplar clips per event type — not thousands of labeled examples. Mixpeek uses these as visual reference points in the taxonomy, not for model training. This means you can be up and running in hours, not months.

How does it handle different camera angles in multi-camera broadcasts?

Each camera feed can be ingested as a separate object. The retriever can search across all angles simultaneously and return the best angle for each highlight moment. Alternatively, ingest the broadcast director feed (already switched) for simpler single-stream processing.

Can it identify specific players without jersey numbers visible?

Yes, using the face extractor. Provide labeled reference frames per player and the system builds visual signature models. Players are identifiable in close-up celebrations, crowd pile-ups, and side-profile shots where jersey numbers aren't visible.

What's the cost to process a full season?

Pricing depends on total hours processed and analysis features enabled. A typical Premier League season (380 matches × 90 min = 570 hours of footage) would be quoted as a custom enterprise package with dedicated processing infrastructure. Contact us for a volume estimate.

How Mixpeek runs distributed multimodal ML on Ray: architecture, patterns, and production lessons

Ethan Steininger — Wed, 25 Feb 2026 15:51:20 GMT

When you index a 10-minute video at Mixpeek, you don't run one model. You run a transcript model, a visual embedding model, a scene description model, a face detection model, an object detection model, a brand safety classifier, an IAB taxonomy tagger, and a shot boundary detector in parallel. Each has different compute requirements, different batch sizes, different GPU/CPU preferences, and different failure modes.

A single user-defined feature extractor chains several of these. A production batch job might run 15 extractors simultaneously across tens of thousands of files. Each extractor then fans out further to process individual frames, chunks, or pages in parallel before results converge into a searchable index.

We needed a distributed compute layer that could handle all of this without us building scheduling, retries, resource isolation, and fault tolerance from scratch. After evaluating Celery, Dask, and a bespoke gRPC approach, we chose Ray. This is a technical walkthrough of how we use it, the patterns we settled on, and the lessons we learned the hard way.

The architecture

We run a KubeRay cluster on GKE, deployed as a RayService custom resource. Two logical layers sit on top: Ray Serve for always-on model inference, and Ray Data for batch pipeline execution.

                     GKE / KubeRay
+--------------------------------------------------------+
|                                                        |
|   +----------------+    +----------------------------+ |
|   |   Head Node    |    |      Ray Serve Layer        | |
|   |   (0 CPUs)     |<-->|  20+ model deployments     | |
|   |   control only |    |  per-model autoscaling      | |
|   +----------------+    +----------------------------+ |
|           |                                            |
|    +-------+------+                                    |
|    v              v                                    |
|  +----------+  +----------+                            |
|  |  CPU     |  |  GPU     |   custom resource:        |
|  | Workers  |  | Workers  |   {"batch": 1}            |
|  |  1-5     |  |  0-3     |   isolates batch jobs     |
|  |  pods    |  |  pods    |   from Serve replicas     |
|  +----------+  +----------+                            |
+--------------------------------------------------------+

The head node is deliberately computation-free (num-cpus: "0") -- it only handles the control plane. This is a Ray best practice we ignored until a runaway batch job starved the scheduler. CPU and GPU worker groups scale independently via KubeRay's autoscaler.

# infra/gke/rayservice.yaml (abbreviated)
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: mixpeek-engine-svc
spec:
  rayClusterConfig:
    headGroupSpec:
      rayStartParams:
        num-cpus: "0"          # head handles control, not work
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
            - name: ray-head
              resources:
                requests: { cpu: "4", memory: "32Gi" }
                limits:   { cpu: "8", memory: "64Gi" }

    workerGroupSpecs:
      - groupName: cpu-workers
        minReplicas: 1
        maxReplicas: 5
        template:
          spec:
            containers:
              - name: ray-worker
                resources:
                  requests: { cpu: "7", memory: "28Gi" }
                  limits:   { cpu: "7", memory: "56Gi" }

      - groupName: gpu-workers
        minReplicas: 0          # scale to zero when idle
        maxReplicas: 3
        template:
          spec:
            tolerations:
              - key: nvidia.com/gpu
                operator: Exists
                effect: NoSchedule
            containers:
              - name: ray-worker
                resources:
                  requests:
                    cpu: "4"
                    memory: "16Gi"
                    nvidia.com/gpu: "1"

Ray Serve: 20+ models, one cluster

Every inference model runs as a named Ray Serve deployment -- text embedders, CLIP variants, transcript models, classifiers. We use the declarative serveConfigV2 YAML format rather than the imperative Python API, which makes deployments GitOps-friendly and lets KubeRay manage rollouts without custom deployment scripts.

The key design decision is per-model autoscaling. A text embedding model has very different throughput and memory characteristics than a video captioning model. Treating them as a homogeneous fleet wastes GPU memory and causes head-of-line blocking:

# serveConfigV2 (excerpt)
serveConfigV2: |
  applications:
    # Lightweight text embedder: scale wide, low memory
    - name: intfloat__multilingual_e5_large_instruct
      import_path: engine.inference.intfloat.multilingual_e5_large_instruct.routes:app
      deployments:
        - name: MultilingualE5LargeInstructV1Deployment
          autoscaling_config:
            min_replicas: 2
            max_replicas: 10
            target_ongoing_requests: 2
            upscale_delay_s: 5
            downscale_delay_s: 300
          ray_actor_options:
            num_cpus: 0.5
            num_gpus: 0
            memory: 2147483648    # 2GB
          max_ongoing_requests: 3

    # Heavy video captioner: scale conservatively, GPU required
    - name: video_captioner
      import_path: engine.inference.video.caption.routes:app
      deployments:
        - name: VideoCaptionerDeployment
          autoscaling_config:
            min_replicas: 0        # scale to zero when idle
            max_replicas: 2
            target_ongoing_requests: 1
            upscale_delay_s: 30
            downscale_delay_s: 600
          ray_actor_options:
            num_cpus: 2
            num_gpus: 0.5
            memory: 8589934592    # 8GB
          max_ongoing_requests: 1

target_ongoing_requests is the key lever. For high-throughput, low-latency models you target more concurrent requests per replica. For heavy models (video captioning, large-scale image encoders), you target 1 to avoid replica OOM from batched inputs piling up.

Ray Data: the extraction pipeline

Batch processing uses Ray Data with map_batches and ActorPoolStrategy. Each pipeline stage is a Python callable operating on a batch of rows.

A naive implementation runs preprocessing (S3 download, format normalization, frame extraction) once per extractor. With 10 extractors on a 1,000-file batch, that's 10,000 redundant S3 reads. We run preprocessing once, cache the result as a Ray Dataset in object store, and fan it out to all extractors:

# engine/pipelines/helpers/job_builder.py
def run_preprocessing_pipeline(
    input_dataset: ray.data.Dataset,
) -> ray.data.Dataset:
    # 1 locally, 56 in prod -- detected at startup via hardware_config
    s3_concurrency = hardware_config.cpu_concurrency

    preprocessing_steps = [
        MapBatchesPipelineStep(
            S3MediaResolver,
            concurrency=s3_concurrency,
            batch_size=8,
            actor_options={"memory": 3 * 1024 * 1024 * 1024},  # 3GB
        ),
        MapBatchesPipelineStep(
            ContentPrep,
            concurrency=s3_concurrency,
            batch_size=16,
            actor_options={"memory": 3 * 1024 * 1024 * 1024},  # 3GB
        ),
    ]

    return BasePipeline(preprocessing_steps).run(input_dataset)

The 3GB memory actor reservation is production-profiled. Early on we ran without memory hints and workers were silently OOM-killed with no useful error. Ray's scheduler doesn't know your model weights are 2.4GB unless you tell it.

Flexible actor pools prevent deadlock

The original code used fixed-size actor pools:

# This deadlocks under concurrent batch jobs
compute = ray.data.ActorPoolStrategy(size=8)

With two concurrent batch jobs each requesting 8-actor pools on a 12-worker cluster, both jobs get stuck waiting for the other to release workers. Flexible pools fix it:

# engine/pipelines/steps.py
class MapBatchesPipelineStep(BasePipelineStep):
    DEFAULT_POOL_MAX_SIZE = 8

    def __init__(self, concurrency=None, pool_max_size=None, **kwargs):
        concurrency_val = concurrency if concurrency is not None else 1
        capped = min(concurrency_val, pool_max_size or self.DEFAULT_POOL_MAX_SIZE)

        # min_size=1: the job can always make progress with a single worker.
        # Ray fills in more workers as they become available. No deadlock.
        self.compute = ray.data.ActorPoolStrategy(min_size=1, max_size=capped)

Isolating batch jobs from Serve replicas

This is the pattern we're most pleased with. Ray Serve replicas and batch pipeline tasks run on the same cluster. Without isolation, a large batch job can starve Serve replicas of workers, causing inference timeouts for live API requests.

The solution is custom Ray resources. We declare a synthetic resource called batch on worker nodes, then require it for every batch task:

# engine/pipelines/tasks.py
def _batch_resource_options() -> dict:
    # Ray Serve replicas never request {"batch": 1},
    # so they physically cannot land on batch-reserved slots.
    return {"resources": {"batch": 1}}

@ray.remote(max_retries=3, **_batch_resource_options())
def process_feature_extractor(
    extractor_request: ExtractorRequest,
    input_dataset: ray.data.Dataset,
) -> None:
    registry = get_inference_registry()
    registry.add_packages(inference_pkg, plugins_pkg, taxonomies_pkg)
    # ... run extraction steps

In the KubeRay worker spec, CPU workers expose this resource:

# CPU worker group rayStartParams
rayStartParams:
  resources: '{"batch": 4}'   # 4 concurrent batch tasks per CPU worker node

A massive overnight batch job can saturate all batch slots without affecting P99 latency on live inference requests. The separation cost is zero -- it's a scheduler hint with no runtime overhead.

Production patterns worth stealing

Non-blocking progress tracking with a Ray Actor

Ray Data pipelines are hard to introspect externally. Worker tasks can't push progress to an external DB without blocking the pipeline. We use a long-lived Ray actor as a shared counter that workers update fire-and-forget:

# engine/monitoring/performance/utils.py
@ray.remote
class ProgressActor:
    def __init__(self, total=None, job_id=None):
        self._processed = 0
        self._total = total

    def incr(self, n=1):
        self._processed += n
        return self._processed

    def get_progress(self):
        return {
            "processed": self._processed,
            "total": self._total,
            "percent": (self._processed / self._total * 100) if self._total else None,
        }

# Instantiate once, pass handle into map_batches workers
progress_actor = ProgressActor.remote(total=dataset_size, job_id=batch_id)

# Inside a Ray Data worker -- fire-and-forget, no blocking:
progress_actor.incr.remote(len(batch))

# From the API layer:
progress = ray.get(progress_actor.get_progress.remote())

Workers call .remote() which returns immediately. The call is queued on the actor's mailbox and executed serially -- atomic increments without locks, without blocking the pipeline.

Custom Datasink for distributed Qdrant writes

Collecting all pipeline output on one node before writing forces full materialization in memory. Ray Data's Datasink API distributes writes across all workers with built-in backpressure:

# engine/databases/qdrant/datasink.py
class QdrantDatasink(Datasink):
    @property
    def supports_distributed_writes(self) -> bool:
        return True   # any worker node can write directly to Qdrant

    @property
    def min_rows_per_write(self) -> int:
        return self.config.batch_size   # Qdrant's optimal upsert batch size

    def write(self, blocks, ctx):
        qdrant = QdrantBaseSync(prefer_grpc=True, ...)
        for block in blocks:
            rows = BlockAccessor.for_block(block).to_pydict()
            # upsert with exponential backoff retry

Peak write throughput scales linearly with worker count. min_rows_per_write prevents a flood of tiny Qdrant upserts that tank performance.

The LocalStack parquet workaround

Ray Data's native S3 parquet I/O uses PyArrow's S3 filesystem under the hood. It works great against AWS S3 in production but silently hangs against LocalStack in local dev. No error, no timeout. We wrapped all parquet I/O:

# engine/utils/ray_parquet.py
def write_parquet_safe(dataset: ray.data.Dataset, path: str) -> int:
    if is_localstack_env():
        # PyArrow + boto3 against LocalStack -- reliable
        rows = dataset.take_all()
        table = pa.Table.from_pylist(rows)
        with get_localstack_s3fs().open(s3_path, "wb") as f:
            pq.write_table(table, f)
        return len(rows)
    else:
        # Production: Ray Data native (distributed across workers)
        dataset.write_parquet(path)
        return dataset.count()

End-to-end flow

User triggers batch job on a collection
       |
       v
FastAPI endpoint
       |  schedules as Ray remote task
       v
process_feature_extractor.remote()    requires {"batch": 1}
       |
       v
  read_parquet_safe()                 reads manifest from S3
       |
       v
  run_preprocessing_pipeline()        S3MediaResolver + ContentPrep
  (runs ONCE, shared across N         Ray Data map_batches
   extractors in this job)
       |
       +------------------+---- ... ----+
       v                  v             v
  extractor_1         extractor_2   extractor_N
  map_batches()       map_batches() map_batches()
  -> Ray Serve        -> Ray Serve  -> Ray Serve
       |
       +------------------+---- ... ----+
                          |
                          v
              QdrantDatasink.write()   distributed writes
              ProgressActor.incr()    fire-and-forget
                          |
                          v
              Indexed and searchable via /retrievers

What we're excited about

Ray Compiled Graphs. Our pipeline has many small coordination steps between stages -- passing dataset handles, routing between CPU and GPU workers. Each carries overhead from Ray's default scheduling path. Compiled graphs pre-compile the execution plan and remove per-call overhead. For latency-sensitive single-document requests this matters more than for bulk batch jobs.

Streaming execution in Ray Data. The current pipeline materializes intermediate datasets between stages, so peak memory scales with dataset size. Ray Data's streaming execution (now the default in Ray 2.x) runs the pipeline lazily -- blocks flow through stages without full materialization. We're migrating progressively, starting with large video batches that currently hit memory pressure on the preprocessing stage.

Finer-grained scheduling policies. Right now we have a coarse split: CPU workers for batch, GPU workers for Serve. As we add more GPU-heavy feature extractors to the batch path, we'll need finer-grained policies. Ray's placement groups and resource bundles give us the primitives -- we just haven't built the logic yet.

Anyscale for managed Ray. We self-manage KubeRay on GKE today. The operational burden is real -- node pool configuration, KubeRay version upgrades, GKE autoscaler tuning, monitoring. We're watching Anyscale's managed offering closely. The make-vs-buy math is still in favor of self-managed at our scale, but that inflection point is approaching.

Lessons, briefly

Zero-CPU head nodes from day one. The head is for control. If it runs user code, a runaway task will take down your scheduler.
Flexible actor pools, not fixed. ActorPoolStrategy(min_size=1, max_size=N) instead of size=N. Fixed pools deadlock under concurrent jobs.
Custom resources for workload isolation. Declaring synthetic resources lets you partition cluster capacity between workload types without separate clusters.
Always reserve memory in actor options. Ray can't infer your model's memory footprint. Explicit memory= hints prevent silent OOM kills.
Fire-and-forget progress via actors. Don't block workers on external DB writes. A Ray actor as a shared counter is cheap and reliable.
Environment-aware I/O wrappers. Ray Data's S3 integration has edge cases in local dev. Thin wrappers that detect the environment save hours of debugging.

If you're building on Ray and want to compare notes, or curious about how Mixpeek uses all of this to power multimodal search and extraction, we're always happy to talk.

If you want to see the output of this infrastructure in action, the retriever showcase has live multimodal search demos you can run against your own content -- no infrastructure required.

Political Ad Disclaimers: How ZIP+4 Targeting Creates Jurisdiction Conflicts

Ethan Steininger — Thu, 19 Feb 2026 19:43:22 GMT

In 2024, Meta paid $24.6 million to settle Washington state campaign finance violations — not for running a prohibited ad, but for failing to maintain compliant disclosure records on ads already in-flight. Their fix was to exit Washington's political ad market entirely. Google, Yahoo, and The Trade Desk did the same. The regulatory intent — transparency about who funds political advertising — was not served by this outcome. Less accountable actors filled the gap.

The compliance problem that drove them out isn't a platform-level recordkeeping quirk. It lives inside every political ad creative, tied to a geographic precision most ad infrastructure was never designed to handle: the ZIP+4 code.

The Problem: One Ad, Three Simultaneous Disclaimer Regimes

A ZIP+4 code — the full nine-digit USPS format — identifies six to twenty delivery points: a few houses on one side of a block, a single apartment floor, a specific office suite. Programmatic platforms can now target CTV and video inventory at this precision using IP-based geolocation. Political campaigns adopted it aggressively because it eliminates wasted impressions in adjacent, non-competitive districts.

The precision creates a compliance dimension that existing tools don't handle: more than 6,000 five-digit ZIP codes straddle multiple congressional districts. At ZIP+4 granularity, a single carrier route segment may simultaneously sit inside a federal House district, a state legislative district, and a local election jurisdiction — each governed by a different disclaimer law, with no preemption between them.

The compliance stack looks like this for a single impression in San Francisco's Mission District:

Federal (FEC): "Paid for by [committee]" + candidate verbal approval statement. Applies to federal races only.
California (FPPC): Top three donors above $50,000 listed in descending order, in Arial ≥10pt, occupying ≥2.5% of screen height. Donor list must be updated within five business days of a threshold crossing. AI-generated content must carry explicit disclosure (AB 2355, 2024).
Local (SF ordinance): Additional city-level requirements layered on top.

A disclaimer that satisfies FEC requirements may not satisfy California's. A California-compliant ad whose top-donor list is five days stale is non-compliant. An AI-generated creative missing AB 2355 disclosure text is non-compliant even if every other element is correct. And none of today's ad tech tools automatically resolve: this impression is in ZIP+4 XXXXX-YYYY → which elections are relevant → which rules apply → does this specific creative satisfy all of them?

Why Existing Tools Leave the Gap

DSP targeting can reach ZIP+4 precision, but targeting and compliance verification are decoupled — no major programmatic platform validates whether the creative's disclaimer legally satisfies the jurisdiction it's serving into.

Platform-level review (Google, Meta) checks advertiser identity, not creative compliance at sub-state geography. Third-party classifiers can identify that an ad is political — they don't verify whether its disclaimer text meets the specific requirements of the specific jurisdiction the device is in. Address-to-district APIs resolve ZIP+4 codes to their legislative districts — useful input, but not a compliance engine. All four capabilities exist as separate, disconnected point solutions.

The gap sits at the intersection of three things that have never been integrated: what's actually in the creative, what rules apply at this location, and real-time validation that the creative satisfies those rules.

How Mixpeek Solves It: Three Layers, One Pipeline

Layer 1 — Feature Extractors: Read the Creative

Before any compliance check, you need to know what the creative actually contains. Mixpeek runs parallel extraction across every creative asset:

OCR reads all on-screen text — disclaimer language, committee names, donor disclosures — and returns bounding box dimensions that enable font size compliance checks (California's 2.5% screen height requirement is measurable directly from the extracted coordinates).
Speech-to-text transcribes the audio track with timestamps, detecting verbal approval statements required for broadcast and their position in the ad.
Face recognition verifies candidate on-screen appearances with duration and frame-coverage measurements — addressing the four-second, 4%-of-frame-height broadcast requirement automatically.
AI-generation detection returns a confidence score for AI-generated or substantially altered content, feeding the AB 2355 and equivalent state disclosure checks.

The result is a structured creative profile: extracted disclaimer text, sponsor entities, audio transcript, candidate appearance data, content classification, and AI-generation score. This profile is computed once per creative version and cached. Every bid-time compliance check is a retriever lookup against the cached profile — not a re-run of extraction.

Layer 2 — Taxonomies: Encode the Rules as Data

Disclaimer requirements are not code — they're policy, and policy changes constantly. Sixteen states have enacted AI disclosure requirements for political ads since 2023. Redistricting after the 2020 Census redrew every state's legislative maps. Encoding these rules as software means a deployment cycle every time a law changes.

Mixpeek stores compliance rules as versioned, queryable taxonomy entries: required elements per jurisdiction, per election type, with effective dates. A ZIP+4-to-jurisdiction mapping sits alongside, sourced from USPS boundary data and legislative district APIs — updated when redistricting finalizes, without touching the extraction or retrieval logic.

When California passes a new rule, one taxonomy record changes. The next creative validated against that jurisdiction reflects the updated requirement automatically. No code deployment. No engineering cycle. The rule set is data; the system adapts.

Layer 3 — Retrievers: Validate at Bid Time

At impression time, a retriever executes three steps in under 100 milliseconds:

Resolve jurisdictions: ZIP+4 → set of overlapping federal, state, and local districts.
Scope to election type: A California assembly campaign filters to state-level rules; the federal and school-board jurisdictions are excluded for that creative.
Validate: Cached creative profile is joined against the active taxonomy rules. Each required element is checked — disclaimer text present, font size compliant, donor count complete, AI disclosure included. The response identifies any missing element, the applicable jurisdiction, and the rule version checked.

That last detail — rule version — is what makes the audit trail defensible. Washington's recordkeeping mandate requires that platforms document exactly what governed each political ad placement. The retriever log captures creative ID, ZIP+4, jurisdiction set, taxonomy version, and compliance result per impression. The public disclosure record emerges as a byproduct of the compliance architecture.

A Concrete Example

It's October 2026. Three campaigns target the same ZIP+4 in San Francisco: a federal House race, a state assembly race, and an SFUSD school board race. All serve through the same SSP.

The federal creative passes: FEC language present, verbal approval statement detected in audio, candidate on-screen for six seconds. The state assembly creative fails: it discloses two of three qualifying donors — a third crossed the $50,000 threshold five days ago and the creative wasn't updated. The school board creative fails: it was produced with AI-generated background imagery, carries no AB 2355 disclosure, and would have served into California's jurisdiction where that disclosure is mandatory.

All three determinations are made pre-bid, in under 100ms, using cached profiles and the current taxonomy version. No human review. Full audit records retained.

The 2026 Cycle Is the Proving Ground

2026 is the first full federal cycle governed by the FEC's 2023 internet disclaimer rules, active AI disclosure mandates across sixteen states, and Washington's $24.6 million precedent for platform liability. The compliance infrastructure question is no longer hypothetical.

Mixpeek's pipeline — feature extractors that read creatives, taxonomies that encode the rules as data, retrievers that join them at bid time — converts the compliance gap from an engineering problem requiring perpetual legal-to-code translation into a data infrastructure problem with a defined maintenance model.

Explore Mixpeek for Advertising or schedule a demo to walk through the ZIP+4 compliance pipeline with your specific inventory and targeting configuration.

Multimodal Monday #45: Birds, Whales, and the End of Latency

Philip Bankier — Tue, 17 Feb 2026 15:43:48 GMT

Quick Take (TL;DR)

Voice AI drops the walkie-talkie act. NVIDIA's PersonaPlex-7B and ElevenLabs' Expressive Mode both ship full-duplex conversation. The AI listens while it talks, interrupts naturally, and adjusts tone mid-sentence. Turn-taking latency is dead.
Vision goes native. Qwen3.5 (397B parameters) and DeepGen 1.0 bake visual understanding directly into the model architecture instead of wiring a vision encoder to a language model after the fact. The result: tighter reasoning over charts, documents, and complex images.
A bird model decoded whale songs. Google fine-tuned Perch 2.0 (trained on birdsong) to classify whale vocalizations. It worked, which means bioacoustic signals share deeper structural patterns than anyone expected.

Tools, Models and Techniques

Qwen3.5-397B-A17B - Qwen's new foundation model pairs a 397B-parameter vision-language architecture with hybrid linear attention heads. It handles document parsing, chart analysis, and visual reasoning natively rather than routing through a separate encoder. Why it matters: An open model at this scale with native multimodal integration puts serious pressure on proprietary alternatives. Blog | Hugging Face

PersonaPlex-7B - NVIDIA released a 7B voice model that listens and speaks at the same time. It supports natural interruptions ("barge-in"), overlapping speech, and real-time turn negotiation without the pause-wait-respond loop. Why it matters: Full-duplex conversation removes the single biggest friction point in voice AI: latency. Hugging Face

Your browser does not support the video tag.

ElevenAgents Expressive Mode - ElevenLabs added breath, pauses, and emotional inflection to their voice agents. The output sounds less like text-to-speech and more like someone actually thinking before they talk. Why it matters: Voice agents in support, coaching, and companionship roles need to sound like they care, and this gets closer. Blog | Try it

Your browser does not support the video tag.

MiniMax M2.5 - MiniMax open-sourced a frontier model tuned for practical work: coding, writing, and structured analysis. It prioritizes instruction-following accuracy over open-ended chat. Why it matters: A model built to execute tasks reliably matters more than one that chats well. Hugging Face

Seedance 2.0 - ByteDance's video generator takes text, images, audio, or existing video as input and produces new video synchronized to the audio beat. It automates the tedious frame-by-frame alignment work that eats hours in post-production. Why it matters: Audio-visual sync is the bottleneck in short-form video production, and this removes it. Project Page

Your browser does not support the video tag.

Qwen-Image-2.0: Professional infographics and photorealism generation. Blog

DeepGen 1.0: A lightweight 5B-parameter unified multimodal model. Hugging Face
GLM-5: From vibe coding to agentic engineering. Blog

KaniTTS2: Open-source 400M TTS model that runs in 3GB VRAM. Hugging Face
SoulX-Singer: High-quality zero-shot singing voice synthesis. GitHub

MioTTS-2.6B: Lightweight TTS optimized for speed in English and Japanese. Hugging Face
FireRed-Image-Edit-1.0: New tool for image editing. Hugging Face

Qwen3-TTS: 1.7B parameters of clean, natural speech synthesis. Hugging Face

Your browser does not support the video tag.

Ming-flash-omni 2.0: New multimodal model from InclusionAI. Hugging Face

Research Highlights

EchoJEPA: Latent Prediction for Hearts - A self-supervised foundation model trained on 18 million echocardiograms. Instead of predicting noisy ultrasound pixels, it learns in latent space and separates clinical signal from artifact, outperforming existing cardiac assessment methods. Why it matters: Self-supervised training on massive unlabeled medical data catches anomalies that small labeled datasets miss. Paper

Bioacoustics Transfer Learning - Google Research adapted Perch 2.0, trained entirely on bird songs, to classify whale vocalizations. The cross-domain transfer worked because bioacoustic signals share fundamental spectral and temporal features across species. Why it matters: You can train on abundant data (birds) and fine-tune for scarce data (whales), which unlocks conservation research without needing millions of labeled samples per species. Blog

Beyond the Unit Hypersphere - This paper challenges the standard practice of normalizing embeddings onto the unit hypersphere in contrastive learning. The authors show that embedding magnitude carries meaningful information about confidence and specificity that normalization destroys. Why it matters: Preserving magnitude leads to more nuanced retrieval and better performance on ambiguous queries. Paper

DuoGen: Mixed-Media Storytelling - NVIDIA's DuoGen generates coherent interleaved sequences of images and text. It decides when to show and when to tell, keeping visual and textual content consistent across the full narrative. Why it matters: This opens the door to AI-generated tutorials, articles, and illustrated content that reads as authored rather than assembled. Project Page

Your browser does not support the video tag.

UniAudio 2.0 - A single audio language model that handles speech, music, and sound effects through text-aligned factorized tokenization. One framework generates, edits, and mixes across all audio types without switching models. Why it matters: Unifying the audio stack (TTS, music generation, foley) into one model creates workflows that were previously impossible without multiple specialized tools. Paper

ALIVE: Lifelike audio-video generation. Project Page

Your browser does not support the video tag.

ConsID-Gen: View-consistent, identity-preserving image-to-video generation. Project Page
JUST-DUB-IT: Video dubbing via joint audio-visual diffusion. Project Page
Voice-First Human-AI Collaboration: Exploring LMMs in mixed reality. Paper
Multimodal Manufacturing Safety Chatbot: Benchmark for RAG approaches in safety. Paper
Alzheimer's Detection: Multimodal fusion for better diagnosis. Paper

Trends & Predictions

Full-Duplex Voice Is Here

PersonaPlex-7B and ElevenAgents both shipped full-duplex voice this week. The "you talk, then I talk" model is officially legacy.

Real conversations overlap. People interrupt, confirm with "uh-huh," and change direction mid-thought. Full-duplex models handle all of this. More importantly, continuous listening lets the model start composing a response before you finish your sentence. That shaves hundreds of milliseconds off response time, which matters in customer support, gaming, and any scenario where hesitation breaks trust. And when the model hears frustration in your voice while you're still talking, it can adjust its response before delivering it.

Native Multimodal Architectures Are Winning

Qwen3.5 and DeepGen 1.0 both build vision into the model from the ground up. No separate encoder. No adapter layer. No translation step. When vision and language train together from scratch, the model reasons with visual information instead of converting it to text first. You get a system that reads a chart and understands the argument the chart is making, not just the numbers on it. Unified architectures also cut inference overhead because data doesn't bounce between modules. This is what enables tasks like "analyze this graph in the context of the surrounding report" where tight cross-modal reasoning is the whole point.

Community + Shoutouts

Larry the OpenClaw: Shoutout to @oliverhenry for the writeup on Larry, the open-source robot arm doing social media. A fun look at embodied AI in the wild. X Post
OneVision Encoder: Thanks to @brian_bo_li for the deep dive into the OneVision Encoder. Understanding the "eyes" of these models is crucial for building better apps. X Post

AutoGuidance Node: A great resource for the ComfyUI community: a custom node implementing AutoGuidance. GitHub
Kling 3.0 Fun: @lexx_aura shows off the capabilities (and hilarity) of Kling 3.0. Sometimes the best way to test a model is to just make something weird. X Post

Your browser does not support the video tag.

That's a wrap for Multimodal Monday #45! From full-duplex voice models that listen and speak simultaneously, to 397B-parameter architectures that reason with pixels instead of converting them to words, to a birdsong classifier that turned out to understand whales, this week showed multimodal AI getting less polite and more useful.

Ready to build multimodal solutions that actually work? Let's talk

Mixpeek Engineering Blog

The 3072-Dimension Problem

The reframe

How decomposition works in practice

Why hierarchy matters

What this unlocks

Agentic retrieval

Cross-collection joins

Clustering that closes the loop

The conceptual frame

Why Vector Search Alone Can't Find What's in Your Videos

The Problem Everyone Ignores

How Retrieval Actually Needs to Work

The Feature Extraction Layer

Multi-Stage Retrieval: The Architecture

Stage Types

Why This Beats Single-Vector Search

The Enrichment Layer: Taxonomies and Clusters

The Decision Tree: Which Architecture When

What This Looks Like in Practice

The Real Shift

Multimodal Taxonomies: How to Classify Video, Images, Audio, and Text with One Category System

What Is a Taxonomy?

Why Single-Modal Classification Breaks

What Makes a Taxonomy "Multimodal"

Flat vs. Hierarchical

Flat Taxonomies

Hierarchical Taxonomies

How It Works

Real-World Applications

Building a Multimodal Taxonomy

Create reference collections

Go hierarchical when you need precision

Choose an execution mode

Test before you materialize

Governance

Object Storage Comparison 2026: 21 Providers, Real Pricing, and the Gotchas Nobody Tells You

The Pricing Lie: Storage Cost Is the Least Important Number

The Escape Cost: What Nobody Compares

The 8 Gotchas That Will Break Your Migration

1. Wasabi's 90-Day Minimum Retention

2. R2 Has No Versioning and No Object Lock

3. DigitalOcean Spaces Caps Objects at 5 GB

4. "S3 Compatible" Is a Spectrum

5. GCS Has the Highest Egress of the Big Three

6. Archive Tiers Have Minimum Retention Traps

7. Event Notifications Are Basically AWS/GCS/MinIO Only

8. Durability Claims Vary in Substance

The Decision Framework

What Happens After You Store It

Try It Yourself

Building a Kalshi Trading Bot with Semantic Search and LLM Extraction

What Are Kalshi Mention Markets?

System Architecture: Six Mixpeek Resources

Resource 1: Namespace — Data Isolation

Resource 2: Hearing Bucket — Video Ingestion

Resource 3: Collection — Embedding + LLM Data Extraction

Resource 4: Signal Retriever — Semantic Search API

Resource 5: Trade History Bucket — Feedback Loop

Resource 6: History Retriever — Edge Calibration

Three Intelligence Layers: From Signal to Trade

Layer 1: Signal Quality Scoring

Layer 2: Portfolio Construction

Layer 3: Historical Calibration

Live Trading Results

Why This Beats Simple Keyword Matching

Technical Stack

Get Started

The Multimodal Data Warehouse: Why Unstructured Data Needs Its Own Snowflake

The $120 Trillion Problem Nobody Talks About

What Is a Multimodal Data Warehouse?

The Core Primitive: Object Decomposition

Storage Tiering: The Economics of Multimodal

Multi-Stage Retrieval: The Query Language for Unstructured Data

The Semantic Join: Cross-Modal SQL

Taxonomies: The Schema for Unstructured Data

Object Reassembly: From Features Back to Answers

The Architecture: How It Actually Works

Use Cases Across Industries

Media & Entertainment