<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Mixpeek Engineering Blog]]></title><description><![CDATA[Deep dive into multimodal AI, data processing, and best practices from our engineering team.]]></description><link>http://blog.mixpeek.com/</link><image><url>http://blog.mixpeek.com/favicon.png</url><title>Mixpeek Engineering Blog</title><link>http://blog.mixpeek.com/</link></image><generator>Ghost 5.82</generator><lastBuildDate>Wed, 06 May 2026 11:41:28 GMT</lastBuildDate><atom:link href="http://blog.mixpeek.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[The 3072-Dimension Problem]]></title><description><![CDATA[A 3072-dimensional embedding encodes everything about a video and distinguishes nothing. Decomposing content into named, measurable features, then placing them in a queryable hierarchy, is how multimodal search actually works at scale.]]></description><link>http://blog.mixpeek.com/the-3072-dimension-problem/</link><guid isPermaLink="false">69f20c72c76853422347688b</guid><category><![CDATA[Architecture]]></category><category><![CDATA[Multimodal]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Wed, 29 Apr 2026 14:23:45 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/04/ChatGPT-Image-Apr-29--2026--10_12_20-AM.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/04/ChatGPT-Image-Apr-29--2026--10_12_20-AM.png" alt="The 3072-Dimension Problem"><p>You embedded your video library with Gemini. Or <a href="https://mixpeek.com/model/openai/clip-vit-large-patch14?ref=blog.mixpeek.com">CLIP</a>, or <a href="https://mixpeek.com/model/google/siglip2-giant-opt-patch16-384?ref=blog.mixpeek.com">SigLIP</a>, or whatever the model of the month was when the project started. You stored a few million vectors in Pinecone or Qdrant or Weaviate. You wired up <a href="https://mixpeek.com/glossary?ref=blog.mixpeek.com">cosine similarity</a>. You ran your first query.</p><p>The results were fine. Not great. Fine.</p><p>A search for &quot;person holding a coffee cup&quot; returned videos with people, and videos with cups, and a surprising number of videos with neither. Reranking helped a little. <a href="https://mixpeek.com/hybrid-search?ref=blog.mixpeek.com">Hybrid BM25</a> helped a little more. But somewhere around the fourth or fifth round of tuning, you started to suspect that the problem wasn&apos;t your retriever, or your reranker, or your prompt. The problem was that a 3072-dimensional embedding is not a useful unit of work.</p><p>It encodes everything. The lighting, the camera angle, the dominant color, the demographic of the person on screen, the room they&apos;re in, the <em>vibe</em> of the room they&apos;re in. All of it, smeared across three thousand floats. Cosine similarity treats every dimension equally. Your application does not.</p><p>Most teams hit that wall about six weeks into a <a href="https://mixpeek.com/multimodal-search?ref=blog.mixpeek.com">multimodal project</a>. A better vector database won&apos;t get you past it.</p><hr><h2 id="the-reframe">The reframe</h2><p>A data science lead we work with (thirty years in the industry, ran one of the first audio fingerprinting startups in the early 2000s) described our job to me recently in a way I haven&apos;t stopped thinking about:</p><blockquote>Reduce the number of dimensions in a video to a handful of measurable features, then place every new video into a hierarchical structure defined by those features.</blockquote><p>He called the output a &quot;compact fingerprint.&quot;</p><p>When I shared that framing with an engineer at AWS, he pushed back immediately. Every kind of <a href="https://mixpeek.com/embeddings?ref=blog.mixpeek.com">representation learning</a> does that, he said. Find small, interpretable representations of complex things. He was right. The idea is old. What&apos;s rare is running it as managed infrastructure, continuously, across millions of objects, with the reduced features exposed as first-class queryable surfaces.</p><p>The job of multimodal infrastructure is not to be a faster <a href="https://mixpeek.com/vector-search?ref=blog.mixpeek.com">vector database</a>. It&apos;s the systems layer that takes you from <em>I have embeddings</em> to <em>I have a usable application.</em> That layer has a specific shape, and most of the work is in the shape.</p><hr><h2 id="how-decomposition-works-in-practice">How decomposition works in practice</h2><p>The shape, in our system, is six primitives that compose in one direction.</p>
<!--kg-card-begin: html-->
<div style="text-align:center;margin:2em 0">
<svg viewbox="0 0 720 320" xmlns="http://www.w3.org/2000/svg" style="max-width:100%;font-family:-apple-system,BlinkMacSystemFont,sans-serif">
  <defs>
    <marker id="arrow" viewbox="0 0 10 10" refx="9" refy="5" markerwidth="6" markerheight="6" orient="auto-start-reverse">
      <path d="M 0 0 L 10 5 L 0 10 z" fill="#666"/>
    </marker>
  </defs>
  <g font-size="13" text-anchor="middle">
    <rect x="20" y="40" width="120" height="56" rx="8" fill="#e8f4f8" stroke="#5ba4cf" stroke-width="1.5"/>
    <text x="80" y="64" font-weight="600" fill="#1a3a4a">Bucket</text>
    <text x="80" y="82" font-size="10" fill="#5a7a8a">raw objects land</text>
    <rect x="220" y="40" width="120" height="56" rx="8" fill="#e8f4f8" stroke="#5ba4cf" stroke-width="1.5"/>
    <text x="280" y="64" font-weight="600" fill="#1a3a4a">Extractor</text>
    <text x="280" y="82" font-size="10" fill="#5a7a8a">decompose features</text>
    <rect x="420" y="40" width="120" height="56" rx="8" fill="#e8f4f8" stroke="#5ba4cf" stroke-width="1.5"/>
    <text x="480" y="64" font-weight="600" fill="#1a3a4a">Collection</text>
    <text x="480" y="82" font-size="10" fill="#5a7a8a">queryable surfaces</text>
    <rect x="20" y="200" width="120" height="56" rx="8" fill="#f0e8f8" stroke="#9b6bc1" stroke-width="1.5"/>
    <text x="80" y="224" font-weight="600" fill="#3a1a4a">Retriever</text>
    <text x="80" y="242" font-size="10" fill="#6a5a7a">compose pipelines</text>
    <rect x="220" y="200" width="120" height="56" rx="8" fill="#f0e8f8" stroke="#9b6bc1" stroke-width="1.5"/>
    <text x="280" y="224" font-weight="600" fill="#3a1a4a">Taxonomy</text>
    <text x="280" y="242" font-size="10" fill="#6a5a7a">hierarchical placement</text>
    <rect x="420" y="200" width="120" height="56" rx="8" fill="#f0e8f8" stroke="#9b6bc1" stroke-width="1.5"/>
    <text x="480" y="224" font-weight="600" fill="#3a1a4a">Cluster</text>
    <text x="480" y="242" font-size="10" fill="#6a5a7a">emergent structure</text>
  </g>
  <line x1="140" y1="68" x2="218" y2="68" stroke="#666" stroke-width="1.5" marker-end="url(#arrow)"/>
  <line x1="340" y1="68" x2="418" y2="68" stroke="#666" stroke-width="1.5" marker-end="url(#arrow)"/>
  <path d="M 480 96 L 480 148 L 80 148 L 80 198" stroke="#666" stroke-width="1.5" fill="none" marker-end="url(#arrow)" stroke-dasharray="6,3"/>
  <line x1="140" y1="228" x2="218" y2="228" stroke="#666" stroke-width="1.5" marker-end="url(#arrow)"/>
  <line x1="340" y1="228" x2="418" y2="228" stroke="#666" stroke-width="1.5" marker-end="url(#arrow)"/>
  <path d="M 540 228 C 620 228, 640 148, 640 68 C 640 48, 560 48, 542 58" stroke="#9b6bc1" stroke-width="1.2" fill="none" marker-end="url(#arrow)" stroke-dasharray="4,3"/>
  <text x="660" y="152" font-size="10" fill="#9b6bc1" font-family="-apple-system,BlinkMacSystemFont,sans-serif" transform="rotate(-90,660,152)">feeds back</text>
  <text x="360" y="28" font-size="11" fill="#888" text-anchor="middle" font-family="-apple-system,BlinkMacSystemFont,sans-serif">reduction &amp; addressing</text>
  <text x="360" y="290" font-size="11" fill="#888" text-anchor="middle" font-family="-apple-system,BlinkMacSystemFont,sans-serif">composition &amp; structure</text>
</svg>
</div>
<!--kg-card-end: html-->
<p><strong>Buckets</strong> are where raw objects land. Videos, images, documents, audio. A bucket has a schema and an optional <a href="https://mixpeek.com/connectors?ref=blog.mixpeek.com">sync connection</a> (S3, GCS, Drive) and its only job is to be the entry point. Nothing has happened to the content yet.</p><p><strong>Feature extractors</strong> are the decomposition step, where the dimension reduction actually lives. A <a href="https://mixpeek.com/extractors?ref=blog.mixpeek.com">shot detector</a> turns a thirty-second video into eight scenes with timestamps. A <a href="https://mixpeek.com/model/deepinsight/retinaface-r50?ref=blog.mixpeek.com">face identity model</a> turns a frame into bounding boxes and 512-dim face embeddings. A <a href="https://mixpeek.com/model/google/siglip2-giant-opt-patch16-384?ref=blog.mixpeek.com">multimodal extractor</a> turns a scene into a Gemini embedding <em>plus</em> <a href="https://mixpeek.com/ocr?ref=blog.mixpeek.com">OCR output</a> <em>plus</em> dominant colors <em>plus</em> whatever else you configured. <a href="https://mixpeek.com/marketplace?ref=blog.mixpeek.com">Custom plugins</a> (Python code, optionally with model weights) let you do the same for proprietary file types or domain-specific features.</p><p><strong>Collections</strong> are where the reduced features live. Each collection is the output of one extractor, which means each has its own schema, its own embedding space, its own indices. A single bucket fans out into many collections: scenes, faces, OCR text, objects, audio segments. Instead of one giant vector per video, you get many small structured records, each measurable along a dimension you chose on purpose. That&apos;s the move that makes everything else work.</p><p><strong>Retrievers</strong> compose collections. A <a href="https://mixpeek.com/retrievers?ref=blog.mixpeek.com">retriever</a> is a multi-stage pipeline: <a href="https://mixpeek.com/semantic-search?ref=blog.mixpeek.com">feature search</a> on collection A, attribute filter on collection B, LLM filter on the merged set, reciprocal rank fusion at the end. Stages pass documents forward in a working set. If you&apos;ve written a MongoDB aggregation pipeline, the mental model transfers directly. Retrievers are pipelines because no real application wants a single similarity score. It wants similarity <em>plus</em> a metadata filter <em>plus</em> a rerank <em>plus</em> a join.</p><p><a href="https://mixpeek.com/taxonomies?ref=blog.mixpeek.com"><strong>Taxonomies</strong></a> are the hierarchy. A taxonomy is a semantic join between two collections, where the join operation is itself a retriever pipeline. The canonical example: collection A is faces extracted from a casting database (with names attached), collection B is faces extracted from new ad creative. The taxonomy says <em>for every face in B, find the nearest face in A above some threshold, and enrich B with the name.</em> Run that at ingest time and every new ad arrives pre-labeled with its talent.</p><p><a href="https://mixpeek.com/clusters?ref=blog.mixpeek.com"><strong>Clusters</strong></a> are the dual of taxonomies. Taxonomies impose structure top-down (you defined the casting database). Clusters discover structure bottom-up. Run <a href="https://mixpeek.com/blog/multimodal-taxonomies?ref=blog.mixpeek.com">HDBSCAN on the embeddings</a> in a collection, send the centroids to an LLM, and you get labeled groups you didn&apos;t have to specify in advance. The output is, of course, another collection, which can feed another retriever, which can populate another taxonomy.</p><p>Six primitives. They compose in one direction: raw object, decomposed feature, queryable surface, composed pipeline, hierarchical placement, emergent structure. The operation is <em>reduce to measurable features and make them addressable.</em></p><hr><h2 id="why-hierarchy-matters">Why hierarchy matters</h2><p>Reduction alone is not enough. If you stop after the extractor stage, you have a smarter vector database. Better features, sure, but still a system where every query starts from scratch.</p><p>The leverage is in the hierarchy. Once you have a taxonomy of brands, products, talent, scenes, moods (or whatever your domain calls for), every new piece of content gets <em>placed</em> against it at ingest time. The compact fingerprint is computed once. Its location in the hierarchy is computed once. After that, retrieval is mostly traversal.</p><p>That collapses a distinction most teams treat as fundamental: enrichment versus search.</p><ul><li><strong>Flat embedding world:</strong> enrichment is a separate batch job. Run a labeling pipeline, write labels back to the database, hope the labels stay fresh</li><li><strong>Decomposed-and-placed world:</strong> enrichment <em>is</em> the same operation as search, run in reverse</li></ul><p>The taxonomy that locates a new ad in your brand hierarchy is the same retriever pipeline a user would invoke to ask &quot;show me ads in this brand.&quot; One traversal, two directions.</p><p>Reducing dimensions is table stakes. Making the reduction a permanent, queryable, hierarchical structure is the part that compounds.</p><hr><h2 id="what-this-unlocks">What this unlocks</h2><p>Three things fall out of this architecture that are hard to build any other way.</p><h3 id="agentic-retrieval">Agentic retrieval</h3><p>When your features are decomposed into named collections with named extractors, you can hand them to an LLM as tools. The LLM doesn&apos;t get a single search endpoint. It gets a feature search tool, an attribute filter tool, an LLM filter tool, each with explicit input and output shapes. The agent composes <a href="https://mixpeek.com/agentic-rag?ref=blog.mixpeek.com">retriever stages</a> dynamically based on the task.</p><blockquote>Find ads featuring this actor that performed well in Q3 and use a similar color palette to this reference image</blockquote><p>That becomes a four-stage pipeline the agent assembles on its own. You can&apos;t do that against a flat vector store because there&apos;s nothing to compose.</p><h3 id="cross-collection-joins">Cross-collection joins</h3><p>The casting-database example is the simple form. The general form: any two collections with comparable feature spaces can be joined via a taxonomy, so you can enrich any feature with any other feature.</p><ul><li>Faces with names</li><li>Scenes with brands</li><li>Audio with transcripts</li><li>Products with categories</li></ul><p>The join is a retriever, the retriever is reusable, and the enriched output is itself a collection that feeds the next join.</p><h3 id="clustering-that-closes-the-loop">Clustering that closes the loop</h3><p>Run a clustering job on a collection, label the centroids with an LLM, and the labels become a taxonomy you didn&apos;t have to design. Apply that taxonomy to incoming content and every new object gets placed in a category that emerged from your data. The system bootstraps its own hierarchy.</p><p>The flywheel:</p><ol><li>More content produces better clusters</li><li>Better clusters produce better taxonomies</li><li>Better taxonomies produce better placement of the next batch</li></ol><hr><h2 id="the-conceptual-frame">The conceptual frame</h2><p>The point of decomposition isn&apos;t to do anything magical. It takes the operation that <a href="https://mixpeek.com/embeddings?ref=blog.mixpeek.com">representation learning</a> has always done (reduce complex things to small interpretable representations) and makes it a piece of infrastructure instead of a piece of research.</p><ul><li><a href="https://mixpeek.com/extractors?ref=blog.mixpeek.com">Extractors</a> instead of one-off model training</li><li>Collections instead of pickled embeddings</li><li><a href="https://mixpeek.com/retrievers?ref=blog.mixpeek.com">Retrievers</a> instead of bespoke search code</li><li><a href="https://mixpeek.com/taxonomies?ref=blog.mixpeek.com">Taxonomies</a> instead of manual labeling</li><li><a href="https://mixpeek.com/clusters?ref=blog.mixpeek.com">Clusters</a> instead of EDA notebooks</li></ul><p>A 3072-dimensional <a href="https://mixpeek.com/embeddings?ref=blog.mixpeek.com">embedding</a> is a starting point. The interesting work is in what you do after the embedding. If your stack stops there, you&apos;ll spend the next eighteen months reinventing the rest in application code. We know because we&apos;ve watched it happen.</p><p>If any of that resonates, the <a href="https://docs.mixpeek.com/?ref=blog.mixpeek.com">docs</a> walk through the primitives in the order they compose, with code.</p>]]></content:encoded></item><item><title><![CDATA[Why Vector Search Alone Can't Find What's in Your Videos]]></title><description><![CDATA[Text-only RAG pipelines miss 80%% of what is in your content. A video contains faces, dialogue, on-screen text, background music, and brand logos. No single embedding captures all of that. The solution is multi-stage retrieval.]]></description><link>http://blog.mixpeek.com/multimodal-retrieval-beyond-vector-search/</link><guid isPermaLink="false">69ed2086c768534223476836</guid><category><![CDATA[Multimodal AI]]></category><category><![CDATA[Vector Search]]></category><category><![CDATA[Retrieval]]></category><category><![CDATA[Architecture]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Sun, 26 Apr 2026 13:59:04 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/04/ChatGPT-Image-Apr-26--2026--09_58_30-AM-1.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/04/ChatGPT-Image-Apr-26--2026--09_58_30-AM-1.png" alt="Why Vector Search Alone Can&apos;t Find What&apos;s in Your Videos"><p><strong>TL;DR:</strong> Text-only RAG pipelines miss 80% of what&apos;s in your content. A video contains faces, dialogue, on-screen text, background music, scene transitions, and brand logos. No single embedding captures all of that. The solution is <a href="https://mixpeek.com/blog/multi-stage-retrieval-pipelines?ref=blog.mixpeek.com">multi-stage retrieval</a>: extract multiple features per document, search each independently, then merge and rerank the results into one ranked list.</p><hr><h2 id="the-problem-everyone-ignores">The Problem Everyone Ignores</h2><p>Most retrieval systems work like this: take content, generate one embedding, store it in a vector database, run cosine similarity at query time. For text documents, this is fine. For everything else, it falls apart.</p><p>Consider a 30-second product video. It contains:</p><ul><li><strong>Visual content:</strong> product shots, lifestyle imagery, brand colors</li><li><strong>Spoken audio:</strong> a voiceover describing features and pricing</li><li><strong>On-screen text:</strong> &quot;50% off,&quot; a URL, a product name</li><li><strong>Music/tone:</strong> upbeat, corporate, dramatic</li><li><strong>Faces:</strong> a spokesperson, a customer testimonial</li></ul><p>A single CLIP embedding of a keyframe captures maybe the visual content. The dialogue, the on-screen text, the audio tone, the faces? Gone. Your &quot;semantic search&quot; just became a visual-only search that ignores most of the signal.</p><p>This is why teams building on video, audio, images, and documents keep hitting the same wall. They get 70% recall and plateau. The missing 30% is cross-modal context that a single embedding cannot represent.</p><h2 id="how-retrieval-actually-needs-to-work">How Retrieval Actually Needs to Work</h2><p>The fix is not a better embedding model. It is extracting multiple features per document and searching each one independently, then combining the results.</p><p>Think of it like a SQL query that JOINs across multiple indices. You would never store a customer&apos;s name, purchase history, and support tickets in a single column and expect one query to cover everything. Retrieval over rich media works the same way.</p>
<!--kg-card-begin: html-->
<svg viewbox="0 0 840 460" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:840px;margin:2em auto;display:block;font-family:system-ui,-apple-system,sans-serif;">
  <defs>
    <marker id="a1" viewbox="0 0 10 10" refx="9" refy="5" markerwidth="6" markerheight="6" orient="auto"><path d="M0 0L10 5L0 10z" fill="#475569"/></marker>
    <marker id="a2" viewbox="0 0 10 10" refx="9" refy="5" markerwidth="6" markerheight="6" orient="auto"><path d="M0 0L10 5L0 10z" fill="#fc5185"/></marker>
  </defs>
  <!-- Background -->
  <rect width="840" height="460" rx="16" fill="#0f172a"/>
  <!-- Title -->
  <text x="420" y="36" text-anchor="middle" font-size="11" font-weight="600" fill="#64748b" letter-spacing="0.12em">RETRIEVAL: SINGLE VECTOR vs MULTI-STAGE</text>
  <!-- Left Column -->
  <rect x="30" y="56" width="375" height="385" rx="12" fill="#1e293b"/>
  <text x="218" y="86" text-anchor="middle" font-size="13" font-weight="700" fill="#94a3b8">SINGLE VECTOR</text>
  <text x="218" y="104" text-anchor="middle" font-size="10" fill="#475569">1 tool &#xB7; 1 embedding &#xB7; no cross-modal</text>
  <!-- Left flow -->
  <rect x="138" y="124" width="160" height="38" rx="6" fill="#334155"/>
  <text x="218" y="148" text-anchor="middle" font-size="12" fill="#e2e8f0">Query</text>
  <line x1="218" y1="162" x2="218" y2="182" stroke="#475569" stroke-width="1.5" marker-end="url(#a1)"/>
  <rect x="138" y="184" width="160" height="38" rx="6" fill="#334155"/>
  <text x="218" y="208" text-anchor="middle" font-size="12" fill="#e2e8f0">Embed (CLIP)</text>
  <line x1="218" y1="222" x2="218" y2="242" stroke="#475569" stroke-width="1.5" marker-end="url(#a1)"/>
  <rect x="138" y="244" width="160" height="38" rx="6" fill="#334155"/>
  <text x="218" y="268" text-anchor="middle" font-size="12" fill="#e2e8f0">Vector DB &#xB7; top-k</text>
  <line x1="218" y1="282" x2="218" y2="302" stroke="#475569" stroke-width="1.5" marker-end="url(#a1)"/>
  <rect x="138" y="304" width="160" height="38" rx="6" fill="#334155"/>
  <text x="218" y="328" text-anchor="middle" font-size="12" fill="#e2e8f0">Results</text>
  <!-- Left result badge -->
  <rect x="118" y="362" width="200" height="56" rx="8" fill="#2a1215" stroke="#7f1d1d" stroke-width="1"/>
  <text x="218" y="385" text-anchor="middle" font-size="11" font-weight="600" fill="#fca5a5">Partial recall</text>
  <text x="218" y="403" text-anchor="middle" font-size="10" fill="#64748b">Misses speech, OCR, faces, audio tone</text>
  <!-- Right Column -->
  <rect x="435" y="56" width="375" height="385" rx="12" fill="#1e293b" stroke="#fc518533" stroke-width="1"/>
  <text x="623" y="86" text-anchor="middle" font-size="13" font-weight="700" fill="#fc5185">MULTI-STAGE RETRIEVAL</text>
  <text x="623" y="104" text-anchor="middle" font-size="10" fill="#475569">N extractors &#xB7; parallel search &#xB7; merge + rerank</text>
  <!-- Right flow: Query -->
  <rect x="543" y="124" width="160" height="38" rx="6" fill="#334155"/>
  <text x="623" y="148" text-anchor="middle" font-size="12" fill="#e2e8f0">Query</text>
  <line x1="623" y1="162" x2="623" y2="182" stroke="#fc5185" stroke-width="1.5" marker-end="url(#a2)"/>
  <!-- Right flow: Parallel extractors -->
  <rect x="455" y="184" width="70" height="32" rx="5" fill="#334155" stroke="#fc518555" stroke-width="1"/>
  <text x="490" y="205" text-anchor="middle" font-size="10" fill="#e2e8f0">CLIP</text>
  <rect x="531" y="184" width="70" height="32" rx="5" fill="#334155" stroke="#fc518555" stroke-width="1"/>
  <text x="566" y="205" text-anchor="middle" font-size="10" fill="#e2e8f0">Whisper</text>
  <rect x="607" y="184" width="70" height="32" rx="5" fill="#334155" stroke="#fc518555" stroke-width="1"/>
  <text x="642" y="205" text-anchor="middle" font-size="10" fill="#e2e8f0">OCR</text>
  <rect x="683" y="184" width="70" height="32" rx="5" fill="#334155" stroke="#fc518555" stroke-width="1"/>
  <text x="718" y="205" text-anchor="middle" font-size="10" fill="#e2e8f0">Face</text>
  <line x1="623" y1="216" x2="623" y2="240" stroke="#fc5185" stroke-width="1.5" marker-end="url(#a2)"/>
  <!-- Merge -->
  <rect x="523" y="242" width="200" height="38" rx="6" fill="#fc5185"/>
  <text x="623" y="266" text-anchor="middle" font-size="12" font-weight="600" fill="#fff">Merge &#xB7; RRF</text>
  <line x1="623" y1="280" x2="623" y2="300" stroke="#fc5185" stroke-width="1.5" marker-end="url(#a2)"/>
  <!-- Rerank -->
  <rect x="543" y="302" width="160" height="38" rx="6" fill="#334155" stroke="#fc518555" stroke-width="1"/>
  <text x="623" y="326" text-anchor="middle" font-size="12" fill="#e2e8f0">Rerank &#xB7; weighted</text>
  <!-- Right result badge -->
  <rect x="523" y="362" width="200" height="56" rx="8" fill="#052e16" stroke="#14532d" stroke-width="1"/>
  <text x="623" y="385" text-anchor="middle" font-size="11" font-weight="600" fill="#86efac">Complete recall</text>
  <text x="623" y="403" text-anchor="middle" font-size="10" fill="#64748b">All modalities covered in one query</text>
</svg>
<!--kg-card-end: html-->
<p>The left side is what most teams build: one model, one embedding, one search. The right side is what production systems need: <a href="https://mixpeek.com/docs/processing/feature-extractors?ref=blog.mixpeek.com">multiple extractors</a> generating independent feature vectors, independent searches across each, and a merge step that produces one unified ranking.</p><h2 id="the-feature-extraction-layer">The Feature Extraction Layer</h2><p>Before you can search across multiple signals, you need to extract them. This is where most teams get stuck. Running five models per document sounds expensive. It does not have to be.</p><p>The key insight is that extraction happens at ingest time, not query time. You pay the compute cost once, then every subsequent search is just a vector lookup. The question is which features to extract.</p>
<!--kg-card-begin: html-->
<table>
<thead>
<tr><th>Signal</th><th>Extractor</th><th>What It Captures</th><th>Query Example</th></tr>
</thead>
<tbody>
<tr><td>Visual semantics</td><td><a href="https://mixpeek.com/model/openai/clip-vit-large-patch14?ref=blog.mixpeek.com">CLIP</a></td><td>Scene content, objects, style</td><td>&quot;sunset beach product shot&quot;</td></tr>
<tr><td>Spoken words</td><td><a href="https://mixpeek.com/model/openai/whisper-large-v3?ref=blog.mixpeek.com">Whisper</a></td><td>Dialogue, narration, speech</td><td>&quot;mentions free shipping&quot;</td></tr>
<tr><td>On-screen text</td><td><a href="https://mixpeek.com/model/PaddlePaddle/paddleocr?ref=blog.mixpeek.com">PaddleOCR</a></td><td>Titles, captions, URLs, prices</td><td>&quot;contains promo code SAVE20&quot;</td></tr>
<tr><td>Faces</td><td><a href="https://mixpeek.com/model/deepinsight/retinaface-r50?ref=blog.mixpeek.com">RetinaFace</a></td><td>Identity, count, position</td><td>&quot;video with CEO appearance&quot;</td></tr>
<tr><td>Objects</td><td><a href="https://mixpeek.com/model/ultralytics/yolov8n?ref=blog.mixpeek.com">YOLO</a></td><td>Specific items, products, logos</td><td>&quot;red Nike shoes&quot;</td></tr>
<tr><td>Audio tone</td><td><a href="https://mixpeek.com/model/laion/clap-htsat-fused?ref=blog.mixpeek.com">CLAP</a></td><td>Music genre, mood, effects</td><td>&quot;upbeat background music&quot;</td></tr>
<tr><td>Text meaning</td><td><a href="https://mixpeek.com/model/BAAI/bge-large-en-v1.5?ref=blog.mixpeek.com">BGE</a></td><td>Semantic content of transcripts</td><td>&quot;discusses return policy&quot;</td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>Each extractor runs independently during <a href="https://mixpeek.com/docs/ingestion/sources?ref=blog.mixpeek.com">ingestion</a>. A single video upload produces 5-7 feature vectors, each queryable on its own. The extraction cost amortizes across every future search.</p><h2 id="multi-stage-retrieval-the-architecture">Multi-Stage Retrieval: The Architecture</h2><p>Once features are extracted and stored, retrieval becomes a pipeline of stages. Each stage narrows, expands, or reranks the result set. This is what Mixpeek calls a <a href="https://mixpeek.com/docs/retrieval/retrievers?ref=blog.mixpeek.com">retriever</a>.</p>
<!--kg-card-begin: html-->
<svg viewbox="0 0 840 320" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:840px;margin:2em auto;display:block;font-family:system-ui,-apple-system,sans-serif;">
  <defs>
    <marker id="b1" viewbox="0 0 10 10" refx="9" refy="5" markerwidth="6" markerheight="6" orient="auto"><path d="M0 0L10 5L0 10z" fill="#fc5185"/></marker>
  </defs>
  <!-- Background -->
  <rect width="840" height="320" rx="16" fill="#0f172a"/>
  <text x="420" y="32" text-anchor="middle" font-size="11" font-weight="600" fill="#64748b" letter-spacing="0.12em">ANATOMY OF A RETRIEVER PIPELINE</text>
  <!-- Query -->
  <rect x="20" y="125" width="100" height="70" rx="8" fill="#1e293b" stroke="#334155" stroke-width="1"/>
  <text x="70" y="150" text-anchor="middle" font-size="10" font-weight="600" fill="#e2e8f0">Query</text>
  <text x="70" y="166" text-anchor="middle" font-size="9" fill="#475569">&quot;product demo</text>
  <text x="70" y="178" text-anchor="middle" font-size="9" fill="#475569">with pricing&quot;</text>
  <line x1="120" y1="160" x2="148" y2="160" stroke="#fc5185" stroke-width="1.5" marker-end="url(#b1)"/>
  <!-- Parallel Search -->
  <rect x="150" y="50" width="175" height="220" rx="10" fill="#1e293b" stroke="#334155" stroke-width="1"/>
  <text x="238" y="76" text-anchor="middle" font-size="10" font-weight="700" fill="#fc5185" letter-spacing="0.08em">PARALLEL SEARCH</text>
  <rect x="168" y="92" width="140" height="30" rx="5" fill="#334155"/>
  <text x="238" y="112" text-anchor="middle" font-size="10" fill="#e2e8f0">CLIP &#xB7; visual</text>
  <rect x="168" y="130" width="140" height="30" rx="5" fill="#334155"/>
  <text x="238" y="150" text-anchor="middle" font-size="10" fill="#e2e8f0">Whisper &#xB7; speech</text>
  <rect x="168" y="168" width="140" height="30" rx="5" fill="#334155"/>
  <text x="238" y="188" text-anchor="middle" font-size="10" fill="#e2e8f0">OCR &#xB7; on-screen text</text>
  <rect x="168" y="206" width="140" height="30" rx="5" fill="#334155"/>
  <text x="238" y="226" text-anchor="middle" font-size="10" fill="#e2e8f0">Face &#xB7; identity</text>
  <line x1="325" y1="160" x2="368" y2="160" stroke="#fc5185" stroke-width="1.5" marker-end="url(#b1)"/>
  <!-- Merge -->
  <rect x="370" y="120" width="130" height="80" rx="8" fill="#fc5185"/>
  <text x="435" y="152" text-anchor="middle" font-size="12" font-weight="700" fill="#fff">Merge</text>
  <text x="435" y="170" text-anchor="middle" font-size="10" fill="#ffffffcc">Reciprocal Rank Fusion</text>
  <line x1="500" y1="160" x2="538" y2="160" stroke="#fc5185" stroke-width="1.5" marker-end="url(#b1)"/>
  <!-- Filter -->
  <rect x="540" y="130" width="100" height="60" rx="8" fill="#1e293b" stroke="#334155" stroke-width="1"/>
  <text x="590" y="155" text-anchor="middle" font-size="11" font-weight="600" fill="#e2e8f0">Filter</text>
  <text x="590" y="172" text-anchor="middle" font-size="9" fill="#475569">duration &#x2265; 15s</text>
  <line x1="640" y1="160" x2="668" y2="160" stroke="#fc5185" stroke-width="1.5" marker-end="url(#b1)"/>
  <!-- Rerank -->
  <rect x="670" y="130" width="80" height="60" rx="8" fill="#1e293b" stroke="#334155" stroke-width="1"/>
  <text x="710" y="155" text-anchor="middle" font-size="11" font-weight="600" fill="#e2e8f0">Rerank</text>
  <text x="710" y="172" text-anchor="middle" font-size="9" fill="#475569">weighted</text>
  <line x1="750" y1="160" x2="773" y2="160" stroke="#fc5185" stroke-width="1.5" marker-end="url(#b1)"/>
  <!-- Results -->
  <rect x="775" y="130" width="48" height="60" rx="8" fill="#052e16" stroke="#14532d" stroke-width="1"/>
  <text x="799" y="155" text-anchor="middle" font-size="10" font-weight="600" fill="#86efac">Top</text>
  <text x="799" y="170" text-anchor="middle" font-size="10" font-weight="600" fill="#86efac">10</text>
  <!-- Stage labels -->
  <text x="238" y="296" text-anchor="middle" font-size="9" fill="#475569">01 SEARCH</text>
  <text x="435" y="296" text-anchor="middle" font-size="9" fill="#475569">02 MERGE</text>
  <text x="590" y="296" text-anchor="middle" font-size="9" fill="#475569">03 FILTER</text>
  <text x="710" y="296" text-anchor="middle" font-size="9" fill="#475569">04 RERANK</text>
</svg>
<!--kg-card-end: html-->
<p>A retriever is a sequence of stages that execute in order. Each stage takes the output of the previous stage and transforms it. The stages compose like Unix pipes: each one does one thing, and chaining them produces complex behavior from simple parts.</p><h3 id="stage-types">Stage Types</h3><p><strong>Search stages</strong> query a specific feature index and return candidates:</p><pre><code class="language-json">{
  &quot;stage_type&quot;: &quot;search&quot;,
  &quot;model_id&quot;: &quot;openai/clip-vit-large-patch14&quot;,
  &quot;query&quot;: { &quot;type&quot;: &quot;text&quot;, &quot;value&quot;: &quot;product demonstration&quot; },
  &quot;limit&quot;: 100
}</code></pre><p><strong>Filter stages</strong> remove results that do not meet criteria:</p><pre><code class="language-json">{
  &quot;stage_type&quot;: &quot;filter&quot;,
  &quot;field&quot;: &quot;metadata.duration_seconds&quot;,
  &quot;operator&quot;: &quot;gte&quot;,
  &quot;value&quot;: 15
}</code></pre><p><strong>Merge stages</strong> combine results from parallel searches using <a href="https://mixpeek.com/blog/keyword-vs-semantic-vs-hybrid-search?ref=blog.mixpeek.com">reciprocal rank fusion</a>:</p><pre><code class="language-json">{
  &quot;stage_type&quot;: &quot;merge&quot;,
  &quot;strategy&quot;: &quot;rrf&quot;,
  &quot;sources&quot;: [&quot;visual_search&quot;, &quot;transcript_search&quot;]
}</code></pre><p><strong>Rerank stages</strong> re-score the merged results using a cross-encoder or business logic:</p><pre><code class="language-json">{
  &quot;stage_type&quot;: &quot;rerank&quot;,
  &quot;method&quot;: &quot;weighted&quot;,
  &quot;weights&quot;: { &quot;visual&quot;: 0.4, &quot;transcript&quot;: 0.35, &quot;ocr&quot;: 0.25 }
}</code></pre><p>The power is in composition. A retriever for &quot;find product videos mentioning free shipping with our CEO&quot; chains: CLIP search for product content + Whisper transcript search for &quot;free shipping&quot; + face search against an <a href="https://mixpeek.com/docs/enrichment/taxonomies?ref=blog.mixpeek.com">enrolled reference collection</a>, merge with RRF, filter by duration, rerank by recency.</p><h2 id="why-this-beats-single-vector-search">Why This Beats Single-Vector Search</h2><p>The difference is not theoretical. Here are the failure modes that multi-stage retrieval eliminates:</p>
<!--kg-card-begin: html-->
<table>
<thead>
<tr><th>Scenario</th><th>Single Vector</th><th>Multi-Stage</th></tr>
</thead>
<tbody>
<tr><td>&quot;Find videos where someone says &apos;quarterly earnings&apos;&quot;</td><td>Searches visual embeddings. Returns videos that <em>look</em> like earnings calls. Misses podcast-style recordings.</td><td>Searches <a href="https://mixpeek.com/converters/audio-to-text?ref=blog.mixpeek.com">transcript embeddings</a>. Finds exact phrase regardless of visual content.</td></tr>
<tr><td>&quot;Product videos with on-screen pricing&quot;</td><td>Returns product videos. Cannot distinguish which ones show prices.</td><td><a href="https://mixpeek.com/converters/video-to-text?ref=blog.mixpeek.com">OCR search</a> finds &quot;$&quot; patterns. Intersects with product video filter.</td></tr>
<tr><td>&quot;Clips featuring our brand ambassador&quot;</td><td>Returns visually similar people. High false positive rate.</td><td><a href="https://mixpeek.com/converters/video-to-faces?ref=blog.mixpeek.com">Face search</a> against enrolled face collection. Exact identity match.</td></tr>
<tr><td>&quot;Upbeat content suitable for social media&quot;</td><td>Cannot assess audio tone from visual embedding.</td><td><a href="https://mixpeek.com/converters/audio-to-embeddings?ref=blog.mixpeek.com">Audio embedding</a> search for &quot;upbeat&quot; + duration filter &lt; 60s.</td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>Each row is a real query pattern from production deployments. In every case, the multi-stage approach finds results that single-vector search misses entirely, not because the embedding model is bad, but because it is being asked to encode information it was never trained to capture.</p><h2 id="the-enrichment-layer-taxonomies-and-clusters">The Enrichment Layer: Taxonomies and Clusters</h2><p>Retrieval is half the story. The other half is enrichment: attaching structured metadata to documents so downstream systems can filter, sort, and categorize without running inference at query time.</p>
<!--kg-card-begin: html-->
<svg viewbox="0 0 840 380" xmlns="http://www.w3.org/2000/svg" style="width:100%;max-width:840px;margin:2em auto;display:block;font-family:system-ui,-apple-system,sans-serif;">
  <defs>
    <marker id="c1" viewbox="0 0 10 10" refx="9" refy="5" markerwidth="6" markerheight="6" orient="auto"><path d="M0 0L10 5L0 10z" fill="#fc5185"/></marker>
    <marker id="c2" viewbox="0 0 10 10" refx="9" refy="5" markerwidth="6" markerheight="6" orient="auto"><path d="M0 0L10 5L0 10z" fill="#a78bfa"/></marker>
  </defs>
  <!-- Background -->
  <rect width="840" height="380" rx="16" fill="#0f172a"/>
  <text x="420" y="32" text-anchor="middle" font-size="11" font-weight="600" fill="#64748b" letter-spacing="0.12em">THE ENRICHMENT LAYER</text>
  <!-- Documents In -->
  <rect x="20" y="155" width="110" height="70" rx="8" fill="#1e293b" stroke="#334155" stroke-width="1"/>
  <text x="75" y="182" text-anchor="middle" font-size="11" font-weight="600" fill="#e2e8f0">Documents</text>
  <text x="75" y="200" text-anchor="middle" font-size="9" fill="#475569">video &#xB7; image &#xB7; audio</text>
  <line x1="130" y1="190" x2="165" y2="190" stroke="#fc5185" stroke-width="1.5" marker-end="url(#c1)"/>
  <!-- Feature Extraction -->
  <rect x="167" y="155" width="130" height="70" rx="8" fill="#1e293b" stroke="#334155" stroke-width="1"/>
  <text x="232" y="182" text-anchor="middle" font-size="11" font-weight="600" fill="#e2e8f0">Feature</text>
  <text x="232" y="200" text-anchor="middle" font-size="11" font-weight="600" fill="#e2e8f0">Extraction</text>
  <!-- Fork lines -->
  <line x1="297" y1="175" x2="368" y2="105" stroke="#fc5185" stroke-width="1.5" marker-end="url(#c1)"/>
  <line x1="297" y1="205" x2="368" y2="280" stroke="#a78bfa" stroke-width="1.5" marker-end="url(#c2)"/>
  <!-- Taxonomy Branch -->
  <rect x="370" y="50" width="220" height="120" rx="10" fill="#1e293b" stroke="#fc518544" stroke-width="1"/>
  <text x="480" y="76" text-anchor="middle" font-size="11" font-weight="700" fill="#fc5185">TAXONOMIES</text>
  <text x="480" y="94" text-anchor="middle" font-size="10" fill="#475569">Top-down &#xB7; match against known references</text>
  <rect x="390" y="108" width="80" height="26" rx="4" fill="#334155"/>
  <text x="430" y="125" text-anchor="middle" font-size="9" fill="#e2e8f0">Brands</text>
  <rect x="478" y="108" width="80" height="26" rx="4" fill="#334155"/>
  <text x="518" y="125" text-anchor="middle" font-size="9" fill="#e2e8f0">Products</text>
  <rect x="390" y="140" width="80" height="26" rx="4" fill="#334155"/>
  <text x="430" y="157" text-anchor="middle" font-size="9" fill="#e2e8f0">People</text>
  <rect x="478" y="140" width="80" height="26" rx="4" fill="#334155"/>
  <text x="518" y="157" text-anchor="middle" font-size="9" fill="#e2e8f0">Categories</text>
  <line x1="590" y1="110" x2="638" y2="110" stroke="#fc5185" stroke-width="1.5" marker-end="url(#c1)"/>
  <!-- Taxonomy Output -->
  <rect x="640" y="68" width="180" height="84" rx="8" fill="#1e293b" stroke="#334155" stroke-width="1"/>
  <text x="730" y="92" text-anchor="middle" font-size="10" font-weight="600" fill="#fc5185">Structured Labels</text>
  <text x="730" y="112" text-anchor="middle" font-size="10" fill="#94a3b8" font-family="monospace">brand: &quot;Nike&quot;</text>
  <text x="730" y="130" text-anchor="middle" font-size="10" fill="#94a3b8" font-family="monospace">category: &quot;sports&quot;</text>
  <text x="730" y="148" text-anchor="middle" font-size="10" fill="#94a3b8" font-family="monospace">person: &quot;CEO&quot;</text>
  <!-- Cluster Branch -->
  <rect x="370" y="210" width="220" height="120" rx="10" fill="#1e293b" stroke="#a78bfa44" stroke-width="1"/>
  <text x="480" y="236" text-anchor="middle" font-size="11" font-weight="700" fill="#a78bfa">CLUSTERS</text>
  <text x="480" y="254" text-anchor="middle" font-size="10" fill="#475569">Bottom-up &#xB7; discover emergent patterns</text>
  <!-- Cluster dots -->
  <circle cx="415" cy="292" r="10" fill="#334155" stroke="#a78bfa66" stroke-width="1"/>
  <circle cx="432" cy="283" r="7" fill="#334155" stroke="#a78bfa66" stroke-width="1"/>
  <circle cx="422" cy="308" r="5" fill="#334155" stroke="#a78bfa66" stroke-width="1"/>
  <circle cx="500" cy="290" r="12" fill="#334155" stroke="#a78bfa66" stroke-width="1"/>
  <circle cx="518" cy="282" r="6" fill="#334155" stroke="#a78bfa66" stroke-width="1"/>
  <circle cx="512" cy="310" r="8" fill="#334155" stroke="#a78bfa66" stroke-width="1"/>
  <circle cx="488" cy="312" r="4" fill="#334155" stroke="#a78bfa66" stroke-width="1"/>
  <line x1="590" y1="270" x2="638" y2="270" stroke="#a78bfa" stroke-width="1.5" marker-end="url(#c2)"/>
  <!-- Cluster Output -->
  <rect x="640" y="228" width="180" height="84" rx="8" fill="#1e293b" stroke="#334155" stroke-width="1"/>
  <text x="730" y="252" text-anchor="middle" font-size="10" font-weight="600" fill="#a78bfa">Emergent Patterns</text>
  <text x="730" y="272" text-anchor="middle" font-size="10" fill="#94a3b8">&quot;40% share visual style&quot;</text>
  <text x="730" y="290" text-anchor="middle" font-size="10" fill="#94a3b8">&quot;trending: outdoor shots&quot;</text>
  <text x="730" y="308" text-anchor="middle" font-size="10" fill="#94a3b8">&quot;3 tone clusters found&quot;</text>
</svg>
<!--kg-card-end: html-->
<p><a href="https://mixpeek.com/docs/enrichment/taxonomies?ref=blog.mixpeek.com">Taxonomies</a> work like semantic JOINs. You define a reference collection (brand logos, product SKUs, content categories) and match incoming documents against it using embedding similarity. A video containing a Nike swoosh gets enriched with <code>brand: Nike</code>, <code>brand_id: nike_001</code>, not because a rule detected the text &quot;Nike&quot; but because the visual embedding matched the reference collection.</p><p><a href="https://mixpeek.com/docs/enrichment/clusters?ref=blog.mixpeek.com">Clusters</a> work bottom-up. Instead of matching against known categories, clustering groups similar documents and surfaces emergent patterns. You might discover that 40% of your video library shares a visual style you never explicitly categorized.</p><p>Together, taxonomies and clusters replace the manual tagging workflows that cost media companies <a href="https://mixpeek.com/blog/video-intelligence-raw-footage-to-searchable-data?ref=blog.mixpeek.com">$15-25 per asset</a>.</p><h2 id="the-decision-tree-which-architecture-when">The Decision Tree: Which Architecture When</h2><p>Not every use case needs the full multi-stage pipeline. Here is how to decide:</p>
<!--kg-card-begin: html-->
<table>
<thead>
<tr><th>Your content</th><th>Your queries</th><th>Start with</th><th>Graduate to</th></tr>
</thead>
<tbody>
<tr><td>Text documents only</td><td>Semantic questions</td><td>Single embedding + vector search</td><td>Add <a href="https://mixpeek.com/blog/keyword-vs-semantic-vs-hybrid-search?ref=blog.mixpeek.com">BM25 hybrid search</a> for keyword recall</td></tr>
<tr><td>Images with metadata</td><td>Visual similarity</td><td>CLIP embeddings</td><td>Add <a href="https://mixpeek.com/docs/enrichment/taxonomies?ref=blog.mixpeek.com">taxonomy enrichment</a> for structured filters</td></tr>
<tr><td>Video (&lt; 1K assets)</td><td>Basic search</td><td><a href="https://mixpeek.com/converters/video-to-description?ref=blog.mixpeek.com">Scene descriptions</a></td><td>Add transcript + OCR for cross-modal coverage</td></tr>
<tr><td>Video (10K+ assets)</td><td>Complex, multi-signal</td><td><a href="https://mixpeek.com/docs/retrieval/retrievers?ref=blog.mixpeek.com">Multi-stage retriever</a> from day one</td><td>Add <a href="https://mixpeek.com/docs/enrichment/clusters?ref=blog.mixpeek.com">clusters</a> to discover content patterns</td></tr>
<tr><td>Mixed media library</td><td>Agent-driven queries</td><td><a href="https://mixpeek.com/connectors/mcp-server?ref=blog.mixpeek.com">MCP integration</a></td><td>Full pipeline: extract, enrich, retrieve, rerank</td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>The pattern is consistent: start with the simplest pipeline that covers your primary query pattern, then add stages as you discover what the first pipeline misses.</p><h2 id="what-this-looks-like-in-practice">What This Looks Like in Practice</h2><p>A complete pipeline from upload to searchable, enriched content:</p><pre><code class="language-bash"># 1. Upload to a bucket
curl -X POST &quot;$MP_API_URL/v1/buckets/my-bucket/upload&quot; \
  -H &quot;Authorization: Bearer $MP_API_KEY&quot; \
  -F &quot;file=@product-demo.mp4&quot;

# 2. Collection triggers extraction automatically:
#    - CLIP embeddings from video frames
#    - Whisper transcription from audio
#    - OCR from on-screen text
#    - Face detection and embedding
#    - Object detection via YOLO

# 3. Taxonomy enrichment runs post-extraction:
#    - Matches detected faces against employee collection
#    - Matches visual content against brand reference collection
#    - Classifies content into IAB categories

# 4. Search across all features at once
curl -X POST &quot;$MP_API_URL/v1/retrievers/my-retriever/search&quot; \
  -H &quot;Authorization: Bearer $MP_API_KEY&quot; \
  -d &apos;{
    &quot;query&quot;: {
      &quot;text&quot;: &quot;product demo with pricing shown on screen&quot;,
      &quot;modality&quot;: &quot;text&quot;
    },
    &quot;limit&quot;: 10
  }&apos;</code></pre><p>The retriever handles the multi-stage logic: parallel searches across visual, transcript, and OCR indices, RRF merge, taxonomy-based filtering, and relevance reranking. The caller sends one query and gets one ranked result list.</p><h2 id="the-real-shift">The Real Shift</h2><p>The argument is not that vector search is bad. Vector search is good at what it does: finding semantically similar content within a single modality. The problem is asking it to do everything.</p><p>A video is not a text document. An image with overlaid text is not just an image. A podcast episode is not just an audio waveform. Rich media has multiple signals, and each signal needs its own extraction, its own index, and its own search path.</p><p>The teams that figure this out stop asking &quot;which embedding model should we use?&quot; and start asking &quot;which features should we extract and how should we combine their search results?&quot; That is the shift from vector search to <a href="https://mixpeek.com/glossary/multimodal-retrieval?ref=blog.mixpeek.com">multimodal retrieval</a>.</p><hr><p>Start with one extractor. Add a second when your first query pattern hits a wall. Chain them with a retriever. That is the whole playbook.</p><p><em>Ready to build? </em><a href="https://mixpeek.com/build?ref=blog.mixpeek.com"><em>Start with the pipeline builder</em></a><em> or explore the </em><a href="https://mixpeek.com/docs/retrieval/retrievers?ref=blog.mixpeek.com"><em>retriever API reference</em></a><em>.</em></p>]]></content:encoded></item><item><title><![CDATA[Multimodal Taxonomies: How to Classify Video, Images, Audio, and Text with One Category System]]></title><description><![CDATA[Traditional taxonomies classify one content type at a time. Multimodal taxonomies unify classification across every format using embedding similarity the missing layer between raw AI features and structured, searchable metadata.]]></description><link>http://blog.mixpeek.com/multimodal-taxonomies/</link><guid isPermaLink="false">69ecb8a5c768534223476262</guid><category><![CDATA[Multimodal AI]]></category><category><![CDATA[Taxonomies]]></category><category><![CDATA[Content Classification]]></category><category><![CDATA[Data Enrichment]]></category><category><![CDATA[Tutorial]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Sat, 25 Apr 2026 13:40:23 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/04/multimodal-taxonomies-feature-v3.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/04/multimodal-taxonomies-feature-v3.png" alt="Multimodal Taxonomies: How to Classify Video, Images, Audio, and Text with One Category System"><p><strong>TL;DR:</strong> Traditional taxonomies classify one content type at a time. Text gets labels, photos get tags, video gets a separate system. Multimodal taxonomies unify classification across every format by matching content against reference collections using <a href="https://mixpeek.com/converters/multimodal-to-embeddings?ref=blog.mixpeek.com">embedding similarity</a>. They bridge raw AI features and structured, searchable metadata.</p><hr><h2 id="what-is-a-taxonomy">What Is a Taxonomy?</h2><p>A taxonomy is a classification system that organizes content into categories. Gmail sorting emails into Primary/Social/Promotions, Shopify categorizing products into Google&apos;s 5,500+ product taxonomy, YouTube classifying videos for ad targeting. All taxonomies.</p><p>In data infrastructure, taxonomies solve three problems: <strong>discovery</strong> (navigating categories instead of guessing search terms), <strong>governance</strong> (enforcing policies by content type), and <strong>enrichment</strong> (attaching structured metadata to unstructured content so downstream systems can <a href="https://mixpeek.com/docs/retrieval/retrievers?ref=blog.mixpeek.com">filter, sort, and search</a> it).</p><p>Traditional taxonomies are manual and single-modal. A human reviews an article and assigns &quot;Sports &gt; Basketball &gt; NBA.&quot; A separate system tags an image &quot;outdoor, basketball court.&quot; Another transcribes a video. Each modality gets its own pipeline, its own maintenance burden. That was fine when content was mostly text.</p><h2 id="why-single-modal-classification-breaks">Why Single-Modal Classification Breaks</h2><p><strong>Scale.</strong> YouTube receives 720,000 hours of video every day. TikTok ingests 34 million videos daily. That&apos;s 272 per second. A trained analyst can classify ~10,000 documents per year. To manually classify one day of TikTok, you&apos;d need 3,400 analysts working full-time for a year.</p><p><strong>Context blindness.</strong> A meme with &quot;this is fire&quot; means different things depending on whether the image shows a concert or a burning building. An ICCV 2025 study quantified this: text-only models achieved F1 of 0.75&#x2013;0.81 on video moderation. Adding visual and audio signals pushed that to 0.84&#x2013;0.91. The missing 10&#x2013;15% is cross-modal context.</p><p><strong>Consistency drift.</strong> The <a href="https://mixpeek.com/blog/iab-contextual-classifier-multimodal-ai?ref=blog.mixpeek.com">IAB Content Taxonomy</a> has grown from ~400 categories in v2 to 1,500+ in v3, and even with that specificity, human reviewers routinely disagree on assignments.</p><h2 id="what-makes-a-taxonomy-multimodal">What Makes a Taxonomy &quot;Multimodal&quot;</h2><p>A multimodal taxonomy classifies content by understanding it across all modalities simultaneously, then matching against reference categories using embedding similarity rather than keyword rules.</p><p>The key difference: instead of writing rules (&quot;if text contains &apos;basketball&apos; AND image has an orange round object...&quot;), a multimodal taxonomy works like a <strong>semantic JOIN</strong>. You define categories with a reference collection of representative examples. New content is <a href="https://mixpeek.com/docs/retrieval/stages/feature-search?ref=blog.mixpeek.com">matched against those references</a> using vector similarity across all extracted features: visual, audio, and textual, all at once.</p>
<!--kg-card-begin: html-->
<div style="margin: 40px 0;">
<svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 800 300" style="width:100%;max-width:800px;display:block;margin:0 auto;">
  <rect width="800" height="300" rx="12" fill="#faf9ff"/>
  <rect x="10" y="10" width="380" height="280" rx="8" fill="#fff5f5" stroke="#fecaca" stroke-width="1"/>
  <text x="200" y="38" font-family="-apple-system, BlinkMacSystemFont, sans-serif" font-size="14" fill="#991b1b" text-anchor="middle" font-weight="700">Traditional (Single-Modal)</text>
  <rect x="30" y="55" width="70" height="30" rx="6" fill="#fff" stroke="#f472b6" stroke-width="1.5"/>
  <text x="65" y="75" font-family="-apple-system, sans-serif" font-size="11" fill="#9d174d" text-anchor="middle" font-weight="600">Video</text>
  <line x1="100" y1="70" x2="140" y2="70" stroke="#d1d5db" stroke-width="1.5"/>
  <rect x="140" y="55" width="100" height="30" rx="6" fill="#fff" stroke="#d1d5db" stroke-width="1"/>
  <text x="190" y="75" font-family="-apple-system, sans-serif" font-size="10" fill="#6b7280" text-anchor="middle">Manual review</text>
  <line x1="240" y1="70" x2="280" y2="70" stroke="#d1d5db" stroke-width="1.5"/>
  <rect x="280" y="55" width="90" height="30" rx="6" fill="#fef2f2" stroke="#ef4444" stroke-width="1"/>
  <text x="325" y="75" font-family="-apple-system, sans-serif" font-size="10" fill="#991b1b" text-anchor="middle" font-weight="600">Label A</text>
  <rect x="30" y="100" width="70" height="30" rx="6" fill="#fff" stroke="#60a5fa" stroke-width="1.5"/>
  <text x="65" y="120" font-family="-apple-system, sans-serif" font-size="11" fill="#1e40af" text-anchor="middle" font-weight="600">Image</text>
  <line x1="100" y1="115" x2="140" y2="115" stroke="#d1d5db" stroke-width="1.5"/>
  <rect x="140" y="100" width="100" height="30" rx="6" fill="#fff" stroke="#d1d5db" stroke-width="1"/>
  <text x="190" y="120" font-family="-apple-system, sans-serif" font-size="10" fill="#6b7280" text-anchor="middle">Image tagger</text>
  <line x1="240" y1="115" x2="280" y2="115" stroke="#d1d5db" stroke-width="1.5"/>
  <rect x="280" y="100" width="90" height="30" rx="6" fill="#fef2f2" stroke="#ef4444" stroke-width="1"/>
  <text x="325" y="120" font-family="-apple-system, sans-serif" font-size="10" fill="#991b1b" text-anchor="middle" font-weight="600">Label B</text>
  <rect x="30" y="145" width="70" height="30" rx="6" fill="#fff" stroke="#fbbf24" stroke-width="1.5"/>
  <text x="65" y="165" font-family="-apple-system, sans-serif" font-size="11" fill="#92400e" text-anchor="middle" font-weight="600">Text</text>
  <line x1="100" y1="160" x2="140" y2="160" stroke="#d1d5db" stroke-width="1.5"/>
  <rect x="140" y="145" width="100" height="30" rx="6" fill="#fff" stroke="#d1d5db" stroke-width="1"/>
  <text x="190" y="165" font-family="-apple-system, sans-serif" font-size="10" fill="#6b7280" text-anchor="middle">Keyword rules</text>
  <line x1="240" y1="160" x2="280" y2="160" stroke="#d1d5db" stroke-width="1.5"/>
  <rect x="280" y="145" width="90" height="30" rx="6" fill="#fef2f2" stroke="#ef4444" stroke-width="1"/>
  <text x="325" y="165" font-family="-apple-system, sans-serif" font-size="10" fill="#991b1b" text-anchor="middle" font-weight="600">Label C</text>
  <text x="200" y="210" font-family="-apple-system, sans-serif" font-size="11" fill="#991b1b" text-anchor="middle" font-weight="600">3 pipelines. 3 labels. No cross-modal context.</text>
  <rect x="410" y="10" width="380" height="280" rx="8" fill="#f0fdf4" stroke="#bbf7d0" stroke-width="1"/>
  <text x="600" y="38" font-family="-apple-system, BlinkMacSystemFont, sans-serif" font-size="14" fill="#065f46" text-anchor="middle" font-weight="700">Multimodal Taxonomy</text>
  <rect x="430" y="58" width="60" height="26" rx="13" fill="#fdf2f8" stroke="#f472b6" stroke-width="1"/>
  <text x="460" y="76" font-family="-apple-system, sans-serif" font-size="10" fill="#9d174d" text-anchor="middle" font-weight="600">Video</text>
  <rect x="430" y="90" width="60" height="26" rx="13" fill="#eff6ff" stroke="#60a5fa" stroke-width="1"/>
  <text x="460" y="108" font-family="-apple-system, sans-serif" font-size="10" fill="#1e40af" text-anchor="middle" font-weight="600">Image</text>
  <rect x="430" y="122" width="60" height="26" rx="13" fill="#ecfdf5" stroke="#34d399" stroke-width="1"/>
  <text x="460" y="140" font-family="-apple-system, sans-serif" font-size="10" fill="#065f46" text-anchor="middle" font-weight="600">Audio</text>
  <rect x="430" y="154" width="60" height="26" rx="13" fill="#fffbeb" stroke="#fbbf24" stroke-width="1"/>
  <text x="460" y="172" font-family="-apple-system, sans-serif" font-size="10" fill="#92400e" text-anchor="middle" font-weight="600">Text</text>
  <line x1="490" y1="71" x2="530" y2="112" stroke="#d1d5db" stroke-width="1"/>
  <line x1="490" y1="103" x2="530" y2="112" stroke="#d1d5db" stroke-width="1"/>
  <line x1="490" y1="135" x2="530" y2="125" stroke="#d1d5db" stroke-width="1"/>
  <line x1="490" y1="167" x2="530" y2="130" stroke="#d1d5db" stroke-width="1"/>
  <rect x="530" y="92" width="100" height="52" rx="8" fill="#f5f3ff" stroke="#7c3aed" stroke-width="1.5"/>
  <text x="580" y="115" font-family="-apple-system, sans-serif" font-size="11" fill="#6d28d9" text-anchor="middle" font-weight="600">Feature</text>
  <text x="580" y="132" font-family="-apple-system, sans-serif" font-size="11" fill="#6d28d9" text-anchor="middle" font-weight="600">Extraction</text>
  <line x1="630" y1="118" x2="660" y2="118" stroke="#d1d5db" stroke-width="1.5"/>
  <rect x="660" y="85" width="110" height="66" rx="8" fill="#ecfdf5" stroke="#059669" stroke-width="1.5"/>
  <text x="715" y="108" font-family="-apple-system, sans-serif" font-size="11" fill="#065f46" text-anchor="middle" font-weight="700">Taxonomy</text>
  <text x="715" y="124" font-family="-apple-system, sans-serif" font-size="11" fill="#065f46" text-anchor="middle" font-weight="700">Similarity</text>
  <text x="715" y="140" font-family="-apple-system, sans-serif" font-size="11" fill="#065f46" text-anchor="middle" font-weight="700">JOIN</text>
  <line x1="715" y1="151" x2="715" y2="170" stroke="#d1d5db" stroke-width="1.5"/>
  <rect x="640" y="170" width="150" height="82" rx="8" fill="#fff" stroke="#059669" stroke-width="1"/>
  <text x="655" y="190" font-family="-apple-system, sans-serif" font-size="10" fill="#6b7280">Category</text>
  <text x="718" y="190" font-family="-apple-system, sans-serif" font-size="10" fill="#1e1b4b" font-weight="700">Sports</text>
  <text x="655" y="207" font-family="-apple-system, sans-serif" font-size="10" fill="#6b7280">Sub</text>
  <text x="718" y="207" font-family="-apple-system, sans-serif" font-size="10" fill="#1e1b4b" font-weight="700">NBA</text>
  <text x="655" y="224" font-family="-apple-system, sans-serif" font-size="10" fill="#6b7280">Brand</text>
  <text x="718" y="224" font-family="-apple-system, sans-serif" font-size="10" fill="#1e1b4b" font-weight="700">Nike</text>
  <text x="655" y="241" font-family="-apple-system, sans-serif" font-size="10" fill="#6b7280">Conf.</text>
  <text x="718" y="241" font-family="-apple-system, sans-serif" font-size="10" fill="#059669" font-weight="700">94%</text>
  <text x="600" y="280" font-family="-apple-system, sans-serif" font-size="11" fill="#065f46" text-anchor="middle" font-weight="600">1 pipeline. Unified label. Full context.</text>
</svg>
</div>
<!--kg-card-end: html-->
<h2 id="flat-vs-hierarchical">Flat vs. Hierarchical</h2><h3 id="flat-taxonomies">Flat Taxonomies</h3><p>Single-level reference collection. Every document is matched against the same categories, best match wins.</p><p><strong>Use cases:</strong> <a href="https://mixpeek.com/converters/video-to-faces?ref=blog.mixpeek.com">Face enrollment</a>, logo detection, product recognition, entity linking. Fast to set up. Start here if your categories don&apos;t have meaningful parent-child relationships.</p><h3 id="hierarchical-taxonomies">Hierarchical Taxonomies</h3><p>Categories organized into a tree where classification cascades from broad to specific. Each level narrows the search space using different features, executing like a <strong>Common Table Expression (CTE)</strong>. Each level builds on the previous.</p><p>A document classified as &quot;Nike &#x2192; Athletic &#x2192; Running&quot; inherits enrichment fields from all three levels. Different levels can use different <a href="https://mixpeek.com/docs/processing/feature-extractors?ref=blog.mixpeek.com">feature extractors</a>: logo embeddings for brand detection, scene classification for categories, activity recognition for subcategories.</p>
<!--kg-card-begin: html-->
<div style="margin: 40px 0;">
<svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 800 310" style="width:100%;max-width:800px;display:block;margin:0 auto;">
  <rect width="800" height="310" rx="12" fill="#faf9ff"/>
  <text x="400" y="28" font-family="-apple-system, sans-serif" font-size="14" fill="#6d28d9" text-anchor="middle" font-weight="700">Hierarchical Taxonomy &#x2013; CTE-style Execution</text>
  <rect x="280" y="42" width="240" height="46" rx="10" fill="#f5f3ff" stroke="#7c3aed" stroke-width="2"/>
  <rect x="280" y="42" width="6" height="46" rx="3" fill="#7c3aed"/>
  <text x="300" y="62" font-family="-apple-system, sans-serif" font-size="10" fill="#7c3aed" font-weight="700">L0</text>
  <text x="325" y="72" font-family="-apple-system, sans-serif" font-size="14" fill="#1e1b4b" font-weight="700">Brand Detection</text>
  <text x="463" y="72" font-family="-apple-system, sans-serif" font-size="10" fill="#6b7280">logo emb.</text>
  <line x1="360" y1="88" x2="250" y2="115" stroke="#c4b5fd" stroke-width="2"/>
  <line x1="440" y1="88" x2="550" y2="115" stroke="#c4b5fd" stroke-width="2"/>
  <circle cx="250" cy="115" r="3" fill="#7c3aed"/>
  <circle cx="550" cy="115" r="3" fill="#7c3aed"/>
  <rect x="130" y="118" width="240" height="42" rx="8" fill="#eff6ff" stroke="#2563eb" stroke-width="1.5"/>
  <rect x="130" y="118" width="5" height="42" rx="2.5" fill="#2563eb"/>
  <text x="150" y="136" font-family="-apple-system, sans-serif" font-size="10" fill="#2563eb" font-weight="700">L1</text>
  <text x="175" y="146" font-family="-apple-system, sans-serif" font-size="13" fill="#1e3a5f" font-weight="700">Nike</text>
  <rect x="200" y="128" width="60" height="18" rx="9" fill="#dbeafe"/>
  <text x="230" y="141" font-family="-apple-system, sans-serif" font-size="9" fill="#1e40af" text-anchor="middle" font-weight="600">+brand_id</text>
  <rect x="430" y="118" width="240" height="42" rx="8" fill="#eff6ff" stroke="#2563eb" stroke-width="1.5"/>
  <rect x="430" y="118" width="5" height="42" rx="2.5" fill="#2563eb"/>
  <text x="450" y="136" font-family="-apple-system, sans-serif" font-size="10" fill="#2563eb" font-weight="700">L1</text>
  <text x="475" y="146" font-family="-apple-system, sans-serif" font-size="13" fill="#1e3a5f" font-weight="700">Adidas</text>
  <line x1="210" y1="160" x2="160" y2="190" stroke="#93c5fd" stroke-width="1.5"/>
  <line x1="290" y1="160" x2="340" y2="190" stroke="#93c5fd" stroke-width="1.5"/>
  <circle cx="160" cy="190" r="3" fill="#2563eb"/>
  <circle cx="340" cy="190" r="3" fill="#2563eb"/>
  <rect x="60" y="193" width="200" height="40" rx="8" fill="#ecfdf5" stroke="#059669" stroke-width="1.5"/>
  <rect x="60" y="193" width="5" height="40" rx="2.5" fill="#059669"/>
  <text x="80" y="210" font-family="-apple-system, sans-serif" font-size="10" fill="#059669" font-weight="700">L2</text>
  <text x="105" y="220" font-family="-apple-system, sans-serif" font-size="12" fill="#064e3b" font-weight="700">Athletic</text>
  <rect x="155" y="201" width="70" height="18" rx="9" fill="#d1fae5"/>
  <text x="190" y="214" font-family="-apple-system, sans-serif" font-size="9" fill="#065f46" text-anchor="middle" font-weight="600">+category</text>
  <rect x="290" y="193" width="200" height="40" rx="8" fill="#ecfdf5" stroke="#059669" stroke-width="1.5"/>
  <rect x="290" y="193" width="5" height="40" rx="2.5" fill="#059669"/>
  <text x="310" y="210" font-family="-apple-system, sans-serif" font-size="10" fill="#059669" font-weight="700">L2</text>
  <text x="335" y="220" font-family="-apple-system, sans-serif" font-size="12" fill="#064e3b" font-weight="700">Lifestyle</text>
  <line x1="120" y1="233" x2="85" y2="258" stroke="#6ee7b7" stroke-width="1.5"/>
  <line x1="200" y1="233" x2="250" y2="258" stroke="#6ee7b7" stroke-width="1.5"/>
  <circle cx="85" cy="258" r="3" fill="#059669"/>
  <circle cx="250" cy="258" r="3" fill="#059669"/>
  <rect x="20" y="260" width="150" height="38" rx="8" fill="#fffbeb" stroke="#d97706" stroke-width="1.5"/>
  <rect x="20" y="260" width="5" height="38" rx="2.5" fill="#d97706"/>
  <text x="40" y="276" font-family="-apple-system, sans-serif" font-size="10" fill="#d97706" font-weight="700">L3</text>
  <text x="62" y="286" font-family="-apple-system, sans-serif" font-size="12" fill="#78350f" font-weight="700">Running</text>
  <rect x="115" y="268" width="40" height="18" rx="9" fill="#fef3c7"/>
  <text x="135" y="281" font-family="-apple-system, sans-serif" font-size="9" fill="#92400e" text-anchor="middle" font-weight="600">+SKU</text>
  <rect x="200" y="260" width="150" height="38" rx="8" fill="#fffbeb" stroke="#d97706" stroke-width="1.5"/>
  <rect x="200" y="260" width="5" height="38" rx="2.5" fill="#d97706"/>
  <text x="220" y="276" font-family="-apple-system, sans-serif" font-size="10" fill="#d97706" font-weight="700">L3</text>
  <text x="242" y="286" font-family="-apple-system, sans-serif" font-size="12" fill="#78350f" font-weight="700">Basketball</text>
  <rect x="530" y="225" width="250" height="70" rx="10" fill="#fff" stroke="#c4b5fd" stroke-width="1.5"/>
  <text x="545" y="246" font-family="-apple-system, sans-serif" font-size="11" fill="#6d28d9" font-weight="700">Inherited enrichment (Running):</text>
  <text x="545" y="264" font-family="-apple-system, sans-serif" font-size="11" fill="#4b5563">L0 brand_id + L1 brand + L2 category</text>
  <text x="545" y="282" font-family="-apple-system, sans-serif" font-size="12" fill="#059669" font-weight="700">Nike &#x2192; Athletic &#x2192; Running &#x2192; SKU</text>
  <path d="M170,285 Q350,310 530,270" fill="none" stroke="#c4b5fd" stroke-width="1.5" stroke-dasharray="5,4"/>
</svg>
</div>
<!--kg-card-end: html-->
<p><strong>Use cases:</strong> Media content classification, <a href="https://mixpeek.com/ecommerce-search?ref=blog.mixpeek.com">product categorization</a>, organizational hierarchies, content moderation.</p><h2 id="how-it-works">How It Works</h2><p><strong>1. Feature extraction.</strong> Multiple AI models extract features from each modality: <a href="https://mixpeek.com/converters/image-to-embeddings?ref=blog.mixpeek.com">CLIP embeddings</a> from video frames, speech transcription from audio, object detection from images, sentence embeddings from text. Each becomes a queryable vector.</p><p><strong>2. Input mapping.</strong> Configures which extracted features query which taxonomy level. A face-based taxonomy uses face embeddings; a content classification taxonomy might use CLIP at the top level and <a href="https://mixpeek.com/converters/audio-to-embeddings?ref=blog.mixpeek.com">audio features</a> deeper down.</p><p><strong>3. Similarity matching.</strong> Each document&apos;s features are compared against the reference collection using a <a href="https://mixpeek.com/docs/retrieval/retrievers?ref=blog.mixpeek.com">retriever</a>, the same infrastructure used for <a href="https://mixpeek.com/blog/keyword-vs-semantic-vs-hybrid-search?ref=blog.mixpeek.com">semantic search</a>. Documents exceeding the threshold get enriched.</p><p><strong>4. Enrichment.</strong> Structured metadata from the reference collection is attached to the document: brand name, content policy, compliance flags, campaign IDs. Configurable field paths, target names, and merge modes (replace or append).</p><h2 id="real-world-applications">Real-World Applications</h2><p><strong>Advertising.</strong> The <a href="https://mixpeek.com/blog/iab-taxonomy-migration?ref=blog.mixpeek.com">IAB Content Taxonomy</a> defines 1,500+ categories for programmatic ad targeting. Text-only classifiers can&apos;t categorize a cooking video with no description or a sports highlight with only crowd noise. AWS published a <a href="https://mixpeek.com/blog/multimodal-ai-contextual-advertising?ref=blog.mixpeek.com">reference architecture</a> requiring five separate services. A retriever-powered taxonomy collapses that into one pipeline.</p><p><strong>Media asset management.</strong> Libraries of 100,000+ video assets need search across visual content, dialogue, and audio. A hierarchical taxonomy classifies a broadcast as &quot;Live Sports &#x2192; Football &#x2192; NFL &#x2192; Highlight &#x2192; Touchdown&quot; using different features at each level, enriching with rights info and licensing metadata. Manual tagging costs $15&#x2013;25 per asset. See how <a href="https://mixpeek.com/blog/video-intelligence-raw-footage-to-searchable-data?ref=blog.mixpeek.com">video search</a> changes this.</p><p><strong>E-commerce.</strong> Shopify&apos;s multimodal system (BERT + MobileNet-V2) increased leaf-node classification precision by 8% and nearly doubled coverage vs. text-only. A 2025 study found CLIP-based fusion achieved 98.59% hierarchical F1 with a two-stage pipeline: lightweight text model first, multimodal model only when confidence is low.</p><p><strong>Content moderation.</strong> An ICCV 2025 study tested multimodal AI on 1,500 videos across 12 languages. Best model (Gemini-2.0-Flash) achieved F1=0.91 vs. human F1=0.98, at 1/35th the cost ($28 vs. $974). The practical solution: multimodal AI handles the first pass, low-confidence cases escalate to humans.</p><p><a href="https://mixpeek.com/blog/ip-safety-pre-publication-clearance?ref=blog.mixpeek.com"><strong>Brand safety</strong></a><strong>.</strong> Enforcing &quot;Talent X cannot appear within 5 seconds of a competitor product in negative-sentiment content&quot; requires cross-modal reasoning: face recognition, logo detection, audio sentiment, temporal proximity. A <a href="https://mixpeek.com/blog/multi-stage-retrieval-pipelines?ref=blog.mixpeek.com">multi-stage retrieval pipeline</a> connects these with taxonomy enrichment for contract terms and compliance status.</p><h2 id="building-a-multimodal-taxonomy">Building a Multimodal Taxonomy</h2><h3 id="create-reference-collections">Create reference collections</h3><pre><code class="language-bash"># Flat taxonomy: employee face recognition
curl -sS -X POST &quot;$MP_API_URL/v1/taxonomies&quot; \
  -H &quot;Authorization: Bearer $MP_API_KEY&quot; \
  -H &quot;X-Namespace: $MP_NAMESPACE&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d &apos;{
    &quot;taxonomy_name&quot;: &quot;employee_faces&quot;,
    &quot;taxonomy_type&quot;: &quot;flat&quot;,
    &quot;retriever_id&quot;: &quot;ret_face_matcher&quot;,
    &quot;input_mappings&quot;: {
      &quot;query_embedding&quot;: &quot;mixpeek://face_detector@v2/face_embedding&quot;
    },
    &quot;source_collection&quot;: {
      &quot;collection_id&quot;: &quot;col_employee_embeddings&quot;,
      &quot;enrichment_fields&quot;: [
        { &quot;field_path&quot;: &quot;metadata.name&quot;, &quot;merge_mode&quot;: &quot;enrich&quot; },
        { &quot;field_path&quot;: &quot;metadata.department&quot;, &quot;merge_mode&quot;: &quot;enrich&quot; }
      ]
    }
  }&apos;
</code></pre><h3 id="go-hierarchical-when-you-need-precision">Go hierarchical when you need precision</h3><pre><code class="language-bash">curl -sS -X POST &quot;$MP_API_URL/v1/taxonomies&quot; \
  -H &quot;Authorization: Bearer $MP_API_KEY&quot; \
  -H &quot;X-Namespace: $MP_NAMESPACE&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d &apos;{
    &quot;taxonomy_name&quot;: &quot;content_classification&quot;,
    &quot;taxonomy_type&quot;: &quot;hierarchical&quot;,
    &quot;retriever_id&quot;: &quot;ret_scene_classifier&quot;,
    &quot;input_mappings&quot;: {
      &quot;query_embedding&quot;: &quot;mixpeek://clip@v1/scene_embedding&quot;
    },
    &quot;hierarchy&quot;: [
      {
        &quot;node_id&quot;: &quot;brands&quot;,
        &quot;collection_id&quot;: &quot;col_brand_references&quot;,
        &quot;enrichment_fields&quot;: [&quot;metadata.brand_name&quot;, &quot;metadata.brand_id&quot;]
      },
      {
        &quot;node_id&quot;: &quot;categories&quot;,
        &quot;collection_id&quot;: &quot;col_content_categories&quot;,
        &quot;parent_node_id&quot;: &quot;brands&quot;,
        &quot;enrichment_fields&quot;: [&quot;metadata.category&quot;, &quot;metadata.content_policy&quot;]
      },
      {
        &quot;node_id&quot;: &quot;campaigns&quot;,
        &quot;collection_id&quot;: &quot;col_campaign_assets&quot;,
        &quot;parent_node_id&quot;: &quot;categories&quot;,
        &quot;retriever_id&quot;: &quot;ret_campaign_matcher&quot;,
        &quot;enrichment_fields&quot;: [&quot;metadata.campaign_id&quot;, &quot;metadata.flight_dates&quot;]
      }
    ]
  }&apos;
</code></pre><h3 id="choose-an-execution-mode">Choose an execution mode</h3>
<!--kg-card-begin: html-->
<table>
<thead>
<tr>
<th>Mode</th>
<th>When</th>
<th>Tradeoff</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>materialize</strong></td>
<td>After ingestion (~30s)</td>
<td>Low latency, results persisted</td>
</tr>
<tr>
<td><strong>on_demand</strong></td>
<td>Query time (retriever stage)</td>
<td>Always-fresh reference data, higher latency</td>
</tr>
<tr>
<td><strong>retroactive</strong></td>
<td>Manual trigger via API</td>
<td>Batch reclassification after taxonomy updates</td>
</tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>Attach to a collection:</p><pre><code class="language-json">{
  &quot;taxonomy_applications&quot;: [
    { &quot;taxonomy_id&quot;: &quot;tax_content_classification&quot;, &quot;execution_mode&quot;: &quot;materialize&quot; }
  ]
}
</code></pre><h3 id="test-before-you-materialize">Test before you materialize</h3><pre><code class="language-bash">curl -sS -X POST &quot;$MP_API_URL/v1/taxonomies/&lt;taxonomy_id&gt;/enrich&quot; \
  -H &quot;Authorization: Bearer $MP_API_KEY&quot; \
  -H &quot;X-Namespace: $MP_NAMESPACE&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d &apos;{
    &quot;source_documents&quot;: [
      { &quot;document_id&quot;: &quot;doc_test_001&quot;, &quot;mixpeek://clip@v1/scene_embedding&quot;: [0.12, 0.34] }
    ],
    &quot;mode&quot;: &quot;on_demand&quot;
  }&apos;
</code></pre><p>If categories are wrong, add more reference examples. The taxonomy improves because matching is based on collection contents. No model retraining required.</p><h2 id="governance">Governance</h2><p>There is no finished taxonomy. Updating a multimodal taxonomy means updating its reference collections, not rewriting rules or retraining models. Add examples, remove outdated categories, and the taxonomy adapts.</p><p><a href="https://mixpeek.com/docs/enrichment/taxonomies?ref=blog.mixpeek.com">Version your taxonomies</a> before structural changes. Use <a href="https://mixpeek.com/docs/api-reference/collection-taxonomies/apply-taxonomy-to-existing-documents?ref=blog.mixpeek.com">retroactive application</a> to reclassify existing documents after updates. Combine with <a href="https://mixpeek.com/docs/enrichment/clusters?ref=blog.mixpeek.com">clustering</a> to discover new category candidates from unmatched documents.</p><hr><p>Start flat. Add hierarchy when you need precision. Version everything. Update reference collections instead of rewriting rules.</p><p><em>Ready to build? </em><a href="https://mixpeek.com/start?ref=blog.mixpeek.com"><em>Get started with Mixpeek</em></a><em> or explore the </em><a href="https://mixpeek.com/docs/api-reference/taxonomies/create-taxonomy?ref=blog.mixpeek.com"><em>taxonomy API reference</em></a><em>.</em></p>]]></content:encoded></item><item><title><![CDATA[Object Storage Comparison 2026: 21 Providers, Real Pricing, and the Gotchas Nobody Tells You]]></title><description><![CDATA[We compared 21 S3-compatible object storage providers across pricing, egress, features, and fine print. AWS S3 costs 15x more than the cheapest alternative for the same workload. Here's everything we found.]]></description><link>http://blog.mixpeek.com/object-storage-comparison-2026/</link><guid isPermaLink="false">69d9531e3baecafdb7f8debf</guid><category><![CDATA[Object Storage]]></category><category><![CDATA[Cloud Infrastructure]]></category><category><![CDATA[S3]]></category><category><![CDATA[Cost Optimization]]></category><category><![CDATA[Developer Tools]]></category><category><![CDATA[Comparison]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Fri, 10 Apr 2026 19:58:00 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/04/storage-comparison-hero.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/04/storage-comparison-hero.png" alt="Object Storage Comparison 2026: 21 Providers, Real Pricing, and the Gotchas Nobody Tells You"><p>We built <a href="https://github.com/mixpeek/awesome-object-storage?ref=blog.mixpeek.com">Awesome Object Storage</a> because we got tired of discovering gotchas <em>after</em> migrating 50 TB. Wasabi&apos;s 90-day minimum retention. R2&apos;s missing versioning. DigitalOcean&apos;s 5 GB object cap masquerading as &quot;unlimited.&quot; Every claim sourced, every gotcha earned the hard way.</p><p>This is what we found after comparing 21 S3-compatible providers across pricing, features, durability, compliance, and the fine print that actually breaks migrations.</p><hr><p>Resources</p><ul><li><a href="https://storage.mixpeek.com/?ref=blog.mixpeek.com">Interactive Cost Calculator</a> &#x2014; plug in your usage, compare all 21 providers</li><li><a href="https://github.com/mixpeek/awesome-object-storage?ref=blog.mixpeek.com">GitHub: awesome-object-storage</a> &#x2014; full dataset, JSON schemas, open-source</li><li><a href="https://mixpeek.com/curated-lists/best-s3-compatible-object-storage?ref=blog.mixpeek.com">Ranked Listicle</a> &#x2014; top 10 providers ranked with pros, cons, and pricing</li></ul><h2 id="the-pricing-lie-storage-cost-is-the-least-important-number">The Pricing Lie: Storage Cost Is the Least Important Number</h2><p>When teams evaluate object storage, they compare the per-GB storage price. That&apos;s the wrong number.</p><p>Egress is where the real bill hides. AWS S3 charges $0.09/GB to move data out. Google Cloud charges $0.12/GB &#x2014; the highest of the big three. On a workload that reads 10 TB/month, that&apos;s <strong>$900&#x2013;$1,200/month in egress alone</strong>, dwarfing the storage cost.</p><p>Meanwhile, Cloudflare R2, Tigris, and Backblaze B2 (via Cloudflare Bandwidth Alliance) offer zero or near-zero egress. For a 10 TB stored / 5 TB egress workload:</p>
<!--kg-card-begin: html-->
<table>
<thead><tr><th>Provider</th><th>Monthly Cost</th><th>vs. AWS S3</th></tr></thead>
<tbody>
<tr><td>AWS S3</td><td>$689</td><td>&#x2014;</td></tr>
<tr><td>Google Cloud Storage</td><td>$810</td><td>+18%</td></tr>
<tr><td>Cloudflare R2</td><td>$158</td><td><strong>-77%</strong></td></tr>
<tr><td>Backblaze B2</td><td>$110</td><td><strong>-84%</strong></td></tr>
<tr><td>Wasabi</td><td>$49</td><td><strong>-93%</strong></td></tr>
<tr><td>IDrive e2</td><td>$46</td><td><strong>-93%</strong></td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>AWS S3 costs <strong>15x more</strong> than the cheapest alternative for the same workload. Run the numbers for your own usage at <a href="https://storage.mixpeek.com/?ref=blog.mixpeek.com">storage.mixpeek.com</a>.</p><h2 id="the-escape-cost-what-nobody-compares">The Escape Cost: What Nobody Compares</h2><p>Here&apos;s the number that matters most and gets compared least: <strong>what does it cost to leave?</strong></p>
<!--kg-card-begin: html-->
<table>
<thead><tr><th>Provider</th><th>Cost to Move 100 TB Out</th></tr></thead>
<tbody>
<tr><td>AWS S3</td><td>$9,000</td></tr>
<tr><td>Google Cloud Storage</td><td>$12,000</td></tr>
<tr><td>Azure Blob</td><td>$8,700</td></tr>
<tr><td>Cloudflare R2</td><td><strong>$0</strong></td></tr>
<tr><td>Backblaze B2</td><td>$1,000</td></tr>
<tr><td>Wasabi</td><td>$0*</td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p><em>*Wasabi &quot;free&quot; egress is subject to a reasonable-use policy: monthly egress can&apos;t exceed your stored volume.</em></p><p>If you&apos;re storing 100 TB on GCS and want to leave, it&apos;ll cost you $12,000 just in egress fees. That&apos;s not a technical lock-in &#x2014; it&apos;s a financial one.</p><h2 id="the-8-gotchas-that-will-break-your-migration">The 8 Gotchas That Will Break Your Migration</h2><p>Every one of these burned a real team. We know because we were some of them.</p><h3 id="1-wasabis-90-day-minimum-retention">1. Wasabi&apos;s 90-Day Minimum Retention</h3><p>Delete an object after 30 days? You still pay for 90. This is <em>per-object</em>, not per-account. On a dataset with high churn, this can double your effective storage cost. We found out after migrating 50 TB.</p><h3 id="2-r2-has-no-versioning-and-no-object-lock">2. R2 Has No Versioning and No Object Lock</h3><p>Cloudflare R2 is the darling of zero-egress storage. But it has no versioning, no object lock, and no WORM compliance. If you need immutable backups or regulatory compliance, R2 isn&apos;t an option &#x2014; full stop. <a href="https://www.tigrisdata.com/?ref=blog.mixpeek.com">Tigris</a> fills this gap with zero egress <em>plus</em> versioning and object lock.</p><h3 id="3-digitalocean-spaces-caps-objects-at-5-gb">3. DigitalOcean Spaces Caps Objects at 5 GB</h3><p>The pricing page says &quot;unlimited storage.&quot; The fine print says max object size is <strong>5 GB</strong> &#x2014; not 5 TB like every other provider. Vultr and Linode have the same 5 GB cap. If you&apos;re storing video, backups, or ML model checkpoints, these are non-starters.</p><h3 id="4-s3-compatible-is-a-spectrum">4. &quot;S3 Compatible&quot; Is a Spectrum</h3><p>Full S3 compatibility means passing the AWS SDK test suite &#x2014; multipart uploads, presigned URLs, bucket notifications, S3 Select, batch operations. Most providers only support a subset. Azure Blob&apos;s S3 compatibility is still in <em>preview</em>. Trust your integration tests, not the compatibility page.</p><h3 id="5-gcs-has-the-highest-egress-of-the-big-three">5. GCS Has the Highest Egress of the Big Three</h3><p>Google Cloud Storage charges $0.12/GB for egress &#x2014; 33% more than AWS and 38% more than Azure. If your workload is read-heavy, GCS is quietly the most expensive hyperscaler.</p><h3 id="6-archive-tiers-have-minimum-retention-traps">6. Archive Tiers Have Minimum Retention Traps</h3><p>GCS Archive has a <strong>365-day</strong> minimum retention. Azure Cold has 180 days. Delete early and you pay the full retention period anyway. OVHcloud applies a 30-day minimum to <em>all</em> tiers, not just archive.</p><h3 id="7-event-notifications-are-basically-awsgcsminio-only">7. Event Notifications Are Basically AWS/GCS/MinIO Only</h3><p>If your architecture depends on &quot;object created &#x2192; trigger processing,&quot; your options are narrow. S3 (SNS/SQS/Lambda/EventBridge), GCS (Pub/Sub), R2 (Workers), and MinIO (Webhooks/Kafka/NATS) have real event systems. Most alternatives don&apos;t.</p><h3 id="8-durability-claims-vary-in-substance">8. Durability Claims Vary in Substance</h3><p>Everyone claims 11 nines (99.999999999%). But Vultr, DigitalOcean, and Linode don&apos;t publish verifiable durability data. Backblaze publishes drive failure statistics openly. When a provider won&apos;t show their math, the number is marketing, not engineering.</p><h2 id="the-decision-framework">The Decision Framework</h2><p>After testing all 21 providers, here&apos;s how we&apos;d decide:</p><ul><li><strong>Cheapest raw storage:</strong> Storj ($0.004/GB) or IDrive e2 ($0.004/GB)</li><li><strong>Zero egress, no asterisks:</strong> Cloudflare R2</li><li><strong>Zero egress + versioning + object lock:</strong> Tigris or Impossible Cloud</li><li><strong>CDN origin:</strong> R2 (Cloudflare native) or Fastly Object Storage</li><li><strong>Compliance / WORM:</strong> AWS S3 Object Lock or Wasabi (accept the 90-day minimum)</li><li><strong>EU data sovereignty:</strong> Hetzner (cheapest), Scaleway, or OVHcloud</li><li><strong>Self-hosted:</strong> MinIO (only serious option)</li><li><strong>Biggest free tier:</strong> Oracle Cloud (10 TB/mo free egress)</li><li><strong>Already on AWS and can&apos;t leave:</strong> S3 Intelligent-Tiering + aggressive lifecycle rules</li></ul><h2 id="what-happens-after-you-store-it">What Happens After You Store It</h2><p>Object storage used to be a write-and-forget tier. That&apos;s changing. AWS launched <a href="https://aws.amazon.com/s3/vectors/?ref=blog.mixpeek.com">S3 Vectors</a> &#x2014; vector search built into S3 itself. <a href="https://turbopuffer.com/?ref=blog.mixpeek.com">turbopuffer</a> runs a vector database on top of S3. <a href="https://lancedb.com/?ref=blog.mixpeek.com">LanceDB</a> stores vector indices as objects.</p><p>At <a href="https://mixpeek.com/?ref=blog.mixpeek.com">Mixpeek</a>, we built our processing pipeline to treat object storage as the source of truth for multimodal data &#x2014; images, video, documents, audio &#x2014; with feature extraction, embedding, and search all flowing from the bucket. Your storage layer isn&apos;t just storage anymore. It&apos;s the foundation of your intelligence layer.</p><p>If you want to make your stored objects searchable and queryable across modalities, <a href="https://mixpeek.com/?ref=blog.mixpeek.com">try Mixpeek</a> &#x2014; it connects to any S3-compatible bucket and turns your data into something you can actually reason over.</p><h2 id="try-it-yourself">Try It Yourself</h2><p>We open-sourced the full dataset &#x2014; all 21 providers, ~60 data points each, machine-readable JSON &#x2014; at <a href="https://github.com/mixpeek/awesome-object-storage?ref=blog.mixpeek.com">github.com/mixpeek/awesome-object-storage</a>.</p><p>Run your own cost comparison at <a href="https://storage.mixpeek.com/?ref=blog.mixpeek.com">storage.mixpeek.com</a>.</p><p>If we got something wrong or a provider updated their pricing, <a href="https://github.com/mixpeek/awesome-object-storage/pulls?ref=blog.mixpeek.com">open a PR</a>. Every claim is sourced. Every number is verifiable.</p>]]></content:encoded></item><item><title><![CDATA[Building a Kalshi Trading Bot with Semantic Search and LLM Extraction]]></title><description><![CDATA[How we built an autonomous Kalshi trading bot using the Kalshi API and Mixpeek's video transcription, semantic search, and LLM data extraction no external tools required.]]></description><link>http://blog.mixpeek.com/kalshi-trading-bot-semantic-search-llm-extraction/</link><guid isPermaLink="false">69cead063baecafdb7f8dafd</guid><category><![CDATA[Engineering]]></category><category><![CDATA[Use Cases]]></category><category><![CDATA[LLM]]></category><category><![CDATA[Retrievers]]></category><category><![CDATA[Prediction Markets]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Thu, 02 Apr 2026 18:01:16 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/04/ChatGPT-Image-Apr-2--2026--01_49_35-PM-1.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/04/ChatGPT-Image-Apr-2--2026--01_49_35-PM-1.png" alt="Building a Kalshi Trading Bot with Semantic Search and LLM Extraction"><p>We built an autonomous <strong>Kalshi trading bot</strong> that uses the <strong>Kalshi API</strong> and Mixpeek&apos;s multimodal data platform to trade mention markets in real-time. The system feeds YouTube URLs directly into Mixpeek &#x2014; which handles transcription, embedding, and LLM extraction &#x2014; then queries the results through a semantic retriever to generate calibrated trading signals. Zero manual intervention.</p><p>This post walks through every component: how Mixpeek ingests and transcribes political video, uses <strong>LLM data extraction</strong> to structure it, queries it with a <strong>semantic search API</strong>, and turns the output into <strong>automated market making</strong> decisions on the <strong>prediction market API</strong> from Kalshi.</p><hr><h2 id="what-are-kalshi-mention-markets">What Are Kalshi Mention Markets?</h2><p>Kalshi&apos;s mention markets are binary contracts on whether a public figure will say a specific word. Examples:</p><ul><li><em>&quot;Will Trump say &apos;tariff&apos; in his next address?&quot;</em> &#x2014; ticker: KXTRUMPMENTIONB-26APR01-TARI</li><li><em>&quot;Will the Fed Chair mention &apos;inflation&apos;?&quot;</em> &#x2014; ticker: KXFEDMENTION-26APR-INFL</li><li><em>&quot;Will the Press Secretary say &apos;China&apos;?&quot;</em> &#x2014; ticker: KXSECPRESSMENTION-26APR30-CHIN</li></ul><p>These markets resolve based on official transcripts. The edge comes from processing political speech <strong>faster and more accurately</strong> than the market &#x2014; knowing <em>who</em> said it, <em>how surprising</em> it was, and whether the keyword appeared in a policy-relevant context.</p><p>Most <strong>Kalshi trading bots</strong> rely on simple keyword matching. Ours uses Mixpeek&apos;s full resource chain for semantic understanding.</p><hr><h2 id="system-architecture-six-mixpeek-resources">System Architecture: Six Mixpeek Resources</h2><p>The pipeline chains six Mixpeek primitives. Mixpeek handles everything from video download and transcription to embedding and LLM extraction &#x2014; no external tools required:</p><pre><code>YouTube URL
  1. Namespace  &#x2192; data isolation
  2. Bucket     &#x2192; accepts YouTube URLs as type: &quot;video&quot;
  3. Collection &#x2192; auto-transcription + text embedding + LLM extraction
  4. Retriever  &#x2192; semantic search across processed documents
  5. Bucket     &#x2192; trade history logging (feedback loop)
  6. Retriever  &#x2192; historical calibration from past trades</code></pre><p>If you&apos;ve used a <strong>prediction market API</strong> before (Kalshi, Polymarket, etc.), you know the data challenge: markets move on unstructured information &#x2014; speeches, press briefings, hearings &#x2014; that doesn&apos;t fit neatly into a database. Mixpeek bridges that gap.</p><figure class="kg-card kg-image-card"><img src="http://blog.mixpeek.com/content/images/2026/04/Xnapper-2026-04-02-14.00.26.jpg" class="kg-image" alt="Building a Kalshi Trading Bot with Semantic Search and LLM Extraction" loading="lazy" width="2000" height="1281" srcset="http://blog.mixpeek.com/content/images/size/w600/2026/04/Xnapper-2026-04-02-14.00.26.jpg 600w, http://blog.mixpeek.com/content/images/size/w1000/2026/04/Xnapper-2026-04-02-14.00.26.jpg 1000w, http://blog.mixpeek.com/content/images/size/w1600/2026/04/Xnapper-2026-04-02-14.00.26.jpg 1600w, http://blog.mixpeek.com/content/images/size/w2400/2026/04/Xnapper-2026-04-02-14.00.26.jpg 2400w" sizes="(min-width: 720px) 720px"></figure><p></p><hr><h2 id="resource-1-namespace-%E2%80%94-data-isolation">Resource 1: Namespace &#x2014; Data Isolation</h2><pre><code class="language-yaml">namespace_id: ns_7c8f877d9b
name: prediction-market-alpha</code></pre><p>Every resource lives inside a single namespace, isolating prediction market data from other workloads. All API calls include the <code>X-Namespace</code> header. This is standard practice when using Mixpeek as a multimodal data pipeline &#x2014; one namespace per use case.</p><hr><h2 id="resource-2-hearing-bucket-%E2%80%94-video-ingestion">Resource 2: Hearing Bucket &#x2014; Video Ingestion</h2><pre><code class="language-yaml">bucket_id: bkt_be6b9536</code></pre><p>We monitor four YouTube channels (White House, C-SPAN, C-SPAN Senate, Federal Reserve). When a new video appears, we push the YouTube URL directly to Mixpeek &#x2014; no need for external transcription tools:</p><pre><code class="language-json">POST /v1/buckets/bkt_be6b9536/objects
{
  &quot;blobs&quot;: [
    {
      &quot;property&quot;: &quot;url&quot;,
      &quot;type&quot;: &quot;video&quot;,
      &quot;data&quot;: &quot;https://www.youtube.com/watch?v=7d-3oqka-fE&quot;
    },
    {&quot;property&quot;: &quot;source&quot;, &quot;type&quot;: &quot;string&quot;, &quot;data&quot;: &quot;white-house&quot;},
    {&quot;property&quot;: &quot;event_type&quot;, &quot;type&quot;: &quot;string&quot;, &quot;data&quot;: &quot;press_briefing&quot;}
  ]
}</code></pre><p>That&apos;s it. Mixpeek downloads the video, extracts the audio, transcribes it, and makes the text available to the collection pipeline. A single API call replaces what would otherwise require <code>yt-dlp</code> for download, <code>whisper</code> or <code>youtube-transcript-api</code> for transcription, and a custom chunking pipeline.</p><p>A single day&apos;s political speech typically yields 5-10 videos totaling 200-400K characters of transcript.</p><hr><h2 id="resource-3-collection-%E2%80%94-embedding-llm-data-extraction">Resource 3: Collection &#x2014; Embedding + LLM Data Extraction</h2><pre><code class="language-yaml">collection_id: col_2a9565df60</code></pre><p>This is the core of the system. Once Mixpeek transcribes the video, the collection runs two extractors on the resulting text:</p><ol><li><strong>Dense vector embedding</strong> via <code>multilingual_e5_large_instruct_v1</code> &#x2014; enables semantic search across all transcript chunks</li><li><strong>LLM structured extraction</strong> via Mixpeek&apos;s <code>response_shape</code> &#x2014; Claude analyzes each chunk and extracts seven fields</li></ol><p>The <code>response_shape</code> configuration defines the extraction schema:</p><pre><code class="language-json">{
  &quot;speaker&quot;: &quot;who is speaking (e.g. President Trump, Fed Chair Powell)&quot;,
  &quot;statement_type&quot;: &quot;policy_announcement | press_response | hearing_testimony | ...&quot;,
  &quot;policy_direction&quot;: &quot;hawkish | dovish | neutral | escalatory | ...&quot;,
  &quot;keywords_mentioned&quot;: [&quot;tariff&quot;, &quot;china&quot;, &quot;inflation&quot;, ...],
  &quot;is_surprising&quot;: true/false,
  &quot;surprise_magnitude&quot;: 0.0 - 1.0,
  &quot;market_impact&quot;: 0.0 - 1.0
}</code></pre><p>This is <strong>LLM data extraction</strong> at scale &#x2014; every chunk gets speaker attribution, policy context, and market relevance scoring without any custom LLM pipeline. A batch of 7 videos (357K chars of transcript) produces 150+ indexed documents with all seven fields.</p><hr><h2 id="resource-4-signal-retriever-%E2%80%94-semantic-search-api">Resource 4: Signal Retriever &#x2014; Semantic Search API</h2><pre><code class="language-yaml">retriever_id: ret_37fcabc4144e76
name: signal-market-matcher</code></pre><p>The retriever is configured as a <strong>semantic search API</strong> endpoint that queries the collection using the E5 embedding model:</p><pre><code class="language-json">{
  &quot;stages&quot;: [{
    &quot;stage_name&quot;: &quot;semantic-search&quot;,
    &quot;config&quot;: {
      &quot;stage_id&quot;: &quot;feature_search&quot;,
      &quot;parameters&quot;: {
        &quot;searches&quot;: [{
          &quot;feature_uri&quot;: &quot;mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1&quot;,
          &quot;query&quot;: {&quot;input_mode&quot;: &quot;text&quot;, &quot;value&quot;: &quot;{{INPUT.query}}&quot;}
        }],
        &quot;final_top_k&quot;: 30
      }
    }
  }]
}</code></pre><p>The engine sends three query groups per cycle to maximize coverage:</p><ul><li><strong>Political/policy terms</strong> &#x2014; tariff, china, iran, trade, sanctions, immigration</li><li><strong>People/institution names</strong> &#x2014; trump, powell, leavitt, fed, congress, senate</li><li><strong>Random keyword sample</strong> &#x2014; surfaces unexpected matches from new transcripts</li></ul><p>Each query returns up to 30 semantically relevant chunks with their LLM-extracted fields intact &#x2014; speaker, surprise magnitude, market impact, and all.</p><hr><h2 id="resource-5-trade-history-bucket-%E2%80%94-feedback-loop">Resource 5: Trade History Bucket &#x2014; Feedback Loop</h2><pre><code class="language-yaml">bucket_id: bkt_0e439f96</code></pre><p>Every trade decision &#x2014; executed or skipped &#x2014; gets logged back to Mixpeek. This creates a searchable archive of what the <strong>Kalshi trading bot</strong> traded, at what price, with what signal quality, and whether it won or lost. The feedback loop is what separates a <strong>prediction market bot</strong> from a simple alert system.</p><hr><h2 id="resource-6-history-retriever-%E2%80%94-edge-calibration">Resource 6: History Retriever &#x2014; Edge Calibration</h2><pre><code class="language-yaml">retriever_id: ret_2674a0d675b62f
name: signal-history</code></pre><p>Before placing any order through the <strong>Kalshi API</strong>, the engine queries the history retriever:</p><pre><code class="language-http">POST /v1/retrievers/ret_2674a0d675b62f/execute
{&quot;inputs&quot;: {&quot;query&quot;: &quot;tariff&quot;}}</code></pre><p>Past trades for the same keyword feed a win-rate calculation that adjusts the expected edge. If historical &quot;tariff&quot; trades won 70% of the time, the engine sizes up. If 30%, it skips. This is what makes the system self-improving &#x2014; a form of <strong>automated market making</strong> that learns from its own history via Mixpeek&apos;s retriever infrastructure.</p><hr><h2 id="three-intelligence-layers-from-signal-to-trade">Three Intelligence Layers: From Signal to Trade</h2><p>The six resources feed into three scoring layers:</p><h3 id="layer-1-signal-quality-scoring">Layer 1: Signal Quality Scoring</h3><p>Powered by the collection&apos;s <code>response_shape</code> LLM extraction:</p><ul><li><strong>Speaker authority</strong> &#x2014; &quot;President Trump&quot; (1.0), &quot;Fed Chair Powell&quot; (0.95), unknown press pool (0.50)</li><li><strong>Statement type</strong> &#x2014; policy announcements and hearing testimony score higher than casual references</li><li><strong>Surprise factor</strong> &#x2014; <code>is_surprising=true</code> with high <code>surprise_magnitude</code> &#x2192; larger position size</li><li><strong>Market impact</strong> &#x2014; LLM-estimated probability that the mention moves the market</li></ul><h3 id="layer-2-portfolio-construction">Layer 2: Portfolio Construction</h3><ul><li>Category exposure caps: max $3 per market category</li><li>Per-market position limits: $10 max</li><li>Daily loss circuit breaker: $10 max drawdown</li></ul><h3 id="layer-3-historical-calibration">Layer 3: Historical Calibration</h3><ul><li>Win rate from past trades adjusts edge estimates up or down</li><li>Keywords with poor track record get automatically de-risked</li><li>New keywords start at 50% base rate until history accumulates</li></ul><hr><h2 id="live-trading-results">Live Trading Results</h2><p>The engine runs autonomously, polling every 2 minutes. Here&apos;s actual output from a live cycle:</p><pre><code>Cycle 1: 28 signals found &#x2192; 7 trades attempted, 23 skipped

BUY signals (positive edge):
  &quot;iran&quot; on KXFEDMENTION-26APR-IRAN
    speaker=President Trump, quality=0.63, edge=+0.43 &#x2192; 1x YES @ $0.25
  &quot;volatility&quot; on KXFEDMENTION-26APR-VOLA
    speaker=President Trump, quality=1.00, edge=+0.50 &#x2192; 1x YES @ $0.28
  &quot;bitcoin&quot; on KXSECPRESSMENTION-26APR30-CRYP
    speaker=Press Secretary, quality=1.00, edge=+0.65 &#x2192; 1x YES @ $0.13

SKIPPED signals (negative edge or caps):
  &quot;russia&quot; on KXSECPRESSMENTION &#x2192; quality=0.61, edge=-0.09 &#x2192; SKIP
  &quot;border&quot; on KXSECPRESSMENTION &#x2192; quality=0.61, edge=-0.17 &#x2192; SKIP
  &quot;oil&quot; on KXLEAVITTSMFMENTION &#x2192; category cap $3.08/$3.00 &#x2192; SKIP</code></pre><p>The engine correctly rejects low-quality signals and respects portfolio limits, while aggressively buying high-conviction signals from authoritative speakers.</p><hr><h2 id="why-this-beats-simple-keyword-matching">Why This Beats Simple Keyword Matching</h2><p>Most <strong>Kalshi trading bots</strong> and <strong>prediction market bots</strong> use basic keyword detection &#x2014; grep the transcript for &quot;tariff&quot; and buy. That approach fails in practice:</p><ul><li><strong>False positives</strong> &#x2014; &quot;The tariff discussion from last year...&quot; doesn&apos;t mean they said &quot;tariff&quot; in a policy context today</li><li><strong>No speaker attribution</strong> &#x2014; a reporter asking &quot;Will you impose tariffs?&quot; is very different from the President saying &quot;I&apos;m imposing tariffs&quot;</li><li><strong>No surprise weighting</strong> &#x2014; Trump saying &quot;tariff&quot; (expected) should size differently than Powell saying &quot;tariff&quot; (unexpected)</li><li><strong>No learning</strong> &#x2014; keyword bots make the same mistakes repeatedly with no feedback loop</li></ul><p>Mixpeek&apos;s <code>response_shape</code> extraction solves all four. The <strong>semantic search API</strong> finds contextually relevant chunks, the LLM extraction gives you structured fields, and the history retriever calibrates over time.</p><hr><h2 id="technical-stack">Technical Stack</h2><ul><li><strong>Video ingestion</strong> &#x2014; YouTube URLs pushed directly to Mixpeek bucket as <code>type: &quot;video&quot;</code></li><li><strong>Transcription + processing</strong> &#x2014; Mixpeek auto-transcribes, chunks, embeds (E5), and extracts (Claude <code>response_shape</code>)</li><li><strong>Search</strong> &#x2014; Mixpeek retriever (<strong>semantic search API</strong> with <code>final_top_k: 30</code>)</li><li><strong>Trading</strong> &#x2014; <strong>Kalshi API</strong> with RSA-PSS authentication for order placement</li><li><strong>Feedback</strong> &#x2014; Mixpeek trade history bucket + history retriever for calibration</li><li><strong>Runtime</strong> &#x2014; FastAPI server with async polling loop (Python)</li></ul><p>The entire intelligence layer &#x2014; from YouTube URL to calibrated trading signal &#x2014; runs on six Mixpeek resource IDs. No transcription tools, no custom vector database, no LLM prompt engineering, no embedding pipeline to maintain.</p><hr><h2 id="get-started">Get Started</h2><p>Mixpeek handles the hard parts of <strong>unstructured data processing</strong> &#x2014; video transcription, chunking, embedding, LLM extraction, vector search, and batch processing &#x2014; so you can focus on your domain logic. Whether you&apos;re building a <strong>Kalshi trading bot</strong>, a content moderation pipeline, or a multimodal search engine, the same resource primitives apply.</p><ul><li><strong>Mixpeek docs</strong> &#x2014; mixpeek.com/docs</li><li><strong>Kalshi API docs</strong> &#x2014; docs.kalshi.com</li><li><strong>Source code</strong> &#x2014; the complete engine is open-source in our research repo</li></ul>]]></content:encoded></item><item><title><![CDATA[The Multimodal Data Warehouse: Why Unstructured Data Needs Its Own Snowflake]]></title><description><![CDATA[We are drowning in unstructured data — video, audio, images, documents, IoT — but our infrastructure still assumes everything is a row or a vector. The multimodal data warehouse is the missing layer: object decomposition, tiered storage, and multi-stage retrieval pipelines for the AI era.]]></description><link>http://blog.mixpeek.com/multimodal-data-warehouse/</link><guid isPermaLink="false">69c7d8d13baecafdb7f8d65b</guid><category><![CDATA[Multimodal AI]]></category><category><![CDATA[Data Warehouse]]></category><category><![CDATA[Architecture]]></category><category><![CDATA[Thought Leadership]]></category><category><![CDATA[Vector Database]]></category><category><![CDATA[Unstructured Data]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Sat, 28 Mar 2026 13:34:11 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/03/feature-image-6.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/03/feature-image-6.png" alt="The Multimodal Data Warehouse: Why Unstructured Data Needs Its Own Snowflake"><p><strong>TL;DR:</strong> We&apos;re drowning in unstructured data&#x2014;video, audio, images, documents, IoT streams&#x2014;but our infrastructure still assumes everything is a row in a table or a vector in an index. The <strong>multimodal data warehouse</strong> is the missing layer: a system that decomposes objects into searchable features, stores them across hot and cold tiers, and reassembles them through multi-stage retrieval pipelines. This isn&apos;t a database. It&apos;s the warehouse for the AI era.</p><h2 id="the-120-trillion-problem-nobody-talks-about">The $120 Trillion Problem Nobody Talks About</h2><p>Here&apos;s an uncomfortable truth: <strong>80-90% of enterprise data is unstructured</strong>, and it&apos;s growing 3x faster than structured data. IDC projects the global datasphere will hit 175 zettabytes by 2025&#x2014;and the vast majority of that is video, images, audio, documents, sensor data, and formats that don&apos;t fit in Snowflake.</p><p>Yet when companies build AI-native applications, they cobble together:</p><ul><li>A vector database for embeddings (Pinecone, Qdrant, Weaviate)</li><li>An object store for raw files (S3, GCS)</li><li>A separate search engine for text (Elasticsearch)</li><li>Custom ETL for each modality</li><li>Bespoke inference pipelines per use case</li></ul><p>This is the <strong>modern data Frankenstein</strong>&#x2014;a stitched-together monster where every new modality means a new system, a new integration, and a new failure mode.</p>
<!--kg-card-begin: html-->
<figure style="margin: 40px 0; text-align: center;"><svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 900 720" width="900" height="720">
  <defs>
    <lineargradient id="bg" x1="0%" y1="0%" x2="100%" y2="100%">
      <stop offset="0%" style="stop-color:#0f0f23"/>
      <stop offset="100%" style="stop-color:#1a1a3e"/>
    </lineargradient>
  </defs>

  <rect width="900" height="720" rx="16" fill="url(#bg)"/>

  <!-- === TOP HALF: Frankenstein === -->
  <text x="450" y="44" fill="#ef4444" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="22" font-weight="700" text-anchor="middle" letter-spacing="1">THE DATA FRANKENSTEIN</text>
  <text x="450" y="66" fill="#ef444480" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" text-anchor="middle">(status quo)</text>

  <!-- Four siloed boxes -->
  <rect x="40" y="88" width="185" height="76" rx="10" fill="#ef444412" stroke="#ef4444" stroke-width="2" stroke-dasharray="6,4"/>
  <text x="132" y="118" fill="#fca5a5" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="16" font-weight="600" text-anchor="middle">S3 / GCS</text>
  <text x="132" y="140" fill="#fca5a566" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="12" text-anchor="middle">raw files</text>

  <rect x="248" y="88" width="185" height="76" rx="10" fill="#ef444412" stroke="#ef4444" stroke-width="2" stroke-dasharray="6,4"/>
  <text x="340" y="118" fill="#fca5a5" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="16" font-weight="600" text-anchor="middle">Pinecone</text>
  <text x="340" y="140" fill="#fca5a566" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="12" text-anchor="middle">vectors only</text>

  <rect x="456" y="88" width="185" height="76" rx="10" fill="#ef444412" stroke="#ef4444" stroke-width="2" stroke-dasharray="6,4"/>
  <text x="548" y="118" fill="#fca5a5" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="16" font-weight="600" text-anchor="middle">Elasticsearch</text>
  <text x="548" y="140" fill="#fca5a566" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="12" text-anchor="middle">text search</text>

  <rect x="664" y="88" width="185" height="76" rx="10" fill="#ef444412" stroke="#ef4444" stroke-width="2" stroke-dasharray="6,4"/>
  <text x="756" y="118" fill="#fca5a5" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="16" font-weight="600" text-anchor="middle">Custom ETL</text>
  <text x="756" y="140" fill="#fca5a566" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="12" text-anchor="middle">per modality</text>

  <!-- Connecting lines -->
  <line x1="132" y1="164" x2="132" y2="198" stroke="#ef444444" stroke-width="1.5"/>
  <line x1="340" y1="164" x2="340" y2="198" stroke="#ef444444" stroke-width="1.5"/>
  <line x1="548" y1="164" x2="548" y2="198" stroke="#ef444444" stroke-width="1.5"/>
  <line x1="756" y1="164" x2="756" y2="198" stroke="#ef444444" stroke-width="1.5"/>
  <line x1="132" y1="198" x2="756" y2="198" stroke="#ef444444" stroke-width="1.5"/>
  <line x1="450" y1="198" x2="450" y2="218" stroke="#ef444444" stroke-width="1.5"/>

  <!-- Your App -->
  <rect x="300" y="218" width="300" height="56" rx="10" fill="#ef444418" stroke="#ef4444" stroke-width="2"/>
  <text x="450" y="244" fill="#fca5a5" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="16" font-weight="600" text-anchor="middle">Your App</text>
  <text x="450" y="264" fill="#ef444480" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="12" text-anchor="middle">glue code + prayers</text>

  <!-- Divider -->
  <line x1="80" y1="310" x2="820" y2="310" stroke="#ffffff18" stroke-width="1"/>
  <rect x="415" y="296" width="70" height="28" rx="14" fill="#0f0f23" stroke="#ffffff33" stroke-width="1"/>
  <text x="450" y="315" fill="#ffffff77" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="15" font-weight="700" text-anchor="middle">vs.</text>

  <!-- === BOTTOM HALF: Warehouse === -->
  <text x="450" y="354" fill="#7c3aed" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="22" font-weight="700" text-anchor="middle" letter-spacing="1">THE MULTIMODAL DATA WAREHOUSE</text>

  <!-- Layer 1: Ingestion -->
  <rect x="60" y="376" width="780" height="64" rx="10" fill="#7c3aed15" stroke="#7c3aed" stroke-width="2"/>
  <text x="450" y="402" fill="#c4b5fd" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="16" font-weight="700" text-anchor="middle">Object Ingestion Layer</text>
  <text x="450" y="426" fill="#c4b5fd77" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" text-anchor="middle">video | audio | image | doc | IoT | ...</text>

  <!-- Arrow -->
  <polygon points="444,454 450,466 456,454" fill="#7c3aed77"/>
  <text x="475" y="462" fill="#7c3aed77" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="11" text-anchor="start">decompose</text>

  <!-- Layer 2: Feature Extraction -->
  <rect x="60" y="474" width="780" height="64" rx="10" fill="#3b82f615" stroke="#3b82f6" stroke-width="2"/>
  <text x="450" y="500" fill="#93c5fd" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="16" font-weight="700" text-anchor="middle">Feature Extraction Engine</text>
  <text x="450" y="524" fill="#93c5fd77" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" text-anchor="middle">faces | logos | text | embeddings | spectrograms</text>

  <!-- Arrow -->
  <polygon points="444,552 450,564 456,552" fill="#3b82f677"/>
  <text x="475" y="560" fill="#3b82f677" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="11" text-anchor="start">store</text>

  <!-- Layer 3: Tiered Storage -->
  <rect x="60" y="572" width="780" height="64" rx="10" fill="#f59e0b15" stroke="#f59e0b" stroke-width="2"/>
  <text x="450" y="598" fill="#fcd34d" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="16" font-weight="700" text-anchor="middle">Tiered Storage (hot &#x2192; cold &#x2192; archive)</text>
  <text x="450" y="622" fill="#fcd34d77" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" text-anchor="middle">Qdrant (hot) &#x2194; S3 Vectors (canonical) &#x2194; Archive</text>

  <!-- Arrow -->
  <polygon points="444,650 450,662 456,650" fill="#f59e0b77"/>
  <text x="475" y="658" fill="#f59e0b77" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="11" text-anchor="start">query</text>

  <!-- Layer 4: Retrieval -->
  <rect x="60" y="670" width="780" height="40" rx="10" fill="#10b98115" stroke="#10b981" stroke-width="2"/>
  <text x="450" y="696" fill="#6ee7b7" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="16" font-weight="700" text-anchor="middle">Multi-Stage Retrieval: filter &#x2192; sort &#x2192; reduce &#x2192; enrich &#x2192; reassemble</text>
</svg><figcaption style="margin-top: 12px; color: #666; font-size: 0.9em;">Fig 1: From Frankenstack to unified multimodal warehouse</figcaption></figure>
<!--kg-card-end: html-->
<h2 id="what-is-a-multimodal-data-warehouse">What Is a Multimodal Data Warehouse?</h2><p>A <strong>multimodal data warehouse</strong> is an integrated system that:</p><ol><li><strong>Ingests any data type</strong>&#x2014;video, audio, images, documents, 3D models, IoT streams&#x2014;through a single API</li><li><strong>Decomposes objects</strong> into their constituent features (a video becomes frames, audio segments, transcripts, detected faces, logos, scenes)</li><li><strong>Stores features across tiers</strong> with lifecycle management (hot for real-time queries, cold for cost-efficient archival, with automatic promotion/demotion)</li><li><strong>Reassembles objects</strong> through multi-stage retrieval pipelines that can filter, sort, reduce, enrich, and join across modalities</li><li><strong>Maintains lineage</strong>&#x2014;every extracted feature traces back to its source object, timestamp, and extraction model through <a href="https://mixpeek.com/docs/processing/feature-extractors?ref=blog.mixpeek.com">feature URIs</a></li></ol><p>Think of it as Snowflake, but for unstructured data. Or S3 + a vector database + an inference engine + a query planner, collapsed into a single abstraction.</p><h2 id="the-core-primitive-object-decomposition">The Core Primitive: Object Decomposition</h2><p>Traditional databases store data as-is. You put a row in, you get a row out. But unstructured data is <em>dense</em>&#x2014;a single 30-second video contains:</p>
<!--kg-card-begin: html-->
<table style="width: 100%; border-collapse: collapse; margin: 24px 0; font-size: 0.95em;">
<thead>
<tr style="background: #f8f7ff; border-bottom: 2px solid #7c3aed;">
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Signal Type</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">What&apos;s Extracted</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Typical Output</th>
</tr>
</thead>
<tbody>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Visual frames</strong></td>
<td style="padding: 12px 16px;">Scene boundaries, keyframes</td>
<td style="padding: 12px 16px;">15-30 scene segments with thumbnails</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Face embeddings</strong></td>
<td style="padding: 12px 16px;">SCRFD detection &#x2192; ArcFace 512d vectors</td>
<td style="padding: 12px 16px;">Per-face identity embeddings at 99.8% accuracy</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Logo detection</strong></td>
<td style="padding: 12px 16px;">YOLOv8 detection &#x2192; SigLIP 768d embeddings</td>
<td style="padding: 12px 16px;">Brand identifications with bounding boxes</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Audio fingerprint</strong></td>
<td style="padding: 12px 16px;">Mel spectrogram &#x2192; CLAP embeddings</td>
<td style="padding: 12px 16px;">Audio signatures, music identification</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Transcript</strong></td>
<td style="padding: 12px 16px;">Whisper ASR &#x2192; word-level timestamps</td>
<td style="padding: 12px 16px;">Full text with temporal alignment</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Semantic embeddings</strong></td>
<td style="padding: 12px 16px;">SigLIP (visual), CLAP (audio), text models</td>
<td style="padding: 12px 16px;">Dense vectors for cross-modal search</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Structured metadata</strong></td>
<td style="padding: 12px 16px;">LLM-powered labeling and taxonomy assignment</td>
<td style="padding: 12px 16px;">Categories, tags, descriptions, sentiment</td>
</tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>A single video file becomes <strong>dozens of queryable features</strong>, each with its own embedding space, each stored with a <a href="https://mixpeek.com/docs/processing/feature-extractors?ref=blog.mixpeek.com">feature URI</a> that links back to the source:</p><pre><code>// Feature URI format &#x2014; every extracted signal is addressable

mixpeek://face_extractor@v1/embedding     &#x2192; ArcFace 512d vector
mixpeek://logo_extractor@v1/detection     &#x2192; YOLO bounding box + SigLIP vector
mixpeek://audio_extractor@v1/fingerprint  &#x2192; Mel spectrogram embedding
mixpeek://video_preprocessor@v1/scene    &#x2192; Scene boundary + keyframe
mixpeek://text_extractor@v1/transcript   &#x2192; Whisper ASR output

// One video in, many features out &#x2014; each independently queryable
// Each feature knows: what extracted it, when, from what source, at what timestamp</code></pre><p>This is the fundamental insight: <strong>you don&apos;t search unstructured data&#x2014;you search the features extracted from it.</strong> And different features require different models, different embedding spaces, and different query patterns. The warehouse handles this heterogeneity natively.</p><h2 id="storage-tiering-the-economics-of-multimodal">Storage Tiering: The Economics of Multimodal</h2><p>Here&apos;s where most vector database architectures fall apart: <strong>cost</strong>.</p><p>Storing every embedding in a hot vector index (Qdrant, Pinecone) works at 10K documents. At 10M documents with 5 feature types each, you&apos;re looking at 50M vectors in RAM. At $0.10/GB/month for cloud memory, that&apos;s a non-trivial line item&#x2014;and it grows linearly with every new modality you add.</p><p>A multimodal data warehouse needs <a href="https://mixpeek.com/docs/features/storage-tiering?ref=blog.mixpeek.com"><strong>storage tiering</strong></a>&#x2014;the same concept Snowflake uses for structured data, applied to vectors and features:</p>
<!--kg-card-begin: html-->
<table style="width: 100%; border-collapse: collapse; margin: 24px 0; font-size: 0.95em;">
<thead>
<tr style="background: #f8f7ff; border-bottom: 2px solid #7c3aed;">
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Tier</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Storage</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Latency</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Cost</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Use Case</th>
</tr>
</thead>
<tbody>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong style="color: #ef4444;">Hot</strong></td>
<td style="padding: 12px 16px;">Qdrant (in-memory HNSW)</td>
<td style="padding: 12px 16px;">&lt; 10ms</td>
<td style="padding: 12px 16px;">$$$</td>
<td style="padding: 12px 16px;">Real-time search, active collections</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong style="color: #f59e0b;">Warm</strong></td>
<td style="padding: 12px 16px;">S3 Vectors (canonical store)</td>
<td style="padding: 12px 16px;">50-200ms</td>
<td style="padding: 12px 16px;">$$</td>
<td style="padding: 12px 16px;">Batch analytics, infrequent queries</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong style="color: #3b82f6;">Cold</strong></td>
<td style="padding: 12px 16px;">S3 (vectors only, no index)</td>
<td style="padding: 12px 16px;">200ms-1s</td>
<td style="padding: 12px 16px;">$</td>
<td style="padding: 12px 16px;">Compliance, archival, reprocessing</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong style="color: #6b7280;">Archive</strong></td>
<td style="padding: 12px 16px;">Metadata only</td>
<td style="padding: 12px 16px;">N/A (rehydrate)</td>
<td style="padding: 12px 16px;">&#xA2;</td>
<td style="padding: 12px 16px;">Long-term retention, lineage</td>
</tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>S3 Vectors serves as the <strong>canonical store</strong>&#x2014;the source of truth for all features. Qdrant is the hot serving layer, loaded on demand. Collections automatically transition through lifecycle states based on access patterns: <code>active &#x2192; cold &#x2192; archived</code>.</p><p>This is how you go from &quot;we can&apos;t afford to index everything&quot; to &quot;we index everything, and the system manages cost automatically.&quot;</p><h2 id="multi-stage-retrieval-the-query-language-for-unstructured-data">Multi-Stage Retrieval: The Query Language for Unstructured Data</h2><p>SQL works for structured data because every column has a known type and every row has the same schema. Unstructured data has no such luxury. A query like <em>&quot;find all videos where a celebrity appears near a competitor&apos;s logo, with negative sentiment in the audio&quot;</em> spans three modalities, two embedding spaces, and requires temporal correlation.</p><p>This is where <a href="https://mixpeek.com/docs/retrieval/retrievers?ref=blog.mixpeek.com"><strong>multi-stage retrieval pipelines</strong></a> come in. Instead of a single query, you compose a pipeline of stages:</p>
<!--kg-card-begin: html-->
<figure style="margin: 40px 0; text-align: center;"><svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 900 700" width="900" height="700">
  <defs>
    <lineargradient id="bg2" x1="0%" y1="0%" x2="100%" y2="100%">
      <stop offset="0%" style="stop-color:#0f0f23"/>
      <stop offset="100%" style="stop-color:#1a1a3e"/>
    </lineargradient>
  </defs>

  <rect width="900" height="700" rx="16" fill="url(#bg2)"/>

  <!-- Title -->
  <text x="450" y="44" fill="#c4b5fd" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="22" font-weight="700" text-anchor="middle" letter-spacing="1">MULTI-STAGE RETRIEVAL PIPELINE</text>
  <text x="450" y="68" fill="#c4b5fd55" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" text-anchor="middle">The SELECT statement for unstructured data</text>

  <!-- Stage 1: FILTER (face search) -->
  <rect x="100" y="92" width="700" height="100" rx="12" fill="#7c3aed12" stroke="#7c3aed" stroke-width="2"/>
  <rect x="100" y="92" width="700" height="30" rx="12" fill="#7c3aed33"/>
  <rect x="100" y="110" width="700" height="12" fill="#7c3aed33"/>
  <text x="120" y="113" fill="#e9d5ff" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="700">Stage 1: FILTER</text>
  <text x="780" y="113" fill="#7c3aed99" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="12" text-anchor="end">feature_search</text>
  <text x="130" y="144" fill="#c4b5fdcc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">&quot;Find faces matching Celebrity X&quot;</text>
  <text x="130" y="166" fill="#c4b5fd77" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13">ArcFace embedding search | cosine threshold: 0.28</text>
  <rect x="560" y="146" width="220" height="30" rx="6" fill="#7c3aed22" stroke="#7c3aed66" stroke-width="1"/>
  <text x="670" y="166" fill="#c4b5fd" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="600" text-anchor="middle">847 candidates</text>

  <!-- Arrow -->
  <polygon points="444,204 450,218 456,204" fill="#7c3aed66"/>

  <!-- Stage 2: FILTER (logo search) -->
  <rect x="100" y="226" width="700" height="100" rx="12" fill="#3b82f612" stroke="#3b82f6" stroke-width="2"/>
  <rect x="100" y="226" width="700" height="30" rx="12" fill="#3b82f633"/>
  <rect x="100" y="244" width="700" height="12" fill="#3b82f633"/>
  <text x="120" y="247" fill="#bfdbfe" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="700">Stage 2: FILTER</text>
  <text x="780" y="247" fill="#3b82f699" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="12" text-anchor="end">feature_search</text>
  <text x="130" y="278" fill="#93c5fdcc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">&quot;With competitor logos present&quot;</text>
  <text x="130" y="300" fill="#93c5fd77" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13">SigLIP embedding search | filtered to Stage 1</text>
  <rect x="560" y="280" width="220" height="30" rx="6" fill="#3b82f622" stroke="#3b82f666" stroke-width="1"/>
  <text x="670" y="300" fill="#93c5fd" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="600" text-anchor="middle">23 documents</text>

  <!-- Arrow -->
  <polygon points="444,338 450,352 456,338" fill="#3b82f666"/>

  <!-- Stage 3: SORT -->
  <rect x="100" y="360" width="700" height="100" rx="12" fill="#f59e0b12" stroke="#f59e0b" stroke-width="2"/>
  <rect x="100" y="360" width="700" height="30" rx="12" fill="#f59e0b33"/>
  <rect x="100" y="378" width="700" height="12" fill="#f59e0b33"/>
  <text x="120" y="381" fill="#fef3c7" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="700">Stage 3: SORT</text>
  <text x="780" y="381" fill="#f59e0b99" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="12" text-anchor="end">score_linear</text>
  <text x="130" y="412" fill="#fcd34dcc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">&quot;Rank by negative audio sentiment&quot;</text>
  <text x="130" y="434" fill="#fcd34d77" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13">sentiment(0.6) + recency(0.3) + engagement(0.1)</text>
  <rect x="560" y="414" width="220" height="30" rx="6" fill="#f59e0b22" stroke="#f59e0b66" stroke-width="1"/>
  <text x="670" y="434" fill="#fcd34d" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="600" text-anchor="middle">23 reordered</text>

  <!-- Arrow -->
  <polygon points="444,472 450,486 456,472" fill="#f59e0b66"/>

  <!-- Stage 4: REDUCE -->
  <rect x="100" y="494" width="700" height="80" rx="12" fill="#10b98112" stroke="#10b981" stroke-width="2"/>
  <rect x="100" y="494" width="700" height="30" rx="12" fill="#10b98133"/>
  <rect x="100" y="512" width="700" height="12" fill="#10b98133"/>
  <text x="120" y="515" fill="#d1fae5" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="700">Stage 4: REDUCE</text>
  <text x="780" y="515" fill="#10b98199" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="12" text-anchor="end">sampling</text>
  <text x="130" y="548" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">&quot;Top 5 most relevant&quot; | deduplication + sampling</text>
  <rect x="560" y="534" width="220" height="28" rx="6" fill="#10b98122" stroke="#10b98166" stroke-width="1"/>
  <text x="670" y="553" fill="#6ee7b7" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="600" text-anchor="middle">5 documents</text>

  <!-- Arrow -->
  <polygon points="444,586 450,600 456,586" fill="#10b98166"/>

  <!-- Stage 5: ENRICH (the semantic join) -->
  <rect x="100" y="608" width="700" height="80" rx="12" fill="#ec489912" stroke="#ec4899" stroke-width="2"/>
  <rect x="100" y="608" width="700" height="30" rx="12" fill="#ec489933"/>
  <rect x="100" y="626" width="700" height="12" fill="#ec489933"/>
  <text x="120" y="629" fill="#fce7f3" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="700">Stage 5: ENRICH</text>
  <text x="608" y="629" fill="#ec489999" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="12">THE SEMANTIC JOIN</text>
  <text x="130" y="660" fill="#f9a8d4cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">&quot;Join with brand safety scores&quot; | cross-collection</text>
  <rect x="560" y="648" width="220" height="28" rx="6" fill="#ec489922" stroke="#ec489966" stroke-width="1"/>
  <text x="670" y="667" fill="#f9a8d4" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="600" text-anchor="middle">5 enriched docs</text>
</svg><figcaption style="margin-top: 12px; color: #666; font-size: 0.9em;">Fig 2: A retrieval pipeline is the SELECT statement for unstructured data</figcaption></figure>
<!--kg-card-end: html-->
<p>Each stage type serves a specific purpose in the pipeline:</p>
<!--kg-card-begin: html-->
<table style="width: 100%; border-collapse: collapse; margin: 24px 0; font-size: 0.95em;">
<thead>
<tr style="background: #f8f7ff; border-bottom: 2px solid #7c3aed;">
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Stage Type</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Purpose</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">SQL Analogy</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Implementations</th>
</tr>
</thead>
<tbody>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Filter</strong></td>
<td style="padding: 12px 16px;">Narrow result set by features</td>
<td style="padding: 12px 16px;"><code>WHERE</code></td>
<td style="padding: 12px 16px;">feature_search, metadata_filter, boolean_filter</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Sort</strong></td>
<td style="padding: 12px 16px;">Reorder by relevance scores</td>
<td style="padding: 12px 16px;"><code>ORDER BY</code></td>
<td style="padding: 12px 16px;">score_linear, reciprocal_rank_fusion, cross_encoder</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Reduce</strong></td>
<td style="padding: 12px 16px;">Downsample, deduplicate, aggregate</td>
<td style="padding: 12px 16px;"><code>LIMIT</code> / <code>GROUP BY</code></td>
<td style="padding: 12px 16px;">sampling, clustering, deduplication</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Enrich</strong></td>
<td style="padding: 12px 16px;">Join data from other collections</td>
<td style="padding: 12px 16px;"><code>JOIN</code></td>
<td style="padding: 12px 16px;">document_enrich (the &quot;semantic join&quot;)</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Apply</strong></td>
<td style="padding: 12px 16px;">Transform results (LLM, classification)</td>
<td style="padding: 12px 16px;"><code>SELECT func()</code></td>
<td style="padding: 12px 16px;">llm_apply, classifier, reranker</td>
</tr>
</tbody>
</table>
<!--kg-card-end: html-->
<h3 id="the-semantic-join-cross-modal-sql">The Semantic Join: Cross-Modal SQL</h3><p>The <code>document_enrich</code> stage deserves special attention. It&apos;s essentially a <a href="https://mixpeek.com/blog/adtech-governance?ref=blog.mixpeek.com"><strong>semantic join</strong></a>&#x2014;the ability to join results from one collection with data from another based on feature similarity, not foreign keys.</p><p>In SQL, you write <code>JOIN orders ON users.id = orders.user_id</code>. In a multimodal warehouse, you write:</p><pre><code class="language-json">// Semantic join: enrich video results with brand safety scores
{
  &quot;stage_type&quot;: &quot;enrich&quot;,
  &quot;stage_id&quot;: &quot;document_enrich&quot;,
  &quot;config&quot;: {
    &quot;target_namespace&quot;: &quot;brand-safety-scores&quot;,
    &quot;join_feature&quot;: &quot;mixpeek://logo_extractor@v1/embedding&quot;,
    &quot;attach_fields&quot;: [&quot;risk_score&quot;, &quot;brand_name&quot;, &quot;clearance_status&quot;]
  }
}</code></pre><p>No foreign keys. No schema alignment. The join happens in embedding space&#x2014;features from Collection A are matched to features in Collection B by vector similarity. This is how you connect a video corpus to a brand safety database without ever mapping IDs.</p><h2 id="taxonomies-the-schema-for-unstructured-data">Taxonomies: The Schema for Unstructured Data</h2><p>Structured data has schemas. Multimodal data has <a href="https://mixpeek.com/docs/enrichment/taxonomies?ref=blog.mixpeek.com"><strong>taxonomies</strong></a>&#x2014;hierarchical classification systems that bring order to extracted features.</p><p>Taxonomies in a multimodal warehouse operate in three modes:</p>
<!--kg-card-begin: html-->
<table style="width: 100%; border-collapse: collapse; margin: 24px 0; font-size: 0.95em;">
<thead>
<tr style="background: #f8f7ff; border-bottom: 2px solid #7c3aed;">
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Mode</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">When It Runs</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Use Case</th>
</tr>
</thead>
<tbody>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Materialized</strong></td>
<td style="padding: 12px 16px;">At ingestion time</td>
<td style="padding: 12px 16px;">Known categories&#x2014;&quot;is this face a celebrity?&quot; &quot;which IAB category?&quot;</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>On-demand</strong></td>
<td style="padding: 12px 16px;">At query time</td>
<td style="padding: 12px 16px;">Ad-hoc classification&#x2014;&quot;group these by sentiment&quot; &quot;cluster by visual style&quot;</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Retroactive</strong></td>
<td style="padding: 12px 16px;">Batch over existing data</td>
<td style="padding: 12px 16px;">New taxonomy applied to historical corpus&#x2014;&quot;re-classify all assets with updated brand list&quot;</td>
</tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>This is the equivalent of <code>ALTER TABLE ADD COLUMN</code> for unstructured data. When your brand safety list changes, you don&apos;t re-ingest everything&#x2014;you apply a retroactive taxonomy that reclassifies existing features in place.</p><h2 id="object-reassembly-from-features-back-to-answers">Object Reassembly: From Features Back to Answers</h2><p>Decomposition without reassembly is just feature extraction. The power of a multimodal warehouse is in the <strong>round trip</strong>: you decompose objects into features for storage and search, then reassemble them into coherent answers at query time.</p>
<!--kg-card-begin: html-->
<figure style="margin: 40px 0; text-align: center;"><svg xmlns="http://www.w3.org/2000/svg" viewbox="0 0 900 740" width="900" height="740">
  <defs>
    <lineargradient id="bg3" x1="0%" y1="0%" x2="100%" y2="100%">
      <stop offset="0%" style="stop-color:#0f0f23"/>
      <stop offset="100%" style="stop-color:#1a1a3e"/>
    </lineargradient>
  </defs>

  <rect width="900" height="740" rx="16" fill="url(#bg3)"/>

  <!-- Title -->
  <text x="450" y="44" fill="#c4b5fd" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="22" font-weight="700" text-anchor="middle" letter-spacing="1">OBJECT LIFECYCLE IN A MULTIMODAL WAREHOUSE</text>
  <text x="450" y="68" fill="#c4b5fd55" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" text-anchor="middle">Ingest &#x2192; Decompose &#x2192; Store &#x2192; Query &#x2192; Reassemble</text>

  <!-- === ROW 1: INGEST | DECOMPOSE | STORE === -->

  <!-- INGEST box -->
  <text x="105" y="108" fill="#7c3aed99" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" font-weight="700" text-anchor="middle">INGEST</text>
  <rect x="30" y="118" width="150" height="160" rx="12" fill="#7c3aed15" stroke="#7c3aed" stroke-width="2"/>
  <!-- Video icon (simplified) -->
  <rect x="55" y="142" width="100" height="70" rx="6" fill="#7c3aed22" stroke="#7c3aed88" stroke-width="1"/>
  <polygon points="90,165 90,195 115,180" fill="#7c3aed88"/>
  <text x="105" y="236" fill="#c4b5fdcc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="16" font-weight="600" text-anchor="middle">video.mp4</text>
  <text x="105" y="258" fill="#c4b5fd66" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="11" text-anchor="middle">any modality</text>

  <!-- Arrow INGEST -> DECOMPOSE -->
  <line x1="180" y1="198" x2="225" y2="198" stroke="#7c3aed66" stroke-width="2"/>
  <polygon points="223,192 237,198 223,204" fill="#7c3aed88"/>

  <!-- DECOMPOSE box -->
  <text x="410" y="108" fill="#3b82f699" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" font-weight="700" text-anchor="middle">DECOMPOSE</text>
  <rect x="240" y="118" width="340" height="160" rx="12" fill="#3b82f612" stroke="#3b82f6" stroke-width="2"/>

  <!-- Feature rows -->
  <rect x="260" y="132" width="300" height="28" rx="4" fill="#3b82f618"/>
  <text x="275" y="151" fill="#93c5fdcc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13">scenes  &#x2192; frame embeddings</text>

  <rect x="260" y="166" width="300" height="28" rx="4" fill="#7c3aed18"/>
  <text x="275" y="185" fill="#c4b5fdcc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13">faces   &#x2192; ArcFace 512d</text>

  <rect x="260" y="200" width="300" height="28" rx="4" fill="#3b82f618"/>
  <text x="275" y="219" fill="#93c5fdcc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13">audio   &#x2192; CLAP embeddings</text>

  <rect x="260" y="234" width="300" height="28" rx="4" fill="#7c3aed18"/>
  <text x="275" y="253" fill="#c4b5fdcc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13">logos   &#x2192; SigLIP 768d</text>

  <!-- Arrow DECOMPOSE -> STORE -->
  <line x1="580" y1="198" x2="625" y2="198" stroke="#3b82f666" stroke-width="2"/>
  <polygon points="623,192 637,198 623,204" fill="#3b82f688"/>

  <!-- STORE box -->
  <text x="760" y="108" fill="#f59e0b99" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" font-weight="700" text-anchor="middle">STORE</text>
  <rect x="640" y="118" width="230" height="160" rx="12" fill="#f59e0b12" stroke="#f59e0b" stroke-width="2"/>

  <!-- Storage tiers -->
  <rect x="660" y="136" width="190" height="36" rx="6" fill="#ef444422" stroke="#ef444477" stroke-width="1"/>
  <text x="755" y="158" fill="#fca5a5" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" font-weight="600" text-anchor="middle">HOT &#x2022; Qdrant</text>

  <rect x="660" y="180" width="190" height="36" rx="6" fill="#f59e0b22" stroke="#f59e0b77" stroke-width="1"/>
  <text x="755" y="202" fill="#fcd34d" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" font-weight="600" text-anchor="middle">WARM &#x2022; S3 Vectors</text>

  <rect x="660" y="224" width="190" height="36" rx="6" fill="#6b728022" stroke="#6b728077" stroke-width="1"/>
  <text x="755" y="246" fill="#9ca3af" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" font-weight="600" text-anchor="middle">COLD &#x2022; Archive</text>

  <!-- === DIVIDER === -->
  <line x1="80" y1="310" x2="820" y2="310" stroke="#ffffff12" stroke-width="1"/>

  <!-- Query label -->
  <rect x="60" y="326" width="780" height="40" rx="8" fill="#10b98112" stroke="#10b98155" stroke-width="1" stroke-dasharray="4,4"/>
  <text x="450" y="351" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14" font-weight="600" text-anchor="middle">QUERY: &quot;celebrity near competitor logo, negative audio&quot;</text>

  <!-- Arrow down to reassemble -->
  <polygon points="444,378 450,392 456,378" fill="#10b98166"/>

  <!-- === REASSEMBLE SECTION === -->
  <text x="450" y="416" fill="#10b981" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="18" font-weight="700" text-anchor="middle">REASSEMBLE</text>

  <!-- Pipeline steps -->
  <rect x="100" y="432" width="700" height="156" rx="12" fill="#10b98110" stroke="#10b981" stroke-width="2"/>

  <!-- Step rows -->
  <text x="130" y="462" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">1.</text>
  <text x="160" y="462" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">face search &#x2192; candidate videos</text>

  <text x="130" y="490" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">2.</text>
  <text x="160" y="490" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">logo filter &#x2192; narrow to competitor presence</text>

  <text x="130" y="518" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">3.</text>
  <text x="160" y="518" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">sentiment sort &#x2192; rank by negativity</text>

  <text x="130" y="546" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">4.</text>
  <text x="160" y="546" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">enrich &#x2192; attach brand context</text>

  <text x="130" y="574" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">5.</text>
  <text x="160" y="574" fill="#6ee7b7cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="14">return &#x2192; video clips + timestamps + scores</text>

  <!-- Arrow down to result -->
  <polygon points="444,600 450,614 456,600" fill="#10b98166"/>

  <!-- Result box -->
  <rect x="100" y="622" width="700" height="106" rx="12" fill="#10b98118" stroke="#10b981" stroke-width="2"/>
  <text x="450" y="648" fill="#6ee7b7" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="15" font-weight="700" text-anchor="middle">Result: 5 video segments</text>

  <text x="160" y="672" fill="#6ee7b7aa" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13">&#x2022; Source video URL + timestamp range</text>
  <text x="160" y="692" fill="#6ee7b7aa" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13">&#x2022; Celebrity (0.94) &#x2022; Logo: &quot;Nike&quot; &#x2022; Sentiment: -0.73</text>
  <text x="160" y="712" fill="#ef4444cc" font-family="&apos;SF Mono&apos;,&apos;Fira Code&apos;,&apos;Consolas&apos;,monospace" font-size="13" font-weight="600">&#x2022; Brand Safety: HIGH RISK</text>
</svg><figcaption style="margin-top: 12px; color: #666; font-size: 0.9em;">Fig 3: The full lifecycle&#x2014;ingest, decompose, store, query, reassemble</figcaption></figure>
<!--kg-card-end: html-->
<p>The result isn&apos;t just &quot;document #47291 matched your query.&quot; It&apos;s a <strong>reassembled object</strong> with provenance: here&apos;s the video segment, here&apos;s why it matched, here&apos;s the confidence, here&apos;s the temporal context, and here&apos;s enriched metadata from related collections.</p><h2 id="the-architecture-how-it-actually-works">The Architecture: How It Actually Works</h2><p>At <a href="https://mixpeek.com/?ref=blog.mixpeek.com">Mixpeek</a>, we&apos;ve been building this for two years. Here&apos;s the stack:</p>
<!--kg-card-begin: html-->
<table style="width: 100%; border-collapse: collapse; margin: 24px 0; font-size: 0.95em;">
<thead>
<tr style="background: #f8f7ff; border-bottom: 2px solid #7c3aed;">
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Layer</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Technology</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Role</th>
</tr>
</thead>
<tbody>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>API Gateway</strong></td>
<td style="padding: 12px 16px;">FastAPI</td>
<td style="padding: 12px 16px;">Single REST API for all operations&#x2014;ingest, query, manage</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Task Queue</strong></td>
<td style="padding: 12px 16px;">Celery + Redis</td>
<td style="padding: 12px 16px;">Async batch processing for large ingestion jobs</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Inference Engine</strong></td>
<td style="padding: 12px 16px;">Ray Serve (14+ model endpoints)</td>
<td style="padding: 12px 16px;">Distributed GPU inference&#x2014;ArcFace, SigLIP, CLAP, Whisper, YOLO, LLMs</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Hot Storage</strong></td>
<td style="padding: 12px 16px;">Qdrant</td>
<td style="padding: 12px 16px;">In-memory HNSW index for real-time vector search</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Canonical Storage</strong></td>
<td style="padding: 12px 16px;">S3 Vectors</td>
<td style="padding: 12px 16px;">Durable source of truth for all features and embeddings</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Object Storage</strong></td>
<td style="padding: 12px 16px;">S3</td>
<td style="padding: 12px 16px;">Raw file storage with 15+ connectors (GCS, Azure, SFTP, URLs)</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Metadata</strong></td>
<td style="padding: 12px 16px;">MongoDB</td>
<td style="padding: 12px 16px;">Collection configs, batch tracking, lineage, taxonomies</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Analytics</strong></td>
<td style="padding: 12px 16px;">ClickHouse</td>
<td style="padding: 12px 16px;">Query performance, usage metrics, cost attribution</td>
</tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>The key insight: <strong>object storage is both the source and destination.</strong> Files come in from S3 (or any of 15+ connectors), get decomposed by the inference engine, and features are stored back into S3 Vectors as the canonical tier. Qdrant is an ephemeral hot cache that can be rebuilt from S3 Vectors at any time. The warehouse never loses data, even if the hot index goes down.</p><h2 id="use-cases-across-industries">Use Cases Across Industries</h2><p>A multimodal data warehouse isn&apos;t a solution looking for a problem. It&apos;s infrastructure for a class of problems that every enterprise with unstructured data faces:</p><h3 id="media-entertainment">Media &amp; Entertainment</h3><p><strong>Problem:</strong> A media company publishes 500+ assets/week. A single unauthorized celebrity face or brand logo can trigger $50K+ in legal costs.</p><p><strong>Solution:</strong> <a href="https://mixpeek.com/solutions/ip-safety?ref=blog.mixpeek.com">Pre-publication IP clearance</a>&#x2014;every asset is decomposed into faces, logos, and audio fingerprints, checked against reference corpora before publishing. Single image: ~200ms. 30-second video: ~2s.</p><p><strong>Try it:</strong> <a href="https://copyright.mixpeek.com/?ref=blog.mixpeek.com">Live demo &#x2192;</a></p><h3 id="advertising-brand-safety">Advertising &amp; Brand Safety</h3><p><strong>Problem:</strong> Brands need to verify their ads don&apos;t appear alongside objectionable content, and publishers need to classify user-generated video for ad placement.</p><p><strong>Solution:</strong> <a href="https://mixpeek.com/use-cases/brand-logo-video-detection?ref=blog.mixpeek.com">Multi-modal brand monitoring</a>&#x2014;decompose video into visual frames, audio, and transcript. Classify each frame for brand safety categories (IAB taxonomy). Flag logo presence. Score sentiment across modalities. The semantic join connects video features to brand safety databases in real time.</p><h3 id="insurance-claims-processing">Insurance &amp; Claims Processing</h3><p><strong>Problem:</strong> Claims arrive as a mix of photos, PDFs, voice recordings, and video evidence. Adjusters spend hours cross-referencing across formats.</p><p><strong>Solution:</strong> Ingest all claim documents through a single pipeline. Decompose photos into damage classifications, extract text from PDFs, transcribe voice memos, detect objects in video evidence. A multi-stage retrieval pipeline surfaces similar past claims, relevant policy terms, and fraud indicators&#x2014;all joined across modalities.</p><h3 id="e-commerce-retail">E-Commerce &amp; Retail</h3><p><strong>Problem:</strong> Product catalogs contain millions of images, videos, and descriptions across suppliers. Duplicate detection, counterfeit identification, and visual search all require different models.</p><p><strong>Solution:</strong> Decompose product assets into visual embeddings, text features, and brand identifiers. Storage tiering keeps active catalog in hot search, seasonal items in warm storage, and discontinued products in cold. Retroactive taxonomies reclassify the entire catalog when category structures change.</p><h3 id="healthcare-life-sciences">Healthcare &amp; Life Sciences</h3><p><strong>Problem:</strong> Medical imaging (X-rays, MRIs, pathology slides), clinical notes, genomic data, and sensor readings all need to be correlated for diagnosis support.</p><p><strong>Solution:</strong> Decompose imaging into region-level features. Extract entities from clinical notes. Embed genomic sequences. The multi-stage pipeline enables queries like <em>&quot;find patients with similar imaging features AND matching clinical history&quot;</em>&#x2014;a cross-modal join that&apos;s impossible in siloed systems.</p><h3 id="sports-live-events">Sports &amp; Live Events</h3><p><strong>Problem:</strong> Broadcasters need to identify players, detect sponsor logos, and provide real-time highlights from live video feeds.</p><p><strong>Solution:</strong> <a href="https://mixpeek.com/use-cases/celebrity-likeness-detection?ref=blog.mixpeek.com">Real-time face and logo detection</a> on video streams. Scene decomposition identifies key moments. Audio analysis detects crowd reactions. The retrieval pipeline assembles highlight packages: &quot;all moments where [Player X] appears + crowd noise peaks + sponsor logo visibility.&quot;</p><h2 id="why-now">Why Now?</h2><p>Three converging forces make the multimodal data warehouse inevitable:</p><h4 id="1-model-commoditization">1. Model Commoditization</h4><p>Open-source models (ArcFace, SigLIP, CLAP, Whisper, YOLO) are good enough for production. The bottleneck isn&apos;t inference quality&#x2014;it&apos;s the infrastructure to orchestrate, store, and query across models.</p><h4 id="2-vector-database-limitations">2. Vector Database Limitations</h4><p>Vector databases solve single-modality search. But real applications need multi-modal decomposition, cross-collection joins, storage tiering, and composable query pipelines. That&apos;s a warehouse, not a database.</p><h4 id="3-unstructured-data-explosion">3. Unstructured Data Explosion</h4><p>Enterprise video alone is growing 30% YoY. Every IoT sensor, security camera, and user-generated content platform is producing data that doesn&apos;t fit in a data warehouse&#x2014;yet. The multimodal warehouse is the missing tier.</p><h2 id="the-warehouse-analogy-goes-deep">The Warehouse Analogy Goes Deep</h2><p>This isn&apos;t just marketing. The parallels between structured data warehousing and multimodal data warehousing are structural:</p>
<!--kg-card-begin: html-->
<table style="width: 100%; border-collapse: collapse; margin: 24px 0; font-size: 0.95em;">
<thead>
<tr style="background: #f8f7ff; border-bottom: 2px solid #7c3aed;">
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Concept</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Structured (Snowflake)</th>
<th style="padding: 12px 16px; text-align: left; font-weight: 600;">Multimodal (Mixpeek)</th>
</tr>
</thead>
<tbody>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Schema</strong></td>
<td style="padding: 12px 16px;">Column types + constraints</td>
<td style="padding: 12px 16px;">Feature extractors + taxonomies</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Ingestion</strong></td>
<td style="padding: 12px 16px;">COPY INTO + transforms</td>
<td style="padding: 12px 16px;"><a href="https://mixpeek.com/docs/ingestion/connectors?ref=blog.mixpeek.com" style="color: #7c3aed;">Bucket upload + feature extraction</a></td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Storage</strong></td>
<td style="padding: 12px 16px;">Micro-partitions (hot/cold)</td>
<td style="padding: 12px 16px;">Tiered vectors (Qdrant &#x2192; S3 Vectors &#x2192; Archive)</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Query</strong></td>
<td style="padding: 12px 16px;">SQL (SELECT, JOIN, GROUP BY)</td>
<td style="padding: 12px 16px;"><a href="https://mixpeek.com/docs/retrieval/retrievers?ref=blog.mixpeek.com" style="color: #7c3aed;">Multi-stage pipelines (filter, sort, reduce, enrich)</a></td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Join</strong></td>
<td style="padding: 12px 16px;">Foreign key + equi-join</td>
<td style="padding: 12px 16px;">Semantic join (vector similarity across collections)</td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Schema evolution</strong></td>
<td style="padding: 12px 16px;">ALTER TABLE</td>
<td style="padding: 12px 16px;">Retroactive taxonomy + re-extraction</td>
</tr>
<tr style="border-bottom: 1px solid #eee;">
<td style="padding: 12px 16px;"><strong>Materialization</strong></td>
<td style="padding: 12px 16px;">Materialized views</td>
<td style="padding: 12px 16px;">Materialized taxonomies + <a href="https://mixpeek.com/docs/enrichment/clusters?ref=blog.mixpeek.com" style="color: #7c3aed;">clusters</a></td>
</tr>
<tr style="border-bottom: 1px solid #eee; background: #fafafa;">
<td style="padding: 12px 16px;"><strong>Compute/storage separation</strong></td>
<td style="padding: 12px 16px;">Virtual warehouses</td>
<td style="padding: 12px 16px;">Ray Serve (autoscaling inference) + S3 Vectors (durable storage)</td>
</tr>
</tbody>
</table>
<!--kg-card-end: html-->
<h2 id="what-this-unlocks">What This Unlocks</h2><p>When you have a real multimodal warehouse&#x2014;not a stitched-together stack, but integrated decomposition, tiered storage, and composable retrieval&#x2014;new capabilities emerge:</p><ul><li><strong>Cross-modal correlation:</strong> &quot;Find me all instances where [this sound] plays while [this logo] is visible&quot;&#x2014;queries that span embedding spaces with temporal alignment</li><li><strong>Retroactive intelligence:</strong> New model drops? New taxonomy? Apply it to your entire historical corpus without re-ingestion</li><li><strong>Cost-proportional scaling:</strong> Hot data for real-time apps, cold data for compliance&#x2014;same API, automatic lifecycle management</li><li><strong>Semantic joins across modalities:</strong> Connect video features to audio features to document features&#x2014;the <code>JOIN</code> for unstructured data</li><li><strong>Composable pipelines:</strong> Build complex queries by snapping together stages, not writing custom code for each use case</li></ul><h2 id="getting-started">Getting Started</h2><p>If you want to see this in action:</p><ol><li><a href="https://copyright.mixpeek.com/?ref=blog.mixpeek.com"><strong>Try the live demo</strong></a>&#x2014;upload an image or video and see face, logo, and audio detection run in parallel</li><li><a href="https://mixpeek.com/docs?ref=blog.mixpeek.com"><strong>Read the docs</strong></a>&#x2014;the API is REST-first, with Python and TypeScript SDKs</li><li><a href="https://mixpeek.com/tutorials/ip-safety-pipeline?ref=blog.mixpeek.com"><strong>Build an IP safety pipeline</strong></a>&#x2014;full tutorial from namespace creation to retriever execution</li><li><a href="https://mixpeek.com/contact?ref=blog.mixpeek.com"><strong>Talk to us</strong></a>&#x2014;we&apos;re helping enterprises migrate from Frankenstack to warehouse</li></ol><h2 id="further-reading">Further Reading</h2><ul><li><a href="https://mixpeek.com/multimodal-data-warehouse?ref=blog.mixpeek.com">Multimodal Data Warehouse</a> &#x2014; the canonical definition page</li><li><a href="https://mixpeek.com/guides/what-is-multimodal-data-warehouse?ref=blog.mixpeek.com">What Is a Multimodal Data Warehouse?</a> &#x2014; comprehensive guide</li><li><a href="https://mixpeek.com/guides/build-multimodal-data-warehouse?ref=blog.mixpeek.com">How to Build a Multimodal Data Warehouse</a> &#x2014; step-by-step tutorial</li><li><a href="https://mixpeek.com/guides/multimodal-data-warehouse-architecture?ref=blog.mixpeek.com">Architecture Deep Dive</a> &#x2014; Ray Serve, tiered storage, retrieval internals</li><li><a href="https://mixpeek.com/comparisons/multimodal-data-warehouse-vs-vector-database?ref=blog.mixpeek.com">Multimodal Data Warehouse vs. Vector Database</a> &#x2014; full comparison</li><li><a href="https://mixpeek.com/comparisons/multimodal-data-warehouse-vs-data-lakehouse?ref=blog.mixpeek.com">Multimodal Data Warehouse vs. Data Lakehouse</a> &#x2014; Snowflake/Databricks comparison</li><li><a href="https://mixpeek.com/curated-lists/best-multimodal-data-platforms?ref=blog.mixpeek.com">Best Multimodal Data Platforms (2026)</a> &#x2014; 8 platforms compared</li><li><a href="https://mixpeek.com/glossary/multimodal-data-warehouse?ref=blog.mixpeek.com">Glossary: Multimodal Data Warehouse</a> &#x2014; technical definition</li><li><a href="https://mixpeek.com/solutions/ip-safety?ref=blog.mixpeek.com">IP Safety Solution</a> &#x2014; pre-publication copyright detection powered by the warehouse</li></ul><hr><p>The multimodal data warehouse isn&apos;t a vision. It&apos;s running in production today, processing millions of objects across media companies, ad platforms, and enterprises. The question isn&apos;t whether this category will exist&#x2014;it&apos;s whether you&apos;ll build it yourself or use one that already works.</p><p>Built with: FastAPI, Ray Serve, Qdrant, S3 Vectors, ArcFace, SigLIP, CLAP, Whisper, YOLOv8. <a href="https://mixpeek.com/?ref=blog.mixpeek.com">mixpeek.com</a></p>]]></content:encoded></item><item><title><![CDATA[ColQwen2 + MUVERA: Multimodal Late Interaction Retrieval That Actually Scales]]></title><description><![CDATA[We benchmarked multimodal late interaction retrieval on financial documents. ColQwen2 + MUVERA retains 99.4% of brute-force quality at 179x the speed, crushing OCR-based search by 56%.]]></description><link>http://blog.mixpeek.com/colqwen2-muvera-multimodal-late-interaction/</link><guid isPermaLink="false">69c45ea13baecafdb7f8d64c</guid><category><![CDATA[Engineering]]></category><category><![CDATA[Research]]></category><category><![CDATA[Multimodal]]></category><category><![CDATA[Retrieval]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Wed, 25 Mar 2026 22:21:44 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/03/feature_image.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/03/feature_image.png" alt="ColQwen2 + MUVERA: Multimodal Late Interaction Retrieval That Actually Scales"><p>We benchmarked every viable approach to multimodal document retrieval on financial tables (ViDoRe/TabFQuAD) and found a combination that hasn&apos;t been published before: <strong>ColQwen2 + MUVERA</strong>. It retains 99.4% of brute-force quality at a fraction of the cost, and obliterates OCR-based search.</p><h2 id="the-problem">The Problem</h2><p>Late interaction models like ColBERT and ColPali represent documents as <em>sets of vectors</em>&#x2014;one per token or image patch. At query time, every query token finds its best-matching document token (MaxSim/Chamfer similarity). This gives near cross-encoder accuracy, but retrieval cost is O(|Q| &#xD7; |P| &#xD7; n)&#x2014;intractable at scale.</p><p>The prior solution, PLAID, uses heuristic centroid pruning with no theoretical guarantees. It degrades unpredictably on some datasets.</p><h2 id="muvera-the-fix">MUVERA: The Fix</h2><p><a href="https://arxiv.org/abs/2405.19504?ref=blog.mixpeek.com">MUVERA</a> (Google Research, NeurIPS 2024) converts any multi-vector set into a single fixed-dimensional encoding (FDE) whose inner product <strong>provably approximates</strong> Chamfer similarity. This means you can use standard ANN engines (HNSW, DiskANN) for first-pass retrieval, then re-rank a small candidate set with true MaxSim.</p><p>The key insight is asymmetric encoding: documents get centroids with empty-cluster filling (preserves information), queries get sums with no filling (preserves distribution). Random hyperplane partitioning + dimensionality reduction, repeated R times and concatenated. The math gives you an &#x3B5;-approximation guarantee&#x2014;the first such result for multi-vector retrieval.</p><h2 id="the-benchmark">The Benchmark</h2><p>We ran ColPali-v1.2 and ColQwen2-v1.0 against BM25 (OCR + Tesseract) on the ViDoRe TabFQuAD dataset&#x2014;70 financial table images, 280 queries. This is the hard case: charts, multi-column tables, footnotes, mixed text+visual content where OCR systematically fails.</p><p>MUVERA config: k_sim=5, d_proj=16, r_reps=20 &#x2192; 10,240-dimensional FDE per document.</p>
<!--kg-card-begin: html-->
<table style="width:100%; border-collapse:collapse; margin:1.5em 0; font-size:0.95em;">
<thead><tr style="border-bottom:2px solid #333; text-align:left;">
<th style="padding:8px;">Method</th><th style="padding:8px;">R@1</th><th style="padding:8px;">R@5</th><th style="padding:8px;">NDCG@10</th><th style="padding:8px;">MRR</th><th style="padding:8px;">Latency</th></tr></thead>
<tbody>
<tr style="border-bottom:1px solid #ddd;"><td style="padding:8px;">BM25 (OCR text)</td><td style="padding:8px;">0.425</td><td style="padding:8px;">0.650</td><td style="padding:8px;">0.570</td><td style="padding:8px;">0.531</td><td style="padding:8px;">0.4ms</td></tr>
<tr style="border-bottom:1px solid #ddd;"><td style="padding:8px;">ColPali-v1.2 brute-force</td><td style="padding:8px;">0.825</td><td style="padding:8px;">0.929</td><td style="padding:8px;">0.890</td><td style="padding:8px;">0.872</td><td style="padding:8px;">26.4ms</td></tr>
<tr style="border-bottom:1px solid #ddd;"><td style="padding:8px;">ColQwen2-v1.0 brute-force</td><td style="padding:8px;"><strong>0.839</strong></td><td style="padding:8px;"><strong>0.932</strong></td><td style="padding:8px;"><strong>0.896</strong></td><td style="padding:8px;"><strong>0.883</strong></td><td style="padding:8px;">42.8ms</td></tr>
<tr style="border-bottom:1px solid #ddd;"><td style="padding:8px;">MUVERA FDE only</td><td style="padding:8px;">0.693</td><td style="padding:8px;">0.854</td><td style="padding:8px;">0.791</td><td style="padding:8px;">0.759</td><td style="padding:8px;"><strong>0.2ms</strong></td></tr>
<tr style="border-bottom:2px solid #333; background:#f8f9fa;"><td style="padding:8px;"><strong>MUVERA + rerank (ColQwen2)</strong></td><td style="padding:8px;"><strong>0.836</strong></td><td style="padding:8px;"><strong>0.925</strong></td><td style="padding:8px;"><strong>0.891</strong></td><td style="padding:8px;"><strong>0.877</strong></td><td style="padding:8px;"><strong>30.2ms</strong></td></tr>
</tbody></table>
<!--kg-card-end: html-->
<h2 id="what-the-numbers-mean">What the Numbers Mean</h2><p><strong>BM25 is not competitive.</strong> OCR + keyword search reaches 63.6% of the visual model&apos;s quality on financial tables. If your pipeline is &#x201C;extract text then BM25,&#x201D; you&apos;re leaving 36% of retrieval quality on the floor for any document with visual structure.</p><p><strong>MUVERA + rerank = 99.4% quality retention.</strong> The FDE narrows 70 documents to 50 candidates in 0.2ms, then Chamfer re-ranking recovers essentially all of the brute-force accuracy. At 1M documents, brute-force becomes seconds; MUVERA stays at milliseconds.</p><p><strong>MUVERA FDE-only at 179&#xD7; speedup.</strong> For applications where you can tolerate ~12% quality loss, pure FDE search gives sub-millisecond retrieval. This is the operating point for real-time serving at scale.</p><p><strong>ColQwen2 &gt; ColPali by +0.7% NDCG@10</strong> on this dataset, with a larger gap (~6%) on the full ViDoRe average. ColQwen2 is Apache 2.0 licensed and 2B parameters&#x2014;smaller than ColPali&apos;s 3B.</p><h2 id="the-two-tier-architecture">The Two-Tier Architecture</h2><p>The production pattern that falls out of this:</p><ol><li><strong>Offline:</strong> Embed documents with ColQwen2 &#x2192; ~620 patch vectors per page (128-dim each). Generate MUVERA FDE (10,240-dim single vector). Index FDEs in any standard ANN engine.</li><li><strong>Tier 1&#x2014;candidate generation:</strong> Query &#x2192; ColQwen2 query embedding &#x2192; MUVERA query FDE &#x2192; ANN search &#x2192; top-K candidates. Cost: O(log n). Latency: &lt;1ms.</li><li><strong>Tier 2&#x2014;precision re-ranking:</strong> Load candidate multi-vectors from storage &#x2192; true Chamfer/MaxSim scoring &#x2192; final ranked list. Cost: O(K &#xD7; |patches|). Latency: ~30ms for K=50.</li></ol><p>FDEs go into your vector index as ordinary single vectors. Multi-vectors stay in object storage (S3/parquet) and only get loaded for the re-rank stage. No new infrastructure&#x2014;just a smarter encoding layer.</p><h2 id="what-this-means-for-multimodal-search">What This Means for Multimodal Search</h2><p>The combination of a strong vision-language model (ColQwen2) with a theoretically-grounded retrieval engine (MUVERA) makes multi-vector search practical at scale for the first time. Prior approaches either sacrificed quality (single-vector), sacrificed speed (brute-force), or sacrificed guarantees (PLAID).</p><p>The verticals where this matters most: financial document search (tables, charts, filings), medical imaging (radiology reports with embedded scans), legal discovery (scanned contracts with annotations), and any domain where OCR is the current bottleneck.</p><h2 id="links">Links</h2><ul><li><a href="https://arxiv.org/abs/2405.19504?ref=blog.mixpeek.com">MUVERA paper (NeurIPS 2024)</a></li><li><a href="https://arxiv.org/abs/2407.01449?ref=blog.mixpeek.com">ColPali paper (ICLR 2025)</a></li><li><a href="https://github.com/google/graph-mining/tree/main/sketching/point_cloud?ref=blog.mixpeek.com">MUVERA reference implementation (C++)</a></li><li><a href="https://github.com/illuin-tech/colpali?ref=blog.mixpeek.com">ColPali / ColQwen2 models</a></li><li><a href="https://huggingface.co/vidore/colqwen2-v1.0?ref=blog.mixpeek.com">ColQwen2-v1.0 on HuggingFace</a></li></ul>]]></content:encoded></item><item><title><![CDATA[We Built a Pre-Publication IP Clearance Pipeline. Here's What We Learned.]]></title><description><![CDATA[Every major IP enforcement tool finds violations after they're live. We built one that catches them before publication. Here's the architecture, the models, and what we learned.]]></description><link>http://blog.mixpeek.com/ip-safety-pre-publication-clearance/</link><guid isPermaLink="false">69c2a5ed3baecafdb7f8d20a</guid><category><![CDATA[Engineering]]></category><category><![CDATA[IP Safety]]></category><category><![CDATA[Computer Vision]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Content Compliance]]></category><category><![CDATA[Tutorial]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Wed, 25 Mar 2026 13:40:11 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/03/feature-image-4.png" medium="image"/><content:encoded><![CDATA[
<!--kg-card-begin: html-->
<iframe width="560" height="315" src="https://www.youtube.com/embed/hmajwWZLkiA?si=kyB56v9QfavyXHgn" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
<!--kg-card-end: html-->
<img src="http://blog.mixpeek.com/content/images/2026/03/feature-image-4.png" alt="We Built a Pre-Publication IP Clearance Pipeline. Here&apos;s What We Learned."><p>Every major IP enforcement tool finds violations after they&apos;re live. We built one that catches them before publication. Here&apos;s the architecture.</p><hr><h2 id="the-problem-content-velocity-vs-clearance-bottleneck">The Problem: Content Velocity vs. Clearance Bottleneck</h2><p>A mid-size media company publishes 200-400 creative assets per week. Each one needs to be checked for unauthorized faces, trademarked logos, and copyrighted audio before it ships. A single missed celebrity likeness in an ad campaign can trigger a seven-figure lawsuit. A logo that&apos;s &quot;close enough&quot; to a registered trademark gets a cease-and-desist within hours.</p><p>The current options are bad:</p><ul><li><strong>Manual review</strong> doesn&apos;t scale. A trained compliance analyst can review maybe 50 assets/day with any rigor. That&apos;s a week&apos;s backlog by Tuesday.</li><li><strong>Post-publication enforcement</strong> (Pixsy, Red Points, VISUA, Copyseeker) finds violations after they&apos;re already live. You pay for takedowns, not prevention. The damage &#x2014; legal exposure, brand risk, platform penalties &#x2014; is already done.</li><li><strong>Perceptual hashing alone</strong> catches exact and near-exact copies, but misses stylized logos, different angles of the same face, or AI-generated content that&apos;s &quot;inspired by&quot; but not pixel-identical to protected IP.</li></ul><p>We wanted a pipeline that clears content <em>before</em> publication, runs in under a second per image and a few seconds per video, and catches the hard cases that hashing misses.</p><hr><h2 id="architecture-overview">Architecture Overview</h2><p>The system is built on three primitives from the Mixpeek API: <strong>Buckets</strong> (storage + ingestion triggers), <strong>Collections</strong> (processing pipelines with feature extractors), and <strong>Retrievers</strong> (multi-stage search).</p><p>The high-level flow:</p><pre><code>Content Asset (image/video/audio)
    |
    v
Bucket Upload (triggers collection pipeline)
    |
    v
Collection Pipeline (parallel extractors)
    |--- Face Detection &#x2192; Face Embedding (ArcFace 512d)
    |--- Scene Splitting &#x2192; Object Detection (YOLO) &#x2192; Logo Embedding (SigLIP 768d)
    |--- Audio Extraction &#x2192; Spectrogram Fingerprinting
    |
    v
Vector Storage (Qdrant &#x2014; one namespace, three vector spaces)
    |
    v
Retriever (multi-stage search across all three corpora)
    |
    v
Clearance Result: { faces: [...], logos: [...], audio: [...] }
</code></pre><p>Three detection layers run in parallel within a single collection pipeline. Each layer has its own feature extractor, its own embedding model, and its own reference corpus. A single retriever execution searches all three and returns a unified result.</p><p>The key insight: <strong>the same pipeline that processes your content for clearance also builds your reference corpus.</strong> Celebrity headshots, trademarked logos, and copyrighted audio tracks are all ingested through the same bucket-collection flow. The only difference is metadata tagging &#x2014; reference items get a <code>corpus_type: &quot;reference&quot;</code> field, content to be checked gets <code>corpus_type: &quot;submission&quot;</code>.</p><hr><h2 id="query-pre-processing-why-it-matters-more-than-model-quality">Query Pre-Processing: Why It Matters More Than Model Quality</h2><p>This is the section most teams skip, and it&apos;s the one that matters most. The quality of what you feed into your embedding model determines your recall far more than which embedding model you pick.</p><h3 id="scene-splitting-for-video">Scene Splitting for Video</h3><p>Naive approach: sample every Nth frame and run detection on each. This is expensive and produces massive redundancy &#x2014; a 30-second talking-head clip generates 900 frames at 30fps, most of which are nearly identical.</p><p>Better approach: split by scene boundaries first. Mixpeek&apos;s <code>scene_splitting</code> extractor uses PySceneDetect&apos;s content-aware detection to identify hard cuts and gradual transitions. A typical 30-second ad breaks into 3-8 scenes. Run detection on representative frames from each scene, not every frame.</p><pre><code class="language-python"># Collection config &#x2014; scene splitting feeds into face detection
{
    &quot;collection_name&quot;: &quot;ip_clearance_pipeline&quot;,
    &quot;feature_extractors&quot;: [
        {
            &quot;feature_extractor_name&quot;: &quot;scene_splitting&quot;,
            &quot;version&quot;: &quot;v1&quot;,
            &quot;parameters&quot;: {
                &quot;threshold&quot;: 27.0,
                &quot;min_scene_len&quot;: 15
            }
        },
        {
            &quot;feature_extractor_name&quot;: &quot;face_identity&quot;,
            &quot;version&quot;: &quot;v1&quot;,
            &quot;parameters&quot;: {
                &quot;quality_threshold&quot;: 0.4,
                &quot;min_face_size&quot;: 40,
                &quot;detection_threshold&quot;: 0.5
            }
        },
        {
            &quot;feature_extractor_name&quot;: &quot;object_detection&quot;,
            &quot;version&quot;: &quot;v1&quot;,
            &quot;parameters&quot;: {
                &quot;model&quot;: &quot;yolov8x-worldv2&quot;,
                &quot;confidence_threshold&quot;: 0.25,
                &quot;classes&quot;: [&quot;logo&quot;, &quot;brand&quot;, &quot;trademark&quot;, &quot;sign&quot;, &quot;label&quot;]
            }
        }
    ]
}
</code></pre><p>This preprocessing step cuts compute by 10-50x on video inputs while actually <em>improving</em> detection quality &#x2014; scene-representative frames are more likely to show faces and logos in clear, unblurred positions than arbitrary frame samples.</p><h3 id="face-cropping-before-embedding">Face Cropping Before Embedding</h3><p>You don&apos;t embed the full frame. You detect faces first (SCRFD &#x2014; Sample and Computation Redistribution for Face Detection), crop each face region, align it to a normalized 112x112 template using 5 facial landmarks, and <em>then</em> generate the identity embedding.</p><p>This is obvious in hindsight, but I&apos;ve seen teams embed full frames and wonder why their face search has 40% recall. The face is 3% of the pixel area in a wide shot. The embedding is dominated by the background.</p><h3 id="object-proposals-for-logo-isolation">Object Proposals for Logo Isolation</h3><p>Same principle for logos. YOLO generates bounding box proposals for logo-like regions. Each region is cropped and embedded independently with SigLIP. A single frame might yield zero logo proposals (clean background) or five (product shelf shot). Each proposal becomes a separate search query against the logo reference corpus.</p><p>The alternative &#x2014; embedding the full frame and hoping the model attends to the logo &#x2014; works surprisingly well for prominent logos (center frame, large area) and fails badly for small, peripheral, or partially occluded marks. The crop-then-embed approach handles both cases.</p><hr><h2 id="the-pipelines-in-detail">The Pipelines in Detail</h2><h3 id="face-detection-%E2%86%92-recognition">Face Detection &#x2192; Recognition</h3><p>Pipeline stages:</p><ol><li><strong>Scene splitting</strong> (video only): PySceneDetect content-aware detection &#x2192; representative frames</li><li><strong>Face detection</strong>: SCRFD-2.5G scans each frame. Outputs bounding boxes, confidence scores, and 5 facial landmarks per detected face.</li><li><strong>Alignment</strong>: Landmarks are used to warp each face to a canonical 112x112 frontal pose. This normalization is what makes the system robust to head tilt, partial profile views, and camera angle variation.</li><li><strong>Embedding</strong>: ArcFace (ResNet-100, trained on MS1MV3) generates a 512-dimensional identity embedding. Cosine similarity in this space corresponds directly to identity &#x2014; same person across lighting, age, expression, and moderate pose changes.</li><li><strong>ANN search</strong>: Each face embedding is searched against the reference corpus in Qdrant. Threshold: cosine similarity &gt;= 0.28 (conservative; FAR ~1e-4 on LFW benchmark).</li></ol><p>Why ArcFace and not CLIP/SigLIP for faces? CLIP embeds semantic similarity. Two different red-haired women in similar settings score high. ArcFace is trained with angular margin loss specifically for identity discrimination &#x2014; it is not interchangeable with general-purpose image embedders for biometric matching.</p><pre><code class="language-python"># Retriever config &#x2014; face search stage
{
    &quot;stage_id&quot;: &quot;feature_search&quot;,
    &quot;stage_type&quot;: &quot;search&quot;,
    &quot;parameters&quot;: {
        &quot;searches&quot;: [
            {
                &quot;feature_uri&quot;: &quot;mixpeek://face_identity@v1/arcface_embedding&quot;,
                &quot;query&quot;: {
                    &quot;input_mode&quot;: &quot;content&quot;,
                    &quot;value&quot;: &quot;{{submission_face_crop}}&quot;
                },
                &quot;filters&quot;: {
                    &quot;corpus_type&quot;: &quot;reference&quot;
                },
                &quot;top_k&quot;: 10,
                &quot;score_threshold&quot;: 0.28
            }
        ]
    }
}
</code></pre><h3 id="logo-detection-%E2%86%92-recognition">Logo Detection &#x2192; Recognition</h3><p>Pipeline stages:</p><ol><li><strong>Scene splitting</strong> (video only): same as above</li><li><strong>Object detection</strong>: YOLOv8x-WorldV2 generates bounding box proposals for logo-like objects. We use the open-vocabulary variant so it generalizes beyond a fixed class set &#x2014; you can prompt it with arbitrary class names at inference time.</li><li><strong>Region cropping</strong>: Each detected region is cropped with 10% padding</li><li><strong>Logo embedding</strong>: SigLIP (ViT-B/16, 768-dimensional) embeds each cropped region. SigLIP over CLIP because its sigmoid pairwise loss produces better-calibrated similarity scores for retrieval tasks.</li><li><strong>Dual matching</strong>: Each crop is matched against the reference corpus via both (a) embedding cosine similarity and (b) perceptual hash distance. Either signal above threshold triggers a match.</li></ol><p>The dual matching is important. Perceptual hashing catches trivial copies (exact logo, maybe resized or JPEG&apos;d) cheaply. The embedding catches stylized variants, partial logos, color inversions, and the deformed versions that generative AI tends to produce. Running both in parallel with an OR-gate means high recall without relying solely on either approach.</p><pre><code class="language-python"># Why pHash + embedding, not just embedding?
#
# pHash: O(1) lookup, exact/near-exact matches, zero false negatives on
#         trivial copies. Catches ~60% of real-world violations.
# Embedding: Handles stylization, partial occlusion, AI-generated variants.
#            Catches the remaining 40% that hashing misses.
#
# Running both costs almost nothing extra &#x2014; the hash is computed during
# ingestion and stored as a payload field. At query time, it&apos;s a
# Qdrant payload filter, not a separate search.
</code></pre><h3 id="audio-fingerprinting">Audio Fingerprinting</h3><p>Pipeline stages:</p><ol><li><strong>Audio extraction</strong>: FFmpeg strips the audio track from video assets</li><li><strong>Spectrogram generation</strong>: Short-time Fourier transform &#x2192; mel spectrogram</li><li><strong>Fingerprint embedding</strong>: The spectrogram is embedded into a dense vector representation for similarity search</li><li><strong>ANN search</strong>: Search against a reference corpus of copyrighted audio tracks, jingles, and licensed music</li></ol><p>Audio is the least mature of the three pipelines and the one where perceptual fingerprinting (Chromaprint/AcoustID-style) still outperforms learned embeddings for exact match detection. The embedding approach shines for covers, remixes, and tempo-shifted versions.</p><hr><h2 id="models-and-tuning">Models and Tuning</h2><h3 id="why-siglip-over-clip">Why SigLIP Over CLIP</h3><p>We evaluated both extensively for the logo matching use case. SigLIP (768-d, ViT-B/16) wins on three dimensions that matter for retrieval:</p><ul><li><strong>Calibrated scores</strong>: SigLIP&apos;s sigmoid loss produces similarity scores that are directly interpretable as match confidence. CLIP&apos;s softmax-normalized scores are relative within a batch, which makes threshold-setting fragile.</li><li><strong>Cropped region performance</strong>: On our internal eval set of 500 logo crops against 3,000 reference brands, SigLIP at threshold 0.75 achieves 94% recall / 97% precision vs CLIP&apos;s 89% recall / 93% precision at its optimal threshold.</li><li><strong>Zero-shot generalization</strong>: SigLIP handles brand logos it&apos;s never seen in training better than CLIP, likely due to the per-pair sigmoid loss not pushing negatives to the same scale as positives.</li></ul><h3 id="face-recognition-arcface-tradeoffs">Face Recognition: ArcFace Tradeoffs</h3><p>ArcFace ResNet-100 is the default. It&apos;s 512 dimensions, runs at ~3ms per face on GPU, and achieves 99.83% accuracy on LFW. The tradeoffs:</p><ul><li><strong>Pose sensitivity</strong>: Accuracy degrades beyond ~60-degree profile angles. The alignment step mitigates this for moderate poses, but a face visible only in full profile may not match.</li><li><strong>Aging</strong>: Embeddings shift over 10+ year spans. A reference photo from 2010 may not match the same person in 2026 at conservative thresholds. Mitigation: include multiple reference images spanning different time periods.</li><li><strong>Low resolution</strong>: Faces below ~40px wide after detection don&apos;t produce reliable embeddings. We set <code>min_face_size: 40</code> as a hard floor.</li></ul><h3 id="custom-yolo-models">Custom YOLO Models</h3><p>The object detection stage supports custom model deployment. If your reference corpus is highly specialized &#x2014; say, you need to detect pharmaceutical packaging marks or specific regulatory symbols &#x2014; you can train a custom YOLO model and upload it as a ZIP file to the platform. The inference service loads it on-demand.</p><p>For generic logo detection, YOLOv8x-WorldV2&apos;s open-vocabulary capability is sufficient. You specify the classes you care about at query time:</p><pre><code class="language-python"># Open-vocabulary object detection &#x2014; no retraining needed
{
    &quot;feature_extractor_name&quot;: &quot;object_detection&quot;,
    &quot;parameters&quot;: {
        &quot;model&quot;: &quot;yolov8x-worldv2&quot;,
        &quot;classes&quot;: [&quot;Nike swoosh&quot;, &quot;McDonald&apos;s arches&quot;, &quot;Apple logo&quot;, &quot;brand logo&quot;]
    }
}
</code></pre><h3 id="the-reranker">The Reranker</h3><p>ANN search returns top-K candidates fast but approximate. For the final ranking, we run a cross-encoder reranker on the top results. The cross-encoder sees both the query and candidate simultaneously (not independently embedded), which allows it to capture fine-grained differences that dual-encoder models miss.</p><p>This is especially valuable for logos where the top-10 ANN results might include 3 genuine matches and 7 visually similar but legally distinct marks. The cross-encoder precision on this disambiguation step is what separates &quot;useful tool&quot; from &quot;alert fatigue generator.&quot;</p><hr><h2 id="the-dataset-challenge">The Dataset Challenge</h2><h3 id="building-the-reference-corpus">Building the Reference Corpus</h3><p>This is where we spent most of our time and where most teams underinvest.</p><p><strong>Faces:</strong> How many reference images per identity do you need? Our empirical finding: 5-10 quality reference images per person covers enough pose/lighting variation to achieve &gt;95% recall at FAR 1e-4. With only 1 reference image, recall drops to ~70%. Below 3, it&apos;s unreliable.</p><p>We built our initial corpus from FaceScrub (~530 identities, ~2,900 images after URL death) supplemented with Wikipedia Commons portraits. The URL attrition on FaceScrub is brutal &#x2014; it&apos;s a 2014 dataset and ~85% of the original URLs are dead. For production, you need a maintained reference database, not a research dataset.</p><pre><code class="language-python"># Corpus quality matters more than quantity
# These are our empirical numbers on the FaceScrub + Wikipedia corpus:
#
# References/identity | Recall@FAR=1e-4 | Notes
# ------------------- | ---------------- | -----
# 1                   | ~70%             | Single frontal portrait
# 3                   | ~88%             | Frontal + 2 varied poses
# 5                   | ~94%             | Diverse lighting/angle
# 10                  | ~97%             | Diminishing returns here
# 20+                 | ~98%             | Not worth the curation cost
</code></pre><p><strong>Logos:</strong> We use LogoDet-3K (158,654 images, 3,000 brands, MIT license) as the base. The critical preprocessing step: LogoDet-3K uses numeric company IDs, not brand names. You must resolve the ID-to-brand mapping before ingestion, or your metadata says &quot;Food/12345&quot; instead of &quot;McDonald&apos;s.&quot; We burned half a day on this.</p><p>For logos, variation coverage matters: you need the logo on white, on dark backgrounds, in color, in grayscale, at multiple scales, and ideally in real-world context (storefront, product packaging, screen captures). A single clean vector logo file is insufficient as a reference.</p><h3 id="the-cold-start-problem">The Cold Start Problem</h3><p>When you first deploy, your reference corpus is whatever you curated. It doesn&apos;t cover edge cases &#x2014; unusual lighting conditions, rare logo variants, faces that are only partially visible. The system&apos;s recall on day one is measurably worse than on day 30.</p><p>The fix is interaction feedback, which feeds directly into the continuous learning loop described in the next section.</p><hr><h2 id="continuous-learning-the-interaction-loop">Continuous Learning: The Interaction Loop</h2><p>This is the part that actually matters long-term, and it&apos;s the part most blog posts about ML pipelines skip.</p><p>Every user interaction with the system generates a signal:</p><ul><li><strong>Click</strong> on a result &#x2192; implicit positive signal</li><li><strong>Skip</strong> a result &#x2192; weak negative signal</li><li><strong>Long view</strong> (&gt;3s on a result) &#x2192; implicit positive signal</li><li><strong>Explicit feedback</strong> &#x2192; &quot;correct match&quot; / &quot;false positive&quot; / &quot;missed match&quot;</li><li><strong>Threshold override</strong> &#x2192; analyst manually approves/rejects at a specific confidence level</li></ul><p>These signals are captured by Mixpeek&apos;s interaction tracking and stored in ClickHouse for analytics. The analytics endpoints expose confidence distributions and signal patterns per retriever:</p><pre><code class="language-python"># Analyzing signal patterns for a retriever
import requests

# Get confidence distribution &#x2014; where are matches landing?
response = requests.get(
    &quot;https://api.mixpeek.com/v1/analytics/retrievers/{retriever_id}/confidence&quot;,
    headers={&quot;Authorization&quot;: f&quot;Bearer {api_key}&quot;}
)

# Returns histogram of match confidence scores
# If there&apos;s a bimodal distribution with a gap at 0.35,
# that&apos;s your natural threshold &#x2014; not the default 0.28.

# Get signal breakdown &#x2014; what&apos;s being confirmed vs. rejected?
signals = requests.get(
    &quot;https://api.mixpeek.com/v1/analytics/retrievers/{retriever_id}/signals&quot;,
    headers={&quot;Authorization&quot;: f&quot;Bearer {api_key}&quot;}
)
</code></pre><p>Over time, these signals inform three adjustments:</p><ol><li><strong>Threshold tuning</strong>: If analysts are consistently rejecting matches above 0.28 but below 0.35, raise the threshold. If they&apos;re marking missed matches in the 0.22-0.28 range, consider lowering it for specific corpora.</li><li><strong>Fusion weight adjustment</strong>: The retriever combines face, logo, and audio signals with configurable weights. If logo matches have a higher false-positive rate than face matches in your domain, down-weight them.</li><li><strong>Reference corpus expansion</strong>: When a new face or logo is confirmed as a genuine match but wasn&apos;t in the reference corpus, add it. This is how the system improves its recall over time without model retraining.</li></ol><p>The system doesn&apos;t auto-adjust thresholds (that would be terrifying for a compliance tool). It surfaces the data; a human makes the call. But having the data surface automatically, instead of discovering your false-positive rate through legal complaints, is the difference between a proactive tool and an expensive audit.</p><hr><h2 id="performance-numbers">Performance Numbers</h2><p>Measured on our production deployment with a corpus of ~3,000 brand logos and ~530 face identities:</p>
<!--kg-card-begin: html-->
<table>
<thead>
<tr><th>Input Type</th><th>Latency (p50)</th><th>Latency (p95)</th><th>Notes</th></tr>
</thead>
<tbody>
<tr><td>Single image</td><td>220ms</td><td>480ms</td><td>Face + logo detection + search</td></tr>
<tr><td>Video (30s, ~5 scenes)</td><td>2.1s</td><td>3.8s</td><td>Includes scene splitting</td></tr>
<tr><td>Video (60s, ~12 scenes)</td><td>4.3s</td><td>7.2s</td><td>Scales linearly with scenes</td></tr>
<tr><td>Audio-only check</td><td>180ms</td><td>350ms</td><td>30s clip fingerprint</td></tr>
<tr><td>Batch (1000 images)</td><td>~4 min</td><td>~7 min</td><td>Parallelized across workers</td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>False positive rates at the default thresholds:</p>
<!--kg-card-begin: html-->
<table>
<thead>
<tr><th>Detection Layer</th><th>Threshold</th><th>FPR</th><th>Recall</th></tr>
</thead>
<tbody>
<tr><td>Face (ArcFace)</td><td>cosine &gt;= 0.28</td><td>~0.01%</td><td>~94%</td></tr>
<tr><td>Logo (SigLIP)</td><td>cosine &gt;= 0.75</td><td>~3%</td><td>~94%</td></tr>
<tr><td>Logo (pHash)</td><td>hamming &lt;= 8</td><td>~0.1%</td><td>~60%</td></tr>
<tr><td>Logo (combined)</td><td>either above</td><td>~3.1%</td><td>~97%</td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>The logo FPR is higher because visually similar but legally distinct marks are common (think: any swoosh-like shape near the Nike threshold). The reranker brings this down to ~0.5% in practice, but we report the pre-reranker numbers because that&apos;s what you&apos;ll see before adding the cross-encoder stage.</p><hr><h2 id="what-wed-do-differently">What We&apos;d Do Differently</h2><p>Having built this and deployed it, here&apos;s what we&apos;d change if starting over:</p><h3 id="1-start-with-perceptual-hashing-add-embeddings-second">1. Start with perceptual hashing, add embeddings second</h3><p>pHash is trivial to implement, runs in microseconds, and catches ~60% of real-world logo violations (exact copies, resizes, JPEG re-compressions, minor crops). If you&apos;re building an MVP, ship pHash first and add the embedding pipeline when you need to catch stylized variants. We built the embedding pipeline first because it was more interesting, which was a mistake.</p><h3 id="2-invest-in-reference-corpus-quality-before-model-selection">2. Invest in reference corpus quality before model selection</h3><p>We spent weeks evaluating ArcFace vs. AdaFace vs. CosFace when the actual bottleneck was that 40% of our FaceScrub reference URLs were dead. 10 good reference images per identity with a mediocre model beats 1 reference image with SOTA. This isn&apos;t a theoretical observation &#x2014; our recall jumped 12 percentage points from corpus cleanup alone, without touching the model.</p><h3 id="3-build-the-feedback-loop-from-day-one">3. Build the feedback loop from day one</h3><p>We added interaction tracking as an afterthought after the first deployment. This meant three weeks of production data with no signal capture &#x2014; three weeks of analyst corrections that were lost. The feedback loop is what makes the system improve over time. Bolt it in from the start, even if the analytics dashboard comes later.</p><h3 id="4-dont-underestimate-the-logo-disambiguation-problem">4. Don&apos;t underestimate the logo disambiguation problem</h3><p>There are thousands of logos that are vaguely circular with a swoosh-like element. At embedding similarity 0.7, the Nike swoosh matches dozens of unrelated marks. The cross-encoder reranker exists because of this problem. If your use case involves logos at all, budget for a reranking stage &#x2014; you will need it.</p><h3 id="5-video-is-a-multiplier-not-a-different-problem">5. Video is a multiplier, not a different problem</h3><p>The per-frame detection is identical to image detection. The only video-specific work is scene splitting and deduplication (the same face appearing across multiple frames of the same scene should not generate multiple alerts). We initially over-engineered the video pipeline before realizing it&apos;s just &quot;image pipeline + scene splitting + dedup.&quot;</p><hr><h2 id="try-it">Try It</h2><p>The demo is live at <a href="https://copyright.mixpeek.com/?ref=blog.mixpeek.com">copyright.mixpeek.com</a>. Upload an image or video and see the face/logo/audio detection results in real time.</p><p>The API is documented at <a href="https://mixpeek.com/docs?ref=blog.mixpeek.com">mixpeek.com/docs</a>. If you&apos;re building a pre-publication clearance pipeline, the key endpoints are:</p><ul><li><strong>Buckets</strong> for ingesting both reference corpora and content submissions</li><li><strong>Collections</strong> with face_identity, object_detection, and audio extractors for processing</li><li><strong>Retrievers</strong> with multi-stage search for running clearance checks</li></ul><p>If you&apos;re building something similar and hit a wall, the <a href="https://mixpeek.com/docs/tutorials?ref=blog.mixpeek.com">tutorial</a> walks through the full pipeline setup from corpus ingestion to retriever configuration.</p><p>We also open-sourced the demo frontend &#x2014; it&apos;s a React app that calls the Mixpeek API. Clone it, point it at your namespace, and you have a working IP clearance UI.</p>]]></content:encoded></item><item><title><![CDATA[I Benchmarked 5 Video Embedding Models So You Don't Have To]]></title><description><![CDATA[We tested Gemini, Twelve Labs Marengo, X-CLIP, SigLIP 2, and InternVideo2 on text-to-video retrieval with graded relevance. The results surprised us.]]></description><link>http://blog.mixpeek.com/video-embedding-benchmark-2026/</link><guid isPermaLink="false">69b30e973baecafdb7f8cd58</guid><category><![CDATA[Benchmark]]></category><category><![CDATA[Video Embeddings]]></category><category><![CDATA[Multi-Modal AI]]></category><category><![CDATA[Information Retrieval]]></category><category><![CDATA[Research]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Thu, 12 Mar 2026 21:40:15 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/03/video-embedding-benchmark-2026.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/03/video-embedding-benchmark-2026.png" alt="I Benchmarked 5 Video Embedding Models So You Don&apos;t Have To"><p>At <a href="https://mixpeek.com/?ref=blog.mixpeek.com">Mixpeek</a>, we process video embeddings at scale for our customers&apos; retrieval pipelines. When Google dropped <a href="https://ai.google.dev/gemini-api/docs/embeddings?ref=blog.mixpeek.com">Gemini Embedding 2</a> &#x2014; claiming to unify text, image, video, and audio in one embedding space &#x2014; we needed to know: does it actually work for video retrieval? And how does it compare to purpose-built alternatives?</p><p>So we built a benchmark. Not a synthetic one with cherry-picked examples &#x2014; a proper IR evaluation with graded relevance &#x2014; dataset, code, and results all <a href="https://github.com/mixpeek/video-embedding-benchmark?ref=blog.mixpeek.com">open-sourced on GitHub</a> &#x2014; following the same methodology as <a href="https://github.com/beir-cellar/beir?ref=blog.mixpeek.com">BEIR</a> and <a href="https://huggingface.co/spaces/mteb/leaderboard?ref=blog.mixpeek.com">MTEB</a>. Twenty CC0 videos, sixty queries, six models, reproducible results.</p><p>Here&apos;s what we found.</p><h2 id="the-results">The Results</h2>
<!--kg-card-begin: html-->
<table>
<thead>
<tr><th>Model</th><th>Dims</th><th>NDCG@5</th><th>NDCG@10</th><th>R@1</th><th>R@5</th><th>MRR</th><th>Latency</th><th>Type</th></tr>
</thead>
<tbody>
<tr><td><strong>Gemini Embedding 2</strong></td><td>3072</td><td>0.697</td><td><strong>0.769</strong></td><td>0.200</td><td>0.717</td><td>0.896</td><td>2,458ms</td><td>API</td></tr>
<tr><td><strong>Marengo 2.7</strong>*</td><td>1024</td><td><strong>0.721</strong></td><td>0.760</td><td><strong>0.250</strong></td><td><strong>0.743</strong></td><td><strong>1.000</strong></td><td>18,148ms</td><td>API</td></tr>
<tr><td><strong>Mixedbread Wholembed v3</strong>**</td><td>ColBERT</td><td>0.644</td><td>0.757</td><td>0.216</td><td>0.649</td><td>0.932</td><td>500ms</td><td>API (Stores)</td></tr>
<tr><td>X-CLIP Base</td><td>512</td><td>0.327</td><td>0.470</td><td>0.067</td><td>0.367</td><td>0.520</td><td>192ms</td><td>Local</td></tr>
<tr><td>SigLIP 2 SO400M</td><td>1152</td><td>0.202</td><td>0.325</td><td>0.075</td><td>0.237</td><td>0.466</td><td>636ms</td><td>Local</td></tr>
<tr><td>InternVideo2 6B</td><td>768</td><td>0.186</td><td>0.302</td><td>0.046</td><td>0.237</td><td>0.405</td><td>24,817ms</td><td>Local</td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p><em>* Marengo results based on 34/60 queries due to Twelve Labs free-tier rate limits. ** Mixedbread results based on 37/60 queries due to rate limiting. Uses Stores API (ColBERT-style late interaction).</em></p><h2 id="what-these-metrics-mean-and-why-you-should-care">What These Metrics Mean (and Why You Should Care)</h2><p>If you&apos;re building any kind of search or retrieval system over video, these numbers tell you how often your users will find what they&apos;re looking for. Let me break them down:</p><p><strong>NDCG@K (Normalized Discounted Cumulative Gain)</strong> is the primary metric. It measures ranking quality with graded relevance &#x2014; not just &quot;did you find it?&quot; but &quot;did you rank the best match highest?&quot; An NDCG@10 of 0.769 means Gemini gets the ranking mostly right across the top 10 results. An NDCG@10 of 0.302 (InternVideo2) means the ranking is barely better than random.</p><p><strong>MRR (Mean Reciprocal Rank)</strong> answers: &quot;Where does the first correct result appear?&quot; Marengo&apos;s perfect 1.000 means the right video was <em>always</em> ranked #1. Every single query. Gemini&apos;s 0.896 means the right answer is usually in the top 2. X-CLIP&apos;s 0.520 means you&apos;re typically scrolling to position 2-3 before finding something relevant.</p><p><strong>Recall@K</strong> tells you what fraction of relevant results appear in the top K. At R@5, Marengo retrieves 74% of relevant videos in the top 5 results. InternVideo2 only gets 24%. If you&apos;re building a search UI that shows 5 results per page, that&apos;s the difference between useful and useless.</p><p><strong>Latency</strong> is wall-clock time to embed one video. X-CLIP does it in 192ms locally. Marengo takes 18 seconds through their API. That&apos;s a 94x difference. For batch processing, this might not matter. For real-time applications, it&apos;s a dealbreaker.</p><h2 id="the-surprising-takeaways">The Surprising Takeaways</h2><h3 id="1-three-api-models-in-a-tight-race">1. Three API models in a tight race</h3><p>The top 3 &#x2014; Gemini, Marengo, and Mixedbread &#x2014; cluster tightly at NDCG@10 0.757&#x2013;0.769. Marengo&apos;s perfect MRR is genuinely impressive (it never puts the wrong video first), while Mixedbread&apos;s ColBERT-style late interaction achieves 0.932 MRR with a very different architecture. But Gemini is close on all metrics and handles text, images, audio, and documents too. For most teams, Gemini&apos;s versatility at 7x faster latency probably wins. Mixedbread&apos;s Stores approach is interesting &#x2014; you upload videos once and search via API. No embedding vectors to manage, no vector DB needed.</p><h3 id="2-model-size-doesnt-predict-quality">2. Model size doesn&apos;t predict quality</h3><p>InternVideo2 has 6 billion parameters. X-CLIP has ~150 million. X-CLIP beats InternVideo2 on every single metric by a wide margin (0.470 vs 0.302 NDCG@10). The reason: InternVideo2&apos;s Stage2 checkpoint is optimized for multimodal pretraining, not zero-shot retrieval. Architecture and training objective matter more than parameter count.</p><h3 id="3-frame-averaging-is-a-dead-end">3. Frame averaging is a dead end</h3><p>SigLIP 2 is a fantastic image encoder. But sampling 8 frames and averaging their embeddings gives you 0.325 NDCG@10 &#x2014; barely above InternVideo2&apos;s pretrained checkpoint. Video is not a bag of frames. Temporal structure &#x2014; what happens <em>between</em> frames &#x2014; carries critical information for retrieval. X-CLIP&apos;s cross-frame attention proves this: same number of frames, 1.4x better results.</p><h3 id="4-the-api-vs-self-hosted-gap-is-2x">4. The API vs. self-hosted gap is 2x</h3><p>All three API models (Gemini, Marengo, Mixedbread) score 0.75+ NDCG@10. The best open-source model (X-CLIP) scores 0.470. That&apos;s a 2x quality gap. If you need high-quality video retrieval today, you&apos;re paying for an API. The open-source video embedding space is still immature.</p><h2 id="what-this-means-for-your-architecture">What This Means For Your Architecture</h2><h3 id="if-youre-building-a-search-product">If you&apos;re building a search product:</h3><p>Use Gemini Embedding 2. It has the best balance of quality, latency, and cost (free tier covers 1K videos/day). The 3072-dim vectors are large, but you get Matryoshka support &#x2014; truncate to 768 dims with minimal quality loss. Marengo is slightly better on retrieval but 7x slower and costs $0.033/min.</p><h3 id="if-latency-matters-more-than-quality">If latency matters more than quality:</h3><p>X-CLIP runs locally in 192ms on consumer hardware. At 0.470 NDCG@10, it&apos;s good enough for recommendation systems, deduplication, or coarse-grained search where you can refine results downstream.</p><h3 id="if-youre-evaluating-at-scale">If you&apos;re evaluating at scale:</h3><p>Don&apos;t use InternVideo2 for retrieval. Despite the hype, the Stage2 checkpoint isn&apos;t designed for zero-shot embedding similarity. If you need an open-source model with &gt;1B params, wait for a contrastive-tuned variant or fine-tune it yourself.</p><h3 id="if-youre-at-mixpeek">If you&apos;re at Mixpeek:</h3><p>This benchmark directly informs our pipeline. We&apos;re integrating Gemini Embedding 2 as a first-class embedding option alongside our existing extractors. The quality-to-latency ratio is unmatched, and the multimodal unification means our customers can search across video, images, and documents with a single model.</p><h2 id="methodology">Methodology</h2><p>We want this to be reproducible. Here&apos;s exactly what we did:</p><ul><li><strong>Dataset:</strong> 20 CC0 videos from Pexels across 5 categories (sports, cooking, nature, urban, technology). All normalized to 640x360, 24fps, 10s max, H.264.</li><li><strong>Queries:</strong> 60 text queries with graded relevance (0/1/2), three per video: exact match, partial match, and hard negative (semantically adjacent but wrong domain).</li><li><strong>Metrics:</strong> Standard IR evaluation following BEIR/MTEB conventions &#x2014; NDCG@K with graded relevance as the primary metric.</li><li><strong>Embedding:</strong> All vectors L2-normalized. Retrieval by cosine similarity. Random seed fixed at 42.</li><li><strong>Frame-based models:</strong> 8 frames uniformly sampled (SigLIP, X-CLIP) or 4 frames (InternVideo2). API models process the full video.</li></ul><h2 id="hard-negative-performance">Hard Negative Performance</h2><p>We specifically designed queries to confuse models &#x2014; e.g., &quot;A technician carefully assembling small electronic parts by hand&quot; for a cooking video (both involve precise hand movements). This tests whether models understand semantics or just match visual patterns.</p>
<!--kg-card-begin: html-->
<table>
<thead>
<tr><th>Model</th><th>NDCG@1</th><th>NDCG@5</th><th>MRR</th></tr>
</thead>
<tbody>
<tr><td>Marengo 2.7*</td><td><strong>1.000</strong></td><td><strong>0.846</strong></td><td><strong>1.000</strong></td></tr>
<tr><td>Mixedbread Wholembed v3**</td><td><strong>1.000</strong></td><td>0.802</td><td><strong>1.000</strong></td></tr>
<tr><td>Gemini Embedding 2</td><td>0.800</td><td>0.790</td><td>0.900</td></tr>
<tr><td>SigLIP 2 SO400M</td><td>0.400</td><td>0.236</td><td>0.556</td></tr>
<tr><td>X-CLIP Base</td><td>0.200</td><td>0.322</td><td>0.467</td></tr>
<tr><td>InternVideo2 6B</td><td>0.200</td><td>0.220</td><td>0.423</td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>Marengo and Mixedbread were never fooled &#x2014; both achieve perfect NDCG@1 and MRR on hard negatives. Gemini was fooled once. The open-source models were confused frequently. This is arguably the most important test for production retrieval &#x2014; false positives in search results destroy user trust.</p><h2 id="cost-comparison">Cost Comparison</h2>
<!--kg-card-begin: html-->
<table>
<thead>
<tr><th>Model</th><th>Pricing</th><th>Est. Cost / 1K Videos</th><th>Vector Storage</th></tr>
</thead>
<tbody>
<tr><td>Gemini Embedding 2</td><td>Free tier (1K/day)</td><td>~$0</td><td>12,288 B/vec</td></tr>
<tr><td>Marengo 2.7</td><td>$0.033/min</td><td>~$5-15</td><td>4,096 B/vec</td></tr>
<tr><td>Mixedbread Wholembed v3</td><td>Free tier, then per-token</td><td>~$0</td><td>N/A (server-side)</td></tr>
<tr><td>X-CLIP Base</td><td>Self-hosted</td><td>GPU cost only</td><td>2,048 B/vec</td></tr>
<tr><td>SigLIP 2</td><td>Self-hosted</td><td>GPU cost only</td><td>4,608 B/vec</td></tr>
<tr><td>InternVideo2 6B</td><td>Self-hosted</td><td>GPU cost only</td><td>3,072 B/vec</td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>Gemini at free tier is almost unfair. For most teams processing &lt;1K videos/day, the cost is literally zero. Marengo&apos;s per-minute pricing adds up fast for large video libraries but might be worth it if retrieval quality is your north star metric.</p><h2 id="reproduce-it-yourself">Reproduce It Yourself</h2><p>Everything is open source &#x2014; code, dataset (all 20 videos), and results &#x2014; in a single repo:</p><p><a href="https://github.com/mixpeek/video-embedding-benchmark?ref=blog.mixpeek.com"><strong>github.com/mixpeek/video-embedding-benchmark</strong></a></p><p><strong>What&apos;s in the repo:</strong></p><ul><li>All 20 CC0 videos included directly (~13MB) &#x2014; no separate download step</li><li><code>data/queries.json</code> with 60 graded-relevance queries</li><li>Pre-computed results for all 6 models in <code>results/</code></li><li>Full benchmark + adapter code for every model</li><li>Clone and run <code>python report.py</code> to verify our numbers &#x2014; no API keys needed</li></ul><pre><code>git clone https://github.com/mixpeek/video-embedding-benchmark.git
cd video-embedding-benchmark

# Videos are already in the repo &#x2014; no download needed
ls data/videos/

# Run individual models
python benchmark.py --model gemini      # needs GEMINI_API_KEY
python benchmark.py --model xclip       # runs locally
python benchmark.py --model siglip      # runs locally
python benchmark.py --model mixedbread  # needs MIXEDBREAD_API_KEY

# Generate comparison report from pre-computed results
python report.py
</code></pre><p>We&apos;ll update this post as we complete the remaining Marengo queries and add Amazon Nova Multimodal to the benchmark.</p><h2 id="whats-next">What&apos;s Next</h2><p>This benchmark covers retrieval quality, but that&apos;s only one dimension. We&apos;re planning to extend it with:</p><ul><li><strong>Longer videos</strong> &#x2014; 30s, 60s, 5min clips to test how models degrade with length</li><li><strong>Domain-specific evaluation</strong> &#x2014; medical, security, retail video datasets</li><li><strong>Cross-modal retrieval</strong> &#x2014; image-to-video, video-to-video search</li><li><strong>Matryoshka dimension scaling</strong> &#x2014; how much quality do you lose at 256d vs 3072d?</li></ul><p>If you&apos;re working on video search or embeddings and want to collaborate on the benchmark, reach out. We&apos;re at <a href="https://mixpeek.com/?ref=blog.mixpeek.com">mixpeek.com</a> or <a href="https://github.com/mixpeek?ref=blog.mixpeek.com">@mixpeek on GitHub</a>.</p><hr><p><em>Ethan Steininger is the founder of </em><a href="https://mixpeek.com/?ref=blog.mixpeek.com"><em>Mixpeek</em></a><em>, a multimodal processing platform for video, image, text, and audio understanding at scale.</em></p>]]></content:encoded></item><item><title><![CDATA[Gemini Embedding 2 is Live: embed multiple files into one vector]]></title><description><![CDATA[Google's Gemini Embedding 2 embeds images, PDFs, and text together in a single API call. Here's how we integrated it into Mixpeek's feature extractor pipeline, the production numbers, and where multi-file embedding beats single-chunk approaches.]]></description><link>http://blog.mixpeek.com/gemini-embedding-2-multifile/</link><guid isPermaLink="false">69b2ec683baecafdb7f8cd08</guid><category><![CDATA[Engineering]]></category><category><![CDATA[Embeddings]]></category><category><![CDATA[Gemini]]></category><category><![CDATA[Multimodal]]></category><category><![CDATA[Feature Extractors]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Thu, 12 Mar 2026 16:40:08 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/03/feature-image-2.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/03/feature-image-2.png" alt="Gemini Embedding 2 is Live: embed multiple files into one vector"><p>Google shipped <strong>Gemini Embedding 2</strong> (<code>gemini-embedding-exp-03-07</code>, 3072-d) last week (<a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2/?ref=blog.mixpeek.com" rel="noreferrer">announcement</a>). The headline number is the dimensionality. The part that actually matters is buried in the announcement: it does multi-modal embedding natively &#x2014; images, PDFs, audio, and text in a single API call, producing one vector that represents all of them together.</p><p>That&apos;s genuinely new. CLIP and SigLIP give you one modality per call and leave you to figure out late fusion. Vertex Multimodal gives you image+text but not PDF, not audio. Gemini Embedding 2 takes whatever you throw at it and returns a single 3072-d float array. No fusion logic, no alignment head you have to train yourself.</p><p>We integrated it into Mixpeek this week. Here&apos;s how it works, what it&apos;s good for, and the actual production numbers.</p><hr><h2 id="quick-context-what-feature-extractors-and-retrievers-are">Quick context: what feature extractors and retrievers are</h2><p>If you haven&apos;t used Mixpeek before:</p><p>A <strong>feature extractor</strong> is a processing pipeline that runs when objects land in a bucket. You point it at blob properties on your objects (image URLs, text fields, PDF attachments) and it writes embeddings into a vector index attached to the namespace. You can have multiple extractors per namespace &#x2014; one for text, one for images, one for multi-modal &#x2014; each writing its own named vector. Docs: <a href="https://docs.mixpeek.com/processing/feature-extractors?ref=blog.mixpeek.com">docs.mixpeek.com/processing/feature-extractors</a>.</p><p>A <strong>retriever</strong> is a query pipeline. You define what inputs it takes, which feature indices to search, and how to fuse and rank results. At query time it embeds your input using the same model that was used at ingest, runs ANN search, and returns ranked documents. The embedding step at query time is what we call the <em>realtime path</em> &#x2014; it runs inline during the request, not in a batch job. Docs: <a href="https://docs.mixpeek.com/retrieval/retrievers?ref=blog.mixpeek.com">docs.mixpeek.com/retrieval/retrievers</a>.</p><hr><h2 id="the-problem-with-one-chunk-per-embedding">The problem with one chunk per embedding</h2><p>Most embedding workflows treat objects as a single blob: extract text, embed it, done. If the object also has an image you embed the image separately and late-fuse at query time, or you throw away the image entirely.</p><p>This is fine for simple retrieval but breaks in a few important ways:</p><ul><li><strong>Product catalogs.</strong> A product is a hero image + a spec sheet PDF + a description. If you embed these separately, a query for &quot;lightweight carbon fiber trail shoe&quot; matches on text but has no idea the product also has a sole pattern image that&apos;s strongly correlated with &quot;trail.&quot; The separate embeddings can&apos;t cross-reference each other.</li><li><strong>Documents with figures.</strong> Research papers, technical reports, slide decks. The figure on page 4 is context for the paragraph below it. Embedding the figure and the paragraph independently loses that relationship. You&apos;d need to chunk and cross-reference manually.</li><li><strong>Brand and compliance monitoring.</strong> You want to know if a video frame + its surrounding caption jointly violate a guideline. A text-only check misses visual context; an image-only check misses the caption spin.</li><li><strong>Anything with metadata that changes the semantics.</strong> A photo of a person means different things with the caption &quot;CEO of Acme Corp&quot; vs &quot;wanted for fraud.&quot; Text and image together carry meaning that neither carries alone.</li></ul><p>The standard workaround is to embed everything separately and hope your late fusion weights are right. With Gemini Embedding 2 you can skip that entirely: pass all of it in one call, get one vector, store one point in Qdrant per object. At query time, pass your query image + your query text in one call. The model figures out the cross-modal alignment.</p><hr><h2 id="how-it-works-in-mixpeek">How it works in Mixpeek</h2><p>The extractor is called <code>gemini_multifile_extractor</code>. You configure it with an <code>input_mappings</code> block that lists which blob properties to embed together. All listed properties are collected per object, fetched (URLs are downloaded, presigned if S3), and sent to Gemini in a single <code>embed_content</code> call.</p><pre><code class="language-json">{
  &quot;feature_extractor_name&quot;: &quot;gemini_multifile_extractor&quot;,
  &quot;version&quot;: &quot;v1&quot;,
  &quot;input_mappings&quot;: {
    &quot;files&quot;: [&quot;hero_image&quot;, &quot;spec_sheet&quot;, &quot;description&quot;]
  },
  &quot;params&quot;: {
    &quot;output_dimensionality&quot;: 3072,
    &quot;task_type&quot;: &quot;RETRIEVAL_DOCUMENT&quot;
  }
}</code></pre><p>The <code>files</code> key is an array of blob property names on your objects. At ingest time, Mixpeek&apos;s Ray Data pipeline calls <code>get_content_list()</code> to resolve each property &#x2014; downloading binaries, passing text strings as-is &#x2014; then builds a <code>Part[]</code> array and fires one Gemini API call per object. The result is a single 3072-d vector stored as a named vector in Qdrant.</p><p>Key things the extractor writes to the document payload:</p><ul><li><code>source_blob_count</code> &#x2014; how many blobs were embedded</li><li><code>source_blob_properties</code> &#x2014; which properties contributed</li><li><code>gemini_multifile_extractor_v1_embedding</code> &#x2014; the 3072-d vector</li></ul><p>Full end-to-end setup guide: <a href="https://docs.mixpeek.com/processing/extractors/gemini-multifile?ref=blog.mixpeek.com">docs.mixpeek.com/processing/extractors/gemini-multifile</a></p><h3 id="bucket-schema-with-multi-blob-objects">Bucket schema with multi-blob objects</h3><pre><code class="language-json">curl -X POST https://api.mixpeek.com/v1/buckets \
  -H &quot;Authorization: Bearer $API_KEY&quot; \
  -H &quot;X-Namespace: $NAMESPACE_ID&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d &apos;{
    &quot;bucket_name&quot;: &quot;products&quot;,
    &quot;schema&quot;: {
      &quot;product_id&quot;:   { &quot;type&quot;: &quot;string&quot; },
      &quot;product_name&quot;: { &quot;type&quot;: &quot;string&quot; },
      &quot;description&quot;:  { &quot;type&quot;: &quot;text&quot; },
      &quot;hero_image&quot;:   { &quot;type&quot;: &quot;image&quot; },
      &quot;spec_sheet&quot;:   { &quot;type&quot;: &quot;document&quot; }
    }
  }&apos;</code></pre><h3 id="uploading-an-object-with-multiple-blobs">Uploading an object with multiple blobs</h3><pre><code class="language-json">curl -X POST https://api.mixpeek.com/v1/buckets/$BUCKET_ID/objects \
  -H &quot;Authorization: Bearer $API_KEY&quot; \
  -H &quot;X-Namespace: $NAMESPACE_ID&quot; \
  -H &quot;Content-Type: application/json&quot; \
  -d &apos;{
    &quot;blobs&quot;: [
      { &quot;blob_property&quot;: &quot;hero_image&quot;,  &quot;url&quot;: &quot;https://cdn.example.com/shoe.jpg&quot; },
      { &quot;blob_property&quot;: &quot;spec_sheet&quot;,  &quot;url&quot;: &quot;s3://products/SKU-42/spec.pdf&quot; },
      { &quot;blob_property&quot;: &quot;description&quot;, &quot;text&quot;: &quot;Lightweight carbon-fiber trail shoe&quot; }
    ],
    &quot;metadata&quot;: { &quot;product_id&quot;: &quot;SKU-42&quot; }
  }&apos;</code></pre><p>One object. Three blobs. One embedding. That&apos;s the whole model.</p><hr><h2 id="production-numbers">Production numbers</h2><p>We ran this against a live namespace on GKE (<code>ns_8606a82b84</code>) with objects containing 2 blobs each (image URL + text). Here&apos;s what we measured:</p>
<!--kg-card-begin: html-->
<table>
<thead>
<tr><th>Path</th><th>Blobs/object</th><th>Gemini API calls</th><th>Vector dims</th><th>Embed latency</th><th>Result score</th></tr>
</thead>
<tbody>
<tr><td>Batch ingest</td><td>2</td><td>1 per object</td><td>3072</td><td>~800ms (Ray actor)</td><td>&#x2014;</td></tr>
<tr><td>Text query</td><td>&#x2014;</td><td>1 per request</td><td>3072</td><td>1,414ms</td><td>0.573</td></tr>
<tr><td>Multi-content query (image + text)</td><td>&#x2014;</td><td>1 per request</td><td>3072</td><td>2,898ms</td><td>0.573</td></tr>
</tbody>
</table>
<!--kg-card-end: html-->
<p>A few things to note:</p><ul><li><strong>One API call per object regardless of blob count.</strong> With 3 blobs you still pay one API call, not three. The cost scales with total token count of the content, not number of parts.</li><li><strong>Multi-content query latency doubles vs text-only.</strong> Makes sense &#x2014; you&apos;re downloading an image over the network before the Gemini call. Factor that into your SLA budget.</li><li><strong>Score is identical (0.573) for text-only and multi-content queries</strong> against these test documents. The test objects were both 2-blob objects. In production on real multi-modal data, multi-content queries should score higher on relevant results because you&apos;re matching on both modalities simultaneously.</li></ul><hr><h2 id="retrieval-text-queries-and-multi-content-queries">Retrieval: text queries and multi-content queries</h2><p>The retriever has two query modes for this extractor:</p><h3 id="text-query-most-common">Text query (most common)</h3><p>Pass a text string. The retriever embeds it via Gemini at request time and searches the multi-file vector index. This works because the multi-file vectors are trained to align across modalities &#x2014; your text query &quot;trail running shoe&quot; has nonzero similarity to vectors built from images + PDFs + descriptions of trail running shoes.</p><pre><code class="language-json">{
  &quot;stages&quot;: [{
    &quot;stage_type&quot;: &quot;filter&quot;,
    &quot;config&quot;: {
      &quot;stage_id&quot;: &quot;feature_search&quot;,
      &quot;parameters&quot;: {
        &quot;searches&quot;: [{
          &quot;feature_uri&quot;: &quot;mixpeek://gemini_multifile_extractor@v1/gemini-embedding-exp-03-07&quot;,
          &quot;query&quot;: {
            &quot;input_mode&quot;: &quot;text&quot;,
            &quot;text&quot;: &quot;{{INPUT.query}}&quot;
          },
          &quot;top_k&quot;: 10
        }]
      }
    }
  }]
}</code></pre><pre><code class="language-json"># Execute
curl -X POST https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute \
  -H &quot;Authorization: Bearer $API_KEY&quot; \
  -H &quot;X-Namespace: $NAMESPACE_ID&quot; \
  -d &apos;{&quot;inputs&quot;: {&quot;query&quot;: &quot;trail running shoe carbon fiber&quot;}, &quot;settings&quot;: {&quot;limit&quot;: 5}}&apos;</code></pre><h3 id="multi-content-query-match-how-you-indexed">Multi-content query (match how you indexed)</h3><p>This is the interesting one. You pass multiple inputs &#x2014; a query image URL plus a text description &#x2014; and they&apos;re embedded together in one Gemini call, producing a query vector that was generated the same way as your indexed vectors. The query and the index are in the same space, built the same way.</p><pre><code class="language-json">{
  &quot;query&quot;: {
    &quot;input_mode&quot;: &quot;multi_content&quot;,
    &quot;values&quot;: [&quot;{{INPUT.image_url}}&quot;, &quot;{{INPUT.description}}&quot;]
  }
}</code></pre><pre><code class="language-json">curl -X POST https://api.mixpeek.com/v1/retrievers/$RETRIEVER_ID/execute \
  -H &quot;Authorization: Bearer $API_KEY&quot; \
  -H &quot;X-Namespace: $NAMESPACE_ID&quot; \
  -d &apos;{
    &quot;inputs&quot;: {
      &quot;image_url&quot;: &quot;https://example.com/query-shoe.jpg&quot;,
      &quot;description&quot;: &quot;trail running shoe carbon fiber&quot;
    }
  }&apos;</code></pre><p><code>values</code> accepts any mix of HTTP URLs, S3 URIs, and plain text strings. S3 URIs are presigned server-side before the Gemini call. You never need to handle fetching or encoding yourself.</p><hr><h2 id="embedding-migration-without-pain-how-namespaces-handle-it">Embedding migration without pain: how namespaces handle it</h2><p>One thing that doesn&apos;t get talked about enough when new embedding models ship: what happens to your existing index?</p><p>Gemini Embedding 2 vectors are not compatible with your old SigLIP or E5 vectors. You can&apos;t query them in the same index &#x2014; the spaces don&apos;t align. The na&#xEF;ve path is &quot;re-embed everything, rebuild your Qdrant collection, update all your retriever configs.&quot; That&apos;s a multi-hour job on a large corpus and you have a period where search is degraded.</p><p>Mixpeek namespaces are designed around this. A namespace is a single Qdrant collection, but it supports <strong>multiple named vectors</strong> per point &#x2014; one per feature extractor. When you add a new extractor to a namespace and trigger reprocessing, Mixpeek writes a new named vector field on each point without touching the existing ones. Your old retriever configs keep working against the old named vectors while the new ones are being populated.</p><p>Once you&apos;ve validated the new extractor&apos;s quality, you update your retriever to point at the new feature URI and you&apos;re done. The old named vectors continue to exist on the points &#x2014; they don&apos;t need to be cleaned up immediately &#x2014; and you can roll back by changing one retriever config field.</p><p>This is the right model for production embedding pipelines. New model ships &#x2192; add extractor &#x2192; let it populate in parallel &#x2192; cut over retriever &#x2192; validate &#x2192; done. No downtime, no search quality regression window, no re-indexing panic.</p><hr><h2 id="where-multi-file-embedding-actually-wins">Where multi-file embedding actually wins</h2><p>There are a few patterns where embedding multiple files together is clearly better than embedding each separately and fusing:</p><h3 id="1-products-with-image-spec-sheet-description">1. Products with image + spec sheet + description</h3><p>E-commerce is the obvious one. A query for &quot;waterproof boots for wide feet size 13&quot; should return boots that match on all three dimensions: the waterproofing is in the description, the width might be in the spec sheet table, and the boot style is in the hero image. Single-modality embeddings can match on any one of these but can&apos;t coherently match on all three simultaneously. Multi-file gives you one embedding that captures the conjunction.</p><p><strong>Measured uplift:</strong> In internal experiments on product search, recall@10 for queries combining visual + specification attributes improved 18-23% over text-only embeddings, and 12-15% over late fusion of separate image and text embeddings.</p><h3 id="2-research-papers-and-technical-documents">2. Research papers and technical documents</h3><p>Arxiv papers, technical reports, anything where figures are integral to the argument. If you embed figure 3 and the paragraph that discusses figure 3 separately, a query about the methodology in figure 3 will miss the connection unless you&apos;ve built explicit cross-reference logic. Embed them together and the model handles the alignment.</p><h3 id="3-video-frames-captions-metadata">3. Video frames + captions + metadata</h3><p>Video indexing at the segment level: a frame image + the ASR transcript of that segment + the scene metadata. Standard video search embeds the transcript and uses the image as a filter. Multi-file embedding makes the image part of the semantic search space, not just a filter value.</p><h3 id="4-brand-and-compliance-monitoring">4. Brand and compliance monitoring</h3><p>You want to know: does this social post (image + caption) jointly violate a brand safety guideline? Text-only checks miss visual context. Image-only checks miss caption framing. Embedding both together into a single space lets you do semantic search against a corpus of flagged examples that also have both image and text &#x2014; you&apos;re matching like with like.</p><h3 id="5-medical-and-legal-documents-with-embedded-charts">5. Medical and legal documents with embedded charts</h3><p>Pathology reports with embedded microscopy images. Legal briefs with embedded exhibit photos. The image isn&apos;t decoration &#x2014; it&apos;s part of the argument. Multi-file embedding captures that the image and the surrounding text are jointly meaningful.</p><hr><h2 id="custom-plugins">Custom plugins</h2><p>The <code>gemini_multifile_extractor</code> is a builtin. If you need to customize the preprocessing &#x2014; resizing images, extracting specific pages from PDFs, doing OCR before embedding, applying access controls &#x2014; you can deploy a custom plugin (enterprise feature, requires dedicated infrastructure).</p><p>Custom plugins are zip archives you upload to Mixpeek. They must have a <code>realtime.py</code> that implements <code>BaseInferenceService.infer()</code> &#x2014; this is called at query time, inline in the retriever request, to generate the query embedding. The plugin can call Gemini internally or any other embedding service.</p><pre><code class="language-python"># realtime.py &#x2014; minimal custom Gemini plugin
from engine.core.base import BaseInferenceService
from google import genai
from google.genai import types
import os

class MyGeminiPlugin(BaseInferenceService):
    async def infer(self, inputs: dict) -&gt; dict:
        client = genai.Client(api_key=os.environ[&quot;GEMINI_API_KEY&quot;])
        parts = self._build_parts(inputs.get(&quot;files&quot;, []))
        response = client.models.embed_content(
            model=&quot;models/gemini-embedding-2-preview&quot;,
            contents=parts,
            config=types.EmbedContentConfig(task_type=&quot;RETRIEVAL_QUERY&quot;),
        )
        return {&quot;embedding&quot;: list(response.embeddings[0].values)}</code></pre><p>The routing is automatic: if your vector index&apos;s <code>inference_service_id</code> starts with <code>custom_plugin_</code>, the retriever sends query-time embedding requests to your Ray Serve deployment. If it starts with <code>google/</code>, it calls the Gemini API directly without going through Ray Serve at all.</p><hr><h2 id="implementation-notes-and-gotchas">Implementation notes and gotchas</h2><p>A few things that bit us during development:</p><p><strong>Model name matters.</strong> <code>gemini-embedding-exp-03-07</code> is available on Google AI API (ai.google.dev) as <code>models/gemini-embedding-2-preview</code>. The Vertex AI endpoint requires <code>api_version=v1beta1</code> &#x2014; the GA endpoint 404s. We use <code>GEMINI_API_KEY</code> if present (Google AI API) and fall back to Vertex only if it&apos;s not set.</p><p><strong>inference_name normalization.</strong> Mixpeek normalizes <code>inference_service_id</code> strings via <code>service_id_to_deployment_name()</code>: slashes become double underscores, hyphens become underscores. So <code>google/gemini-embedding-exp-03-07</code> becomes <code>google__gemini_embedding_exp_03_07</code> in the database. If you&apos;re debugging retriever routing, this is why a <code>startswith(&quot;google/gemini-embedding&quot;)</code> check silently fails &#x2014; you need to check the normalized form too.</p><p><strong>Array input_mappings serialization.</strong> Ray Data/Arrow serializes Python lists as numpy arrays when passing through <code>map_batches</code>. Any code touching array-valued input_mappings columns needs a <code>if hasattr(items, &quot;tolist&quot;): items = items.tolist()</code> guard before iterating. This was a subtle batch processing bug that caused jobs to fail on the first attempt after pipeline startup.</p><p><strong>Task type for retrieval.</strong> Use <code>RETRIEVAL_DOCUMENT</code> at ingest and <code>RETRIEVAL_QUERY</code> at query time for best recall. If you&apos;re doing symmetric similarity (find products similar to this product), use <code>SEMANTIC_SIMILARITY</code> at both stages. Mismatching these degrades recall measurably &#x2014; in our tests, using <code>RETRIEVAL_DOCUMENT</code> at query time reduced recall@10 by ~8% vs <code>RETRIEVAL_QUERY</code>.</p><p><strong>Dimensionality reduction.</strong> Gemini Embedding 2 supports output dimensionality from 256 to 3072. Lower dimensions reduce storage and ANN search latency. Our testing showed recall@10 dropping ~3% at 768-d vs 3072-d, and ~9% at 256-d. For most production workloads 768-d is a reasonable tradeoff &#x2014; it halves the Qdrant memory footprint.</p><hr><h2 id="full-end-to-end-walkthrough">Full end-to-end walkthrough</h2><p>The complete guide with working curl commands covering bucket setup &#x2192; multi-blob upload &#x2192; collection config &#x2192; batch processing &#x2192; retriever creation &#x2192; both query modes is at:</p><p><a href="https://docs.mixpeek.com/processing/extractors/gemini-multifile?ref=blog.mixpeek.com"><strong>docs.mixpeek.com/processing/extractors/gemini-multifile</strong></a></p><p>Related docs:</p><ul><li><a href="https://docs.mixpeek.com/processing/feature-extractors?ref=blog.mixpeek.com">Feature Extractors overview</a></li><li><a href="https://docs.mixpeek.com/retrieval/retrievers?ref=blog.mixpeek.com">Retrievers reference</a></li><li><a href="https://docs.mixpeek.com/ingestion/collections?ref=blog.mixpeek.com">Collections and input_mappings</a></li></ul><hr><h2 id="whats-next">What&apos;s next</h2><p>A few things on the backlog:</p><ul><li><strong>Batched query-time embedding.</strong> Right now the retriever embeds one query per request. For re-ranking pipelines where you need embeddings for N candidates, batching the Gemini calls would reduce latency significantly.</li><li><strong>Selective blob embedding.</strong> Currently you list all blob properties in <code>input_mappings.files</code> and all of them are embedded together every time. A predicate system &#x2014; &quot;only include <code>spec_sheet</code> if the object has a <code>category == &apos;electronics&apos;</code>&quot; &#x2014; would let you get more precise about what goes into each object&apos;s vector.</li><li><strong>Streaming updates.</strong> Objects that get updated blobs (e.g., a product whose spec sheet is revised) should trigger incremental re-embedding of just that blob&apos;s contribution, not a full reprocessing of the object. This requires some delta-tracking in the manifest that isn&apos;t there yet.</li></ul><p>If you&apos;re building on top of Mixpeek and have a use case where multi-file embedding is relevant, <a href="https://mixpeek.com/contact?ref=blog.mixpeek.com">talk to us</a>. The implementation is new and we&apos;re actively shaping the API surface based on what people actually need.</p>]]></content:encoded></item><item><title><![CDATA[Query Preprocessing: Semantic Search With Large Files]]></title><description><![CDATA[How we built query preprocessing into Mixpeek's feature_search stage — decompose a 500MB video into chunks, embed in parallel, fuse results. Zero API surface change for callers.]]></description><link>http://blog.mixpeek.com/query-preprocessing-large-file-search/</link><guid isPermaLink="false">69af05e23baecafdb7f8ca50</guid><category><![CDATA[Engineering]]></category><category><![CDATA[Vector Search]]></category><category><![CDATA[Retrieval]]></category><category><![CDATA[Video Intelligence]]></category><category><![CDATA[Technical]]></category><category><![CDATA[Product Updates]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Mon, 09 Mar 2026 18:11:09 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/03/hero.svg" medium="image"/><content:encoded><![CDATA[<h2 id="the-problem-your-query-is-bigger-than-your-embeddings">The problem: your query is bigger than your embeddings</h2><img src="http://blog.mixpeek.com/content/images/2026/03/hero.svg" alt="Query Preprocessing: Semantic Search With Large Files"><p>Most vector search systems assume queries are small. A sentence. An image. A short audio clip. The entire retrieval literature is built around this assumption: you embed a query into a single vector, search against an index of many vectors, return ranked results.</p><p>This works until a user hands you a 500MB video and says &quot;find me everything in my library that looks like this.&quot;</p><p>We started seeing this pattern from multiple customers in Q4 2025. A media company wanted to search their archive using a raw broadcast clip. A legal team wanted to submit a full contract PDF as a query against a corpus of prior agreements. An IP safety product (<a href="https://mixpeek.com/docs/retrieval/stages/apply/ip-safety-verify?ref=blog.mixpeek.com">which we also built</a>) needed to scan uploaded videos for trademark violations by searching frame-by-frame against a brand index.</p><p>The naive solutions all have obvious problems:</p><ul><li><strong>Reject large inputs</strong> &#x2014; forces the client to pre-split, which breaks the API abstraction and requires them to implement fusion logic</li><li><strong>Average all frame embeddings into one vector</strong> &#x2014; destroys temporal structure. A 10-minute video becomes one meaningless centroid.</li><li><strong>Limit query size</strong> &#x2014; a 100MB video limit is arbitrary and still doesn&apos;t solve the composition problem</li></ul><p>What we wanted: pass a large file directly as a query input, have the system figure out how to search with it, get back a ranked list as if it were a simple query.</p><hr><h2 id="the-insight-ingestion-and-query-are-the-same-operation">The insight: ingestion and query are the same operation</h2><p>Here&apos;s the key observation that made this tractable: <strong>the decomposition logic we already use for ingestion is exactly what we need for query preprocessing</strong>.</p><p>When a video gets ingested into Mixpeek, it goes through a <a href="https://mixpeek.com/docs/processing/extractors/multimodal?ref=blog.mixpeek.com">feature extractor</a> that:</p><ol><li>Splits the video into segments (keyframes, fixed intervals, or scene boundaries)</li><li>Embeds each segment via the configured model</li><li>Stores the resulting vectors in Qdrant alongside payload metadata</li></ol><p>Query preprocessing is the same pipeline, just routing the output differently. Instead of writing vectors to Qdrant, we use them to <em>search</em> Qdrant. The same extractor, the same chunking logic, the same embedding model. This matters because it guarantees that query embeddings and index embeddings are always in the same vector space &#x2014; no distribution shift from using a different chunking strategy at query time.</p><p>The execution flow looks like this:</p><pre><code>feature_search stage
&#x2502;
&#x251C;&#x2500; 1. Detect input type
&#x2502;     &#x2192; video/500MB detected
&#x2502;     &#x2192; route to query_preprocessing
&#x2502;
&#x251C;&#x2500; 2. Decompose via extractor pipeline
&#x2502;     &#x2192; same extractor that indexed the data
&#x2502;     &#x2192; e.g. 20 keyframes from a 10-min video
&#x2502;
&#x251C;&#x2500; 3. Batch embed (parallel)
&#x2502;     &#x2192; 20 segments &#x2192; inference service &#x2192; 20 vectors
&#x2502;
&#x251C;&#x2500; 4. Parallel Qdrant searches
&#x2502;     &#x2192; 20 concurrent ANN queries
&#x2502;     &#x2192; each returns top_k candidates
&#x2502;
&#x251C;&#x2500; 5. Fuse results
&#x2502;     &#x2192; RRF / max / avg across 20 result sets
&#x2502;     &#x2192; deduplicate (same doc from multiple frames &#x2192; keep best)
&#x2502;
&#x2514;&#x2500; Output: single ranked list, same shape as a simple query response
</code></pre><p>From the caller&apos;s perspective, nothing changes. You pass a file URL, you get results back. The complexity is entirely internal.</p><hr><h2 id="api-design">API design</h2><p>We added a <code>query_preprocessing</code> object to the <a href="https://mixpeek.com/docs/retrieval/stages/feature-search?ref=blog.mixpeek.com">feature_search stage</a>. It can live at the stage level (applies to all searches as a default) or per-search (overrides the default for that search).</p><p>Zero-config usage &#x2014; just pass a large file and the system figures out the rest:</p><pre><code class="language-json">{
  &quot;stage_id&quot;: &quot;feature_search&quot;,
  &quot;parameters&quot;: {
    &quot;searches&quot;: [{
      &quot;feature_uri&quot;: &quot;mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding&quot;,
      &quot;query&quot;: {
        &quot;input_mode&quot;: &quot;content&quot;,
        &quot;value&quot;: &quot;s3://my-bucket/broadcast-clip.mp4&quot;
      },
      &quot;query_preprocessing&quot;: {
        &quot;max_chunks&quot;: 20,
        &quot;aggregation&quot;: &quot;rrf&quot;
      }
    }]
  }
}
</code></pre><h3 id="the-params-field-is-the-extractors-own-parameter-schema">The params field is the extractor&apos;s own parameter schema</h3><p>This is the part that surprised people internally when we first described it: <code>query_preprocessing.params</code> is not a new configuration surface. It is literally the same parameter schema that the extractor accepts during ingestion.</p><p>Whatever you put in a collection&apos;s extractor config for <code>multimodal_extractor@v1</code> &#x2014; <code>video_interval_seconds</code>, <code>max_resolution</code>, <code>keyframe_threshold</code>, whatever that extractor exposes &#x2014; those same keys go in <code>params</code> here. The preprocessing step runs the extractor with those params to decompose the query, exactly as it would during collection processing. Same code path, same config schema, different output destination.</p><pre><code class="language-json">{
  &quot;stage_id&quot;: &quot;feature_search&quot;,
  &quot;parameters&quot;: {
    &quot;searches&quot;: [{
      &quot;feature_uri&quot;: &quot;mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding&quot;,
      &quot;query&quot;: { &quot;input_mode&quot;: &quot;content&quot;, &quot;value&quot;: &quot;{{INPUT.video}}&quot; },
      &quot;query_preprocessing&quot;: {
        &quot;max_chunks&quot;: 30,
        &quot;aggregation&quot;: &quot;max&quot;,
        &quot;dedup_field&quot;: &quot;metadata.document_id&quot;,
        &quot;params&quot;: {
          &quot;split_method&quot;: &quot;time&quot;,
          &quot;time_split_interval&quot;: 5
        }
      }
    }]
  }
}
</code></pre><p>This means there&apos;s nothing new to learn about the preprocessing parameters. If you know how to configure the extractor for ingestion, you already know how to configure it for query preprocessing. The <a href="https://mixpeek.com/docs/processing/extractors/multimodal?ref=blog.mixpeek.com">extractor documentation</a> is the reference for both.</p><p>Per-search preprocessing, mixed with a plain text query:</p><pre><code class="language-json">{
  &quot;stage_id&quot;: &quot;feature_search&quot;,
  &quot;parameters&quot;: {
    &quot;searches&quot;: [
      {
        &quot;feature_uri&quot;: &quot;mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding&quot;,
        &quot;query&quot;: { &quot;input_mode&quot;: &quot;content&quot;, &quot;value&quot;: &quot;{{INPUT.video}}&quot; },
        &quot;query_preprocessing&quot;: {
          &quot;max_chunks&quot;: 30,
          &quot;aggregation&quot;: &quot;max&quot;
        }
      },
      {
        &quot;feature_uri&quot;: &quot;mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1&quot;,
        &quot;query&quot;: { &quot;input_mode&quot;: &quot;text&quot;, &quot;value&quot;: &quot;{{INPUT.caption}}&quot; }
      }
    ]
  }
}
</code></pre><p>The second search has no preprocessing &#x2014; it&apos;s a plain single-vector text query. Multi-modal retrieval with heterogeneous query types, fused at the stage level.</p><hr><h2 id="fusion-strategies">Fusion strategies</h2><p>Once you have N result sets from N chunk searches, you need to combine them. We support three strategies:</p><h3 id="rrf-reciprocal-rank-fusion">RRF (Reciprocal Rank Fusion)</h3><p>Each document&apos;s score is the sum of <code>1 / (k + rank)</code> across all chunk result sets where it appeared. <code>k</code> is a smoothing constant (typically 60).</p><p>RRF is rank-based, so it&apos;s immune to score magnitude differences between chunks. A document that ranks 3rd in 5 different chunk searches beats one that ranks 1st in only 1. This is the right default for &quot;find content that&apos;s generally similar to this video&quot; queries.</p><h3 id="max">Max</h3><p>Keep the highest score a document received across all chunk searches. Use this when you want &quot;find the moment in this video that best matches something in the index&quot; &#x2014; you care about the best alignment, not average alignment.</p><h3 id="avg">Avg</h3><p>Average the scores across all chunk results where the document appeared. Documents that show up consistently across many chunks beat documents that match one chunk perfectly. Useful for &quot;find videos with similar overall content distribution.&quot;</p><p>The right strategy depends on the query semantics. For IP safety (does this video contain a specific brand?), <code>max</code> is correct &#x2014; you want the single best match. For &quot;find content similar to this video,&quot; <code>rrf</code> is more robust.</p><hr><h2 id="what-we-didnt-do-a-strategy-auto-mode">What we didn&apos;t do: a &quot;strategy: auto&quot; mode</h2><p>Early in the design we considered a <code>strategy: &quot;auto&quot;</code> parameter that would detect file size and type and choose chunking parameters automatically. We prototyped it.</p><p>The problem is that the right chunking depends on what you&apos;re trying to find, not just the file. A 5-second clip queried against a movie archive probably wants dense keyframe sampling. The same clip queried against a sports highlight reel probably wants scene-boundary splits. There&apos;s no way to infer this from the file alone.</p><p>We removed auto mode. If we add it back, it&apos;ll be as a starting heuristic with explicit override support &#x2014; not as a magic setting that hides what&apos;s actually happening. <a href="https://mixpeek.com/docs/retrieval/stages/feature-search?ref=blog.mixpeek.com#query-preprocessing">The full parameter reference is in the docs</a>.</p><hr><h2 id="credit-model">Credit model</h2><p>Each chunk counts as one retrieval credit. A <code>max_chunks: 20</code> config on a video that produces 20 keyframes costs 20 credits, same as running 20 separate single-vector searches. This is intentional &#x2014; preprocessing is not a way to get bulk search at single-query pricing. The cost is transparent and predictable.</p><p>The cap parameter (<code>max_chunks</code>, range 1&#x2013;100) exists to bound the cost at query time. If an extractor would produce 50 chunks but you set <code>max_chunks: 20</code>, we take the first 20 by default. You can configure the sampling strategy via extractor params if you need uniform sampling instead.</p><hr><h2 id="the-ip-safety-case">The IP safety case</h2><p>The use case that drove us to ship this quickly was our <a href="https://mixpeek.com/docs/retrieval/stages/apply/ip-safety-verify?ref=blog.mixpeek.com">IP safety verification pipeline</a>. The product takes a video (a YouTube upload, a broadcast clip, an ad creative) and checks it against a face index (93K embeddings across ~5K identities) and a brand logo index (25K brands).</p><p>The query <em>is</em> the video. There&apos;s no text query, no image query &#x2014; you&apos;re searching with the entire asset. Before query preprocessing, this required the caller to extract frames, embed them, run searches, and fuse results themselves. Now it&apos;s one API call:</p><pre><code class="language-json">{
  &quot;stage_id&quot;: &quot;ip_safety_verify&quot;,
  &quot;parameters&quot;: {
    &quot;face_index_s3_uri&quot;: &quot;s3://mixpeek-server-prod/ip-safety/face_index.npz&quot;,
    &quot;brand_index_s3_uri&quot;: &quot;s3://mixpeek-server-prod/ip-safety/logo_text_index_v2.npz&quot;,
    &quot;image_url_field&quot;: &quot;metadata.frame_url&quot;
  }
}
</code></pre><p>The stage handles frame extraction, parallel embedding, and fusion internally. Callers pass a video URL and get back identified faces and brands with confidence scores.</p><hr><h2 id="limitations-and-known-tradeoffs">Limitations and known tradeoffs</h2><p><strong>Latency scales with chunk count.</strong> 20 parallel Qdrant searches is fast (we batch the embedding calls), but it&apos;s not the same as 1 search. For latency-sensitive paths, set a low <code>max_chunks</code> or pre-extract a representative keyframe.</p><p><strong>The extractor must support the input type.</strong> Query preprocessing routes through the same extractor pipeline as ingestion. If your namespace uses a text-only extractor, you can&apos;t pass a video as a query. The feature URI determines what decomposition is possible.</p><p><strong>Chunk ordering is not preserved.</strong> The fused result list is ranked by similarity score, not temporal order. If you need results ordered by where in the query video they matched, you&apos;d need to add that as post-processing (we don&apos;t have a stage for this yet).</p><p><strong>Deduplication is per-field.</strong> If two chunks both match the same 5-second clip but from different angles, they&apos;ll show up as different results unless you configure <code>dedup_field</code> to collapse by document ID. Know your data model.</p><hr><h2 id="whats-next">What&apos;s next</h2><p>Query preprocessing is live in the <code>feature_search</code> stage today. <a href="https://mixpeek.com/docs/retrieval/stages/feature-search?ref=blog.mixpeek.com#query-preprocessing">Full docs here</a>.</p><p>The pattern &#x2014; decompose input, embed in parallel, fuse results &#x2014; generalizes beyond feature search. The same approach should work in rerank stages (LLM-score each chunk of a large document, take the max) and in apply stages (run a classifier on each frame of a video, return the worst-case result). We haven&apos;t built those yet, but the abstraction is the same.</p><p>If you&apos;re building something where the query is a large file, <a href="https://mixpeek.com/start?ref=blog.mixpeek.com">we&apos;d like to hear about it</a>. The current implementation was shaped almost entirely by real production use cases. The next iteration will be too.</p><p>&#x2014; <a href="https://mixpeek.com/?ref=blog.mixpeek.com">Mixpeek</a></p>]]></content:encoded></item><item><title><![CDATA[AI Video Analysis for Sports: Build Automated Highlight Reels, Archive Search, and Performance Analytics]]></title><description><![CDATA[Sports broadcasters cut 4-8 hour editing sessions to 15 minutes using AI video analysis. Learn how to build automated highlight detection, archive search, and performance analytics pipelines for any sport.]]></description><link>http://blog.mixpeek.com/ai-video-analysis-sports/</link><guid isPermaLink="false">69aed6993baecafdb7f8c69e</guid><category><![CDATA[Video]]></category><category><![CDATA[Video AI]]></category><category><![CDATA[Sports]]></category><category><![CDATA[Sports Analytics]]></category><category><![CDATA[Computer Vision]]></category><category><![CDATA[Industry]]></category><category><![CDATA[Tutorials]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Mon, 09 Mar 2026 16:09:41 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/03/sports-blog-hero.jpg" medium="image"/><content:encoded><![CDATA[<h2 id="the-problem-sports-video-is-unstructured-at-scale">The Problem: Sports Video is Unstructured at Scale</h2><img src="http://blog.mixpeek.com/content/images/2026/03/sports-blog-hero.jpg" alt="AI Video Analysis for Sports: Build Automated Highlight Reels, Archive Search, and Performance Analytics"><p>A single 90-minute soccer match generates 90 minutes of raw video. A full Premier League weekend &#x2014; 10 matches &#x2014; produces 15+ hours. Multiply by 38 match weeks, add training sessions, press conferences, and behind-the-scenes footage, and a mid-sized sports media operation is managing thousands of hours of content per season.</p><p>The bottleneck isn&apos;t storage. It&apos;s making that video <em>useful</em>.</p><ul><li>Highlight editors manually watch entire games &#x2014; 4-8 hours per match &#x2014; to find key moments</li><li>Archive footage is effectively unsearchable beyond filename and date</li><li>Analytics teams download raw video and manually annotate events frame by frame</li><li>Social media teams miss the optimal publish window because clips aren&apos;t ready in time</li></ul><p>AI video analysis solves all of these by treating sports video as structured, queryable data instead of opaque files.</p>
<!--kg-card-begin: html-->
<div style="background: linear-gradient(135deg, #f0f9ff 0%, #e0f2fe 100%); border: 1px solid #0ea5e9; border-radius: 12px; padding: 1.5rem 2rem; margin: 1.5rem 0;">
  <p style="margin: 0 0 0.75rem 0; font-weight: 700; color: #0369a1; font-size: 1rem;">Explore on Mixpeek</p>
  <div style="display: flex; flex-wrap: wrap; gap: 0.75rem;">
    <a href="https://mixpeek.com/solutions/sports?ref=blog.mixpeek.com" style="display:inline-flex;align-items:center;gap:0.4rem;background:#3b82f6;color:white;padding:0.5rem 1rem;border-radius:8px;text-decoration:none;font-size:0.875rem;font-weight:600;">&#x1F3DF; Sports Solution Page</a>
    <a href="https://mixpeek.com/use-cases/sports-highlights?ref=blog.mixpeek.com" style="display:inline-flex;align-items:center;gap:0.4rem;background:#7c3aed;color:white;padding:0.5rem 1rem;border-radius:8px;text-decoration:none;font-size:0.875rem;font-weight:600;">&#x1F3AC; Use Case: Sports Highlights</a>
    <a href="https://mixpeek.com/recipes/sports-highlights?ref=blog.mixpeek.com" style="display:inline-flex;align-items:center;gap:0.4rem;background:#059669;color:white;padding:0.5rem 1rem;border-radius:8px;text-decoration:none;font-size:0.875rem;font-weight:600;">&#x1F4CB; Recipe: Build It Yourself</a>
  </div>
</div>
<!--kg-card-end: html-->
<h2 id="how-ai-video-analysis-works-for-sports">How AI Video Analysis Works for Sports</h2><p>Modern sports video AI combines three layers of analysis that run in parallel:</p><h3 id="1-visual-action-detection">1. Visual Action Detection</h3><p>Computer vision models analyze each frame to detect specific actions &#x2014; ball trajectory, player contact, goalkeeper positioning, crowd rise. Rather than generic object detection, sports-tuned models classify actions against sport-specific exemplars: what a goal looks like vs. what a save looks like vs. what a foul looks like.</p><p>The foundation is a multimodal embedding model (like SigLIP or CLIP) that converts each video scene into a dense vector. These vectors are compared against labeled exemplar clips to classify the action type and calculate confidence scores.</p><h3 id="2-audio-spike-detection">2. Audio Spike Detection</h3><p>Crowd noise and commentator speech are incredibly reliable highlight signals. Audio transcription (Whisper large-v3) captures the words &#x2014; &quot;GOAAAAAL!&quot;, &quot;unbelievable&quot;, &quot;he&apos;s done it again&quot; &#x2014; while audio feature extraction detects the energy spike of 50,000 fans simultaneously standing up.</p><p>Commentary excitement combined with crowd noise creates a compound signal that&apos;s almost impossible to fake and extremely reliable for identifying high-intensity moments.</p><h3 id="3-on-screen-graphic-parsing">3. On-Screen Graphic Parsing</h3><p>Score changes, VAR indicators, replay flags, and player stat overlays are broadcast signals that confirm something significant just happened. OCR (optical character recognition) extracts these as structured data &#x2014; goal time, team, score &#x2014; which can be correlated with the visual and audio signals for maximum confidence.</p><h3 id="fusion-and-ranking">Fusion and Ranking</h3><p>The three signals are fused using reciprocal rank fusion (RRF) &#x2014; a method that combines rankings from multiple retrieval sources without requiring manual weight calibration. The result is a ranked list of timestamped moments, each with a highlight confidence score.</p>
<!--kg-card-begin: html-->
<figure style="margin: 2rem 0; background: #f8fafc; border: 1px solid #e2e8f0; border-radius: 12px; padding: 1.5rem; overflow-x: auto;">
<figcaption style="text-align:center; font-weight:600; color:#374151; margin-bottom:1rem; font-size:0.95rem;">Reference Architecture &#x2014; Mixpeek Sports Highlights Pipeline</figcaption>
<svg viewbox="0 0 900 340" xmlns="http://www.w3.org/2000/svg" style="width:100%; max-width:860px; display:block; margin:0 auto; font-family:system-ui,sans-serif;">
  <!-- Layer labels -->
  <text x="90" y="22" text-anchor="middle" font-size="11" font-weight="700" fill="#6b7280" letter-spacing="1">INPUT</text>
  <text x="290" y="22" text-anchor="middle" font-size="11" font-weight="700" fill="#6b7280" letter-spacing="1">EXTRACTION</text>
  <text x="570" y="22" text-anchor="middle" font-size="11" font-weight="700" fill="#6b7280" letter-spacing="1">ENRICHMENT</text>
  <text x="780" y="22" text-anchor="middle" font-size="11" font-weight="700" fill="#6b7280" letter-spacing="1">RETRIEVAL</text>

  <!-- Input box -->
  <rect x="20" y="40" width="140" height="80" rx="10" fill="#dbeafe" stroke="#3b82f6" stroke-width="1.5"/>
  <text x="90" y="70" text-anchor="middle" font-size="13" font-weight="600" fill="#1d4ed8">Game Footage</text>
  <text x="90" y="88" text-anchor="middle" font-size="10" fill="#3b82f6">S3 / CDN / Live Stream</text>
  <text x="90" y="104" text-anchor="middle" font-size="10" fill="#6b7280">MP4 &#xB7; MOV &#xB7; HLS</text>

  <!-- Arrow input→extraction -->
  <line x1="160" y1="80" x2="195" y2="80" stroke="#9ca3af" stroke-width="2" marker-end="url(#arr)"/>

  <!-- Video extractor -->
  <rect x="195" y="40" width="130" height="56" rx="8" fill="#ede9fe" stroke="#7c3aed" stroke-width="1.5"/>
  <text x="260" y="64" text-anchor="middle" font-size="12" font-weight="600" fill="#5b21b6">Video Extractor</text>
  <text x="260" y="82" text-anchor="middle" font-size="10" fill="#7c3aed">Scene embeddings</text>
  <rect x="322" y="52" width="38" height="18" rx="9" fill="#7c3aed"/>
  <text x="341" y="65" text-anchor="middle" font-size="9" font-weight="700" fill="white">60 wt</text>

  <!-- Audio extractor -->
  <rect x="195" y="108" width="130" height="56" rx="8" fill="#fce7f3" stroke="#db2777" stroke-width="1.5"/>
  <text x="260" y="132" text-anchor="middle" font-size="12" font-weight="600" fill="#9d174d">Audio Extractor</text>
  <text x="260" y="150" text-anchor="middle" font-size="10" fill="#db2777">Crowd + commentary</text>
  <rect x="322" y="120" width="38" height="18" rx="9" fill="#db2777"/>
  <text x="341" y="133" text-anchor="middle" font-size="9" font-weight="700" fill="white">40 wt</text>

  <!-- OCR extractor -->
  <rect x="195" y="176" width="130" height="48" rx="8" fill="#d1fae5" stroke="#059669" stroke-width="1.5"/>
  <text x="260" y="200" text-anchor="middle" font-size="12" font-weight="600" fill="#065f46">OCR Layer</text>
  <text x="260" y="216" text-anchor="middle" font-size="10" fill="#059669">Scores &#xB7; VAR &#xB7; overlays</text>

  <!-- Arrows extraction→enrichment -->
  <line x1="325" y1="68" x2="430" y2="100" stroke="#9ca3af" stroke-width="1.5" marker-end="url(#arr)"/>
  <line x1="325" y1="136" x2="430" y2="116" stroke="#9ca3af" stroke-width="1.5" marker-end="url(#arr)"/>
  <line x1="325" y1="200" x2="430" y2="130" stroke="#9ca3af" stroke-width="1.5" marker-end="url(#arr)"/>

  <!-- Enrichment / taxonomy -->
  <rect x="430" y="60" width="200" height="140" rx="10" fill="#fef3c7" stroke="#d97706" stroke-width="1.5"/>
  <text x="530" y="88" text-anchor="middle" font-size="13" font-weight="600" fill="#92400e">Sport Taxonomy</text>
  <circle cx="490" cy="118" r="18" fill="#fde68a" stroke="#d97706" stroke-width="1.2"/>
  <text x="490" y="123" text-anchor="middle" font-size="9" fill="#92400e">Goal</text>
  <circle cx="530" cy="148" r="18" fill="#fde68a" stroke="#d97706" stroke-width="1.2"/>
  <text x="530" y="153" text-anchor="middle" font-size="9" fill="#92400e">Save</text>
  <circle cx="570" cy="118" r="18" fill="#fde68a" stroke="#d97706" stroke-width="1.2"/>
  <text x="570" y="123" text-anchor="middle" font-size="9" fill="#92400e">Foul</text>
  <circle cx="490" cy="178" r="18" fill="#fde68a" stroke="#d97706" stroke-width="1.2"/>
  <text x="490" y="183" text-anchor="middle" font-size="9" fill="#92400e">Card</text>
  <circle cx="570" cy="178" r="18" fill="#fde68a" stroke="#d97706" stroke-width="1.2"/>
  <text x="570" y="183" text-anchor="middle" font-size="9" fill="#92400e">Replay</text>

  <!-- Arrow enrichment→retrieval -->
  <line x1="630" y1="130" x2="680" y2="130" stroke="#9ca3af" stroke-width="2" marker-end="url(#arr)"/>

  <!-- Retrieval box -->
  <rect x="680" y="60" width="190" height="140" rx="10" fill="#f0fdf4" stroke="#16a34a" stroke-width="1.5"/>
  <text x="775" y="90" text-anchor="middle" font-size="13" font-weight="600" fill="#14532d">Highlight Retriever</text>
  <rect x="700" y="102" width="150" height="26" rx="6" fill="#bbf7d0" stroke="#16a34a" stroke-width="1"/>
  <text x="775" y="120" text-anchor="middle" font-size="11" font-weight="600" fill="#166534">RRF Fusion</text>
  <text x="775" y="150" text-anchor="middle" font-size="10" fill="#166534">Ranked clip manifest</text>
  <text x="775" y="166" text-anchor="middle" font-size="10" fill="#166534">with timestamps</text>
  <rect x="700" y="178" width="150" height="18" rx="6" fill="#16a34a"/>
  <text x="775" y="191" text-anchor="middle" font-size="10" font-weight="700" fill="white">&#x23F1; 15-20 min / match</text>

  <!-- Arrowhead marker -->
  <defs>
    <marker id="arr" markerwidth="8" markerheight="8" refx="6" refy="3" orient="auto">
      <path d="M0,0 L0,6 L8,3 z" fill="#9ca3af"/>
    </marker>
  </defs>
</svg>
</figure>
<!--kg-card-end: html-->
<h2 id="building-a-sports-highlights-pipeline-with-mixpeek">Building a Sports Highlights Pipeline with Mixpeek</h2><p>Here&apos;s how to build a production highlight pipeline. The core workflow is: ingest footage &#x2192; extract multimodal features &#x2192; define highlight criteria &#x2192; execute retrieval &#x2192; assemble clips.</p><p>The full step-by-step code is available in the <a href="https://mixpeek.com/recipes/sports-highlights?ref=blog.mixpeek.com"><strong>Sports Highlights Recipe</strong></a> &#x2014; including bucket setup, collection configuration, taxonomy creation, retriever definition, and output parsing.</p><h3 id="step-1-ingest-game-footage">Step 1: Ingest Game Footage</h3><pre><code class="language-python">import requests

headers = {
    &quot;Authorization&quot;: &quot;Bearer YOUR_API_KEY&quot;,
    &quot;X-Namespace&quot;: &quot;sports-media&quot;,
    &quot;Content-Type&quot;: &quot;application/json&quot;
}

# Create collections for scenes and audio
scene_collection = requests.post(&quot;https://api.mixpeek.com/v1/collections&quot;, headers=headers, json={
    &quot;collection_name&quot;: &quot;game-scenes&quot;,
    &quot;source&quot;: {&quot;type&quot;: &quot;bucket&quot;, &quot;bucket_id&quot;: &quot;bkt_footage&quot;},
    &quot;feature_extractor&quot;: {
        &quot;feature_extractor_name&quot;: &quot;video_extractor&quot;,
        &quot;version&quot;: &quot;v1&quot;,
        &quot;input_mappings&quot;: {&quot;video_url&quot;: &quot;video_url&quot;},
        &quot;parameters&quot;: {
            &quot;scene_detection_threshold&quot;: 0.3,
            &quot;keyframe_interval&quot;: 2,
            &quot;max_scenes&quot;: 500
        },
        &quot;field_passthrough&quot;: [
            {&quot;source_path&quot;: &quot;sport&quot;},
            {&quot;source_path&quot;: &quot;game_id&quot;},
            {&quot;source_path&quot;: &quot;broadcast_date&quot;}
        ]
    }
}).json()

# Ingest a match
requests.post(f&quot;https://api.mixpeek.com/v1/buckets/bkt_footage/objects&quot;,
    headers=headers, json={
        &quot;metadata&quot;: {
            &quot;sport&quot;: &quot;soccer&quot;,
            &quot;game_id&quot;: &quot;cl-2026-final&quot;,
            &quot;broadcast_date&quot;: &quot;2026-05-25&quot;
        },
        &quot;blobs&quot;: [{&quot;property&quot;: &quot;video_url&quot;, &quot;type&quot;: &quot;video&quot;,
                   &quot;url&quot;: &quot;s3://my-bucket/games/cl-final.mp4&quot;}]
    })</code></pre><h3 id="step-2-define-highlight-criteria">Step 2: Define Highlight Criteria</h3><p>Configure what counts as a highlight for your sport using a Mixpeek taxonomy. Each event type needs 5-20 exemplar clips &#x2014; not thousands of labeled examples, just representative samples:</p><pre><code class="language-python">taxonomy = requests.post(&quot;https://api.mixpeek.com/v1/taxonomies&quot;, headers=headers, json={
    &quot;taxonomy_name&quot;: &quot;soccer_events&quot;,
    &quot;taxonomy_type&quot;: &quot;flat&quot;,
    &quot;nodes&quot;: [
        {&quot;node_id&quot;: &quot;goal&quot;, &quot;collection_id&quot;: &quot;col_goal_exemplars&quot;},
        {&quot;node_id&quot;: &quot;save&quot;, &quot;collection_id&quot;: &quot;col_save_exemplars&quot;},
        {&quot;node_id&quot;: &quot;foul&quot;, &quot;collection_id&quot;: &quot;col_foul_exemplars&quot;},
        {&quot;node_id&quot;: &quot;celebration&quot;, &quot;collection_id&quot;: &quot;col_celebration_exemplars&quot;},
    ]
}).json()</code></pre><h3 id="step-3-retrieve-highlights">Step 3: Retrieve Highlights</h3><pre><code class="language-python">highlights = requests.post(
    &quot;https://api.mixpeek.com/v1/retrievers/soccer-highlights/execute&quot;,
    headers=headers,
    json={
        &quot;inputs&quot;: {&quot;game_id&quot;: &quot;cl-2026-final&quot;},
        &quot;limit&quot;: 20
    }
).json()

for doc in highlights[&quot;documents&quot;]:
    start = doc[&quot;metadata&quot;][&quot;start_time&quot;]
    end = doc[&quot;metadata&quot;][&quot;end_time&quot;]
    keyframe = doc[&quot;metadata&quot;][&quot;keyframe_url&quot;]
    print(f&quot;{start:.1f}s - {end:.1f}s | score: {doc[&apos;score&apos;]:.3f}&quot;)
    # &#x2192; Use start/end to extract clips with FFmpeg or your video API</code></pre><h2 id="real-results-what-sports-teams-are-getting">Real Results: What Sports Teams Are Getting</h2>
<!--kg-card-begin: html-->
<table>
  <thead>
    <tr><th>Metric</th><th>Before AI</th><th>After AI</th><th>Improvement</th></tr>
  </thead>
  <tbody>
    <tr><td>Highlight turnaround</td><td>4-8 hours</td><td>15-20 min</td><td>24x faster</td></tr>
    <tr><td>Key moments captured</td><td>60-70%</td><td>95%+</td><td>+46% coverage</td></tr>
    <tr><td>Editor hours per game</td><td>6+ hours</td><td>&lt;30 min review</td><td>12x reduction</td></tr>
    <tr><td>Social clips per game</td><td>3-5</td><td>15-25</td><td>5x more content</td></tr>
  </tbody>
</table>
<!--kg-card-end: html-->
<h2 id="beyond-highlights-other-sports-video-ai-use-cases">Beyond Highlights: Other Sports Video AI Use Cases</h2><h3 id="archive-search">Archive Search</h3><p>Your historical footage is worth more than you&apos;re getting from it. AI video analysis makes decades of archived broadcast footage searchable by semantic query &#x2014; &quot;find all bicycle kicks from 2018-2022&quot;, &quot;show every time [player name] scored in the final 10 minutes&quot;. Instead of a media librarian spending hours on a request, results come back in seconds.</p><p>Sports analytics software built on vector search (not keyword search) enables this. Every scene becomes a semantic data point, not a filename.</p><h3 id="player-performance-analytics">Player Performance Analytics</h3><p>Combine face recognition with action detection to compile every clip of a specific player automatically. Coaching staff query: &quot;show me all crosses by our left back in the last 5 matches&quot; &#x2014; the system retrieves exact timestamps across hours of footage without any manual tagging.</p><h3 id="broadcast-compliance-monitoring">Broadcast Compliance Monitoring</h3><p>Automatically flag content that violates broadcast standards &#x2014; crowd violence, hate speech in chants (via audio transcription), on-pitch incidents requiring regulatory review. Real-time processing means compliance teams review flagged content within minutes of an incident occurring.</p><h3 id="monetization-personalized-highlight-feeds">Monetization: Personalized Highlight Feeds</h3><p>Different fans want different highlights. With multimodal AI, generate personalized highlight feeds &#x2014; goal-only feeds, specific-player feeds, defensive play feeds &#x2014; from the same source footage. Each fan gets the moments relevant to their preferences, increasing engagement and subscription value.</p><h2 id="choosing-the-right-sports-video-analytics-platform">Choosing the Right Sports Video Analytics Platform</h2><p>Not all video AI platforms are built for sports workflows. Key criteria for sports media:</p><ul><li><strong>Multi-modal fusion:</strong> Visual + audio + text signals must combine into a single highlight score. Platforms that only do computer vision miss the audio signals that are often the most reliable indicators.</li><li><strong>Sport-configurable:</strong> Basketball dunks are not soccer goals. The platform needs configurable event taxonomies per sport &#x2014; not generic action detection that classifies &quot;sports&quot; as a single category.</li><li><strong>Processing speed:</strong> A 90-minute match should analyze in &lt;20 minutes. For live workflows, near-real-time latency is required for social media clips.</li><li><strong>Self-hosting option:</strong> Broadcast content often has rights restrictions. The ability to deploy in your own infrastructure &#x2014; not a shared cloud &#x2014; is critical for compliance.</li><li><strong>Archive-scale:</strong> Leagues and broadcasters manage decades of footage. The platform must handle millions of scenes without degraded search quality.</li></ul><h2 id="getting-started">Getting Started</h2><p>Building a sports highlights pipeline with Mixpeek takes about an hour to set up:</p><ol><li>Create an account and get your API key at <a href="https://mixpeek.com/?ref=blog.mixpeek.com">mixpeek.com</a></li><li>Review the <a href="https://mixpeek.com/solutions/sports?ref=blog.mixpeek.com"><strong>Sports Media &amp; Analytics solution page</strong></a> for the full platform overview</li><li>Work through the <a href="https://mixpeek.com/use-cases/sports-highlights?ref=blog.mixpeek.com"><strong>Sports Highlights use case</strong></a> to understand the end-to-end workflow</li><li>Clone the <a href="https://mixpeek.com/recipes/sports-highlights?ref=blog.mixpeek.com"><strong>Sports Highlights Recipe</strong></a> &#x2014; it has complete Python and cURL code ready to run</li><li>Collect 10-20 exemplar clips per event type for your sport and ingest a test match</li></ol><p>For enterprise deployments &#x2014; live stream integration, self-hosted infrastructure, or custom model training for specific sports &#x2014; <a href="https://mixpeek.com/contact?ref=blog.mixpeek.com">contact the Mixpeek team</a> for a scoped architecture review.</p><h2 id="frequently-asked-questions">Frequently Asked Questions</h2><h3 id="which-sports-work-with-mixpeek">Which sports work with Mixpeek?</h3><p>Any sport that&apos;s been filmed. The taxonomy system is fully configurable &#x2014; define what counts as a highlight moment for your sport using exemplar clips. Soccer, basketball, American football, baseball, tennis, rugby, cricket, esports, and motorsports all work. Multi-sport deployments run separate taxonomies per sport simultaneously.</p><h3 id="do-i-need-a-large-labeled-dataset-to-get-started">Do I need a large labeled dataset to get started?</h3><p>No. You need 5-20 exemplar clips per event type &#x2014; not thousands of labeled examples. Mixpeek uses these as visual reference points in the taxonomy, not for model training. This means you can be up and running in hours, not months.</p><h3 id="how-does-it-handle-different-camera-angles-in-multi-camera-broadcasts">How does it handle different camera angles in multi-camera broadcasts?</h3><p>Each camera feed can be ingested as a separate object. The retriever can search across all angles simultaneously and return the best angle for each highlight moment. Alternatively, ingest the broadcast director feed (already switched) for simpler single-stream processing.</p><h3 id="can-it-identify-specific-players-without-jersey-numbers-visible">Can it identify specific players without jersey numbers visible?</h3><p>Yes, using the face extractor. Provide labeled reference frames per player and the system builds visual signature models. Players are identifiable in close-up celebrations, crowd pile-ups, and side-profile shots where jersey numbers aren&apos;t visible.</p><h3 id="whats-the-cost-to-process-a-full-season">What&apos;s the cost to process a full season?</h3><p>Pricing depends on total hours processed and analysis features enabled. A typical Premier League season (380 matches &#xD7; 90 min = 570 hours of footage) would be quoted as a custom enterprise package with dedicated processing infrastructure. <a href="https://mixpeek.com/contact?ref=blog.mixpeek.com">Contact us</a> for a volume estimate.</p>]]></content:encoded></item><item><title><![CDATA[How Mixpeek runs distributed multimodal ML on Ray: architecture, patterns, and production lessons]]></title><description><![CDATA[We run 20+ ML models in parallel across video, image, and document pipelines. Here's the Ray architecture behind it -- custom resource isolation, flexible actor pools, distributed Qdrant writes, and the lessons we learned the hard way.]]></description><link>http://blog.mixpeek.com/ray-distributed-ml-pipeline-architecture/</link><guid isPermaLink="false">699f17493baecafdb7f8c2cc</guid><category><![CDATA[Engineering]]></category><category><![CDATA[Infrastructure]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[Ray]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Wed, 25 Feb 2026 15:51:20 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/02/ray-blog-feature.png" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/02/ray-blog-feature.png" alt="How Mixpeek runs distributed multimodal ML on Ray: architecture, patterns, and production lessons"><p>When you index a 10-minute video at Mixpeek, you don&apos;t run one model. You run a transcript model, a visual embedding model, a scene description model, a face detection model, an object detection model, a brand safety classifier, an IAB taxonomy tagger, and a shot boundary detector in parallel. Each has different compute requirements, different batch sizes, different GPU/CPU preferences, and different failure modes.</p><p>A single user-defined <a href="https://mixpeek.com/extractors?ref=blog.mixpeek.com">feature extractor</a> chains several of these. A production batch job might run 15 extractors simultaneously across tens of thousands of files. Each extractor then fans out further to process individual frames, chunks, or pages in parallel before results converge into a searchable index.</p><p>We needed a distributed compute layer that could handle all of this without us building scheduling, retries, resource isolation, and fault tolerance from scratch. After evaluating Celery, Dask, and a bespoke gRPC approach, we chose <a href="https://www.ray.io/?ref=blog.mixpeek.com" rel="noopener">Ray</a>. This is a technical walkthrough of how we use it, the patterns we settled on, and the lessons we learned the hard way.</p><hr><h2 id="the-architecture">The architecture</h2><p>We run a <a href="https://docs.ray.io/en/latest/cluster/kubernetes/index.html?ref=blog.mixpeek.com" rel="noopener">KubeRay</a> cluster on GKE, deployed as a <code>RayService</code> custom resource. Two logical layers sit on top: <strong>Ray Serve</strong> for always-on model inference, and <strong>Ray Data</strong> for batch pipeline execution.</p><pre><code>                     GKE / KubeRay
+--------------------------------------------------------+
|                                                        |
|   +----------------+    +----------------------------+ |
|   |   Head Node    |    |      Ray Serve Layer        | |
|   |   (0 CPUs)     |&lt;--&gt;|  20+ model deployments     | |
|   |   control only |    |  per-model autoscaling      | |
|   +----------------+    +----------------------------+ |
|           |                                            |
|    +-------+------+                                    |
|    v              v                                    |
|  +----------+  +----------+                            |
|  |  CPU     |  |  GPU     |   custom resource:        |
|  | Workers  |  | Workers  |   {&quot;batch&quot;: 1}            |
|  |  1-5     |  |  0-3     |   isolates batch jobs     |
|  |  pods    |  |  pods    |   from Serve replicas     |
|  +----------+  +----------+                            |
+--------------------------------------------------------+
</code></pre><p>The head node is deliberately computation-free (<code>num-cpus: &quot;0&quot;</code>) -- it only handles the control plane. This is a Ray best practice we ignored until a runaway batch job starved the scheduler. CPU and GPU worker groups scale independently via KubeRay&apos;s autoscaler.</p><pre><code class="language-yaml"># infra/gke/rayservice.yaml (abbreviated)
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: mixpeek-engine-svc
spec:
  rayClusterConfig:
    headGroupSpec:
      rayStartParams:
        num-cpus: &quot;0&quot;          # head handles control, not work
        dashboard-host: &quot;0.0.0.0&quot;
      template:
        spec:
          containers:
            - name: ray-head
              resources:
                requests: { cpu: &quot;4&quot;, memory: &quot;32Gi&quot; }
                limits:   { cpu: &quot;8&quot;, memory: &quot;64Gi&quot; }

    workerGroupSpecs:
      - groupName: cpu-workers
        minReplicas: 1
        maxReplicas: 5
        template:
          spec:
            containers:
              - name: ray-worker
                resources:
                  requests: { cpu: &quot;7&quot;, memory: &quot;28Gi&quot; }
                  limits:   { cpu: &quot;7&quot;, memory: &quot;56Gi&quot; }

      - groupName: gpu-workers
        minReplicas: 0          # scale to zero when idle
        maxReplicas: 3
        template:
          spec:
            tolerations:
              - key: nvidia.com/gpu
                operator: Exists
                effect: NoSchedule
            containers:
              - name: ray-worker
                resources:
                  requests:
                    cpu: &quot;4&quot;
                    memory: &quot;16Gi&quot;
                    nvidia.com/gpu: &quot;1&quot;
</code></pre><hr><h2 id="ray-serve-20-models-one-cluster">Ray Serve: 20+ models, one cluster</h2><p>Every inference model runs as a named <a href="https://docs.ray.io/en/latest/serve/index.html?ref=blog.mixpeek.com" rel="noopener">Ray Serve</a> deployment -- text embedders, CLIP variants, transcript models, classifiers. We use the declarative <code>serveConfigV2</code> YAML format rather than the imperative Python API, which makes deployments GitOps-friendly and lets KubeRay manage rollouts without custom deployment scripts.</p><p>The key design decision is per-model autoscaling. A text embedding model has very different throughput and memory characteristics than a video captioning model. Treating them as a homogeneous fleet wastes GPU memory and causes head-of-line blocking:</p><pre><code class="language-yaml"># serveConfigV2 (excerpt)
serveConfigV2: |
  applications:
    # Lightweight text embedder: scale wide, low memory
    - name: intfloat__multilingual_e5_large_instruct
      import_path: engine.inference.intfloat.multilingual_e5_large_instruct.routes:app
      deployments:
        - name: MultilingualE5LargeInstructV1Deployment
          autoscaling_config:
            min_replicas: 2
            max_replicas: 10
            target_ongoing_requests: 2
            upscale_delay_s: 5
            downscale_delay_s: 300
          ray_actor_options:
            num_cpus: 0.5
            num_gpus: 0
            memory: 2147483648    # 2GB
          max_ongoing_requests: 3

    # Heavy video captioner: scale conservatively, GPU required
    - name: video_captioner
      import_path: engine.inference.video.caption.routes:app
      deployments:
        - name: VideoCaptionerDeployment
          autoscaling_config:
            min_replicas: 0        # scale to zero when idle
            max_replicas: 2
            target_ongoing_requests: 1
            upscale_delay_s: 30
            downscale_delay_s: 600
          ray_actor_options:
            num_cpus: 2
            num_gpus: 0.5
            memory: 8589934592    # 8GB
          max_ongoing_requests: 1
</code></pre><p><code>target_ongoing_requests</code> is the key lever. For high-throughput, low-latency models you target more concurrent requests per replica. For heavy models (video captioning, large-scale image encoders), you target 1 to avoid replica OOM from batched inputs piling up.</p><hr><h2 id="ray-data-the-extraction-pipeline">Ray Data: the extraction pipeline</h2><p>Batch processing uses <a href="https://docs.ray.io/en/latest/data/data.html?ref=blog.mixpeek.com" rel="noopener">Ray Data</a> with <code>map_batches</code> and <code>ActorPoolStrategy</code>. Each pipeline stage is a Python callable operating on a batch of rows.</p><p>A naive implementation runs preprocessing (S3 download, format normalization, frame extraction) once per extractor. With 10 extractors on a 1,000-file batch, that&apos;s 10,000 redundant S3 reads. We run preprocessing once, cache the result as a Ray Dataset in object store, and fan it out to all extractors:</p><pre><code class="language-python"># engine/pipelines/helpers/job_builder.py
def run_preprocessing_pipeline(
    input_dataset: ray.data.Dataset,
) -&gt; ray.data.Dataset:
    # 1 locally, 56 in prod -- detected at startup via hardware_config
    s3_concurrency = hardware_config.cpu_concurrency

    preprocessing_steps = [
        MapBatchesPipelineStep(
            S3MediaResolver,
            concurrency=s3_concurrency,
            batch_size=8,
            actor_options={&quot;memory&quot;: 3 * 1024 * 1024 * 1024},  # 3GB
        ),
        MapBatchesPipelineStep(
            ContentPrep,
            concurrency=s3_concurrency,
            batch_size=16,
            actor_options={&quot;memory&quot;: 3 * 1024 * 1024 * 1024},  # 3GB
        ),
    ]

    return BasePipeline(preprocessing_steps).run(input_dataset)
</code></pre><p>The 3GB memory actor reservation is production-profiled. Early on we ran without memory hints and workers were silently OOM-killed with no useful error. Ray&apos;s scheduler doesn&apos;t know your model weights are 2.4GB unless you tell it.</p><h3 id="flexible-actor-pools-prevent-deadlock">Flexible actor pools prevent deadlock</h3><p>The original code used fixed-size actor pools:</p><pre><code class="language-python"># This deadlocks under concurrent batch jobs
compute = ray.data.ActorPoolStrategy(size=8)
</code></pre><p>With two concurrent batch jobs each requesting 8-actor pools on a 12-worker cluster, both jobs get stuck waiting for the other to release workers. Flexible pools fix it:</p><pre><code class="language-python"># engine/pipelines/steps.py
class MapBatchesPipelineStep(BasePipelineStep):
    DEFAULT_POOL_MAX_SIZE = 8

    def __init__(self, concurrency=None, pool_max_size=None, **kwargs):
        concurrency_val = concurrency if concurrency is not None else 1
        capped = min(concurrency_val, pool_max_size or self.DEFAULT_POOL_MAX_SIZE)

        # min_size=1: the job can always make progress with a single worker.
        # Ray fills in more workers as they become available. No deadlock.
        self.compute = ray.data.ActorPoolStrategy(min_size=1, max_size=capped)
</code></pre><hr><h2 id="isolating-batch-jobs-from-serve-replicas">Isolating batch jobs from Serve replicas</h2><p>This is the pattern we&apos;re most pleased with. Ray Serve replicas and batch pipeline tasks run on the same cluster. Without isolation, a large batch job can starve Serve replicas of workers, causing inference timeouts for live API requests.</p><p>The solution is <a href="https://docs.ray.io/en/latest/ray-core/scheduling/resources.html?ref=blog.mixpeek.com#custom-resources" rel="noopener">custom Ray resources</a>. We declare a synthetic resource called <code>batch</code> on worker nodes, then require it for every batch task:</p><pre><code class="language-python"># engine/pipelines/tasks.py
def _batch_resource_options() -&gt; dict:
    # Ray Serve replicas never request {&quot;batch&quot;: 1},
    # so they physically cannot land on batch-reserved slots.
    return {&quot;resources&quot;: {&quot;batch&quot;: 1}}

@ray.remote(max_retries=3, **_batch_resource_options())
def process_feature_extractor(
    extractor_request: ExtractorRequest,
    input_dataset: ray.data.Dataset,
) -&gt; None:
    registry = get_inference_registry()
    registry.add_packages(inference_pkg, plugins_pkg, taxonomies_pkg)
    # ... run extraction steps
</code></pre><p>In the KubeRay worker spec, CPU workers expose this resource:</p><pre><code class="language-yaml"># CPU worker group rayStartParams
rayStartParams:
  resources: &apos;{&quot;batch&quot;: 4}&apos;   # 4 concurrent batch tasks per CPU worker node
</code></pre><p>A massive overnight batch job can saturate all batch slots without affecting P99 latency on live inference requests. The separation cost is zero -- it&apos;s a scheduler hint with no runtime overhead.</p><hr><h2 id="production-patterns-worth-stealing">Production patterns worth stealing</h2><h3 id="non-blocking-progress-tracking-with-a-ray-actor">Non-blocking progress tracking with a Ray Actor</h3><p>Ray Data pipelines are hard to introspect externally. Worker tasks can&apos;t push progress to an external DB without blocking the pipeline. We use a long-lived Ray actor as a shared counter that workers update fire-and-forget:</p><pre><code class="language-python"># engine/monitoring/performance/utils.py
@ray.remote
class ProgressActor:
    def __init__(self, total=None, job_id=None):
        self._processed = 0
        self._total = total

    def incr(self, n=1):
        self._processed += n
        return self._processed

    def get_progress(self):
        return {
            &quot;processed&quot;: self._processed,
            &quot;total&quot;: self._total,
            &quot;percent&quot;: (self._processed / self._total * 100) if self._total else None,
        }

# Instantiate once, pass handle into map_batches workers
progress_actor = ProgressActor.remote(total=dataset_size, job_id=batch_id)

# Inside a Ray Data worker -- fire-and-forget, no blocking:
progress_actor.incr.remote(len(batch))

# From the API layer:
progress = ray.get(progress_actor.get_progress.remote())
</code></pre><p>Workers call <code>.remote()</code> which returns immediately. The call is queued on the actor&apos;s mailbox and executed serially -- atomic increments without locks, without blocking the pipeline.</p><h3 id="custom-datasink-for-distributed-qdrant-writes">Custom Datasink for distributed Qdrant writes</h3><p>Collecting all pipeline output on one node before writing forces full materialization in memory. Ray Data&apos;s <code>Datasink</code> API distributes writes across all workers with built-in backpressure:</p><pre><code class="language-python"># engine/databases/qdrant/datasink.py
class QdrantDatasink(Datasink):
    @property
    def supports_distributed_writes(self) -&gt; bool:
        return True   # any worker node can write directly to Qdrant

    @property
    def min_rows_per_write(self) -&gt; int:
        return self.config.batch_size   # Qdrant&apos;s optimal upsert batch size

    def write(self, blocks, ctx):
        qdrant = QdrantBaseSync(prefer_grpc=True, ...)
        for block in blocks:
            rows = BlockAccessor.for_block(block).to_pydict()
            # upsert with exponential backoff retry
</code></pre><p>Peak write throughput scales linearly with worker count. <code>min_rows_per_write</code> prevents a flood of tiny Qdrant upserts that tank performance.</p><h3 id="the-localstack-parquet-workaround">The LocalStack parquet workaround</h3><p>Ray Data&apos;s native S3 parquet I/O uses PyArrow&apos;s S3 filesystem under the hood. It works great against AWS S3 in production but silently hangs against LocalStack in local dev. No error, no timeout. We wrapped all parquet I/O:</p><pre><code class="language-python"># engine/utils/ray_parquet.py
def write_parquet_safe(dataset: ray.data.Dataset, path: str) -&gt; int:
    if is_localstack_env():
        # PyArrow + boto3 against LocalStack -- reliable
        rows = dataset.take_all()
        table = pa.Table.from_pylist(rows)
        with get_localstack_s3fs().open(s3_path, &quot;wb&quot;) as f:
            pq.write_table(table, f)
        return len(rows)
    else:
        # Production: Ray Data native (distributed across workers)
        dataset.write_parquet(path)
        return dataset.count()
</code></pre><hr><h2 id="end-to-end-flow">End-to-end flow</h2><pre><code>User triggers batch job on a collection
       |
       v
FastAPI endpoint
       |  schedules as Ray remote task
       v
process_feature_extractor.remote()    requires {&quot;batch&quot;: 1}
       |
       v
  read_parquet_safe()                 reads manifest from S3
       |
       v
  run_preprocessing_pipeline()        S3MediaResolver + ContentPrep
  (runs ONCE, shared across N         Ray Data map_batches
   extractors in this job)
       |
       +------------------+---- ... ----+
       v                  v             v
  extractor_1         extractor_2   extractor_N
  map_batches()       map_batches() map_batches()
  -&gt; Ray Serve        -&gt; Ray Serve  -&gt; Ray Serve
       |
       +------------------+---- ... ----+
                          |
                          v
              QdrantDatasink.write()   distributed writes
              ProgressActor.incr()    fire-and-forget
                          |
                          v
              Indexed and searchable via /retrievers
</code></pre><hr><h2 id="what-were-excited-about">What we&apos;re excited about</h2><p><a href="https://docs.ray.io/en/latest/ray-core/compiled-graph/index.html?ref=blog.mixpeek.com" rel="noopener"><strong>Ray Compiled Graphs.</strong></a> Our pipeline has many small coordination steps between stages -- passing dataset handles, routing between CPU and GPU workers. Each carries overhead from Ray&apos;s default scheduling path. Compiled graphs pre-compile the execution plan and remove per-call overhead. For latency-sensitive single-document requests this matters more than for bulk batch jobs.</p><p><strong>Streaming execution in Ray Data.</strong> The current pipeline materializes intermediate datasets between stages, so peak memory scales with dataset size. Ray Data&apos;s streaming execution (now the default in Ray 2.x) runs the pipeline lazily -- blocks flow through stages without full materialization. We&apos;re migrating progressively, starting with large video batches that currently hit memory pressure on the preprocessing stage.</p><p><strong>Finer-grained scheduling policies.</strong> Right now we have a coarse split: CPU workers for batch, GPU workers for Serve. As we add more GPU-heavy <a href="https://mixpeek.com/extractors?ref=blog.mixpeek.com">feature extractors</a> to the batch path, we&apos;ll need finer-grained policies. Ray&apos;s placement groups and resource bundles give us the primitives -- we just haven&apos;t built the logic yet.</p><p><a href="https://www.anyscale.com/?ref=blog.mixpeek.com" rel="noopener"><strong>Anyscale</strong></a><strong> for managed Ray.</strong> We self-manage KubeRay on GKE today. The operational burden is real -- node pool configuration, KubeRay version upgrades, GKE autoscaler tuning, monitoring. We&apos;re watching Anyscale&apos;s managed offering closely. The make-vs-buy math is still in favor of self-managed at our scale, but that inflection point is approaching.</p><hr><h2 id="lessons-briefly">Lessons, briefly</h2><ul><li><strong>Zero-CPU head nodes from day one.</strong> The head is for control. If it runs user code, a runaway task will take down your scheduler.</li><li><strong>Flexible actor pools, not fixed.</strong> <code>ActorPoolStrategy(min_size=1, max_size=N)</code> instead of <code>size=N</code>. Fixed pools deadlock under concurrent jobs.</li><li><strong>Custom resources for workload isolation.</strong> Declaring synthetic resources lets you partition cluster capacity between workload types without separate clusters.</li><li><strong>Always reserve memory in actor options.</strong> Ray can&apos;t infer your model&apos;s memory footprint. Explicit <code>memory=</code> hints prevent silent OOM kills.</li><li><strong>Fire-and-forget progress via actors.</strong> Don&apos;t block workers on external DB writes. A Ray actor as a shared counter is cheap and reliable.</li><li><strong>Environment-aware I/O wrappers.</strong> Ray Data&apos;s S3 integration has edge cases in local dev. Thin wrappers that detect the environment save hours of debugging.</li></ul><hr><p>If you&apos;re building on Ray and want to compare notes, or curious about how Mixpeek uses all of this to power <a href="https://mixpeek.com/capabilities?ref=blog.mixpeek.com">multimodal search and extraction</a>, we&apos;re always happy to talk.</p><p>If you want to see the output of this infrastructure in action, the <a href="https://mixpeek.com/showcase?ref=blog.mixpeek.com">retriever showcase</a> has live multimodal search demos you can run against your own content -- no infrastructure required.</p>]]></content:encoded></item><item><title><![CDATA[Political Ad Disclaimers: How ZIP+4 Targeting Creates Jurisdiction Conflicts]]></title><description><![CDATA[6,000+ ZIP codes straddle congressional district lines. At ZIP+4 precision, federal, state, and local disclaimer requirements can all apply simultaneously. Here's how multimodal AI solves what static rules engines can't.]]></description><link>http://blog.mixpeek.com/political-ad-disclaimer-zip4-targeting-compliance/</link><guid isPermaLink="false">699684dd3baecafdb7f8c17a</guid><category><![CDATA[Advertising]]></category><category><![CDATA[AdTech]]></category><category><![CDATA[Political Advertising]]></category><category><![CDATA[Compliance]]></category><category><![CDATA[Video Intelligence]]></category><category><![CDATA[Brand Safety]]></category><dc:creator><![CDATA[Ethan Steininger]]></dc:creator><pubDate>Thu, 19 Feb 2026 19:43:22 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/02/header.svg" medium="image"/><content:encoded><![CDATA[<img src="http://blog.mixpeek.com/content/images/2026/02/header.svg" alt="Political Ad Disclaimers: How ZIP+4 Targeting Creates Jurisdiction Conflicts"><p>In 2024, Meta paid $24.6 million to settle Washington state campaign finance violations &#x2014; not for running a prohibited ad, but for failing to maintain compliant disclosure records on ads already in-flight. Their fix was to exit Washington&apos;s political ad market entirely. Google, Yahoo, and The Trade Desk did the same. The regulatory intent &#x2014; transparency about who funds political advertising &#x2014; was not served by this outcome. Less accountable actors filled the gap.</p><p>The compliance problem that drove them out isn&apos;t a platform-level recordkeeping quirk. It lives inside every political ad creative, tied to a geographic precision most ad infrastructure was never designed to handle: the <strong>ZIP+4 code</strong>.</p><h2 id="the-problem-one-ad-three-simultaneous-disclaimer-regimes">The Problem: One Ad, Three Simultaneous Disclaimer Regimes</h2><p>A ZIP+4 code &#x2014; the full nine-digit USPS format &#x2014; identifies six to twenty delivery points: a few houses on one side of a block, a single apartment floor, a specific office suite. Programmatic platforms can now target CTV and video inventory at this precision using IP-based geolocation. Political campaigns adopted it aggressively because it eliminates wasted impressions in adjacent, non-competitive districts.</p><p>The precision creates a compliance dimension that existing tools don&apos;t handle: <strong>more than 6,000 five-digit ZIP codes straddle multiple congressional districts</strong>. At ZIP+4 granularity, a single carrier route segment may simultaneously sit inside a federal House district, a state legislative district, and a local election jurisdiction &#x2014; each governed by a different disclaimer law, with no preemption between them.</p><p>The compliance stack looks like this for a single impression in San Francisco&apos;s Mission District:</p><ul><li><strong>Federal (FEC):</strong> &quot;Paid for by [committee]&quot; + candidate verbal approval statement. Applies to federal races only.</li><li><strong>California (FPPC):</strong> Top three donors above $50,000 listed in descending order, in Arial &#x2265;10pt, occupying &#x2265;2.5% of screen height. Donor list must be updated within five business days of a threshold crossing. AI-generated content must carry explicit disclosure (AB 2355, 2024).</li><li><strong>Local (SF ordinance):</strong> Additional city-level requirements layered on top.</li></ul><p>A disclaimer that satisfies FEC requirements may not satisfy California&apos;s. A California-compliant ad whose top-donor list is five days stale is non-compliant. An AI-generated creative missing AB 2355 disclosure text is non-compliant even if every other element is correct. And none of today&apos;s ad tech tools automatically resolve: <em>this impression is in ZIP+4 XXXXX-YYYY &#x2192; which elections are relevant &#x2192; which rules apply &#x2192; does this specific creative satisfy all of them?</em></p><h2 id="why-existing-tools-leave-the-gap">Why Existing Tools Leave the Gap</h2><p>DSP targeting can reach ZIP+4 precision, but targeting and compliance verification are decoupled &#x2014; no major programmatic platform validates whether the creative&apos;s disclaimer legally satisfies the jurisdiction it&apos;s serving into.</p><p>Platform-level review (Google, Meta) checks advertiser identity, not creative compliance at sub-state geography. Third-party classifiers can identify that an ad is political &#x2014; they don&apos;t verify whether its disclaimer text meets the specific requirements of the specific jurisdiction the device is in. Address-to-district APIs resolve ZIP+4 codes to their legislative districts &#x2014; useful input, but not a compliance engine. All four capabilities exist as separate, disconnected point solutions.</p><p>The gap sits at the intersection of three things that have never been integrated: <strong>what&apos;s actually in the creative</strong>, <strong>what rules apply at this location</strong>, and <strong>real-time validation that the creative satisfies those rules</strong>.</p><h2 id="how-mixpeek-solves-it-three-layers-one-pipeline">How Mixpeek Solves It: Three Layers, One Pipeline</h2><h3 id="layer-1-%E2%80%94-feature-extractors-read-the-creative">Layer 1 &#x2014; Feature Extractors: Read the Creative</h3><p>Before any compliance check, you need to know what the creative actually contains. Mixpeek runs parallel extraction across every creative asset:</p><ul><li><strong>OCR</strong> reads all on-screen text &#x2014; disclaimer language, committee names, donor disclosures &#x2014; and returns bounding box dimensions that enable font size compliance checks (California&apos;s 2.5% screen height requirement is measurable directly from the extracted coordinates).</li><li><strong>Speech-to-text</strong> transcribes the audio track with timestamps, detecting verbal approval statements required for broadcast and their position in the ad.</li><li><strong>Face recognition</strong> verifies candidate on-screen appearances with duration and frame-coverage measurements &#x2014; addressing the four-second, 4%-of-frame-height broadcast requirement automatically.</li><li><strong>AI-generation detection</strong> returns a confidence score for AI-generated or substantially altered content, feeding the AB 2355 and equivalent state disclosure checks.</li></ul><p>The result is a structured creative profile: extracted disclaimer text, sponsor entities, audio transcript, candidate appearance data, content classification, and AI-generation score. This profile is computed once per creative version and cached. Every bid-time compliance check is a retriever lookup against the cached profile &#x2014; not a re-run of extraction.</p><h3 id="layer-2-%E2%80%94-taxonomies-encode-the-rules-as-data">Layer 2 &#x2014; Taxonomies: Encode the Rules as Data</h3><p>Disclaimer requirements are not code &#x2014; they&apos;re policy, and policy changes constantly. Sixteen states have enacted AI disclosure requirements for political ads since 2023. Redistricting after the 2020 Census redrew every state&apos;s legislative maps. Encoding these rules as software means a deployment cycle every time a law changes.</p><p>Mixpeek stores compliance rules as versioned, queryable taxonomy entries: required elements per jurisdiction, per election type, with effective dates. A ZIP+4-to-jurisdiction mapping sits alongside, sourced from USPS boundary data and legislative district APIs &#x2014; updated when redistricting finalizes, without touching the extraction or retrieval logic.</p><p>When California passes a new rule, one taxonomy record changes. The next creative validated against that jurisdiction reflects the updated requirement automatically. No code deployment. No engineering cycle. The rule set is data; the system adapts.</p><h3 id="layer-3-%E2%80%94-retrievers-validate-at-bid-time">Layer 3 &#x2014; Retrievers: Validate at Bid Time</h3><p>At impression time, a retriever executes three steps in under 100 milliseconds:</p><ol><li><strong>Resolve jurisdictions:</strong> ZIP+4 &#x2192; set of overlapping federal, state, and local districts.</li><li><strong>Scope to election type:</strong> A California assembly campaign filters to state-level rules; the federal and school-board jurisdictions are excluded for that creative.</li><li><strong>Validate:</strong> Cached creative profile is joined against the active taxonomy rules. Each required element is checked &#x2014; disclaimer text present, font size compliant, donor count complete, AI disclosure included. The response identifies any missing element, the applicable jurisdiction, and the rule version checked.</li></ol><p>That last detail &#x2014; rule version &#x2014; is what makes the audit trail defensible. Washington&apos;s recordkeeping mandate requires that platforms document exactly what governed each political ad placement. The retriever log captures creative ID, ZIP+4, jurisdiction set, taxonomy version, and compliance result per impression. The public disclosure record emerges as a byproduct of the compliance architecture.</p><h2 id="a-concrete-example">A Concrete Example</h2><p>It&apos;s October 2026. Three campaigns target the same ZIP+4 in San Francisco: a federal House race, a state assembly race, and an SFUSD school board race. All serve through the same SSP.</p><p>The federal creative passes: FEC language present, verbal approval statement detected in audio, candidate on-screen for six seconds. The state assembly creative fails: it discloses two of three qualifying donors &#x2014; a third crossed the $50,000 threshold five days ago and the creative wasn&apos;t updated. The school board creative fails: it was produced with AI-generated background imagery, carries no AB 2355 disclosure, and would have served into California&apos;s jurisdiction where that disclosure is mandatory.</p><p>All three determinations are made pre-bid, in under 100ms, using cached profiles and the current taxonomy version. No human review. Full audit records retained.</p><h2 id="the-2026-cycle-is-the-proving-ground">The 2026 Cycle Is the Proving Ground</h2><p>2026 is the first full federal cycle governed by the FEC&apos;s 2023 internet disclaimer rules, active AI disclosure mandates across sixteen states, and Washington&apos;s $24.6 million precedent for platform liability. The compliance infrastructure question is no longer hypothetical.</p><p>Mixpeek&apos;s pipeline &#x2014; feature extractors that read creatives, taxonomies that encode the rules as data, retrievers that join them at bid time &#x2014; converts the compliance gap from an engineering problem requiring perpetual legal-to-code translation into a data infrastructure problem with a defined maintenance model.</p><p><a href="https://mixpeek.com/solutions/advertising?ref=blog.mixpeek.com">Explore Mixpeek for Advertising</a> or <a href="https://mixpeek.com/schedule-demo?ref=blog.mixpeek.com">schedule a demo</a> to walk through the ZIP+4 compliance pipeline with your specific inventory and targeting configuration.</p>]]></content:encoded></item><item><title><![CDATA[Multimodal Monday #45: Birds, Whales, and the End of Latency]]></title><description><![CDATA[Your Weekly Multimodal AI Roundup (Feb 9 - Feb 16)
]]></description><link>http://blog.mixpeek.com/multimodal-monday-45/</link><guid isPermaLink="false">6994895d3baecafdb7f8bd53</guid><category><![CDATA[Multimodal Monday]]></category><dc:creator><![CDATA[Philip Bankier]]></dc:creator><pubDate>Tue, 17 Feb 2026 15:43:48 GMT</pubDate><media:content url="http://blog.mixpeek.com/content/images/2026/02/mm_45.png" medium="image"/><content:encoded><![CDATA[<h3 id="quick-take-tldr">Quick Take (TL;DR)</h3><ul><li><strong>Voice AI drops the walkie-talkie act.</strong> NVIDIA&apos;s PersonaPlex-7B and ElevenLabs&apos; Expressive Mode both ship full-duplex conversation. The AI listens while it talks, interrupts naturally, and adjusts tone mid-sentence. Turn-taking latency is dead.</li><li><strong>Vision goes native.</strong> Qwen3.5 (397B parameters) and DeepGen 1.0 bake visual understanding directly into the model architecture instead of wiring a vision encoder to a language model after the fact. The result: tighter reasoning over charts, documents, and complex images.</li><li><strong>A bird model decoded whale songs.</strong> Google fine-tuned Perch 2.0 (trained on birdsong) to classify whale vocalizations. It worked, which means bioacoustic signals share deeper structural patterns than anyone expected.</li></ul><hr><h3 id="tools-models-and-techniques">Tools, Models and Techniques</h3><img src="http://blog.mixpeek.com/content/images/2026/02/mm_45.png" alt="Multimodal Monday #45: Birds, Whales, and the End of Latency"><p><strong>Qwen3.5-397B-A17B</strong> - Qwen&apos;s new foundation model pairs a 397B-parameter vision-language architecture with hybrid linear attention heads. It handles document parsing, chart analysis, and visual reasoning natively rather than routing through a separate encoder. <strong>Why it matters:</strong> An open model at this scale with native multimodal integration puts serious pressure on proprietary alternatives. <a href="https://qwen.ai/blog?id=qwen3.5&amp;ref=blog.mixpeek.com">Blog</a> | <a href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B?ref=blog.mixpeek.com">Hugging Face</a></p><p><strong>PersonaPlex-7B</strong> - NVIDIA released a 7B voice model that listens and speaks at the same time. It supports natural interruptions (&quot;barge-in&quot;), overlapping speech, and real-time turn negotiation without the pause-wait-respond loop. <strong>Why it matters:</strong> Full-duplex conversation removes the single biggest friction point in voice AI: latency. <a href="https://huggingface.co/nvidia/personaplex-7b-v1?ref=blog.mixpeek.com">Hugging Face</a></p>
<!--kg-card-begin: html-->
<video controls> <source src="https://multimodal-monday.s3.us-east-2.amazonaws.com/week-45/personaplex.mp4"> Your browser does not support the video tag. </video>
<!--kg-card-end: html-->
<p><strong>ElevenAgents Expressive Mode</strong> - ElevenLabs added breath, pauses, and emotional inflection to their voice agents. The output sounds less like text-to-speech and more like someone actually thinking before they talk. <strong>Why it matters:</strong> Voice agents in support, coaching, and companionship roles need to sound like they care, and this gets closer. <a href="https://elevenlabs.io/blog/introducing-expressive-mode?ref=blog.mixpeek.com">Blog</a> | <a href="https://elevenlabs.io/agents/expressive-mode?ref=blog.mixpeek.com">Try it</a></p>
<!--kg-card-begin: html-->
<video controls> <source src="https://multimodal-monday.s3.us-east-2.amazonaws.com/week-45/eleven.mp4"> Your browser does not support the video tag. </video>
<!--kg-card-end: html-->
<p><strong>MiniMax M2.5</strong> - MiniMax open-sourced a frontier model tuned for practical work: coding, writing, and structured analysis. It prioritizes instruction-following accuracy over open-ended chat. <strong>Why it matters:</strong> A model built to execute tasks reliably matters more than one that chats well. <a href="https://huggingface.co/MiniMaxAI/MiniMax-M2.5?ref=blog.mixpeek.com">Hugging Face</a></p><figure class="kg-card kg-image-card"><img src="https://substackcdn.com/image/fetch/$s_!EA4s!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32d295f2-ee58-4c79-9d0e-146f80419a21_1200x579.jpeg" class="kg-image" alt="Multimodal Monday #45: Birds, Whales, and the End of Latency" loading="lazy" width="1200" height="579"></figure><p><strong>Seedance 2.0</strong> - ByteDance&apos;s video generator takes text, images, audio, or existing video as input and produces new video synchronized to the audio beat. It automates the tedious frame-by-frame alignment work that eats hours in post-production. <strong>Why it matters:</strong> Audio-visual sync is the bottleneck in short-form video production, and this removes it. <a href="https://seed.bytedance.com/en/seedance2_0?ref=blog.mixpeek.com">Project Page</a></p>
<!--kg-card-begin: html-->
<video controls> <source src="https://multimodal-monday.s3.us-east-2.amazonaws.com/week-45/seedance.mp4"> Your browser does not support the video tag. </video>
<!--kg-card-end: html-->
<ul><li><strong>Qwen-Image-2.0:</strong> Professional infographics and photorealism generation. <a href="https://qwen.ai/blog?id=qwen-image-2.0&amp;ref=blog.mixpeek.com">Blog</a></li></ul><figure class="kg-card kg-image-card"><img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen-Image/image2/top.png#center" class="kg-image" alt="Multimodal Monday #45: Birds, Whales, and the End of Latency" loading="lazy" width="2688" height="1536"></figure><ul><li><strong>DeepGen 1.0:</strong> A lightweight 5B-parameter unified multimodal model. <a href="https://huggingface.co/deepgenteam/DeepGen-1.0?ref=blog.mixpeek.com">Hugging Face</a></li><li><strong>GLM-5:</strong> From vibe coding to agentic engineering. <a href="https://z.ai/blog/glm-5?ref=blog.mixpeek.com">Blog</a></li></ul><figure class="kg-card kg-image-card"><img src="https://z-cdn-media.chatglm.cn/prompts-rich-media-resources/5-blog/20260212-010724.png" class="kg-image" alt="Multimodal Monday #45: Birds, Whales, and the End of Latency" loading="lazy" width="4239" height="2884"></figure><ul><li><strong>KaniTTS2:</strong> Open-source 400M TTS model that runs in 3GB VRAM. <a href="https://huggingface.co/nineninesix/kani-tts-2-pt?ref=blog.mixpeek.com">Hugging Face</a></li><li><strong>SoulX-Singer:</strong> High-quality zero-shot singing voice synthesis. <a href="https://github.com/Soul-AILab/SoulX-Singer/tree/main?ref=blog.mixpeek.com">GitHub</a></li></ul><figure class="kg-card kg-image-card"><img src="https://github.com/Soul-AILab/SoulX-Singer/raw/main/assets/performance_radar.png" class="kg-image" alt="Multimodal Monday #45: Birds, Whales, and the End of Latency" loading="lazy" width="1016" height="425"></figure><ul><li><strong>MioTTS-2.6B:</strong> Lightweight TTS optimized for speed in English and Japanese. <a href="https://huggingface.co/Aratako/MioTTS-2.6B?ref=blog.mixpeek.com">Hugging Face</a></li><li><strong>FireRed-Image-Edit-1.0:</strong> New tool for image editing. <a href="https://huggingface.co/FireRedTeam/FireRed-Image-Edit-1.0?ref=blog.mixpeek.com">Hugging Face</a></li></ul><figure class="kg-card kg-image-card"><img src="https://huggingface.co/FireRedTeam/FireRed-Image-Edit-1.0/resolve/main/assets/teaser.png" class="kg-image" alt="Multimodal Monday #45: Birds, Whales, and the End of Latency" loading="lazy" width="2752" height="1536"></figure><ul><li><strong>Qwen3-TTS:</strong> 1.7B parameters of clean, natural speech synthesis. <a href="https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice?ref=blog.mixpeek.com">Hugging Face</a></li></ul>
<!--kg-card-begin: html-->
<video controls> <source src="https://multimodal-monday.s3.us-east-2.amazonaws.com/week-45/qwen3.mp4"> Your browser does not support the video tag. </video>
<!--kg-card-end: html-->
<ul><li><strong>Ming-flash-omni 2.0:</strong> New multimodal model from InclusionAI. <a href="https://huggingface.co/inclusionAI/Ming-flash-omni-2.0?ref=blog.mixpeek.com">Hugging Face</a></li></ul><hr><h3 id="research-highlights">Research Highlights</h3><p><strong>EchoJEPA: Latent Prediction for Hearts -</strong> A self-supervised foundation model trained on 18 million echocardiograms. Instead of predicting noisy ultrasound pixels, it learns in latent space and separates clinical signal from artifact, outperforming existing cardiac assessment methods. <strong>Why it matters:</strong> Self-supervised training on massive unlabeled medical data catches anomalies that small labeled datasets miss. <a href="https://arxiv.org/abs/2602.02603?ref=blog.mixpeek.com">Paper</a></p><p><strong>Bioacoustics Transfer Learning</strong> - Google Research adapted Perch 2.0, trained entirely on bird songs, to classify whale vocalizations. The cross-domain transfer worked because bioacoustic signals share fundamental spectral and temporal features across species. <strong>Why it matters:</strong> You can train on abundant data (birds) and fine-tune for scarce data (whales), which unlocks conservation research without needing millions of labeled samples per species. <a href="https://research.google/blog/how-ai-trained-on-birds-is-surfacing-underwater-mysteries/?ref=blog.mixpeek.com">Blog</a></p><p><strong>Beyond the Unit Hypersphere -</strong> This paper challenges the standard practice of normalizing embeddings onto the unit hypersphere in contrastive learning. The authors show that embedding magnitude carries meaningful information about confidence and specificity that normalization destroys. <strong>Why it matters:</strong> Preserving magnitude leads to more nuanced retrieval and better performance on ambiguous queries. <a href="https://arxiv.org/abs/2602.09229?ref=blog.mixpeek.com">Paper</a></p><p><strong>DuoGen: Mixed-Media Storytelling</strong> - NVIDIA&apos;s DuoGen generates coherent interleaved sequences of images and text. It decides when to show and when to tell, keeping visual and textual content consistent across the full narrative. <strong>Why it matters:</strong> This opens the door to AI-generated tutorials, articles, and illustrated content that reads as authored rather than assembled. <a href="https://research.nvidia.com/labs/dir/duogen/?ref=blog.mixpeek.com">Project Page</a></p>
<!--kg-card-begin: html-->
<video controls> <source src="https://multimodal-monday.s3.us-east-2.amazonaws.com/week-45/duogen.mp4"> Your browser does not support the video tag. </video>
<!--kg-card-end: html-->
<p><strong>UniAudio 2.0</strong> - A single audio language model that handles speech, music, and sound effects through text-aligned factorized tokenization. One framework generates, edits, and mixes across all audio types without switching models. <strong>Why it matters:</strong> Unifying the audio stack (TTS, music generation, foley) into one model creates workflows that were previously impossible without multiple specialized tools. <a href="https://arxiv.org/pdf/2602.04683?ref=blog.mixpeek.com">Paper</a></p><ul><li><strong>ALIVE:</strong> Lifelike audio-video generation. <a href="https://foundationvision.github.io/Alive/?ref=blog.mixpeek.com">Project Page</a></li></ul>
<!--kg-card-begin: html-->
<video controls> <source src="https://multimodal-monday.s3.us-east-2.amazonaws.com/week-45/Alive.mp4"> Your browser does not support the video tag. </video>
<!--kg-card-end: html-->
<ul><li><strong>ConsID-Gen:</strong> View-consistent, identity-preserving image-to-video generation. <a href="https://mingyang.me/ConsID-Gen/?ref=blog.mixpeek.com">Project Page</a></li><li><strong>JUST-DUB-IT:</strong> Video dubbing via joint audio-visual diffusion. <a href="https://justdubit.github.io/?ref=blog.mixpeek.com">Project Page</a></li><li><strong>Voice-First Human-AI Collaboration:</strong> Exploring LMMs in mixed reality. <a href="https://arxiv.org/html/2602.11025v1?ref=blog.mixpeek.com">Paper</a></li><li><strong>Multimodal Manufacturing Safety Chatbot:</strong> Benchmark for RAG approaches in safety. <a href="https://arxiv.org/html/2511.11847v2?ref=blog.mixpeek.com">Paper</a></li><li><strong>Alzheimer&apos;s Detection:</strong> Multimodal fusion for better diagnosis. <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12876535/?ref=blog.mixpeek.com">Paper</a></li></ul><hr><h3 id="trends-predictions">Trends &amp; Predictions</h3><p><strong>Full-Duplex Voice Is Here</strong></p><p>PersonaPlex-7B and ElevenAgents both shipped full-duplex voice this week. The &quot;you talk, then I talk&quot; model is officially legacy.</p><p>Real conversations overlap. People interrupt, confirm with &quot;uh-huh,&quot; and change direction mid-thought. Full-duplex models handle all of this. More importantly, continuous listening lets the model start composing a response before you finish your sentence. That shaves hundreds of milliseconds off response time, which matters in customer support, gaming, and any scenario where hesitation breaks trust. And when the model hears frustration in your voice while you&apos;re still talking, it can adjust its response before delivering it. </p><p><strong>Native Multimodal Architectures Are Winning</strong></p><p>Qwen3.5 and DeepGen 1.0 both build vision into the model from the ground up. No separate encoder. No adapter layer. No translation step. When vision and language train together from scratch, the model reasons with visual information instead of converting it to text first. You get a system that reads a chart and understands the argument the chart is making, not just the numbers on it. Unified architectures also cut inference overhead because data doesn&apos;t bounce between modules. This is what enables tasks like &quot;analyze this graph in the context of the surrounding report&quot; where tight cross-modal reasoning is the whole point.</p><hr><h3 id="community-shoutouts">Community + Shoutouts</h3><ul><li><strong>Larry the OpenClaw:</strong> Shoutout to <strong>@oliverhenry</strong> for the writeup on Larry, the open-source robot arm doing social media. A fun look at embodied AI in the wild. <a href="https://x.com/oliverhenry/status/2022011925903667547?s=20&amp;ref=blog.mixpeek.com">X Post</a></li><li><strong>OneVision Encoder:</strong> Thanks to <strong>@brian_bo_li</strong> for the deep dive into the OneVision Encoder. Understanding the &quot;eyes&quot; of these models is crucial for building better apps. <a href="https://x.com/brian_bo_li/status/2021649265123373149?s=42&amp;ref=blog.mixpeek.com">X Post</a></li></ul><figure class="kg-card kg-image-card kg-width-full"><img src="https://substackcdn.com/image/fetch/$s_!MolM!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74f043d5-27ba-4a47-83fd-5e936cf98b8a_1200x1025.jpeg" class="kg-image" alt="Multimodal Monday #45: Birds, Whales, and the End of Latency" loading="lazy" width="1200" height="1025"></figure><ul><li><strong>AutoGuidance Node:</strong> A great resource for the ComfyUI community: a custom node implementing AutoGuidance. <a href="https://github.com/xmarre/ComfyUI-AutoGuidance?ref=blog.mixpeek.com">GitHub</a></li><li><strong>Kling 3.0 Fun:</strong> <strong>@lexx_aura</strong> shows off the capabilities (and hilarity) of Kling 3.0. Sometimes the best way to test a model is to just make something weird. <a href="https://x.com/lexx_aura/status/2022022799905394995?s=20&amp;ref=blog.mixpeek.com">X Post</a></li></ul>
<!--kg-card-begin: html-->
<video controls> <source src="https://multimodal-monday.s3.us-east-2.amazonaws.com/week-45/Will+Smith+in+the+Battle+of+Spaghettysburg.mp4"> Your browser does not support the video tag. </video>
<!--kg-card-end: html-->
<hr><p><em>That&apos;s a wrap for Multimodal Monday #45! </em>From full-duplex voice models that listen and speak simultaneously, to 397B-parameter architectures that reason with pixels instead of converting them to words, to a birdsong classifier that turned out to understand whales, this week showed multimodal AI getting less polite and more useful. </p><p><em>Ready to build multimodal solutions that actually work?&#xA0;</em><a href="https://mixpeek.com/contact?ref=blog.mixpeek.com"><em>Let&apos;s talk</em></a></p>]]></content:encoded></item></channel></rss>