Semantic Chunking Strategies for Multimodal RAG

How semantic chunking improves RAG quality by splitting content at natural boundaries rather than fixed token counts. Covers text, documents, video, and audio.
Semantic Chunking Strategies for Multimodal RAG

The quality of a RAG system depends more on how you chunk your data than on which embedding model you use. Poor chunking — splitting mid-sentence, breaking apart related paragraphs, or creating chunks that are too large to be specific — produces embeddings that do not accurately represent their content, leading to irrelevant retrieval results.

Semantic chunking solves this by splitting content at natural boundaries: topic shifts, paragraph breaks, scene changes, and structural markers. This post covers chunking strategies for text, documents, video, and audio.

Why Fixed-Size Chunking Fails

The most common chunking approach splits text every N tokens (typically 256-512) with some overlap. This is simple to implement but creates several problems:

  • Split sentences — A chunk boundary in the middle of a sentence produces two fragments, neither of which captures the complete idea
  • Mixed topics — A 512-token chunk might contain the end of one topic and the beginning of another, creating an embedding that represents neither well
  • Lost context — Important context like "the following table shows..." gets separated from the table it references
  • Redundant overlap — Overlap-based approaches duplicate content, inflating index size and retrieval noise

Embedding-Based Semantic Chunking

The most effective text chunking strategy uses embeddings to detect topic boundaries:

  1. Split the document into sentences
  2. Compute embeddings for each sentence
  3. Calculate cosine similarity between consecutive sentence embeddings
  4. When similarity drops below a threshold, insert a chunk boundary
  5. Merge very short chunks with their neighbors

The threshold is typically set using a percentile of all consecutive similarities — the 25th percentile works well for most content, meaning boundaries are placed at the 25% largest similarity drops.

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunk(text, percentile_threshold=25):
    # Split into sentences
    sentences = text.split(". ")

    # Compute embeddings
    embeddings = model.encode(sentences)

    # Calculate consecutive similarities
    similarities = [
        np.dot(embeddings[i], embeddings[i+1]) /
        (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1]))
        for i in range(len(embeddings) - 1)
    ]

    # Find threshold
    threshold = np.percentile(similarities, percentile_threshold)

    # Split at low-similarity points
    chunks = []
    current_chunk = [sentences[0]]

    for i, sim in enumerate(similarities):
        if sim < threshold:
            chunks.append(". ".join(current_chunk) + ".")
            current_chunk = [sentences[i + 1]]
        else:
            current_chunk.append(sentences[i + 1])

    chunks.append(". ".join(current_chunk) + ".")
    return chunks

Document Layout Chunking

For structured documents (PDFs, HTML, Markdown), layout analysis provides even better chunking signals than embedding similarity:

  • Headings — Each heading starts a new chunk. Nested headings create hierarchical chunks.
  • Tables — Tables are kept as complete chunks, never split across boundaries.
  • Lists — Bulleted/numbered lists are kept together with their introductory text.
  • Code blocks — Code examples are preserved as atomic units.
  • Page boundaries — In PDFs, page breaks are natural (though not always semantic) boundaries.

The best approach combines structural signals with semantic analysis: use document structure as primary split points, then apply embedding-based splitting within large sections.

Video Chunking: Scene-Based Segmentation

For video content, the equivalent of semantic chunking is scene-based segmentation. Instead of splitting every N seconds, detect visual and audio boundaries:

  • Visual scene changes — Camera cuts, transitions, and significant visual shifts
  • Speaker changes — When a different person starts speaking (via diarization)
  • Topic shifts — Detected from the transcript using the same embedding-based approach as text
  • Silence boundaries — Pauses in audio often correspond to topic transitions

Each scene becomes a retrieval unit with its own embedding, transcript chunk, and metadata. This enables frame-accurate search results rather than returning entire videos.

Audio Chunking

Audio content (podcasts, calls, meetings) benefits from transcript-based semantic chunking combined with audio signals:

  • Speaker turns — Split when the speaker changes
  • Topic boundaries — Semantic chunking on the transcript
  • Silence detection — Pauses longer than a threshold indicate segment boundaries
  • Music/jingle detection — In podcasts, musical interludes separate segments

Chunk Size Guidelines

Content TypeTarget Chunk SizeRationale
General text200-500 tokensBalances specificity and context
Technical docs300-800 tokensPreserves code examples and explanations
FAQs1 Q&A pair per chunkEach pair is a complete retrieval unit
Video scenes10-60 secondsLong enough for context, short enough for relevance
Audio segments30-120 secondsAligns with natural speech patterns

Measuring Chunking Quality

Evaluate your chunking strategy by measuring downstream retrieval quality:

  • Chunk coherence — Do chunks contain complete, self-contained ideas?
  • Retrieval recall — When you search for a known answer, does the relevant chunk appear in the top results?
  • Answer quality — Does the LLM produce better answers when using semantically chunked context vs. fixed-size chunks?

In our testing, semantic chunking typically improves retrieval recall by 15-25% compared to fixed-size chunking, with the largest gains on long documents with multiple topics.

Learn more about chunking in our glossary entry on semantic chunking, or explore document understanding for more on layout-aware processing.

About the author
Ethan Steininger

Ethan Steininger

Former lead of MongoDB's Search Team, Ethan noticed the most common problem customers faced was building indexing and search infrastructure on their S3 buckets. Mixpeek was born.

Mixpeek Engineering Blog

Deep dive into multimodal AI, data processing, and best practices from our engineering team.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Mixpeek Engineering Blog.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.