Semantic Video Chunking: Scene Detection

Intelligent video chunking using scene detection and vector embeddings. This tutorial covers how to break down videos into semantic scenes, generate embeddings, and enable powerful semantic search capabilities.
Semantic Video Chunking: Scene Detection

Just as RAG systems break down text documents into meaningful chunks for better processing and retrieval, video content benefits from intelligent segmentation through scene detection. This approach parallels text tokenization in several crucial ways:

Semantic Coherence

Scene detection identifies natural boundaries in video content, maintaining semantic completeness just like how text chunking preserves sentence or paragraph integrity. Each scene represents a complete "thought" or action sequence, rather than arbitrary time-based splits. For example, in a cooking tutorial:

Ingredient Prep Mixing Cooking Plating [0.2, 0.8, 0.1, ...] [0.6, 0.3, 0.9, ...] [0.4, 0.7, 0.5, ...] [0.9, 0.2, 0.4, ...] Scene-Based Video Chunking with Vector Embeddings 0:00 1:45 3:30 5:15 7:00

Retrieval Precision

Scene-based chunks enable precise content retrieval. Instead of returning entire videos, systems can identify and serve the exact relevant scene, similar to how RAG systems return specific text passages rather than complete documents.

Vector Embedding Quality

Scene-based chunking produces higher quality embeddings because:

  1. Each embedding represents a coherent visual concept
  2. The embeddings aren't "confused" by mixing multiple scenes
  3. The semantic space remains clean, enabling better similarity matching

Processing Efficiency

Like token windows in language models, scene-based chunking helps manage video processing:

  1. Smaller, focused chunks enable efficient processing
  2. Parallel processing becomes more feasible
  3. Storage and retrieval operations are optimized
  4. Reduces redundant processing of similar frames
Semantic Chunking: Text vs Video Text Document Chunking The quick brown fox jumps over the lazy dog. Machine learning models process data in chunks. This helps maintain context and meaning while enabling efficient processing of information. {"text": "The quick brown fox jumps over the lazy dog.", "embedding": [0.2, 0.8, ...]} {"text": "Machine learning models process data in chunks.", "embedding": [0.5, 0.3, ...]} {"text": "This helps maintain context and meaning while...", "embedding": [0.7, 0.4, ...]} Video Scene Detection {"scene": "Opening Scene", "duration": "0:00-1:30", "embedding": [0.1, 0.9, ...]} {"scene": "Action Sequence", "duration": "1:31-3:00", "embedding": [0.8, 0.2, ...]} {"scene": "Closing Scene", "duration": "3:01-4:30", "embedding": [0.3, 0.7, ...]} Key Benefits: • Maintains semantic meaning • Enables efficient processing • Improves retrieval accuracy • Preserves context

Multimodal Understanding

Unlike text tokenization, video scenes often contain multiple modalities (visual, audio, text-on-screen) that need to be processed together. This complexity makes intelligent chunking even more crucial for maintaining context and enabling accurate understanding.

💡
Learn multimodal understanding for free: http://multimodaluniversity.com/

Introduction

Video understanding at scale requires efficient processing and indexing of video content. This tutorial demonstrates how to implement dynamic video chunking using scene detection, generate embeddings with Mixpeek, and store them in Weaviate for semantic search capabilities.

Video Processing Pipeline Video Input S3 / Direct Upload / URL Scene Detection PySceneDetect / Content-based Video Chunking FFmpeg Extraction / Parallel Processing mixpeek Semantic Embedding Generation weaviate Vector Database Storage Semantic Search Scene Retrieval / Ranking Data Flow Types: Raw Video → Scene Timestamps → Video Chunks → Embeddings → Vector Storage → Search Results

Prerequisites

pip install scenedetect weaviate-client python-dotenv requests

Implementation Guide

1. Scene Detection with PySceneDetect

First, let's implement the scene detection logic:

from scenedetect import detect, ContentDetector
import os

def detect_scenes(video_path, threshold=27.0):
    """
    Detect scene changes in a video file using content detection.
    
    Args:
        video_path (str): Path to the video file
        threshold (float): Detection threshold (lower = more sensitive)
    
    Returns:
        list: List of scene timestamps (start, end) in seconds
    """
    # Detect scenes using content detection
    scenes = detect(video_path, ContentDetector(threshold=threshold))
    
    # Convert scenes to timestamp ranges
    scene_timestamps = []
    for scene in scenes:
        start_time = scene[0].get_seconds()
        end_time = scene[1].get_seconds()
        scene_timestamps.append((start_time, end_time))
    
    return scene_timestamps

2. Video Chunking Utility

Create a utility to split the video into chunks based on detected scenes:

import subprocess

def chunk_video(video_path, output_dir, timestamps):
    """
    Split video into chunks based on scene timestamps.
    
    Args:
        video_path (str): Path to the source video
        output_dir (str): Directory to save video chunks
        timestamps (list): List of (start, end) timestamps
    
    Returns:
        list: Paths to generated video chunks
    """
    chunk_paths = []
    
    for idx, (start, end) in enumerate(timestamps):
        output_path = os.path.join(output_dir, f"chunk_{idx}.mp4")
        
        # Use ffmpeg to extract the chunk
        command = [
            'ffmpeg', '-i', video_path,
            '-ss', str(start),
            '-t', str(end - start),
            '-c', 'copy',
            output_path
        ]
        
        subprocess.run(command, capture_output=True)
        chunk_paths.append(output_path)
    
    return chunk_paths

3. Mixpeek Integration

Set up the Mixpeek client for generating embeddings:

import requests
import json
from typing import List

class MixpeekClient:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.mixpeek.com"
        
    def generate_embedding(self, video_url: str, vector_index: str) -> dict:
        """
        Generate embeddings for a video chunk using Mixpeek.
        """
        headers = {
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {self.api_key}'
        }
        
        payload = {
            "type": "url",
            "value": video_url,
            "vector_index": vector_index
        }
        
        response = requests.post(
            f"{self.base_url}/features/extractors/embed",
            headers=headers,
            json=payload
        )
        
        return response.json()

4. Weaviate Integration

Set up the Weaviate client for storing embeddings:

import weaviate
from datetime import datetime

def setup_weaviate_schema(client):
    """
    Set up the Weaviate schema for video chunks.
    """
    class_obj = {
        "class": "VideoChunk",
        "vectorizer": "none",  # We'll use custom vectors from Mixpeek
        "properties": [
            {
                "name": "videoId",
                "dataType": ["string"]
            },
            {
                "name": "chunkStart",
                "dataType": ["number"]
            },
            {
                "name": "chunkEnd",
                "dataType": ["number"]
            },
            {
                "name": "sourceUrl",
                "dataType": ["string"]
            }
        ]
    }
    
    client.schema.create_class(class_obj)

def store_embedding(client, embedding_data: dict, chunk_metadata: dict):
    """
    Store video chunk embedding in Weaviate.
    """
    vector = embedding_data["embedding"]
    
    properties = {
        "videoId": chunk_metadata["video_id"],
        "chunkStart": chunk_metadata["start_time"],
        "chunkEnd": chunk_metadata["end_time"],
        "sourceUrl": chunk_metadata["url"]
    }
    
    client.data_object.create(
        "VideoChunk",
        properties,
        vector=vector
    )

5. Putting It All Together

Here's how to use all the components together:

import os
from dotenv import load_dotenv

def process_video(video_path: str, upload_base_url: str):
    """
    Process a video through the entire pipeline:
    1. Detect scenes
    2. Create chunks
    3. Generate embeddings
    4. Store in Weaviate
    """
    load_dotenv()
    
    # Initialize clients
    mixpeek_client = MixpeekClient(os.getenv("MIXPEEK_API_KEY"))
    weaviate_client = weaviate.Client(os.getenv("WEAVIATE_URL"))
    
    # Detect scenes
    scenes = detect_scenes(video_path)
    
    # Create chunks
    output_dir = "video_chunks"
    os.makedirs(output_dir, exist_ok=True)
    chunk_paths = chunk_video(video_path, output_dir, scenes)
    
    # Process each chunk
    video_id = os.path.basename(video_path)
    
    for idx, (chunk_path, (start_time, end_time)) in enumerate(zip(chunk_paths, scenes)):
        # Upload chunk and get URL (implementation depends on your storage solution)
        chunk_url = f"{upload_base_url}/{os.path.basename(chunk_path)}"
        
        # Generate embedding
        embedding_data = mixpeek_client.generate_embedding(
            chunk_url,
            "video_vector"
        )
        
        # Store in Weaviate
        chunk_metadata = {
            "video_id": video_id,
            "start_time": start_time,
            "end_time": end_time,
            "url": chunk_url
        }
        
        store_embedding(weaviate_client, embedding_data, chunk_metadata)

Searching Video Chunks

Here's how to search through the processed video chunks:

def search_video_chunks(client, query_vector, limit=5):
    """
    Search for similar video chunks using the query vector.
    """
    response = (
        client.query
        .get("VideoChunk", ["videoId", "chunkStart", "chunkEnd", "sourceUrl"])
        .with_near_vector({
            "vector": query_vector,
            "certainty": 0.7
        })
        .with_limit(limit)
        .do()
    )
    
    return response["data"]["Get"]["VideoChunk"]
Video Scene Search Process Search Query "cooking scene" [0.2, 0.8, 0.5, ...] Vector Search Ranked Results Scene 1 - 95% Match Cooking demonstration Time: 2:30-4:15 Scene 2 - 87% Match Kitchen preparation Time: 0:45-2:15 Scene 3 - 82% Match Recipe overview Time: 0:00-0:45 More results... Embedding Similarity Search Metrics: • Vector Dimension: 1024 • Distance Metric: Cosine Similarity • Response Time: <100ms • Top-k Results: 10

Best Practices

  1. Scene Detection Tuning
    • Adjust the threshold based on your video content
    • Consider using multiple detection methods for different types of content
    • Implement minimum/maximum chunk duration constraints
  2. Embedding Storage
    • Use batch processing for multiple chunks
    • Implement error handling and retries
    • Consider implementing a caching layer
  3. Performance Optimization
    • Process chunks in parallel when possible
    • Implement progressive loading for large videos
    • Use appropriate video codec settings for chunks

Conclusion

This pipeline enables efficient video understanding by:

  • Breaking videos into meaningful segments
  • Generating rich embeddings for each segment
  • Enabling semantic search across video content

The combination of PySceneDetect, Mixpeek, and Weaviate creates a powerful system for video understanding and retrieval.

All this in two API calls

Want to implement this entire pipeline in just two API calls? Here's how you can do it with Mixpeek:

Ingest video:

import requests

url = "https://api.mixpeek.com/ingest/videos/url"

payload = {
    "url": "https://example.com/sample-video.mp4",
    "collection": "scene_tutorial"
}
headers = {"Content-Type": "application/json"}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.text)
Ingest Video Url - Mixpeek

Hybrid search videos:

import requests

url = "https://api.mixpeek.com/features/search"

payload = {
    "queries": [
        {
            "type": "text",
            "value": "boy outside",
            "vector_index": "multimodal"
        },
        {
            "type": "url",
            "value": "https://example.com/dog.jpg",
            "vector_index": "multimodal"
        }
    "collections": ["scene_tutorial"],
}
headers = {"Content-Type": "application/json"}

response = requests.request("POST", url, json=payload, headers=headers)
Search Features - Mixpeek
This endpoint allows you to search features.

That's it! All the complexity of scene detection, chunking, embedding generation, vector storage, and semantic search is handled for you in these two simple API calls.

About the author
Ethan Steininger

Ethan Steininger

Probably outside.

Multimodal Makers | Mixpeek

Learn best practices, reference architectures and follow example tutorials to build multimodal AI applications

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.