Just as RAG systems break down text documents into meaningful chunks for better processing and retrieval, video content benefits from intelligent segmentation through scene detection. This approach parallels text tokenization in several crucial ways:
Semantic Coherence
Scene detection identifies natural boundaries in video content, maintaining semantic completeness just like how text chunking preserves sentence or paragraph integrity. Each scene represents a complete "thought" or action sequence, rather than arbitrary time-based splits. For example, in a cooking tutorial:
Retrieval Precision
Scene-based chunks enable precise content retrieval. Instead of returning entire videos, systems can identify and serve the exact relevant scene, similar to how RAG systems return specific text passages rather than complete documents.
Vector Embedding Quality
Scene-based chunking produces higher quality embeddings because:
- Each embedding represents a coherent visual concept
- The embeddings aren't "confused" by mixing multiple scenes
- The semantic space remains clean, enabling better similarity matching
Processing Efficiency
Like token windows in language models, scene-based chunking helps manage video processing:
- Smaller, focused chunks enable efficient processing
- Parallel processing becomes more feasible
- Storage and retrieval operations are optimized
- Reduces redundant processing of similar frames
Multimodal Understanding
Unlike text tokenization, video scenes often contain multiple modalities (visual, audio, text-on-screen) that need to be processed together. This complexity makes intelligent chunking even more crucial for maintaining context and enabling accurate understanding.
Introduction
Video understanding at scale requires efficient processing and indexing of video content. This tutorial demonstrates how to implement dynamic video chunking using scene detection, generate embeddings with Mixpeek, and store them in Weaviate for semantic search capabilities.
Prerequisites
pip install scenedetect weaviate-client python-dotenv requests
Implementation Guide
1. Scene Detection with PySceneDetect
First, let's implement the scene detection logic:
from scenedetect import detect, ContentDetector
import os
def detect_scenes(video_path, threshold=27.0):
"""
Detect scene changes in a video file using content detection.
Args:
video_path (str): Path to the video file
threshold (float): Detection threshold (lower = more sensitive)
Returns:
list: List of scene timestamps (start, end) in seconds
"""
# Detect scenes using content detection
scenes = detect(video_path, ContentDetector(threshold=threshold))
# Convert scenes to timestamp ranges
scene_timestamps = []
for scene in scenes:
start_time = scene[0].get_seconds()
end_time = scene[1].get_seconds()
scene_timestamps.append((start_time, end_time))
return scene_timestamps
2. Video Chunking Utility
Create a utility to split the video into chunks based on detected scenes:
import subprocess
def chunk_video(video_path, output_dir, timestamps):
"""
Split video into chunks based on scene timestamps.
Args:
video_path (str): Path to the source video
output_dir (str): Directory to save video chunks
timestamps (list): List of (start, end) timestamps
Returns:
list: Paths to generated video chunks
"""
chunk_paths = []
for idx, (start, end) in enumerate(timestamps):
output_path = os.path.join(output_dir, f"chunk_{idx}.mp4")
# Use ffmpeg to extract the chunk
command = [
'ffmpeg', '-i', video_path,
'-ss', str(start),
'-t', str(end - start),
'-c', 'copy',
output_path
]
subprocess.run(command, capture_output=True)
chunk_paths.append(output_path)
return chunk_paths
3. Mixpeek Integration
Set up the Mixpeek client for generating embeddings:
import requests
import json
from typing import List
class MixpeekClient:
def __init__(self, api_key: str):
self.api_key = api_key
self.base_url = "https://api.mixpeek.com"
def generate_embedding(self, video_url: str, vector_index: str) -> dict:
"""
Generate embeddings for a video chunk using Mixpeek.
"""
headers = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {self.api_key}'
}
payload = {
"type": "url",
"value": video_url,
"vector_index": vector_index
}
response = requests.post(
f"{self.base_url}/features/extractors/embed",
headers=headers,
json=payload
)
return response.json()
4. Weaviate Integration
Set up the Weaviate client for storing embeddings:
import weaviate
from datetime import datetime
def setup_weaviate_schema(client):
"""
Set up the Weaviate schema for video chunks.
"""
class_obj = {
"class": "VideoChunk",
"vectorizer": "none", # We'll use custom vectors from Mixpeek
"properties": [
{
"name": "videoId",
"dataType": ["string"]
},
{
"name": "chunkStart",
"dataType": ["number"]
},
{
"name": "chunkEnd",
"dataType": ["number"]
},
{
"name": "sourceUrl",
"dataType": ["string"]
}
]
}
client.schema.create_class(class_obj)
def store_embedding(client, embedding_data: dict, chunk_metadata: dict):
"""
Store video chunk embedding in Weaviate.
"""
vector = embedding_data["embedding"]
properties = {
"videoId": chunk_metadata["video_id"],
"chunkStart": chunk_metadata["start_time"],
"chunkEnd": chunk_metadata["end_time"],
"sourceUrl": chunk_metadata["url"]
}
client.data_object.create(
"VideoChunk",
properties,
vector=vector
)
5. Putting It All Together
Here's how to use all the components together:
import os
from dotenv import load_dotenv
def process_video(video_path: str, upload_base_url: str):
"""
Process a video through the entire pipeline:
1. Detect scenes
2. Create chunks
3. Generate embeddings
4. Store in Weaviate
"""
load_dotenv()
# Initialize clients
mixpeek_client = MixpeekClient(os.getenv("MIXPEEK_API_KEY"))
weaviate_client = weaviate.Client(os.getenv("WEAVIATE_URL"))
# Detect scenes
scenes = detect_scenes(video_path)
# Create chunks
output_dir = "video_chunks"
os.makedirs(output_dir, exist_ok=True)
chunk_paths = chunk_video(video_path, output_dir, scenes)
# Process each chunk
video_id = os.path.basename(video_path)
for idx, (chunk_path, (start_time, end_time)) in enumerate(zip(chunk_paths, scenes)):
# Upload chunk and get URL (implementation depends on your storage solution)
chunk_url = f"{upload_base_url}/{os.path.basename(chunk_path)}"
# Generate embedding
embedding_data = mixpeek_client.generate_embedding(
chunk_url,
"video_vector"
)
# Store in Weaviate
chunk_metadata = {
"video_id": video_id,
"start_time": start_time,
"end_time": end_time,
"url": chunk_url
}
store_embedding(weaviate_client, embedding_data, chunk_metadata)
Searching Video Chunks
Here's how to search through the processed video chunks:
def search_video_chunks(client, query_vector, limit=5):
"""
Search for similar video chunks using the query vector.
"""
response = (
client.query
.get("VideoChunk", ["videoId", "chunkStart", "chunkEnd", "sourceUrl"])
.with_near_vector({
"vector": query_vector,
"certainty": 0.7
})
.with_limit(limit)
.do()
)
return response["data"]["Get"]["VideoChunk"]
Best Practices
- Scene Detection Tuning
- Adjust the threshold based on your video content
- Consider using multiple detection methods for different types of content
- Implement minimum/maximum chunk duration constraints
- Embedding Storage
- Use batch processing for multiple chunks
- Implement error handling and retries
- Consider implementing a caching layer
- Performance Optimization
- Process chunks in parallel when possible
- Implement progressive loading for large videos
- Use appropriate video codec settings for chunks
Conclusion
This pipeline enables efficient video understanding by:
- Breaking videos into meaningful segments
- Generating rich embeddings for each segment
- Enabling semantic search across video content
The combination of PySceneDetect, Mixpeek, and Weaviate creates a powerful system for video understanding and retrieval.
All this in two API calls
Want to implement this entire pipeline in just two API calls? Here's how you can do it with Mixpeek:
Ingest video:
import requests
url = "https://api.mixpeek.com/ingest/videos/url"
payload = {
"url": "https://example.com/sample-video.mp4",
"collection": "scene_tutorial"
}
headers = {"Content-Type": "application/json"}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.text)
Hybrid search videos:
import requests
url = "https://api.mixpeek.com/features/search"
payload = {
"queries": [
{
"type": "text",
"value": "boy outside",
"vector_index": "multimodal"
},
{
"type": "url",
"value": "https://example.com/dog.jpg",
"vector_index": "multimodal"
}
"collections": ["scene_tutorial"],
}
headers = {"Content-Type": "application/json"}
response = requests.request("POST", url, json=payload, headers=headers)
That's it! All the complexity of scene detection, chunking, embedding generation, vector storage, and semantic search is handled for you in these two simple API calls.