AI Video Tagging With Dynamic Taxonomies

AI video tagging used to mean manual review and basic object detection. With multimodal models and dynamic taxonomies, you can now automatically detect brand moments, inappropriate content, actions, moods and trending content at scale.
AI Video Tagging With Dynamic Taxonomies

Dynamic taxonomies enable automatic classification of video content at scale. Instead of manually tagging thousands of hours of footage, multimodal AI can identify scenes, moods, actions, and key moments across your video library.

Video Segmentation & Taxonomy Classification 0:00 - 0:15 0:15 - 0:30 0:30 - 0:45 0:45 - 1:00 high_energy emotional dialog action 0.92 0.85 0.88 0.95 Legend: Video Segment | Confidence Score Above | Classifier Connection

Real-World Applications

Content Libraries

  • Scene-level categorization for episodic content
  • Identification of specific actions (fights, chases, emotional moments)
  • Automated content moderation
  • Mood-based classification for recommendation systems

News & Sports

  • Automatic distinction between studio/field footage
  • Action detection (goals, plays, celebrations)
  • Speaker/anchor identification
  • On-screen text extraction and classification

User-Generated Content

  • Brand moment detection
  • Inappropriate content flagging
  • Action/mood classification
  • Trending content identification

Implementation Guide

Define Your Taxonomy Structure

Create hierarchical classifications that match your content:

POST /entities/taxonomies
{
  "taxonomy_name": "content_classifier",
  "nodes": [
    {
      "name": "moods",
      "embedding_config": [
        {
          "embedding_model": "multimodal",
          "type": "text",
          "value": "Scene mood and emotional atmosphere analysis"
        }
      ],
      "children": [
        {
          "name": "high_energy",
          "embedding_config": [
            {
              "embedding_model": "multimodal",
              "type": "video",
              "value": "https://assets.example.com/reference/action_scene.mp4"
            },
            {
              "embedding_model": "text",
              "value": "Fast-paced, dynamic, intense action and movement"
            }
          ]
        },
        {
          "name": "emotional",
          "embedding_config": [
            {
              "embedding_model": "multimodal",
              "type": "video",
              "value": "https://assets.example.com/reference/dramatic_scene.mp4"
            },
            {
              "embedding_model": "text",
              "value": "Dramatic, emotional, intimate character moments"
            }
          ]
        }
      ]
    }
  ]
}
Taxonomies - Mixpeek
Create and manage hierarchical classifications for multimodal content organization

Set Up Processing Pipeline

Configure your namespace and collection:

POST /namespaces
{
  "namespace_name": "video_processing",
  "vector_indexes": ["multimodal", "text"],
  "payload_indexes": [
    {
      "field_name": "taxonomy.classifications",
      "type": "keyword",
      "field_schema": {
        "type": "keyword",
        "is_tenant": false
      }
    }
  ]
}
Namespaces - Mixpeek
Create isolated environments for organizing and managing your search applications

Process Videos

Ingest videos with intelligent sampling and taxonomy classification:

POST /ingest/videos/url
{
  "url": "https://content.example.com/videos/episode_123.mp4",
  "collection": "premium_content",
  "feature_extractors": {
    "interval_sec": 10,
    "embed": [
      {
        "type": "url",
        "vector_index": "multimodal"
      }
    ],
    "describe": {
      "enabled": true,
      "vector_index": "text"
    }
  },
  "taxonomy_config": {
    "taxonomy_ids": ["tax_abc123"],
    "confidence_threshold": 0.75,
    "min_segment_duration": 5
  }
}
Feature Extraction - Mixpeek
Configure and customize multimodal feature extraction for different content types
Video Ingestion Intelligent Sampling conf: 0.92 Taxonomy Classification conf: 0.88 Searchable Tags action high_energy mood scene 🔍 Searchable Raw Video Key Frames Classifications Search Index

Intelligent Sampling Settings

Choose sampling intervals based on content type:

Content Type Interval (sec) Rationale
Action/Sports 5-10 Capture rapid changes
Dialog Scenes 15-20 Focus on key moments
News/Interviews 20-30 Capture scene changes
💡
For a more intelligent sampling, consider dynamic scene splitting: https://blog.mixpeek.com/dynamic-video-chunking-scene-detection/

Key Optimizations

Reference Selection

  • Use high-quality, representative video clips for each category
  • Include multiple examples per taxonomy node
  • Update reference content as your library evolves

Confidence Thresholds

  • Start high (0.85+) for critical classifications
  • Lower (0.7+) for general categorization
  • Adjust based on validation results

Search Integration

Query classified content:

POST /features/search
{
  "collections": ["premium_content"],
  "queries": [
    {
      "vector_index": "multimodal",
      "type": "text",
      "value": "high energy action sequence"
    }
  ],
  "filters": {
    "AND": [
      {
        "key": "taxonomy.classifications.node_id",
        "operator": "in",
        "value": ["tax_node_high_energy"]
      }
    ]
  },
  "group_by": {
    "field": "asset_id",
    "max_features": 5
  }
}
Queries - Mixpeek
Build powerful multimodal search queries across text, images, and videos

Practical Tips

  1. Start Small
    • Begin with 2-3 main categories
    • Validate classification accuracy
    • Expand based on results
  2. Optimize Processing
    • Use appropriate sampling intervals
    • Batch process similar content
    • Monitor classification confidence
  3. Maintain Quality
    • Regularly update reference content
    • Review edge cases
    • Adjust thresholds based on needs

Common Challenges

  1. Mixed Content
    • Solution: Use multiple reference examples
    • Example: News segments with both studio/field footage
  1. Temporal Context
    • Solution: Adjust sampling intervals
    • Example: Sports highlights need denser sampling
  1. Scale Issues
    • Solution: Batch processing with appropriate intervals
    • Example: Process episodic content in seasons
Taxonomy Configuration Decision Tree Content Type? High Action (Sports, Action) Dialog Heavy (News, Interviews) Mixed Content (UGC, Shows) Configuration: interval_sec: 5 confidence: 0.85 embedding: multimodal min_segment: 3s Configuration: interval_sec: 15 confidence: 0.75 embedding: text+multimodal min_segment: 10s Configuration: interval_sec: 10 confidence: 0.80 embedding: multimodal min_segment: 5s Optimize for rapid changes Prioritize speaker detection Balance accuracy/performance Decision Levels: Content Type Category Configuration Optimization

The power of dynamic taxonomies comes from combining intelligent sampling with multimodal understanding. By properly configuring your taxonomy structure and processing pipeline, you can automatically classify thousands of hours of content with high accuracy.

Additional Learning

Here's how some other leaders in the space are thinking about the same problem:

About the author
Ethan Steininger

Ethan Steininger

Probably outside.

Multimodal Makers | Mixpeek

Learn best practices, reference architectures and follow example tutorials to build multimodal AI applications

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.