Reproducible Multimodal Datasets, Without Losing Your Mind

Reproducible multimodal datasets using immutable object storage. Learn how Tigris and Mixpeek enable dataset versioning, retraining, and auditability.
Reproducible Multimodal Datasets, Without Losing Your Mind

Imagine you’re retraining a model that worked great three weeks ago.

Same code.
Same hyperparameters.
Same architecture.

But the results are… different.

Not wildly different. Just enough to make you uneasy.

Was it a model change?
A preprocessing tweak?
A labeling update?


Or did the underlying dataset quietly shift without anyone noticing?

two “training runs” diverging despite identical code

For most ML teams working with images, video, audio, or documents, this moment is painfully familiar. The hard part isn’t training models—it’s understanding exactly what data went into them, and whether you can ever get that dataset back again.


The real problem: datasets don’t stand still

Multimodal ML pipelines are inherently dynamic:

  • New raw assets are ingested continuously
  • Extractors generate derived clips, embeddings, transcripts, and metadata
  • Labels evolve as taxonomies change
  • Clusters shift as new data arrives
  • Models are retrained, evaluated, and rolled forward

Over time, what we casually call “the dataset” is actually a moving target.

raw data → extractors → labels → retraining, with subtle drift over time

Most teams try to manage this with a mix of folder conventions, timestamps, and best intentions. But as scale grows, small inconsistencies compound. Reproducing last month’s training set becomes guesswork. Audits turn into archaeology.


Treating object storage as dataset lineage

This is where Tigris changes the baseline.

Tigris provides immutable, versioned object storage with fast listing and rich metadata—while remaining fully S3-compatible. Instead of thinking in terms of “latest files,” teams can reason about states over time.

Every object version captures:

  • What existed
  • When it changed
  • What state it was in at any point
multiple versions of the same asset with timestamps

Object storage stops being a dumping ground. It becomes a durable system of record.

“Once you start treating object storage as a system of record, rather than just a place to put files, reproducibility becomes a property of the system."
Ovais Tariq, CEO, Tigris Data

Where Mixpeek fits in

Mixpeek builds directly on top of that foundation.

Rather than treating datasets as flat files, Mixpeek models multimodal data as a progression:

raw video/image/PDF → document abstraction → embeddings, transcripts, labels, clusters

For each dataset version, Mixpeek persists:

  • Original raw assets
  • Derived artifacts (clips, page segments)
  • Extractor outputs (JSON, embeddings, transcripts)
  • Cluster assignments and label manifests

Because these artifacts live as versioned objects in Tigris, any dataset snapshot can be reconstructed deterministically.


Rebuilding a dataset, not reconstructing it

Need to retrain a model from six weeks ago?
Need to explain why a prediction changed?
Need to audit what data powered a decision?

You don’t reverse-engineer the past.

You rebuild it.

“rebuild dataset” flow — select timestamp → fetch object versions → regenerate dataset

This shifts reproducibility from “best effort” to default behavior.


Why this matters in practice

For ML and platform teams, this unlocks:

  • True reproducibility — dataset snapshots, not guesses
  • Safer retraining — re-runs start from known inputs
  • Clear audit trails — derived signals trace back to raw objects
  • Lower operational overhead — no bespoke versioning logic
ad-hoc pipelines vs versioned datasets

The result isn’t just cleaner pipelines—it’s calmer ones.

💡
Want to stop guessing what went into your last training run?

The recipe walks through the full implementation—from raw asset ingestion to rebuildable snapshots.

→ View the Recipe


A small shift with outsized impact

There’s no new model architecture here.
No exotic optimization trick.

Just a reframing:

What if object storage was the source of truth for dataset lineage—and everything else was derived?
Tigris at the base, Mixpeek above, models at the top

By combining Tigris’s immutable object versions with Mixpeek’s multimodal enrichment and retrieval layer, dataset versioning stops being an afterthought and becomes the default.

And when datasets stop slipping out from under you, everything downstream gets easier to reason about.

About the author
Ethan Steininger

Ethan Steininger

Former lead of MongoDB's Search Team, Ethan noticed the most common problem customers faced was building indexing and search infrastructure on their S3 buckets. Mixpeek was born.

Mixpeek Engineering Blog

Deep dive into multimodal AI, data processing, and best practices from our engineering team.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Mixpeek Engineering Blog.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.