Unlock your S3 Bucket with Mixpeek and MongoDB KNN

Unlock your S3 Bucket with Mixpeek and MongoDB KNN

The standard design pattern when you want to serve non JSON data to your client is to first store it in S3, then send that object's public file_url to your transactional database of choice.

While MongoDB does have GridFS support, it's not always effective due to filesize limitations. This is why it's encouraged to take advantage of cheap cold storage, and then simply use that s3_url as a reference in your mongoDB collection like:

{
    "s3_url": "https://s3.resume.pdf",
    "filename": "Ethan's Resume",
    "metadata": {}
}

This allows our client to decide how they want to process the object. However some issues comes up:

  • What if we want to access the contents?
  • What if we want our server to process it before sending to the client?
  • What if we want to do fancy AI on top of the documents, images, video or audio

This is where multimodal indexing comes in.

How does it work?

Once you create a Mixpeek connection to your S3 bucket as the source, every new object added to that bucket will be streamed into your Mixpeek ML pipeline then the output sent right into your MongoDB collection.

Mixpeek supports three steps out of the box, but still grants the capability of BYO docker containers for custom AI-powered chaining.

  • Extract: If it's a PDF, the table contents, text and even images are pulled out. Audio gets transcribed, video can be object/motion tagging and image can be OCR or object detection. Read the extract docs
file_output = mixpeek.extract.text(file_url="s3://document.pdf")
  • Generate: If it's text, you can instruct the pipeline to use ML to generate a summary or tags. Read the generate docs
class Authors(BaseModel):
    author_email: str

class PaperDetails(BaseModel):
    paper_title: str
    author: Authors

response = mixpeek.generate(
    model="gpt-3.5-turbo",
    modality="text",
    response_format=PaperDetails,
    context=f"Format this document: {file_output}",
)
  • Embed: Supply your own transformer embeddings or use ours (everything is open source). We'll embed the extracted contents or the raw files using text encoders, video encoders, image encoders or audio. Read the embed docs
embedding = mixpeek.embed(
  modality="text", 
  input="hello world",
  model="jinaai/jina-embeddings-v2-base-en"
)

All of these methods are abstracted into a single pipeline: https://docs.mixpeek.com/pipelines/create

Now to combine it all into a single change data capture, AI-powered ETL from S3 into your MongoDB collection:

from mixpeek import Mixpeek, FileTools, SourceS3

def handler(event, context):
    mixpeek = Mixpeek("API_KEY")
    file_url = SourceS3.file_url(event['bucket'], event['key'])
    pdf_data = FileTools.load_document(file_url)
    num_pages = FileTools.document_page_count(pdf_data)
    
    results = []
    
    for page_number in range(1, num_pages + 1):
        page_text = FileTools.extract_text(pdf_data, page_number)
        page_embedding = mixpeek.embed.text(
          input=page_text, 
          model="sentence-transformers/all-MiniLM-L6-v2"
        )
        
        obj = {
            "page_number": page_number,
            "text": page_text,
            "embedding": page_embedding,
            "file_url": file_url
        }
        results.append(obj)
    
    return results

This creates an event stream from your S3 bucket so that every change from your bucket invokes your pipeline endpoint and stores the output in your collection.

One major point of frustration developers experience is "what happens if I modify my representative data". Objects in your MongoDB collection are rarely static, they change often. As does your S3 bucket.

Mixpeek understands inserts vs updates vs deletes and is able to intelligently handle the embeddings by replacing/updating them in real-time.

Once we have vectors, tags, and embeddings the sky is really the limit. We advise writing queries that span these data structures, and MongoDB has you covered.

Here's a MongoDB query that combines:

  • text: inverted index using the best full text search engine, Lucene
  • $knnBeta: stored vectors with K nearest neighbors similarity
  • compound.filter: standard MongoDB B-tree indexes
[
  {
    $vectorSearch: { // KNN query
      index: "indexName",
      path: "fieldToSearch",
      queryVector: [0, 1, 2, 3],
      filter: {
        $and: [
          {
            freshness: {
              $eq: "fresh",
            },
            year: {
              $lt: 1975,
            },
          },
        ], // text and integer (pre-filtering)
      },
    },
  },
  {
    $match: {
      foo: "bar",
    },
  }, // standard mongodb query
]

More advanced query that enables hybrid search in MongoDB: https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/reciprocal-rank-fusion/

What about Multimodal Retrieval Augment Generation?

Send the query results to an LLM for "reasoning". Mixpeek has a library that lets you structure the output:

class Authors(BaseModel):
    author_email: str

class PaperDetails(BaseModel):
    paper_title: str
    author: Authors

response = mixpeek.generate(
    model={"provider":"GPT", "model":"gpt-3.5-turbo"},
    response_format=PaperDetails,
    context=f"format this document and make sure to respond and adhere to the provided JSON format: {corpus}",
    messages=[],
    settings={"temperature":0.5},
)

Here we're supplying a corpus to our GPT model and telling it to structure the output in a certain way based on Pydantic schemas.

This returns amazing, structured outputs:

{
    "author": {"author_email": "shannons@allenai.org"},
    "paper_title": "LayoutParser: A Unifiend Toolkit for Deep Learning Based"
}

Completely free AI playground to use these methods: https://mixpeek.com/start

Benefits of Mixpeek & MongoDB

  • Consistent: Leveraging MongoDB's change streams, every write is causally consistent
  • Multimodal: One query that spans multiple indexes and embedding spaces
  • Durable: Mixpeek ensures the entire process per write has guaranteed execution
  • Atomic: If one step fails in the pipeline, nothing get's written so you don't have any half-written data

What else can you build?

Much, much more. The sky is the limit with multimodal AI....

💡
We'll even build a FREE multimodal proof of concept for your business, just schedule a call
About the author
Ethan Steininger

Ethan Steininger

Probably outside.

Multimodal Makers | Mixpeek

Ready to put your multimodal AI use cases to work?

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.