Reverse Image Search with CLIP and MongoDB

AI-powered image discovery app using Mixpeek's multimodal SDK and MongoDB's $vectorSearch. Features deep learning, vector embeddings, and KNN search for advanced visual content management.
Reverse Image Search with CLIP and MongoDB

The team at Shimmer built an innovative AI-powered image discovery app showcasing Mixpeek multimodal understanding and MongoDB $vectorSearch.

This writeup will explore the architecture, implementation, and key features of Shimmer.

Architecture Overview

graph TD A[User] -->|Interacts with| B[Shimmer Web App] B -->|Sends requests| C[Backend Server] C -->|Uses| D[Mixpeek SDK] D -->|Generates embeddings
and metadata| C C -->|Stores/Retrieves data| E[MongoDB] E -->|Performs $vectorSearch| E subgraph "Frontend" A B end subgraph "Backend" C D end subgraph "Database" E end style A fill:#f9f,stroke:#333,stroke-width:2px style B fill:#bbf,stroke:#333,stroke-width:2px style C fill:#dfd,stroke:#333,stroke-width:2px style D fill:#fdd,stroke:#333,stroke-width:2px style E fill:#ddf,stroke:#333,stroke-width:2px

Shimmer consists of three main components:

  1. Frontend: A responsive web application built with HTML, CSS, and JavaScript
  2. Backend: Powered by Mixpeek's multimodal understanding SDK
  3. Database: MongoDB with $vectorSearch for similarity search

Implementation Details

1. Image Indexing

When a new image is added to the system, it goes through the following process:

  1. The image is uploaded to the Mixpeek API
  2. Mixpeek processes the image, generating:
    • Vector embedding
    • Caption
    • Metadata extraction
  3. The processed data is stored in MongoDB

Here's a sample of the MongoDB document structure with the data extracted via Mixpeek AI:

{
  "_id": { "$oid": "66c2350bd8a79aee18e80e25" },
  "created_at": { "$date": "2024-08-18T17:53:15.387Z" },
  "embedding": [0.18961688876152039, -0.05179230123758316, ...],
  "caption": "A small wooden boat floating on a serene lake surrounded by misty mountains at sunset",
  "file_url": "https://mixpeek-api.s3.us-east-2.amazonaws.com/ix-0Sqm__ZbVpAIHRkOJIyqEyhGEaNXs2dvY6WXJ6o6GkEzI0lXfR7S-qBimKKhI_OS8Bw/shimmer/a26b0f1b-6ada-4ab8-8536-f827582e4859.jpg",
  "file_id": "a26b0f1b-6ada-4ab8-8536-f827582e4859",
  "collection_id": "shimmer",
  "metadata": {
    "url": "https://scontent-lga3-1.cdninstagram.com/v/t51.29350-15/455897686_812566104022598_2572383807345858400_n.jpg?stp=dst-jpg_e35&_nc_ht=scontent-lga3-1.cdninstagram.com&_nc_cat=102&_nc_ohc=WgWr7a2ok-8Q7kNvgEBReYB&edm=AEhyXUkBAAAA&ccb=7-5&ig_cache_key=MzQzNTI2ODU2NDU0NDQ5NTI2Ng%3D%3D.2-ccb7-5&oh=00_AYC2dV7pLOxfcPrYfoKX7dAVwXFqeHsqGfEWjVYi9xbXiA&oe=66C7F7B3&_nc_sid=8f1549",
    "ai_analysis": {
      "scene_classification": [
        {"label": "landscape", "confidence": 0.98},
        {"label": "sunset", "confidence": 0.95},
        {"label": "lake", "confidence": 0.92}
      ],
      "object_detection": [
        {"label": "boat", "confidence": 0.89, "bounding_box": [0.2, 0.6, 0.4, 0.3]},
        {"label": "mountain", "confidence": 0.87, "bounding_box": [0.1, 0.1, 0.8, 0.5]},
        {"label": "tree", "confidence": 0.76, "bounding_box": [0.05, 0.4, 0.2, 0.3]}
      ],
      "color_analysis": {
        "dominant_colors": [
          {"color": "#FF7F50", "percentage": 0.35},
          {"color": "#4682B4", "percentage": 0.25},
          {"color": "#2F4F4F", "percentage": 0.20}
        ],
        "average_color": "#A0522D"
      },
      "emotion_analysis": {
        "overall_mood": "peaceful",
        "emotions": [
          {"emotion": "tranquility", "intensity": 0.85},
          {"emotion": "awe", "intensity": 0.72},
          {"emotion": "serenity", "intensity": 0.68}
        ]
      },
      "artistic_style": {
        "style": "landscape photography",
        "influences": ["Ansel Adams", "Peter Lik"],
        "techniques": ["long exposure", "golden hour lighting"]
      },
      "image_quality": {
        "sharpness": 0.92,
        "exposure": 0.88,
        "noise_level": 0.05,
        "dynamic_range": 0.85
      },
      "semantic_segmentation": {
        "sky": 0.30,
        "water": 0.25,
        "mountains": 0.20,
        "vegetation": 0.15,
        "boat": 0.05,
        "other": 0.05
      },
      "text_detection": [],
      "estimated_location": {
        "type": "mountain lake",
        "confidence": 0.85,
        "possible_locations": [
          {"name": "Lake Louise, Canada", "confidence": 0.6},
          {"name": "Lake Bled, Slovenia", "confidence": 0.4}
        ]
      }
    }
  }
}

Shimmer uses MongoDB's $vectorSearch for efficient KNN search. The search query looks like this:

{
  "$vectorSearch": {
    "index": "vector_index",
    "filter": filter_query,
    "path": "embedding",
    "queryVector": embedding,
    "numCandidates": 100 * page,
    "limit": 100 * page
  }
}

This query allows for:

  • Filtering based on metadata
  • Efficient similarity search using vector embeddings
  • Pagination for large result sets

3. Frontend Implementation

The frontend is built using HTML, CSS, and JavaScript. Here's a snippet of the JavaScript code that handles image rendering and infinite scrolling:

class GalleryManager {
  constructor() {
    this.isLoading = false;
    this.itemsPerPage = 12;
    this.currentFetcher = new ListFilesFetcher();
  }

  async fetchImages() {
    if (this.isLoading) return;
    this.isLoading = true;
    this.showLoader();

    try {
      const data = await this.currentFetcher.fetch();
      if (data.results) {
        await this.displayImages(data.results);
      } else {
        console.error('Unexpected data structure:', data);
      }
    } catch (error) {
      console.error('Error fetching images:', error);
    } finally {
      this.isLoading = false;
      this.hideLoader();
    }
  }

  async displayImages(images) {
    // ... (code to display images in the gallery)
  }

  handleScroll() {
    const scrollPosition = window.innerHeight + window.scrollY;
    const documentHeight = document.documentElement.scrollHeight;
    const threshold = 200;

    if (scrollPosition >= documentHeight - threshold && this.currentFetcher.hasNextPage()) {
      this.fetchImages();
    }
  }

  // ... (other methods)
}

This code manages the image gallery, handles infinite scrolling, and dynamically loads images as the user scrolls.

Key Features and Benefits

  1. Image Search: Utilizes vector embeddings and KNN search for fast and accurate image retrieval.
  2. Multimodal Understanding: Leverages Mixpeek's SDK to generate rich metadata, including captions and vector embeddings.
  3. Scalability: MongoDB's $vectorSearch allows for efficient searching even with large image collections.
  4. Responsive UI: The frontend provides a smooth user experience with infinite scrolling and dynamic image loading.

Why Shimmer is Useful

Shimmer revolutionizes image discovery by:

  1. Enabling Visual Search: Users can find similar images based on visual content, not just text descriptions.
  2. Improving Content Organization: Automatically generated captions and metadata help in better categorization and retrieval of images.
  3. Enhancing User Experience: The intuitive interface and efficient search capabilities make image exploration enjoyable and productive.

Aesthetic-minded creators (like myself, hehe) can then create custom feeds of their posts: https://shimmer.so/ethan.codes

Other Use Cases for Image Discovery

  1. Medical Imaging Analysis:
    Enable radiologists to quickly find similar X-rays or MRIs, improving diagnosis accuracy and speed. This could potentially save lives by identifying critical conditions earlier.
  2. Retail Product Recommendation:
    Implement visual search in e-commerce platforms, allowing customers to find products similar to an image they upload. This can significantly boost sales and improve customer satisfaction.
  3. Satellite Imagery for Climate Change Research:
    Analyze vast archives of satellite images to identify and track changes in landscapes over time, aiding in climate change research and environmental monitoring.

These use cases demonstrate the versatility and potential impact of AI-powered image discovery across various industries, showcasing how technology like Shimmer can drive innovation and efficiency in critical sectors.

About the author
Ethan Steininger

Ethan Steininger

Probably outside.

Multimodal Makers | Mixpeek

Ready to put your multimodal AI use cases to work?

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Multimodal Makers | Mixpeek.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.