0:00
/
0:00
Transcript

Build the Ultimate MCP Server for Multimodal AI

Supports RAG over audio, video, images and text!

Today, we will show you how to add multimodal capabilities to any AI app.

We will achieve this by building an ultimate MCP server for multimodal AI.

Our tech stack:

  • Pixeltable to build the multi-modal AI infra (open-source).

  • CrewAI to orchestrate the agentic workflow.

For context, Pixeltable is a go-to Python library for Multimodal AI, which streamlines the entire pipeline from data storage to model execution.

It handles images, videos, text & audio effortlessly. Our MCP servers are built on top of Pixeltable.

Here’s the system overview:

  • User submits a query

  • The router agent identifies the modality and triggers a specialist

  • Specialist agent sends relevant context to the response generator

  • The user receives a coherent response

We have added a video at the top if you prefer that.

The code is available in this Studio: Ultimate MCP server for Multimodal AI. You can run it without any installations by reproducing our environment below:

Let's dive into the code!


Docker Setup

Deploy the Pixeltable MCP server using Docker Compose.

This setup starts 4 MCP servers (document, audio, image, & video) with Server-Sent Events (SSE) transport.

Connect MCP server to CrewAI

With our Pixeltable servers prepared, let's integrate MCP servers as tools in CrewAI!

It's fairly easy, check this out:

Next, we start defining the agents...

Define Query Router Agent

Router Agent directs user queries within our system, analyzing them to assign each to the appropriate specialist agent.

Define Image Specialist Agent

Video Specialist Agent utilizes the Video MCP Server for its tools.

It creates an index, inserts videos, and processes both frames and audio to make it available for RAG.

Similarly, we can define the other specialists: Image, Audio, and Document Specialist Agents. The same code is used, which is shared at the end.

Define Response Synthesis Agent

Synthesis Agent serves as the final quality control layer, refining retrieval outputs from specialized agents into polished, user-friendly responses.

Create CrewAI Agentic Flow

Let's explore how to connect our crews of agents and Pixeltable MCP servers as tools within CrewAI Flow...👇

Now here's the video that we'll ingest and do RAG over.

You can do the same for any modality, images, audio, etc.

No changes would be required.

Now let's see our MCP-powered, multi-modal, multi-agent workflow in action.

Below, we invoke the PixeltableFlow and ask the system the name of the younger kid (which was said in the video). It responded with the right response:

The entire code is available in this Studio: Ultimate MCP server for Multimodal AI. You can run it without any installations by reproducing our environment below:

Thanks for reading!

Discussion about this video