100% Local Multimodal RAG using DeepSeek's Janus

Playback speed

Share post at current time

0:00

100% Local Multimodal RAG using DeepSeek's Janus

...with open-source tools.

Avi Chawla

Jan 30, 2025

If you don't want to send your data to OpenAI or any external servers...

Try GroundX On-Prem, the ultimate secure and scalable RAG solution you can use locally or on-premise.

GroundX is designed to process complex, real-world documents that can have images, tables, and flowcharts along with regular text.

What makes it a game-changer:

Great Python SDK
Compatible with any Kubernetes setup
Secure storage for data and vectors
Ingest service fine-tuned on 1M+ documents
Supports hybrid RAG pipelines effortlessly

GroundX consistently beats leading RAG tools when it comes to handling complex, large-scale documents.

GroundX GitHub

Thanks to EyeLevel.ai for partnering today.

MultiModal RAG with DeepSeek Janus

Continuing the discussion from GroundX…

After DeepSeek-R1, DeepSeek dropped more open-weight multimodal models—Janus, Janus-Pro, and Janus-Flow.

They can understand images and generate images from text input.

Moreover, they beat OpenAI's DALL-E 3 and Stable Diffusion in GenEval and DPG-Bench benchmarks.

Today, let’s do a hands-on demo of building a multimodal RAG with Janus-Pro on a complex document shown below:

It has several complex diagrams, text within visualizations, and tables—perfect for multimodal RAG.

We’ll use:

Colpali to understand and embed docs using vision capabilities.
Qdrant as the vector database.
DeepSeek’s latest Janus-Pro multimodal LLM to generate a response.

The video at the top shows the final outcome.

Let's build it!

1) Embed data

We extract each document page as an image and embed it using ColPali.

We did a full architectural breakdown of ColPali in Part 9 of the RAG crash course and also optimized it with binary quantization.

ColPali uses vision capabilities to understand the context. It produces patches for every page, and each patch gets an embedding vector.

This is implemented below:

2) Vector database

Embeddings are ready. Next, we create a Qdrant vector database and store these embeddings in it, as demonstrated below:

3) Download DeepSeek Janus

Next, we set up our DeepSeek's latest Janus-Pro by downloading it from HuggingFace.

4) Query vector database and generate a response

Next, we:

Query the vector database to get the most relevant pages.
Pass the pages (as images) along with the query to DeepSeek Janus-Pro to generate the response.

Done!

We have implemented a 100% local Multimodal RAG powered by DeepSeek's latest Janus-Pro.

There’s some streamlit part we have shown here, but after building it, we get this clear and neat interface.

In this example, it produces the right response by retrieving the correct page and understanding a complex visualization👇

Here's one more example with a correct response:

Wasn’t that easy and straightforward?

That said, you can avoid the hassle of building an enterprise-grade RAG pipeline yourself with GroundX.

In head-to-head testing, GroundX significantly outperforms many popular RAG tools, especially with respect to complex documents at scale.

GroundX GitHub

Get started with GroundX (open-source) here: GroundX GitHub.

The code for today's demo is available here: Multimodal RAG with DeepSeek.

👉 Over to you: What other demos would you like to see with DeepSeek?

Thanks for reading!

Daily Dose of Data Science

100% Local Multimodal RAG using DeepSeek's Janus

GroundX: An enterprise-grade RAG solution [Open-source]

MultiModal RAG with DeepSeek Janus

1) Embed data

2) Vector database

3) Download DeepSeek Janus

4) Query vector database and generate a response

Discussion about this video