Compare Qwen 3 Coder vs. Sonnet 4 for Code Generation

...by building an entire eval pipeline.

Jul 25, 2025

Build production-ready apps directly from Slack

Factory has introduced Droids into Slack. They can now read and write to your channels, streamlining team workflows.

Steps:

Connect your Slack workspace via the Settings page in Factory.
Next, add the Factory app to any channel by typing /invite Factory in Slack.

Once connected, Droids can see your channels, read conversations, and send messages to you and your team.

Vibe-code production-code apps here →

Build with Factory

Thanks to Factory for partnering today!

Qwen 3 Coder & Sonnet 4 for Code Generation

Qwen-3 Coder is Alibaba’s most powerful open-source coding LLM.

Today, let's build a pipeline to compare it to Sonnet 4 using:

LiteLLM for orchestration (open-source).
DeepEval for evaluation (open-source).
AnthropicAI Claude Sonnet 4 and Qwen 3 Coder as LLMs.
Open Router to access Qwen 3 Coder.

Here's our workflow:

Ingest a GitHub repo and provide it as context to the LLMs.
Generate code using both models.
Evaluate and compare the generated code using DeepEval.

Let’s implement this!

Load API keys

Qwen3 Coder is open-source. But for this demo, we are going to access it via the OpenRouter API.

So we store the OpenRouter and Anthropic API keys in a .env file and load them into the environment.

Ingest GitHub repo

We use GitIngest to turn the user-specified GitHub repo into simple LLM-ready text data.

LLMs will use this as context to answer the user's query.

Code correctness metric

We will now create evaluation metrics for our task using DeepEval.

This metric compares the quality and correctness of the generated code against a reference ground truth code.

Code readability metric

This metric ensures the code adheres to proper formatting and consistent naming conventions.

It also assesses the quality of comments and docstrings that make the code easy to understand.

Best practices metric

This metric ensures that the code is modular, efficient, and implements proper error handling.

Generate model response

Now we are all set to generate responses from both models.

We specify the ingested codebase as context in the prompt, and stream the responses from both models in parallel.

Evaluate generated code

We use GPT-4o as the judge LLM.

It evaluates both responses, produces the metrics declared above, and also provides detailed reasoning for each metric.

Streamlit UI

Finally, we create a nice Streamlit UI that makes comparing and evaluating both models in a single interface easy.

Time to test..

Query 1: Build an MCP server that watches a GitHub repo for new issues and sends them to a Telegram group.

Sonnet 4 vs Qwen 3 Coder:

Correctness: 0.79 vs 0.90
Readability: 0.91 vs 0.90
Best practices: 0.82 vs 0.82

Overall, Qwen3 Coder wins.

Query 2: Build an MCP server that creates a new Notion page when someone drops a file into a specific Google Drive folder.

Sonnet 4 vs. Qwen 3 Coder:

Correctness: 0.74 vs 0.84
Readability: 0.90 vs 0.91
Best practices: 0.73 vs 0.78

Qwen3 Coder wins again!

Finally, here are 10 more evaluations I ran using DeepEval on building MCP servers.

Qwen 3 Coder won in 9 cases.
Claude Sonnet 4 won in 1 case (while having a lower correctness score).

Qwen 3 Coder consistently has a higher correctness score than Sonnet 4.

You can find the code for this newsletter issue here →

Thanks for reading!

P.S. For those wanting to develop “Industry ML” expertise:

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) that align with such topics.

Develop "Industry ML" Skills

Here are some of them:

Learn everything about MCPs in this crash course with 9 parts →
Learn how to build Agentic systems in a crash course with 14 parts.
Learn how to build real-world RAG apps and evaluate and scale them in this crash course.

Learn sophisticated graph architectures and how to train them on graph data.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Learn how to run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn techniques to reliably test new models in production.
Learn how to build privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Daily Dose of Data Science

Discussion about this post

Ready for more?