Multi-turn Evals for LLM Apps

Hands-on guide with code.

Avi Chawla

Jul 15, 2025

In today's newsletter:

Turn a Jira ticket into a PR.
Multi-turn evals for LLM apps.
Why do ML models need calibration?

Turn a Jira ticket into a PR with coding Agents

Now you can solve Jira tickets directly via Coding Agents. Just assign the ticket to Codegen and done!

This will generate a PR in a few minutes based on what’s detailed in the ticket!

Try it here →

Use Codegen Coding Agents

Thanks to Codegen for partnering today!

Multi-turn Evals for LLM Apps

Conversational systems need a different kind of evaluation.

Unlike single-turn tasks, conversations unfold over multiple messages.

This means the AI’s behavior must be consistent, compliant, and context-aware across turns, not just accurate in one-shot outputs.

The code snippet below depicts how to use DeepEval (open-source) to run multi-turn, regulation-aware evaluations in just a few lines:

Here’s a quick explanation:

Define your multi-turn test case: Use ConversationalTestCase and pass in a list of turns, just like OpenAI’s message format:

Define a custom metric: This metric uses ConversationalGEval to define a metric in plain English. It checks whether the assistant avoids giving investment advice and instead nudges users toward professional help.

Finally, run the evaluation:

Done!

This will provide a detailed breakdown of which conversations passed and which failed, along with a score distribution:

Moreover, you also get a full UI to inspect individual turns:

There are two good things about this:

The entire pipeline is extremely simple to set up and requires just a few lines of code.
DeepEval is 100% open-source with 9200+ stars, and you can easily self-host it so your data stays where you want.

Here’s the GitHub repo →

You can read about multi-turn evals in the documentation here →

Why ML models need calibration?

Modern neural networks being trained today are highly misleading.

They appear to be heavily overconfident in their predictions.

For instance, if a model predicts an event with a 70% probability, then ideally, out of 100 such predictions, approximately 70 should result in the event occurring.

However, many experiments have revealed that modern neural networks appear to be losing this ability, as depicted below:

The average confidence of LeNet (an old model) closely matches its accuracy.
The average confidence of the ResNet (a relatively modern model) is substantially higher than its accuracy.

Calibration solves this.

A model is calibrated if the predicted probabilities align with the actual outcomes.

Handling this is important because the model will be used in decision-making and an overly confident can be fatal.

To exemplify, say a government hospital wants to conduct an expensive medical test on patients.

To ensure that the govt. funding is used optimally, a reliable probability estimate can help the doctors make this decision.

If the model isn't calibrated, it will produce overly confident predictions.

There has been a rising concern in the industry about ensuring that our machine learning models communicate their confidence effectively.

Thus, being able to detect miscalibration and fix is a super skill one can possess.

Learn how to build well-calibrated models in this crash course →

P.S. Assuming you care about probabilities, which model would you prefer in the image below?

Thanks for reading!

P.S. For those wanting to develop “Industry ML” expertise:

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) that align with such topics.

Develop "Industry ML" Skills

Here are some of them:

Learn how to build Agentic systems in a crash course with 14 parts.
Learn how to build real-world RAG apps and evaluate and scale them in this crash course.

Learn sophisticated graph architectures and how to train them on graph data.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Learn how to run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn techniques to reliably test new models in production.
Learn how to build privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.

All these resources will help you cultivate key skills that businesses and companies care about the most.

Daily Dose of Data Science

Discussion about this post

Ready for more?

Daily Dose of Data Science

Multi-turn Evals for LLM Apps

Hands-on guide with code.

Turn a Jira ticket into a PR with coding Agents

Multi-turn Evals for LLM Apps

​Why ML models need calibration?​

P.S. For those wanting to develop “Industry ML” expertise:

Discussion about this post

Ready for more?

Why ML models need calibration?