Multi-turn Evals for LLM Apps

...explained with code!

Mar 19, 2026

Multi-turn Evals for LLM Apps

Conversational systems need a different kind of evaluation.

Unlike single-turn tasks, conversations unfold over multiple messages.

This means the AI’s behavior must be consistent, compliant, and context-aware across turns, not just accurate in one-shot outputs.

The code snippet below depicts how to use DeepEval (open-source) to run multi-turn, regulation-aware evaluations in just a few lines:

Here’s a quick explanation:

Define your multi-turn test case: Use ConversationalTestCase and pass in a list of turns, just like OpenAI’s message format:

Define a custom metric: This metric uses ConversationalGEval to define a metric in plain English. It checks whether the assistant avoids giving investment advice and instead nudges users toward professional help.

Finally, run the evaluation:

Done!

This will provide a detailed breakdown of which conversations passed and which failed, along with a score distribution:

Moreover, you also get a full UI to inspect individual turns:

There are two good things about this:

The entire pipeline is extremely simple to set up and requires just a few lines of code.
DeepEval is 100% open-source with 9200+ stars, and you can easily self-host it so your data stays where you want.

Here’s the GitHub repo →

You can read about multi-turn evals in the documentation here →

Any AI agent can now do what Claude does!

There’s an open-source version of Anthropic’s new Generative UI tools for Claude.

It lets you render live charts, interactive diagrams, and 3D scenes right inside your own AI apps, just like Claude.

Check this:

You can ask the agent to visualize sorting algorithms, render 3D animations, build interactive dashboards, or generate network diagrams, and it renders everything as live HTML/SVG inside the chat.

It’s built on top of CopilotKit, the open-source frontend stack for AI agents and Generative UI.

The key insight is that most AI apps today are stuck in text-only mode. Generative UI changes that by letting your agents produce fully interactive visual components on the fly.

And CopilotKit lets you build exactly that.

It supports multiple agentic UI patterns like frontend tools, human-in-the-loop workflows, and generative components, and lets you wire them into any AI agent built with LangGraph, CrewAI, or custom backends.

You can find the project repo here →

You can find the CopilotKit main repo here →

We will cover a hands-on demo on Generative UI with CopilotKit pretty soon, but in the meantime, we have shared a hands-on demo below on how you can use CopilotKit to build a full-stack agentic app.

Continue reading below 👇

Building a Full-stack Agentic App

Backends like CrewAI, LangGraph, Mastra, etc., can do a lot.

But the hardest part is embedding them into interactive user-facing software products, like Cursor.

AG-UI (Agent-User Interaction Protocol) is an open-source protocol designed to address this and build front-end-powered Agents.

Today, let’s do a hands-on demo of this and learn how you can use the AG-UI protocol to build user-facing AI apps (you can find the full code here).

First, for more context...

AG-UI is an event-based protocol that enables communication between agents and frontends. It does this using events like:

RUN_STARTED/RUN_FINISHED: Lifecycle event
TEXT_MESSAGE_: Streaming response
TOOL_CALL_: Tool execution

You can directly embed these events into your UI and get reactive agents.

Here’s the workflow of our app

User submits a portfolio request
The agent runs the steps: fetch prices → allocate cash → create insights
AG-UI streams the agent’s progress step by step
CopilotKit updates the UI in real time

1️⃣ Import events

To begin, we import the key events that our Agent will use to communicate with the frontend of our app.

The description for each of these event classes is also shown in the snippet below:

Essentially, the point of these events is that the frontend can use these to display info on the UI when the backend (Agent) sends some info.

For instance, events RUN_STARTED and RUN_FINISHED mark the start or end of an agent’s task execution. As soon as the Agent begins the task, it will emit this event, which can be captured by the UI.

2️⃣ Define State

First, we define the Agent’s Flow state to track:

Available cash
Portfolio
Tool call logs, etc.

This state inherits from CopilotKitState, which extends LangGraph’s MessagesState and will be updated based on the app’s emitted events.

3️⃣ Define Flow

Next, we define a 5-step workflow using CrewAI Flows:

start(): Initialize with portfolio data
chat(): Extract investment parameters
simulation(): Download stock data
allocation(): Calculate performance
insights(): Generate analysis

4️⃣ Create FastAPI Endpoint

This is our endpoint that wraps the Flow to handle incoming requests and stream events to the frontend.

Here, the event generator:

Emits lifecycle events
Streams state updates
Handles tool calls
Manages text responses

5️⃣ Human-in-the-Loop (HITL)

In Agentic apps, it is always desired to have HITL functionalities.

In our case, the Agent can request human approval before rendering charts and updating the portfolio.

This is added in the allocation step of our Flow, and the rationale is that as soon as the Agent has decided on an allocation, it can get human approval before permanently making an update: