In today's newsletter:
Test Agents at scale with other Agents [open-source].
Top 4 LLM fine-tuning frameworks.
16 techniques to build real-world RAG systems.
Test Agents at scale with other Agents [open-source]
Traditional testing relies on fixed inputs and exact outputs. But agents speak in language, and there’s no single “correct” response. That’s why we test Agents using other Agents by simulating Users and Judges.
Here's the process with LangWatch Scenario framework (open-source):
Define three Agents:
The Agent you want to test.
A User Simulator Agent that acts like a real user.
A Judge Agent for evaluation.
Let your Agent and User Simulator Agent interact with each other.
Evaluate the exchange using the Judge Agent based on the specified criteria.
The code depicts a simulation run:
Done!
Key features:
Test Agent behavior by simulating users in different scenarios.
Evaluate at any point of the conversation using multi-turn control.
Integrate any Agent by implementing just one
call()
method.Combine with any LLM eval framework or custom evals.
Top 4 Open-source LLM Fine-tuning Frameworks
There are several techniques for fine-tuning LLMs:
And likewise, there are several frameworks available to fine-tune them!
Here are the top 4:
From single-GPU “click-to-tune” notebooks to trillion-param clusters, these four libraries cover every LLM finetuning scenario.
Let’s understand which one to use, & when.
Unsloth (42k stars)
Unsloth makes fine-tuning easy and fast, turning a mid-range GPU into a powerhouse with a simple Colab or Kaggle notebook.
Triton kernels: 2× speed, up to 80 % less VRAM
LoRA / QLoRA / full-tune in 4-/8-/16-bit
Text, speech, diffusion, BERT—“everything” works
Runs on any CUDA-7.0+ NVIDIA GPU
Perfect for hackers and small teams using 12–24 GB GPUs needing quick LoRA experiments without DeepSpeed configs or clusters
Axolotl (10k stars)
Axolotl keeps your entire pipeline in one YAML file; write once, reuse from data prep to serving.
Full finetuning/LoRA/QLoRA/GPTQ/RL/preference tuning
FlashAttn, XFormers, multi-packing, seq-parallel
Laptop-to-cluster scaling via FSDP, DeepSpeed, Ray
Ready Docker images & PyPI wheels
Perfect for teams that crave reproducibility and want to toggle advanced recipes by flipping a YAML switch.
LlamaFactory (54k stars)
LlamaFactory offers an easy web interface for fine-tuning models—guide through a wizard, watch training, and deploy with one command. No-code.
16-bit, freeze-tune, LoRA, low-bit QLoRA
FlashAttn-2, LongLoRA, GaLore, DoRA baked in
Dashboards via LlamaBoard, W&B, MLflow
One-click OpenAI-style API or vLLM worker
Perfect for builders who prefer GUIs, need cutting-edge features, and want built-in dashboards.
DeepSpeed (39k stars)
DeepSpeed is the engine that turns clusters into supercomputers, unlocking super-fast LLM training and inference.
ZeRO, MoE, 3-D parallelism for trillion-scale training
Custom inference kernels for sub-second latency
ZeroQuant & XTC compression cut size and cost
Plug-and-play with HF, Lightning, MosaicML
Perfect for enterprises and researchers pushing models above ten billion parameters or serving at massive QPS.
👉 Over to you: What other frameworks would you add here?
16 techniques to build real-world RAG systems
On paper, implementing a RAG system seems simple—connect a vector database, process documents, embed the data, embed the query, query the vector database, and prompt the LLM.
But in practice, turning a prototype into a high-performance application is an entirely different challenge.
We published a two-part guide that covers 16 practical techniques to build real-world RAG systems:
Thanks for reading!
P.S. For those wanting to develop “Industry ML” expertise:
At the end of the day, all businesses care about impact. That’s it!
Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?
We have discussed several other topics (with implementations) that align with such topics.
Here are some of them:
Learn how to build Agentic systems in a crash course with 14 parts.
Learn how to build real-world RAG apps and evaluate and scale them in this crash course.
Learn sophisticated graph architectures and how to train them on graph data.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here.
Learn how to run large models on small devices using Quantization techniques.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust using Conformal Predictions.
Learn how to identify causal relationships and answer business questions using causal inference in this crash course.
Learn how to scale and implement ML model training in this practical guide.
Learn techniques to reliably test new models in production.
Learn how to build privacy-first ML systems using Federated Learning.
Learn 6 techniques with implementation to compress ML models.
All these resources will help you cultivate key skills that businesses and companies care about the most.