Reinforcement fine-tuning (RFT) allows you to transform any open-source LLM into a reasoning powerhouse.
No labeled data is needed.
Today, we'll use Predibase for RFT to transform Qwen-2.5:7b into a reasoning model.
We have attached a video above if you prefer watching instead of reading. The code is linked later in the issue.
Let's begin!
Fine-tuning techniques
Before diving into RFT, it's crucial to understand how we usually fine-tune LLMs using SFT, or supervised fine-tuning.
SFT process:
It starts with a static labeled dataset of prompt–completion pairs.
Adjust the model weights to match these completions.
The best model (LoRA checkpoint) is then deployed for inference.
RFT process:
RFT uses an online “reward” approach—no static labels required.
The model explores different outputs, and a Reward Function scores their correctness.
Over time, the model learns to generate higher-reward answers using GRPO.
SFT uses static data and often memorizes answers. RFT, being online, learns from rewards and explores new strategies.
Implementation
Now, let's turn Qwen-2.5 into a reasoning LLM.
Installation and setup
First, we install all the necessary dependencies and complete the setup (you need a Predibase API key):
Training Objective
We'll fine-tune our model using the Countdown dataset, a popular resource for evaluating and enhancing the reasoning and math capabilities of an LLM.
This is passed as a prompt template in the code.
Load dataset
Load the dataset for finetuning and add a proper prompt template to each record that we can feed to the LLM.
Define Reward functions
#1) Format reward function:
This checks whether the model’s output contains a <think>
block and an <answer>
block in the correct order.
If it does, the model scores +1. Otherwise, it’s zero.
#2) Equation reward function
We parse the model’s final equation, check it uses all numbers once, contains only valid math symbols, and correctly calculates the target.
If everything checks out, the reward is +1.
Fine-Tuning Setup
We set up a GRPO-based fine-tuning job with two reward functions—one checks the format, and the other verifies the math.
We also specify target modules for LoRA fine-tuning.
And we're done!
That’s all we need to fine-tune the LLM.
Once this has been trained, we can test the model.
Below, we have asked it to create an equation with the numbers 15, 74, and 9 that evaluates to 50, and it produces the correct response:
On the flip side, when we ask Llama3.2 the same question, it is not able to answer it:
You can also find the model we just trained; it's available on HuggingFace:
You can find all the code for this demo in this Colab notebook →
We'll leave you with a guide on how to choose the right fine-tuning method:
Thanks for reading!
Share this post