There’s not much we can do to build a supervised system when the data we begin with is unlabeled.
Using unsupervised techniques (if they fit the task) can be a solution, but supervised systems are typically on par with unsupervised ones.
Another way, if feasible, is to rely on self-supervised learning.
Self-supervised learning is when we have an unlabeled dataset (say text data), but we somehow figure out a way to build a supervised learning model out of it.
This becomes possible due to the inherent nature of the task.
Consider an LLM, for instance.
In a nutshell, its core objective is to predict the next token based on previously predicted tokens (or the given context).
This is a classification task, and the labels are tokens.
But text data is raw. It has no labels.
Then how did we train this classification task?
Self-supervised techniques solve this problem.
Due to the inherent nature of the task (next-token prediction, to be specific), every piece of raw text data is already self-labeled.
The model is only supposed to learn the mapping from previous tokens to the next token.
This is called self-supervised learning.
Coming back to the topic…
While self-supervised learning is promising, it has limited applicability, largely depending on the task.
At this stage, the only possibility one notices is annotating the dataset.
However, data annotation is difficult, expensive, time-consuming, and tedious.
Active learning is a relatively easy, inexpensive, quick, and interesting way to address this.
Here’s how it works.
Active learning
As the name suggests, the idea is to build the model with active human feedback on examples it is struggling with.
The visual below summarizes this:
Let’s get into the details.
We begin by manually labeling a tiny percentage of the dataset.
While there’s no rule on how much data should be labeled, I have used active learning (successfully) while labeling as low as ~1% of the dataset, so try something in that range.
Next, build a model on this small labeled dataset.
Of course, this won’t be a perfect model, but that’s okay.
Next, generate predictions on the dataset we did not label:
It’s obvious that we cannot determine if these predictions are correct as we do not have any labels.
That’s why we need to be a bit selective with the type of model we choose.
More specifically, we need a model that, either implicitly or explicitly, can also provide a confidence level with its predictions.
As the name suggests, a confidence level reflects the model’s confidence in generating a prediction.
If a model could speak, it would be like:
I am predicting a “cat” and am 95% confident about my prediction.
I am predicting a “cat” and am 5% confident about my prediction.
And so on…
Probabilistic models (ones that provide a probabilistic estimate for each class) are typically a good fit here.
This is because one can determine a proxy for confidence level from probabilistic outputs.
In the above two examples, consider the gap between 1st and 2nd highest probabilities:
In example #1, the gap is large. This can indicate that the model is quite confident in its prediction.
In example #2, the gap is small. This can indicate that the model is NOT quite confident in its prediction.
Now, go back to the predictions generated above and rank them in order of confidence:
In the above image:
The model is already quite confident with the first two instances. There’s no point checking those.
Instead, it would be best if we (the human) annotate the instances with which it is least confident.
To get some more perspective, consider the image below. Logically speaking, which data point’s human label will provide more information to the model? I know you already know the answer.
Thus, in the next step, we provide our human label to the low-confidence predictions and feed it back to the model with the previously labeled dataset:
Repeat this a few times and stop when you are satisfied with the performance.
In my experience, active learning has always been an immensely time-saving approach to building supervised models on unlabeled datasets.
The only thing that you have to be careful about is generating confidence measures.
If you mess this up, it will affect every subsequent training step.
There’s one more thing I like to do when using active learning.
While combining the low-confidence data with the seed data, we can also use the high-confidence data. The labels would be the model’s predictions.
This variant of active learning is called cooperative learning.
I will publish a demo of active learning in this newsletter in a few days.
Hope you learned something new today.
👉 Over to you: What are some other efficient ways of building supervised models with unlabelled datasets?
Thanks for reading!
1 Referral: Unlock 450+ practice questions on NumPy, Pandas, and SQL.
2 Referrals: Get access to advanced Python OOP deep dive.
3 Referrals: Get access to the PySpark deep dive for big-data mastery.
Get your unique referral link:
Are you preparing for ML/DS interviews or want to upskill at your current job?
Every week, I publish in-depth ML deep dives. The topics align with the practical skills that typical ML/DS roles demand.
Join below to unlock all full articles:
Here are some of the top articles:
[FREE] A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.
Understanding LoRA-derived Techniques for Optimal LLM Fine-tuning
8 Fatal (Yet Non-obvious) Pitfalls and Cautionary Measures in Data Science.
5 Must-Know Ways to Test ML Models in Production (Implementation Included).
A Detailed and Beginner-Friendly Introduction to PyTorch Lightning: The Supercharged PyTorch
Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Join below to unlock all full articles:
👉 If you love reading this newsletter, share it with friends!
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
We have the similar problem and already trying to solve using the same way by creating the lables with some rules and training the model :)