More often than not, I find myself in situations when having a labeled dataset can be truly useful, but there are no labels.
One way to counter this is by relying on unsupervised techniques, but there are many inherent challenges, especially in evaluating these models.
Another way, if feasible, is to rely on self-supervised learning.
For more context about self-supervised learning…
Self-supervised learning is when we have an unlabeled dataset (say text data). However, we can still build a supervised learning model.
This becomes possible due to the inherent nature of the task as one finds a way to generate labels automatically.
Consider an LLM, for instance.
In a nutshell, its core objective is to predict the next word based on previously predicted words (or the given context).
This is a classification task, and the labels are the entire vocabulary.
But the text data is raw, right? It has no labels. Then how did we train this classification task?
That’s where self-supervised techniques help.
Due to the inherent nature of the task (next-word prediction, to be specific), every piece of raw text data is already self-labeled.
The model is only supposed to learn the mapping from previous words to the next word.
This is called self-supervised learning.
Coming back to the topic…
However, self-supervised learning still has pretty limited applicability, which depends on the problem we intend to solve.
Thus, another possibility is manually annotating a small fraction of the dataset and then resorting to semi-supervised learning techniques.
For more context about semi-supervised learning
Semi-supervised learning is a technique that iteratively builds the model using both labeled and unlabeled data.
Initially, the model is trained on the small labeled dataset.
Then, it uses this partially trained model to make predictions on the unlabeled data. These predictions are treated as pseudo-labels.
The model is then retrained using both the labeled data and the pseudo-labeled data.
This process continues iteratively, with the model improving its performance as more precise pseudo-labels are generated and incorporated into the training process.
If the task allows, I find Pigeon, an open-source tool, to be extremely useful for annotation purposes.
The reason why I love this tool is because it makes labeling and retrieving labels super quick.
More specifically, one can annotate the dataset by simply defining buttons and labeling samples, as demonstrated below:
This also works for text data, as depicted below:
Simply put, as long as we can quickly assess the data samples and define some annotation labels to label that sample, we can use Pigeon.
Once we get some labels, we can proceed with semi-supervised techniques.
For instance, Sklearn provides a semi-supervised classifier for such purposes, which can be used after annotating the given data:
Isn’t Pigeon cool for data annotation?
👉 Over to you: What other techniques are you aware of for building ML models on unlabeled datasets?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Whenever you are ready, here’s one more way I can help you:
Every week, I publish 1-2 in-depth deep dives (typically 20+ mins long). Here are some of the latest ones that you will surely like:
[FREE] A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.
You Are Probably Building Inconsistent Classification Models Without Even Realizing
Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?
PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 If you love reading this newsletter, feel free to share it with friends!
👉 Tell the world what makes this newsletter special for you by leaving a review here :)