More often than not, I find myself in situations when having a labeled dataset can be truly useful, but there are no labels.
One way to counter this is by relying on unsupervised techniques, but there are many inherent challenges, especially in evaluating these models.
Another way, if feasible, is to rely on self-supervised learning.
For more context about self-supervised learning…
Self-supervised learning is when we have an unlabeled dataset (say text data). However, we can still build a supervised learning model.
This becomes possible due to the inherent nature of the task as one finds a way to generate labels automatically.
Consider an LLM, for instance.
In a nutshell, its core objective is to predict the next word based on previously predicted words (or the given context).
This is a classification task, and the labels are the entire vocabulary.
But the text data is raw, right? It has no labels. Then how did we train this classification task?
That’s where self-supervised techniques help.
Due to the inherent nature of the task (next-word prediction, to be specific), every piece of raw text data is already self-labeled.
The model is only supposed to learn the mapping from previous words to the next word.
This is called self-supervised learning.
Coming back to the topic…
However, self-supervised learning still has pretty limited applicability, which depends on the problem we intend to solve.
Thus, another possibility is manually annotating a small fraction of the dataset and then resorting to semi-supervised learning techniques.
For more context about semi-supervised learning
Semi-supervised learning is a technique that iteratively builds the model using both labeled and unlabeled data.
Initially, the model is trained on the small labeled dataset.
Then, it uses this partially trained model to make predictions on the unlabeled data. These predictions are treated as pseudo-labels.
The model is then retrained using both the labeled data and the pseudo-labeled data.
This process continues iteratively, with the model improving its performance as more precise pseudo-labels are generated and incorporated into the training process.
If the task allows, I find Pigeon, an open-source tool, to be extremely useful for annotation purposes.
The reason why I love this tool is because it makes labeling and retrieving labels super quick.
More specifically, one can annotate the dataset by simply defining buttons and labeling samples, as demonstrated below:
This also works for text data, as depicted below:
Simply put, as long as we can quickly assess the data samples and define some annotation labels to label that sample, we can use Pigeon.
Once we get some labels, we can proceed with semi-supervised techniques.
For instance, Sklearn provides a semi-supervised classifier for such purposes, which can be used after annotating the given data:
Isn’t Pigeon cool for data annotation?
👉 Over to you: What other techniques are you aware of for building ML models on unlabeled datasets?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Whenever you are ready, here’s one more way I can help you:
Every week, I publish 1-2 in-depth deep dives (typically 20+ mins long). Here are some of the latest ones that you will surely like:
[FREE] A Beginner-friendly and Comprehensive Deep Dive on Vector Databases.
You Are Probably Building Inconsistent Classification Models Without Even Realizing
Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?
PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 If you love reading this newsletter, feel free to share it with friends!
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
The method shown in the section on semi-supervised learning seems clever but may be dangerous to use. Imagine that we have an incomplete dataset with 3 classes (A, B, C) but for which we have a few examples with known labels A and B, but none with C. The iterative method will identify unknown items of classes A and B, but will not work for C class items. Indeed, the model will predict the probabilities for classes A and B (for which the sum will be 100% if we use a softmax layer for classification). The class with the highest probability will be allocated to C-class items; it will be A or B instead of C. If we continue the same process, the C class will fuse with A and/or B.
Data clustering could be used to figure out an approximative number of classes. Say we have 4 classes; A, B and 2 unknown classes Z1 and Z2. We can create a new dataset where all known A elements are in the group A, the same for B elements, and where we put a similar number of unlabeled data in groups Z1 and Z2. Next we train a 4-class model on this dataset.
We expect the classification results for A and B classes to be better than those for Z1 and Z2 classes. Let us take a look at the misclassifications of Z1 elements. Those allocated (with high probability) to classes A and B could be relabeled accordingly. We do the same to elements in class Z2. The training process restarts and the relabeling continues until desired.
We can add new unknown Z1 and Z2 element in the dataset and redo the training-relabeling process.
At some point, if necessary, we could take a look at the data in the Z1 and Z2 classes and figure out what they contain. If they are easy to identify (e.g., images of horses and whales) we can use active learning and ask an expert to figure out all the very difficult cases when the probabilities of 2 or more classes are similar. This will result in a refinement of the boundaries between classes.
This is not foolproof obviously, but it is worth the try.