Back in 2019, I was working with an ML research group in Germany.
One day, a Ph.D. student came up to me (and others in the lab), handed over a small sample of the dataset he was working with, and requested us to label it, despite having true labels.
This made me curious about why gathering human labels was necessary for him when he already had ground truth labels available.
So I asked.
What I learned that day changed my approach to incremental model improvement, and I am sure you will find this idea fascinating too.
Let me explain what I learned.
Consider we are building a multiclass classification model. Say it’s a model that classifies an input image as a rock, paper, or scissors:
For simplicity, let’s assume there’s no class imbalance.
Calculating the class-wise validation accuracies gives us the following results:
Paper class: 97%
Rock class: 82%
Scissor class: 75%
Question: Which class would you most intuitively proceed to inspect further and improve the model on?
After looking at these results, most people believe that “Scissor” is the worst-performing class and should be inspected further.
But this might not be true.
And this is precisely what that Ph.D. student wanted to verify by collecting human labels.
Let’s say that the human labels give us the following results:
Based on this, do you still think the model performs the worst on the “Scissor” class?
No, right?
I mean, of course, the model has the least accuracy on the “Scissor” class, and I am not denying it.
However, with more context, we notice that the model is doing a pretty good job classifying the “Scissor” class. This is because an average human is achieving just 2% higher accuracy in comparison to what our model is able to achieve.
However, the above results astonishingly reveal that it is the “Rock” class instead that demands more attention. The accuracy difference between an average human and the model is way too high (13%).
Had we not known this, we would have continued to improve the “Scissor” class, when in reality, “Rock” requires more improvement.
Ever since I learned this technique, I have found it super helpful to determine my next steps for model improvement, if possible.
I say “if possible” because I understand that many datasets are hard for humans to interpret and label.
Nonetheless, if it is feasible to set up such a “human baseline,” one can get so much clarity into how the model is performing.
As a result, one can effectively redirect their engineering efforts in the right direction.
Of course, I am not claiming that this will be universally useful in all use cases.
For instance, if the model is already performing better than the baseline, the model improvements from there on will have to be guided based on past results.
Yet, in such cases, surpassing a human baseline at least helps us validate that the model is doing better than what a human can do.
Isn’t that a great technique?
👉 Over to you: What other ways do you use to direct model improvement efforts?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?
PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.
How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.
Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!
Great article. Before taking up any classification problem, my first question is always how the existing system(be it human or rule based one) is performing. Will take that as baseline and do the tuning.
To calculate class wise validation of rock paper and scissor assuming the above is calculating each rock, paper, and scissor class, did they use a confusion matrix to come up with the percentages?