Clean ML Datasets With Cleanlab

...in just 4 lines of Python code.

Oct 24, 2024

AI isn’t magic. It’s math.

Understand the concepts powering technology like ChatGPT in minutes a day with Brilliant.

Thousands of quick, interactive lessons in AI, programming, logic, data science, and more make it easy. Try it free for 30 days.

Join Today!

Thanks to Brilliant for sponsoring today’s issue.

Clean ML Datasets With Cleanlab

For the longest time, no one could get past the 91% accuracy on ImageNet (92.4% is quite recent).

Why?

It was found that ImageNet had over 100k mislabeled images.

Real-world datasets are messy.

They often come with noisy labels, missing values, and outliers that can severely degrade your model’s performance.

No sophisticated ML algorithms can compensate for poor-quality or mislabeled data.

Researchers from MIT developed Cleanlab, which is an open-source library that cleans your data in just a few lines of code.

As shown in the image above, Cleanlab can flag errors in any type of data (text, image, tabular, audio), like:

out-of-distribution samples
outliers
label issues
duplicates, etc.

All it takes is just four lines of code:

Import the package.
Pass the dataset and specify the label column.
Find issues by passing the embedding matrix and the probabilities predicted by the model.
Finally, generate the report!

Done!

It will generate a report like the one shown above.

This way, you can easily clean your datasets for training accurate ML models.

Isn’t that impressive?

Several notebook demos are available here if you want to learn more: Cleanlab demo.

Cleanlab Demo

Cleanlab GitHub repo: GitHub repository

P.S. For those wanting to develop “Industry ML” expertise:

At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop "Industry ML" Skills

Here are some of them:

Learn techniques to run large models on small devices: Quantization: Optimize ML Models to Run Them on Tiny Hardware
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust: Conformal Predictions: Build Confidence in Your ML Model’s Predictions.
Learn how to identify causal relationships and answer business questions: A Crash Course on Causality – Part 1
Learn how to scale ML model training: A Practical Guide to Scaling ML Model Training.
Learn techniques to reliably roll out new models in production: 5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Learn how to build privacy-first ML systems: Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
Learn how to compress ML models and reduce costs: Model Compression: A Critical Step Towards Efficient Machine Learning.

All these resources will help you cultivate key skills that businesses and companies care about the most.

SPONSOR US

Get your product in front of 105k+ data scientists and machine learning professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.

Daily Dose of Data Science

Discussion about this post