A Beginner-friendly Guide to Multi-GPU Training
Learn how to scale models using distributed training.
If you look at job descriptions for Applied ML or ML engineer roles on LinkedIn, most of them demand skills like the ability to train models on large datasets:
Of course, this is not something new or emerging.
But the reason they explicitly mention “large datasets” is quite simple to understand.
Businesses have more data than ever before.
Traditional single-node model training just doesn’t work because one cannot wait months to train a model.
Distributed (or multi-GPU) training is one of the most essential ways to address this.
In this week’s ML deep dive, I decided to cover this: A Beginner-friendly Guide to Multi-GPU Model Training.
We are covering some core technicalities behind multi-GPU training, how it works under the hood, and implementation details.
We also look at the key considerations for multi-GPU (or distributed) training, which, if not addressed appropriately, may lead to suboptimal performance or slow training.
Why care about it?
Every business loves to save/make money. Every!
Honestly speaking, when I started my first DS/ML role, I was under the impression that my manager would tell me what I was supposed to do, and I had to complete it.
I didn’t realize that all an employee is expected to do is help their employer either make or save more money. That’s it!
Almost every project one takes up is centered around this objective.
When you want to propose a new project, think about how it can help them make or save more money, and show them the ROI.
Your idea could:
Optimize resource consumption, which saves them money.
Model compression techniques align in this direction: Model Compression: A Critical Step Towards Efficient Machine Learning.
Result in lower training times, which saves them money on compute. Some ideas include:
Multi-GPU Model Training (which we are discussing today)
improve customer experience, which increases customer’s lifetime value and makes them more money.
I hope you get the point.
Always remember business metrics are not “Model Accuracy” or “F1 score” per se. Instead, they are “ROI,” “Cost-saving,” “Customer LTV increase,” “Latency improvement,” etc.
They believe that someone with distributed training skills will help them save/make more money.
Develop those skills.
Here’s a deep dive into multi-GPU training: A Beginner-friendly Guide to Multi-GPU Model Training.
Along similar lines, PySpark is another must have skill. I wrote this beginner-friendly guide for you to get started: Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.
Thanks for reading.
Have a good day!
Avi