A Misconception About Pandas Apply

Avoid using apply() method at all times.

Avi Chawla

Oct 09, 2024

AI isn’t magic. It’s math.

Understand the concepts powering technology like ChatGPT in minutes a day with Brilliant.

Thousands of quick, hands-on lessons in AI, programming, logic, and more make it easy. Level up fast with:

Bite-sized, interactive lessons that make complex AI concepts accessible and engaging.
Personalized learning paths and competitive features to keep you motivated and on track.
Building skills to tackle real-world problems—not just memorizing information.

Join over 10M people and try it free for 30 days. Plus, Daily Dose of Data Science readers can snag a special discount on an annual premium subscription with this link.

Join Today!

Thanks to Brilliant for sponsoring today’s issue.

What’s wrong with apply()?

In my experience, most Pandas users believe that apply() is a vectorized method.

But this is NOT true.

Contrary to this common belief, every Pandas user MUST know that Pandas’ apply() method is NOT vectorized.

Instead, it’s just a glorified Python for-loop, which never offers any inherent vectorization-based optimization that one might expect.

As a result, the code always runs at native Python speed, i.e., slow.

What are the solutions?

One solution is to eliminate the apply() method by using a vectorized approach instead.

But I understand that, at times, coming up with a vectorized approach is difficult.

Another solution that I find handy is to parallelize the apply() method by using third-party optimized libraries instead.

The image below compares the run-time of Pandas apply() with four alternatives that support parallelization:

It is evident that Pandas’ apply() is not the optimal way to apply a method. In fact, it’s the slowest of all five.

There are a couple of reasons for this:

Pandas ALWAYS run on a single core of a CPU. Therefore, it does not possess any parallelization capabilities that it could possibly leverage.
Pandas’ apply() method is not vectorized. Therefore, it does not possess any vectorization capabilities either.

Honestly speaking, while the four external libraries shown in the visual above do not possess any vectorization capabilities either, they do leverage parallelization.

That is how we get to see a run-time improvement when we use them.

Here, please note that even though mapply() is the fastest here, it does not mean it will always be the fastest. Consider benchmarking on your own dataset first.

Moreover, I know that the add_row() method I demonstrated in the image above can be easily vectorized. I picked this particular example just for the sake of simplicity.

As a departing note, remember that your first possible attempt must ALWAYS be to write vectorized operations.

Consider these third-party libraries only when you see no scope to write vectorized code, and you see no other option but to use apply().

Get started with these libraries here:

Swifter: https://github.com/jmcarpenter2/swifter
Pandarallel: https://github.com/nalepae/pandarallel
Parallel Pandas: https://pypi.org/project/parallel-pandas/
Mapply: https://pypi.org/project/mapply/

👉 Over to you: What other techniques do you commonly use to optimize Pandas’ operations?

For those wanting to develop “Industry ML” expertise:

P.S. At the end of the day, all businesses care about impact. That’s it!

Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Here are some of them:

Learn sophisticated graph architectures and how to train them on graph data: A Crash Course on Graph Neural Networks – Part 1
Learn techniques to run large models on small devices: Quantization: Optimize ML Models to Run Them on Tiny Hardware
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust: Conformal Predictions: Build Confidence in Your ML Model’s Predictions.
Learn how to identify causal relationships and answer business questions: A Crash Course on Causality – Part 1
Learn how to scale ML model training: A Practical Guide to Scaling ML Model Training.
Learn techniques to reliably roll out new models in production: 5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Learn how to build privacy-first ML systems: Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
Learn how to compress ML models and reduce costs: Model Compression: A Critical Step Towards Efficient Machine Learning.

All these resources will help you cultivate key skills that businesses and companies care about the most.

SPONSOR US

Get your product in front of 100,000 data scientists and other tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.

To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.

Daily Dose of Data Science