The Most Common Misconception Pandas Users Have About Apply() Method
Avoid using apply() method at all times.
The apply()
method in Pandas is the most common approach to apply a function along an axis of a DataFrame/Series.
In my experience, when using apply()
, most Pandas users believe that it is a vectorized method.
In other words, they believe that apply()
operates efficiently and performs element-wise operations like other vectorized operations in Pandas.
But this is NOT true.
Contrary to this common belief, every Pandas user MUST know that Pandas’ apply()
method is NOT vectorized.
Instead, it’s just a glorified Python for-loop, which never offers any inherent vectorization-based optimization that one might expect.
As a result, the code always runs at native Python speed, i.e., slow.
What are the solutions?
One solution is to eliminate the apply()
method by using a vectorized approach instead.
But I understand that, at times, coming up with a vectorized approach is difficult.
Another solution that I find handy is to parallelize the apply()
method by using third-party optimized libraries instead.
The image below compares the run-time of Pandas apply()
with four alternatives that support parallelization:
It is evident that Pandas’ apply()
is not the optimal way to apply a method. In fact, it’s the slowest of all five.
There are a couple of reasons for this:
Pandas ALWAYS run on a single core of a CPU. Therefore, it does not possess any parallelization capabilities that it could possibly leverage.
Pandas’
apply()
method is not vectorized. Therefore, it does not possess any vectorization capabilities either.
Honestly speaking, while the four external libraries shown in the visual above do not possess any vectorization capabilities either, they do leverage parallelization.
That is how we get to see a massive run-time improvement when we use them.
Here, please note that even though mapply()
is the fastest here, it does not mean it will always be the fastest. Consider benchmarking on your own dataset first.
Moreover, I know that the add_row()
method I demonstrated in the image above can be easily vectorized. I picked this particular example just for the sake of simplicity.
As a departing note, remember that your first possible attempt must ALWAYS be to write vectorized operations.
Consider these third-party libraries only when you see no scope to write vectorized code, and you see no other option but to use apply()
.
Get started with these libraries here:
Pandarallel: https://github.com/nalepae/pandarallel
Parallel Pandas: https://pypi.org/project/parallel-pandas/
Mapply: https://pypi.org/project/mapply/
👉 Over to you: What other techniques do you commonly use to optimize Pandas’ operations?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning
You Cannot Build Large Data Projects Until You Learn Data Version Control!
Sklearn Models are Not Deployment Friendly! Supercharge Them With Tensor Computations.
Deploy, Version Control, and Manage ML Models Right From Your Jupyter Notebook with Modelbit
Gaussian Mixture Models (GMMs): The Flexible Twin of KMeans.
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!