Most Pandas users have a misconception about inplace operations.
They profoundly use them in expectation of:
Smaller run-time
Lower memory usage
And, of course, the reasoning makes intuitive sense as well.
Inplace, as the name suggests, must modify the DataFrame without creating a new copy. Thus, it is okay to expect that inplace will be more efficient.
Yet, this is rarely the case, which is also evident from the image below:
It is clear that in most cases, inplace operations are slow.
Why does this happen?
Contrary to common belief, Pandas’ inplace operations NEVER prevent the creation of a new copy.
It is just that these operations assign the copy back to the same address.
But during this assignment step, Pandas has to perform some additional checks — SettingWithCopy
, for instance, to ensure that the DataFrame is being modified correctly.
This, at times, can be an expensive operation.
Yet, in general, there is no guarantee that an inplace operation is faster, which is also validated by the results above.
What’s more, one thing I particularly dislike about inplace operations is that they inhibit method chaining as depicted below:
As a result, I never prefer using inplace operations in Pandas.
👉Over to you: Despite this, are there still any situations where you prefer using inplace operations in Pandas?
1 Referral: Unlock 450+ practice questions on NumPy, Pandas, and SQL.
2 Referrals: Get access to advanced Python OOP deep dive.
3 Referrals: Get access to the PySpark deep dive for big-data mastery.
Get your unique referral link:
Are you overwhelmed with the amount of information in ML/DS?
Every week, I publish no-fluff deep dives on topics that truly matter to your skills for ML/DS roles.
For instance:
A Beginner-friendly Introduction to Kolmogorov Arnold Networks (KANs).
5 Must-Know Ways to Test ML Models in Production (Implementation Included).
Understanding LoRA-derived Techniques for Optimal LLM Fine-tuning
8 Fatal (Yet Non-obvious) Pitfalls and Cautionary Measures in Data Science
Implementing Parallelized CUDA Programs From Scratch Using CUDA Programming
You Are Probably Building Inconsistent Classification Models Without Even Realizing.
And many many more.
Join below to unlock all full articles:
SPONSOR US
Get your product in front of 79,000 data scientists and other tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters — thousands of leaders, senior data scientists, machine learning engineers, data analysts, etc., who have influence over significant tech decisions and big purchases.
To ensure your product reaches this influential audience, reserve your space here or reply to this email to ensure your product reaches this influential audience.
And tracemalloc is an excellent tool to prove that inplace=True does NOT even spare on memory consumption although this should be the very cause of its existence. tracemalloc shows the increase in memory usage and the maximum transient usage as well. I created a test Series:
import pandas as pd
data = [ 10, 8, 10, 20, 10, 8, 9, 11, 8, 6, 11, 6]
idx = ['b', 'a', 'b', 'c', 'b', 'a', 'd', 'e', 'a', 'f', 'e', 'f']
sr = pd.Series(data, index=idx)
Then I sorted it with inplace set to False and True. First measurement:
tm.start()
sr_1 = sr.sort_values(inplace=False)
del sr # just to be fair, though it played no role
print(tm.get_traced_memory()) # (6352, 10093)
tm.stop()
Second measurement:
tm.start()
sr.sort_values(inplace=True)
print(tm.get_traced_memory()) # (5648, 10093)
tm.stop()
I find tracemalloc a very good instrument because it shows the real increase of Python's memory consumption. When we run memory_usage(deep=True) on the above Series it shows only 792 bytes. It is only the tip of the iceberg above sea level.
Thank you for the writeup Avi Chawla. I didn't know inplace operations are this bad in terms of performance and usability. I guess this feature should never have been introduced to pandas, there is no actual functional benefit to inplace operations and developers might use it without knowing the performance implications it has