Avoid This Costly Mistake When Indexing A DataFrame
Row-then-column is not the same as Column-then-row.
When indexing a dataframe, choosing whether to select a column first or slice a row first is pretty important from a run-time perspective.
As shown above, selecting the column first is over 15 times faster than slicing the row first. Why?
As I have talked before, Pandas DataFrame is a column-major data structure. Thus, consecutive elements in a column are stored next to each other in memory.
As processors are efficient with contiguous blocks of memory, accessing a column is much faster than accessing a row (read more about this in one of my previous posts here).
But when you slice a row first, each row is retrieved by accessing non-contiguous blocks of memory, thereby making it slow.
Also, once all the elements of a row are gathered, Pandas converts them to a Series, which is another overhead.
We can verify this conversion below:
Instead, when you select a column first, elements are retrieved by accessing contiguous blocks of memory, which is way faster. Also, a column is inherently a Pandas Series. Thus, there is no conversion overhead involved like above.
Overall, by accessing the column first, we avoid accessing non-contiguous memory access, which does happen when we access the row first.
This makes selecting the column first faster than slicing a row first in indexing operations.
If you are confused about what selecting, indexing, slicing, and filtering mean, here’s what you should read next:
👉 Read what others are saying about this post on LinkedIn.
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.
👉 If you love reading this newsletter, feel free to share it with friends!
Find the code for my tips here: GitHub.
I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.
Let me see if I get this straight. df = pd.dataframe('somedata')
df.iloc[1] should pull the first row, uses a lot of memory and is slow (lets assume the index has a name like "apples" for this row)
but if you transpose the df dataframe first, you could call df['apples] and get the same data returned in a series, but it would be faster and more memory efficient?
Is this correct?