If You Are Not Able To Code A Vectorized Approach, Try This.
A simple tweak to improve iteration run-time over a DataFrame.
Although we should never iterate over a dataframe and prefer vectorized code, what if we are not able to come up with a vectorized solution?
In my yesterday's post on why iterating a dataframe is costly, someone posed a pretty genuine question. They asked: “Let’s just say you are forced to iterate. What will be the best way to do so?”
Firstly, understand that the primary reason behind the slowness of iteration is due to the way a dataframe is stored in memory. (If you wish to recap this, read yesterday’s post here.)
Being a column-major data structure, retrieving its rows requires accessing non-contiguous blocks of memory. This increases the run-time drastically.
Yet, if you wish to perform only row-based operations, a quick fix is to convert the dataframe to a NumPy array.
NumPy is faster here because, by default, it stores data in a row-major manner. Thus, its rows are retrieved by accessing contiguous blocks of memory, making it efficient over iterating a dataframe.
That being said, do note that the best way is to write vectorized code always. Use the Pandas-to-NumPy approach only when you are truly struggling with writing vectorized code.
👉 See what others are saying about this post on LinkedIn: Post Link.
👉 If you love reading this newsletter, feel free to share it with friends!
Find the code for my tips here: GitHub.
I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn.