How to Read Multiple CSV Files Efficiently

Oct 22, 2022

In many situations, the data is often split into multiple CSV files and transferred to the DS/ML team for use.

As Pandas does not support parallelization, one has to iterate over the list of files and read them one by one for further processing.

"Datatable" can provide a quick fix for this. Instead of reading them iteratively with Pandas, you can use Datatable to read a bunch of files. Being parallelized, it provides a significant performance boost as compared to Pandas.

The performance gain is not just limited to I/O but is observed in many other tabular operations as well.

Daily Dose of Data Science

Discussion about this post