0:00
/
0:00
Transcript

FireDucks vs. cuDF

CPU library vs. GPU library.

While Pandas is the most popular DataFrame library, it is terribly slow.

  • It only uses a single CPU core.

  • It has bulky DataFrames.

  • It eagerly executes code, which prevents any possible optimization.

FireDucks is a highly optimized, drop-in replacement for Pandas with the same API.

You just need to change one line of code → 𝐢𝐦𝐩𝐨𝐫𝐭 𝗳𝗶𝗿𝗲𝗱𝘂𝗰𝗸𝘀.𝐩𝐚𝐧𝐝𝐚𝐬 𝐚𝐬 𝐩𝐝

Done!

The video above and the image below compare the run-time of FireDucks with Pandas and cuDF—a GPU DataFrame library.

As you can tell, FireDucks is even faster than cuDF in this case.

That said, the query in the above experiment loads all columns of the two parquet files.

When I optimized it manually by only loading the required columns, the run-time dropped to:

  • Pandas: 14 seconds (from 48 seconds)

  • FireDucks: 0.8 seconds (from 0.8 seconds) [same as before]

  • cuDF: 0.9 seconds (from 2.6 seconds)

This shows that the FireDucks’ compiler does the same optimization automatically, which one has to explicitly do in cuDF and Pandas.

Most importantly, there is no impact on the final result.

You can find the Colab notebook for our experiment here: FireDucks vs. cuDF Colab Notebook.

Thanks for reading!


P.S. For those wanting to develop “Industry ML” expertise:

At the end of the day, all businesses care about impact. That’s it!

  • Can you reduce costs?

  • Drive revenue?

  • Can you scale ML models?

  • Predict trends before they happen?

We have discussed several other topics (with implementations) in the past that align with such topics.

Develop "Industry ML" Skills

Here are some of them:

All these resources will help you cultivate key skills that businesses and companies care about the most.

Discussion about this video