While Pandas is the most popular DataFrame library, it is terribly slow.
It only uses a single CPU core.
It has bulky DataFrames.
It eagerly executes code, which prevents any possible optimization.
FireDucks is a highly optimized, drop-in replacement for Pandas with the same API.
You just need to change one line of code → 𝐢𝐦𝐩𝐨𝐫𝐭 𝗳𝗶𝗿𝗲𝗱𝘂𝗰𝗸𝘀.𝐩𝐚𝐧𝐝𝐚𝐬 𝐚𝐬 𝐩𝐝
Done!
The video above and the image below compare the run-time of FireDucks with Pandas and cuDF—a GPU DataFrame library.
As you can tell, FireDucks is even faster than cuDF in this case.
That said, the query in the above experiment loads all columns of the two parquet files.
When I optimized it manually by only loading the required columns, the run-time dropped to:
Pandas: 14 seconds (from 48 seconds)
FireDucks: 0.8 seconds (from 0.8 seconds) [same as before]
cuDF: 0.9 seconds (from 2.6 seconds)
This shows that the FireDucks’ compiler does the same optimization automatically, which one has to explicitly do in cuDF and Pandas.
Most importantly, there is no impact on the final result.
You can find the Colab notebook for our experiment here: FireDucks vs. cuDF Colab Notebook.
Thanks for reading!
P.S. For those wanting to develop “Industry ML” expertise:
At the end of the day, all businesses care about impact. That’s it!
Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?
We have discussed several other topics (with implementations) in the past that align with such topics.
Here are some of them:
Learn sophisticated graph architectures and how to train them on graph data: A Crash Course on Graph Neural Networks – Part 1.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here: Bi-encoders and Cross-encoders for Sentence Pair Similarity Scoring – Part 1.
Learn techniques to run large models on small devices: Quantization: Optimize ML Models to Run Them on Tiny Hardware.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust: Conformal Predictions: Build Confidence in Your ML Model’s Predictions.
Learn how to identify causal relationships and answer business questions: A Crash Course on Causality – Part 1
Learn how to scale ML model training: A Practical Guide to Scaling ML Model Training.
Learn techniques to reliably roll out new models in production: 5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Learn how to build privacy-first ML systems: Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
Learn how to compress ML models and reduce costs: Model Compression: A Critical Step Towards Efficient Machine Learning.
All these resources will help you cultivate key skills that businesses and companies care about the most.
Share this post