I created the following visual, which depicts the 15 most common tabular operations in Pandas and their corresponding translations in SQL, Polars, and PySpark.
While the motivation for Pandas and SQL is clear and well-known, let me tell you why you should care about Polars and PySpark.
Why Polars?
Pandas has many limitations, which Polars addresses, such as:
Pandas always adheres to single-core computation → Polars is multi-core.
Pandas offers no lazy execution → Polars does.
Pandas creates bulky DataFrames → Polars’ DFs are lightweight.
Pandas is slow on large datasets → Polars is remarkably efficient.
In fact, if we look at the run-time comparison on some common operations, it’s clear that Polars is much more efficient than Pandas:
Why Spark?
While tabular data space is mainly dominated by Pandas and Sklearn, one can hardly expect any benefit from them beyond some GBs of data due to their single-node processing.
A more practical solution is to use distributed computing instead — a framework that disperses the data across many small computers.
Spark is among the best technologies used to quickly and efficiently analyze, process, and train models on big datasets.
That is why most data science roles at big tech demand proficiency in Spark. It’s that important.
We covered this in detail in a recent deep dive as well: Don’t Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark.
👉 Over to you: What are some other faster alternatives to Pandas that you are aware of?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.
The button is located towards the bottom of this email.
Thanks for reading!
Latest full articles
If you’re not a full subscriber, here’s what you missed last month:
You Are Probably Building Inconsistent Classification Models Without Even Realizing
Why Sklearn’s Logistic Regression Has no Learning Rate Hyperparameter?
PyTorch Models Are Not Deployment-Friendly! Supercharge Them With TorchScript.
How To (Immensely) Optimize Your Machine Learning Development and Operations with MLflow.
DBSCAN++: The Faster and Scalable Alternative to DBSCAN Clustering.
Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
You Cannot Build Large Data Projects Until You Learn Data Version Control!
To receive all full articles and support the Daily Dose of Data Science, consider subscribing:
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!