5 Comments
Nov 24, 2023Liked by Avi Chawla

I presume that all of the figures were calculated using a single node/CPU.

My usage of parquet has been on clusters of multiple nodes running on cloud environments (AWS,Databricks) atop S3 storage. If a very large dataset has been "sharded" into, say, 20 parquet files, and if the cluster is running a like number (20) of nodes, then reading and writing the dataset to/from parquet can be lightning-fast.

It would be interesting to rerun the analysis on clusters of varying sizes where formats such as parquet would benefit from the parallelization while other formats may not.

Expand full comment

Worthwhile to investigate all sorts of potyential optimization areas...On my machine I tend to think of the multiple things I'm running locall rather that connections as I have a fiber connection..but that is not necessarily the case.

Expand full comment
Nov 23, 2023Liked by Avi Chawla

Nice. Wondering how you plot those data so nicely in your post?

Expand full comment