I presume that all of the figures were calculated using a single node/CPU.
My usage of parquet has been on clusters of multiple nodes running on cloud environments (AWS,Databricks) atop S3 storage. If a very large dataset has been "sharded" into, say, 20 parquet files, and if the cluster is running a like number (20) of nodes, then readin…
I presume that all of the figures were calculated using a single node/CPU.
My usage of parquet has been on clusters of multiple nodes running on cloud environments (AWS,Databricks) atop S3 storage. If a very large dataset has been "sharded" into, say, 20 parquet files, and if the cluster is running a like number (20) of nodes, then reading and writing the dataset to/from parquet can be lightning-fast.
It would be interesting to rerun the analysis on clusters of varying sizes where formats such as parquet would benefit from the parallelization while other formats may not.
Yes, Michael, that's correct. I used Pandas, which inherently runs on a single node :)
And of course, totally agree with what you mentioned. I used that quite often too so I think it would be pretty could to do some analysis along those lines. Thanks for the suggestion :)
I presume that all of the figures were calculated using a single node/CPU.
My usage of parquet has been on clusters of multiple nodes running on cloud environments (AWS,Databricks) atop S3 storage. If a very large dataset has been "sharded" into, say, 20 parquet files, and if the cluster is running a like number (20) of nodes, then reading and writing the dataset to/from parquet can be lightning-fast.
It would be interesting to rerun the analysis on clusters of varying sizes where formats such as parquet would benefit from the parallelization while other formats may not.
Yes, Michael, that's correct. I used Pandas, which inherently runs on a single node :)
And of course, totally agree with what you mentioned. I used that quite often too so I think it would be pretty could to do some analysis along those lines. Thanks for the suggestion :)