Parquet is underdesigned. Some parts of it do not scale well. I believe that Par...

Renaud · on May 27, 2024

Parquet is not a database, it's a storage format that allows efficient column reads so you can get just the data you need without having to parse and read the whole file.

Most tools can run queries across parquet files.

Like everything, it has its strengths and weaknesses, but in most cases, it has better trade-offs over CSV if you have more than a few thousand rows.

bsoles · on May 27, 2024

> Parquet is not a database.

This is not emphasized often enough. Parquet is useless for anything that requires writing back computed results as in data used by signal processing applications.

maxnevermind · on May 27, 2024

> 7.2 millions row groups

Why would you need 7.2 mil row groups?

Row group size when stored in HDFS is usually equal to HDFS bock size by default, which is 128MB

7.2 mil * 128MB ~ 1PB

You have a single parquet file 1PB in size?

thesz · on May 27, 2024

Parquet is not HDFS. It is a static format, not a B-tree in disguise like HDFS.

You can have compressed Parquet columns with 8192 entries being a couple of tens bytes in size. 600 columns in a row group is then 12K bytes or so, leading us to 100GB file, not a petabyte. Four orders of magnitude of difference between your assessment and mine.

apwell23 · on May 27, 2024

some critiques of parquet by andy pavlo

https://www.vldb.org/pvldb/vol17/p148-zeng.pdf

thesz · on May 27, 2024

Thanks, very insightful.

"Dictionary Encoding is effective across data types (even for floating-point values) because most real-world data have low NDV ratios. Future formats should continue to apply the technique aggressively, as in Parquet."

So this is not critique, but assessment. And Parquet has some interesting design decisions I did not know about.

So, let me thank you again. ;)

imiric · on May 27, 2024

What format would you recommend instead?

thesz · on May 27, 2024

I do not know a good one.

A former colleague of mine is now working on a memory-mapped log-structured merge tree implementation and it can be a good alternative. LSM provides elasticity, one can store as much data as one needs, it is static, thus it can be compressed as well as Parquet-stored data, memory mapping and implicit indexing of data do not require additional data structures.

Something like LevelDB and/or RocksDB can provide most of that, especially when used in covering index [1] mode.

[1] https://www.sqlite.org/queryplanner.html#_covering_indexes

datadeft · on May 27, 2024

Nobody is forcing you to use a single Parquet file.

thesz · on May 27, 2024

Of course.

But nobody tells me that I can hit a hard limit and then I need a second Parquet file and should have some code for that.

The situation looks to me as if my "Favorite DB server" supports, say, only 1.9 billions records per table and if I hit that limit I need a second instance of my "Favorite DB server" just for that unfortunate table. And it is not documented anywhere.