Comparison of Time-Series Data Transport Formats for Smallest Storage: Avro, Parquet, CSV

Goal: Efficiently transport integer-based financial time-series data to dedicated machines and research partners by experimenting with the smallest data transport format(s) among Avro, Parquet, and compressed CSVs.

My financial time-series data is currently collected and stored in hundreds of gigabytes of SQLite files on non-clustered, RAIDed Linux machines. I have an experimental cluster computer running Spark, but I also have access to AWS ML tools, as well as partners with their own ML tools and environments (TensorFlow, Keras, etc.). My goal this weekend is to experiment with and implement a compact and efficient data transport format.

Results Summary

Parquet v2 with internal GZip achieved an impressive 83% compression on my real data and achieved an extra 10 GB in savings over compressed CSVs. Compressed CSVs achieved a 78% compression. Both took a similar amount of time for the compression, but Parquet files are more easily ingested by Hadoop HDFS.


Preliminary tests

Using a sample of 35 random symbols with only integers, here are the aggregate data sizes under various storage formats and compression codecs on Windows. They are sorted by increasing the compression ratio using plain CSVs as a baseline.

FormatInternal CompressionPost-CompressionSize (KB) ⯆Ratio ⯅
SQLite148,173-92%
CSV77,2800%
SQLiteDeflate47,39539%
AvroNone45,96941%
AvroSnappy27,91664%
SQLiteLZMA27,33064%
ParquetNone26,10866%
ParquetSnappy24,79368%
AvroGZip(1)19,56375%
AvroGZip(9)18,78176%
ParquetGZip18,52476%
ParquetSnappyDeflate18,23176%
AvroSnappyLMZA18,09376%
ParquetNoneDeflate17,89577%
AvroBZip217,81077%
ParquetGZipDeflate17,71177%
ParquetGZipLZMA17,67777%
Parquet (v2)Snappy16,78578%
AvroXZ(1)16,60678%
AvroXZ(9)15,52880%
CSVDeflate15,46980%
ParquetSnappyLZMA14,82181%
Parquet (v2)GZip14,76481%
ParquetNoneLZMA14,58881%
AvroNoneLMZA14,55581%
CSVLZMA13,27483%

Note: The rows highlighted above show the transport formats with the most promise.

GZip is supplied by org.apache.hadoop.io.compress.GzipCodec. Snappy is supplied by org.apache.parquet.hadoop.codec.SnappyCodec. XZ is supplied by org.apache.commons.compress.compressors.xz. BZip2 is supplied by org.apache.commons.compress.compressors.bzip2. Deflate is supplied by 7-Zip on ultra compression. LZMA is supplied by 7-Zip on ultra compression. Other compression codecs for Parquet on Windows are not available without Hadoop installed.

Test data: Compressed CSVs

The schema doesn’t change for the majority of the data, so CSV is simple, convenient, and plus it is human-readable. TensorFlow, for example, ingests CSV easily. In the preliminary tests above it outperformed all other transport formats when post-compressed with LZMA on ultra compression (83%). The major drawback is that the compressed transport file needs to be decompressed into bloated CSV files first (0%).

Internal CompressionPost-CompressionSize (KB) ⯆Ratio ⯅
NoneLZMA13,27483%

Test data: Avro and Parquet (v1) with post-compression

The next most efficient data transport formats are uncompressed Avro and Parquet (version 1) with post-compression of LMZA set to ultra compression (81% and 81%). Avro is a record-based data format that contains the schema and can be split up into several files. Parquet is a column-based data format that is quick to look up or join a subset of columns. Both formats are natively used in the Apache ecosystem, for instance in Hadoop and Spark. The drawback again is that the transport files must be expanded into individual Avro and Parquet files (41% and 66%). In the sample data, expanded Parquet files occupy less space than Avro files.

FormatInternal CompressionPost-CompressionSize (KB) ⯆Ratio ⯅
Parquet (v1)NoneLZMA14,58881%
AvroNoneLMZA14,55581%
Fun fact: After not knowing how to pronounce the word “parquet”, I found the consensus online to be “par-KAY” and comes from French meaning “floor”. This is also a flooring style (mosaic flooring) popular in the 1600s (and again in the ’60s) and resembles the Apache Parquet project logo.

Apache Parquet logo

Test data: Parquet (v2) with GZip

It turns out Parquet (version 2) files with GZip column-compression yield an 81% compression ratio without the need for additional post-compression. This means these Parquet files can be ingested by Hadoop’s HDFS directly without the additional pre-decompression step. An observed 81% compression is comparable to the 83% compressed CSV files.

Internal CompressionPost-CompressionSize (KB) ⯆Ratio ⯅
GZipStore14,76481%
On sample data, plain CSV files compressed with LZMA have the best compression. Next, I experiment on the full data to see if this is still the case.

Results on full data

A cleaned export of real data into CSV format results in 146 GB (149,503 MB) of plain CSV files. When converted into the above transport formats, here are the results1:

FormatInternal CompressionPost-CompressionSize (MB) ⯆Ratio ⯅Time (hrs)
CSV149,5030%
SQLiteDeflate58,32761%
Parquet (v1)GZipStore34,08977%11.9 (11.1 + 0.8)
CSVLMZA32,87778%13.8 (7.2 + 6.6)
Parquet (v2)GZipStore25,63283%14.0 (13.2 + 0.8)

Full data: Compressed CSVs

On real data, it took 7.2 hours to export to plain CSVs and another 6.6 hours to compress all the files (GZip level 6) to a single transport zip file. The 13.8 hours of processing resulted in a 78% compression. CPU and heap activity looked like below.

Internal CompressionPost-CompressionSize (MB) ⯆Ratio ⯅Time (hrs)
NoneLMZA32,87778%13.8 (7.2 + 6.6)
7 hours of exporting data to plain CSV
7 hours of exporting data to plain CSV

Full data: Parquet (v1) with GZip

Exporting to Parquet v1 and then storing the files to a single archive took 11.9 hours resulted in a 78% compression. CPU and heap activity is as follows.

Internal CompressionPost-CompressionSize (MB) ⯆Ratio ⯅Time (hrs)
GZipStore34,08977%11.9 (11.1 + 0.8)
11 hours of writing Parquet v1 files with internal GZip
11 hours of writing Parquet v1 files with internal GZip

Full data: Parquet (v2) with GZip

Exporting to Parquet v2 and then storing all the files to a single archive took 14 hours and achieved an 83% compression. The total CPU and heap activity is shown below.

Internal CompressionPost-CompressionSize (MB) ⯆Ratio ⯅Time (hrs)
GZipStore25,63283%14.0 (13.2 + 0.8)
13 hours of writing Parquet v2 files with internal GZip
13 hours of writing Parquet v2 files with internal GZip

Discussion

The default Parquet writer is version one. Then what is the difference between Parquet version one and two? Parquet version two uses delta encoding which is extremely well-suited for sorted timestamp columns. Instead of storing a series of four bytes for a timestamp column, only one or two bytes may be needed per entry. For instance, the difference in timestamp seconds between two minute-candles is 60 which fits in a single byte. This explains the huge compression advantage of Parquet version two – extra nearly 10 GB in savings over compressed CSVs.

Success: CSV is the de-facto format for small data sets. For larger data sets, a more exotic format is warranted. I’ve found that my entire data set can now fit in a 32 GB USB stick for transport with an 83% compression by leveraging Parquet v2 and delta compression.

Notes:

  1. Time in brackets denote export time then storage time to a single transport file.