Comparison of Time-Series Data Transport Formats for Smallest Storage: Avro, Parquet, CSV

EricSeptember 4, 2019September 30, 2020

Goal: Efficiently transport integer-based financial time-series data to dedicated machines and research partners by experimenting with the smallest data transport format(s) among Avro, Parquet, and compressed CSVs.

My financial time-series data is currently collected and stored in hundreds of gigabytes of SQLite files on non-clustered, RAIDed Linux machines. I have an experimental cluster computer running Spark, but I also have access to AWS ML tools, as well as partners with their own ML tools and environments (TensorFlow, Keras, etc.). My goal this weekend is to experiment with and implement a compact and efficient data transport format.

Results Summary

Parquet v2 with internal GZip achieved an impressive 83% compression on my real data and achieved an extra 10 GB in savings over compressed CSVs. Compressed CSVs achieved a 78% compression. Both took a similar amount of time for the compression, but Parquet files are more easily ingested by Hadoop HDFS.

Preliminary tests

Using a sample of 35 random symbols with only integers, here are the aggregate data sizes under various storage formats and compression codecs on Windows. They are sorted by increasing the compression ratio using plain CSVs as a baseline.

Format	Internal Compression	Post-Compression	Size (KB) ⯆	Ratio ⯅
SQLite			148,173	-92%
CSV			77,280	0%
SQLite		Deflate	47,395	39%
Avro	None		45,969	41%
Avro	Snappy		27,916	64%
SQLite		LZMA	27,330	64%
Parquet	None		26,108	66%
Parquet	Snappy		24,793	68%
Avro	GZip(1)		19,563	75%
Avro	GZip(9)		18,781	76%
Parquet	GZip		18,524	76%
Parquet	Snappy	Deflate	18,231	76%
Avro	Snappy	LMZA	18,093	76%
Parquet	None	Deflate	17,895	77%
Avro	BZip2		17,810	77%
Parquet	GZip	Deflate	17,711	77%
Parquet	GZip	LZMA	17,677	77%
Parquet (v2)	Snappy		16,785	78%
Avro	XZ(1)		16,606	78%
Avro	XZ(9)		15,528	80%
CSV		Deflate	15,469	80%
Parquet	Snappy	LZMA	14,821	81%
Parquet (v2)	GZip		14,764	81%
Parquet	None	LZMA	14,588	81%
Avro	None	LMZA	14,555	81%
CSV		LZMA	13,274	83%

Note: The rows highlighted above show the transport formats with the most promise.

GZip is supplied by org.apache.hadoop.io.compress.GzipCodec. Snappy is supplied by org.apache.parquet.hadoop.codec.SnappyCodec. XZ is supplied by org.apache.commons.compress.compressors.xz. BZip2 is supplied by org.apache.commons.compress.compressors.bzip2. Deflate is supplied by 7-Zip on ultra compression. LZMA is supplied by 7-Zip on ultra compression. Other compression codecs for Parquet on Windows are not available without Hadoop installed.

Test data: Compressed CSVs

The schema doesn’t change for the majority of the data, so CSV is simple, convenient, and plus it is human-readable. TensorFlow, for example, ingests CSV easily. In the preliminary tests above it outperformed all other transport formats when post-compressed with LZMA on ultra compression (83%). The major drawback is that the compressed transport file needs to be decompressed into bloated CSV files first (0%).

Internal Compression	Post-Compression	Size (KB) ⯆	Ratio ⯅
None	LZMA	13,274	83%

Test data: Avro and Parquet (v1) with post-compression

The next most efficient data transport formats are uncompressed Avro and Parquet (version 1) with post-compression of LMZA set to ultra compression (81% and 81%). Avro is a record-based data format that contains the schema and can be split up into several files. Parquet is a column-based data format that is quick to look up or join a subset of columns. Both formats are natively used in the Apache ecosystem, for instance in Hadoop and Spark. The drawback again is that the transport files must be expanded into individual Avro and Parquet files (41% and 66%). In the sample data, expanded Parquet files occupy less space than Avro files.

Format	Internal Compression	Post-Compression	Size (KB) ⯆	Ratio ⯅
Parquet (v1)	None	LZMA	14,588	81%
Avro	None	LMZA	14,555	81%

Fun fact: After not knowing how to pronounce the word “parquet”, I found the consensus online to be “par-KAY” and comes from French meaning “floor”. This is also a flooring style (mosaic flooring) popular in the 1600s (and again in the ’60s) and resembles the Apache Parquet project logo.

Test data: Parquet (v2) with GZip

It turns out Parquet (version 2) files with GZip column-compression yield an 81% compression ratio without the need for additional post-compression. This means these Parquet files can be ingested by Hadoop’s HDFS directly without the additional pre-decompression step. An observed 81% compression is comparable to the 83% compressed CSV files.

Internal Compression	Post-Compression	Size (KB) ⯆	Ratio ⯅
GZip	Store	14,764	81%

On sample data, plain CSV files compressed with LZMA have the best compression. Next, I experiment on the full data to see if this is still the case.

Results on full data

A cleaned export of real data into CSV format results in 146 GB (149,503 MB) of plain CSV files. When converted into the above transport formats, here are the results¹:

Format	Internal Compression	Post-Compression	Size (MB) ⯆	Ratio ⯅	Time (hrs)
CSV	–	–	149,503	0%	–
SQLite	–	Deflate	58,327	61%	–
Parquet (v1)	GZip	Store	34,089	77%	11.9 (11.1 + 0.8)
CSV	–	LMZA	32,877	78%	13.8 (7.2 + 6.6)
Parquet (v2)	GZip	Store	25,632	83%	14.0 (13.2 + 0.8)

Full data: Compressed CSVs

On real data, it took 7.2 hours to export to plain CSVs and another 6.6 hours to compress all the files (GZip level 6) to a single transport zip file. The 13.8 hours of processing resulted in a 78% compression. CPU and heap activity looked like below.

Internal Compression	Post-Compression	Size (MB) ⯆	Ratio ⯅	Time (hrs)
None	LMZA	32,877	78%	13.8 (7.2 + 6.6)

Full data: Parquet (v1) with GZip

Exporting to Parquet v1 and then storing the files to a single archive took 11.9 hours resulted in a 78% compression. CPU and heap activity is as follows.

Internal Compression	Post-Compression	Size (MB) ⯆	Ratio ⯅	Time (hrs)
GZip	Store	34,089	77%	11.9 (11.1 + 0.8)

11 hours of writing Parquet v1 files with internal GZip

Full data: Parquet (v2) with GZip

Exporting to Parquet v2 and then storing all the files to a single archive took 14 hours and achieved an 83% compression. The total CPU and heap activity is shown below.

Internal Compression	Post-Compression	Size (MB) ⯆	Ratio ⯅	Time (hrs)
GZip	Store	25,632	83%	14.0 (13.2 + 0.8)

13 hours of writing Parquet v2 files with internal GZip

Discussion

The default Parquet writer is version one. Then what is the difference between Parquet version one and two? Parquet version two uses delta encoding which is extremely well-suited for sorted timestamp columns. Instead of storing a series of four bytes for a timestamp column, only one or two bytes may be needed per entry. For instance, the difference in timestamp seconds between two minute-candles is 60 which fits in a single byte. This explains the huge compression advantage of Parquet version two – extra nearly 10 GB in savings over compressed CSVs.

Success: CSV is the de-facto format for small data sets. For larger data sets, a more exotic format is warranted. I’ve found that my entire data set can now fit in a 32 GB USB stick for transport with an 83% compression by leveraging Parquet v2 and delta compression.

Notes:

Time in brackets denote export time then storage time to a single transport file. ↩