Storing Financial Time-Series Data Efficiently

Goal: Before acquiring and storing long-term financial time-series data, I need to know the database schema, storage growth, and difficulty and cost of maintaining the database.

How large could financial data grow and cost?

Each candle has OHLCV data plus a Unix timestamp in seconds (Note: I chose this timestamp structure to aid in cleaning the candles without the need for a slow Date interpreter. Statistical languages like R and trading platforms like NinjaTrader or MetaTrader natively use human-readable date-time strings). I go through some napkin math below.

Initial Schema

Here is the bare minimum needed to encode OHLCV data.

TimestampOpenHighLowCloseVolume
INT 4 bytesREAL 8 bytesREAL 8 bytesREAL 8 bytesREAL 8 bytesINT 4 bytes

4B + (8B * 4) + 4B = 40 bytes/candle

When I built my first prototype database I encoded the candle time period in the table name (e.g. OneDay, OneMinute). This saved space until I tried to clean the candles. It turns out that the broker will send incomplete period candles. A start and end timestamp is needed.

Better Schema

A start and end timestamp are needed, so the updated structure looks like the following.

StartEndOpenHighLowCloseVolume
INT 4 bytesINT 4 bytesREAL 8 bytesREAL 8 bytesREAL 8 bytesREAL 8 bytesINT 4 bytes

4B + 4B + (8B * 4) + 4B = 44 bytes/candle

Efficient Schema

The overwhelming majority of the people who have come before me recommend using integers instead of floats or reals (doubles). There are two very strong reasons for this: decimal representations introduce rounding errors (especially in keeping accurate track of forex pips), and analysis software handles integers much faster. With SQLite the size of the INT actually varies depending on the magnitude of the column. I also added a status column to help clean the candles.

StartEndOpenHighLowCloseVolume
INT 4 bytesINT 4 bytesINT 1~8 bytesINT 1~8 bytesINT 1~8 bytesINT 1~8 bytesINT 4 bytes

4B + 4B + (4~8B * 4) + 4B = Between 16 bytes/candle and 44 bytes/candle

Until July 2019 I was using 8-byte REALs to represent money numbers since doubles are returned from the broker’s API. I’ve since converted my databases to use integers. The rest of this analysis is with the original REAL numbers.

Expected space increase per day

I established previously that (at this time) there are about 48,750 quotable symbols, but let’s round that to 50,000 symbols. How much storage will one day of candles require? Let’s assume there is at least one trade per day for each symbol and nothing is halted or delisted this day.

50,000 day candles * 44 bytes = 2,148 KB ≈ 2.1 MB per day

Worst case minute-candle space per day

It is more challenging to estimate the space required for minute candles because not all symbols move every minute. A thinly-traded warrant might only have a handful of trades in the 390 minutes of in-hours trading (6.5 hours * 60 minutes = 390 minutes) whereas FB or AAPL may have up to 960 minute-candles due to pre-market and after-hours trading.

When in the blink of an eye Facebook’s stock crashed on June 25, 2018 from an all-time high of US$218 swiftly down to US$173.5, it traded after hours for a full 4 hours past end of trading. I was watching the Time and Sales flash like Christmas lights. That’s an extra 240 minute-candles for this day, not to mention the pre-market activity.
Facebook stock crash on July 25, 2018
Facebook after-hours stock crash on July 25, 2018

It turns out popular stocks, especially those on the NASDAQ, like FB and AAPL start trading at 4 AM ET (5.5 hours early) and finish after-hours by 8 PM ET (4 hours after the bell) – that’s a full 16 hours of trading. This is what I have access to via my real-time API, anyway.

Let’s calculate the worst-case in-hours candle data space:

50,000 minute candles * 390 minutes/day * 44 bytes/candle ≈ 818 MB per day!

Estimated real minute-candle space per day

Let’s assume there isn’t a panic crash today and the average number of trades per symbol is 10% of worse-case, or 39 candles/day, then

50,000 minute candles * 390 minutes/day * 0.1 * 44 bytes/candle ≈ 82 MB per day

The actual number fluctuates wildly day after day, but PINX stocks don’t move as much as AAPL. I started with an optimistic estimate of 84 MB/day as our storage requirement (82 MB + 2 MB). It will take time to know what the real average is.


Some time goes by…


Measured candle space per day

It’s good to start with estimates to get ballpark requirements. Now let’s actually calculate the data growth and see how close I came to the optimistic estimate. Using a 10-trading-day period from Monday, July 9th, 2018 at midnight until Friday, July 20th, 2018 at 23:59 ET time, a helper program calculated this:

Candles added since 2018-07-09:
OneDay: 742203
OneMinute: 68556081

The space required, averaged over 10 trading days, then becomes:

One day: 742,203 day candles / 10 days * 44 bytes/candle ≈ 3.1 MB per day
One minute: 68,556,081 minute candles / 10 days * 44 bytes/candle ≈ 288 MB per day

It turns out my optimistic estimate of 10% was quite off1. From the above, the percentage-from-worst is closer to 35%. We can then say that all the available quotable symbols “move” only 35% of the time on average.

The reason there are more candles in the day tables (3.1 MB vs. 2.1 MB) is because of how my cleaning algorithm works; cleaned candles are sometimes assembled from partial candles. That means there may be multiple-day candles for a given day, but not all of them span a full 24-hour period.

Estimated storage space per year

There are about 252 regular U.S. trading days per year.

365.25 (avg. days per year) * 5/7 (5 trading days a week) = 260.89 – 9 (U.S. holidays) = 251.89 days ≈ 252 days

Canada trades on some U.S. holidays, and the U.S. trades on some Canadian holidays, but let’s use the 252 trading days a year. The napkin math then becomes:

252 days * 291 MB/day = 73,332 MB/year ≈ 72 GB per year!

Assuming equal turnover of symbols (which is not the case2), we would need to be prepared to store about 72 GB of candle data per year over and above storing all the historic data we want.

Estimated AWS storage cost per year

Let’s make an assumption that we have one year of historic data, or 72 GB of clean OHLCV candle data already. Where exactly does that data reside? A local MySQL database? Would SQLite work? How about Amazon AWS’s RDS solution?

Storing candle data... AWS RDS or SQLite
Storing candle data… AWS RDS or SQLite?

If we are just accumulating candles each day for storage, then it requires a minimum of 33 database-hours per week. If we want to accumulate and analyze the candles, then at least 88 database-hours per week are needed:

6.5 hours/day * 5 days/week = 32.5 hours/week ≈ 33 hours/week (storage)
24 hours – 6.5 hours = 17.5 hours/day ≈ 88 hours/week (storage + analysis)

With a 72 GB general SSD3 and 293 MB/day inbound (free), the AWS costs are:

DB Instance33 hours/week88 hours/week
db.t2.medium (2 CPUs, 2 GB)US$15.64/moUS$31.76/mo
db.t2.xlarge (4 CPUs, 16 GB)US$109.07/mo

Say you have 2 years of historic data, or 144 GB:

DB Instance33 hours/week88 hours/week
db.t2.medium (2 CPUs, 2 GB)US$23.92/moUS$40.04/mo
db.t2.xlarge (4 CPUs, 16 GB)US$117.35/mo

Say you have 5 years of historic data, or 360 GB:

DB Instance33 hours/week88 hours/week
db.t2.medium (2 CPUs, 2 GB)US$71.76/moUS$100.73/mo
db.t2.xlarge (4 CPUs, 16 GB)US$165.19/mo

This doesn’t include the hardware running the TA algorithms or where our intermediary data (for indicators) resides. With this approach, we are most likely losing money on trades and on database usage fees. While convenient, let’s not use AWS or any cloud-hosted solution.


Distributed local SQLite databases

There are advantages with using SQLite databases over monolithic databases: speed, transportability, and scalability. In my setup, the combined candles DB for AAPL is under 10 MB. If I need to perform heavy analysis on tech alpha-dog Apple, I can load the entire candle DB into a SQLite in-memory store. If I need to see how that other NASDAQ securities move in relation and/or response to Apple, then the other 3300 or so of those securities can be loaded into distributed memory – it’s ballpark 30 GB of RAM to hold it all.

Tip: Use notime in /etc/fstab on the mounted folder holding the SQlite databases to reduce file activity and speed up IO.

Small, individual databases can be replicated across clusters on file systems such as GFS, GlusterFS, and Hadoop. This scales well on inexpensive hardware too. I just wouldn’t know where to begin with a monolithic 360 GB MySQL database entrenched in a standalone file system while keeping cost down.

Conclusion: Having a single 360 GB MySQL database doesn’t make sense. Just optimizing, backing that up, creating replicas, or getting into parallel processing with concurrency raises all kinds of challenges. It’s also expensive if hosted in a cloud environment.

Update 2018-10: It is better to not use a local SSD to store real-time stock data as it comes in over the wire or to post-process data. This was after 8 months with just 15 TB of writes (on a less popular SSD brand):

SSD about to fail
SSD about to fail

I now use RAID-1 HDDs to record daily candles.

Notes:

  1. I actually just guessed because I was going to calculate it anyway.
  2. “One trend that needs to be highlighted is the number of listed companies in the US: over the past 20-years from 1996 to 2016 they have been falling by half from approx 7,322 to 3,671” Source: WorldBank.org
  3. Do not use an SSD to record real-time stock data. See below.