Acquiring Candle Data for Quantitative Financial Analysis Research

EricJune 20, 2018August 25, 2019

This would make a good interview question:

There are about 120,000 public North American securities, bonds, rights, and index symbols.
You have a paid API that can access all of them in OHLCV¹ format if they are quotable.
There are two critical API constraints²:
- 15,000 calls per hour
- 20 calls per second

Goal: Design an algorithm that efficiently gets all the daily trading OHLCV candles for quantitative financial analysis research.

Napkin math

Minute candles are the most valuable, followed by day candles. Other time scale candles can be assembled from these without the need for unnecessary storage. Without any other information, the number of daily API calls would be:

120,000 symbols x 2 = 240,000 API calls needed

We are only allowed 15,000 calls per hour, so

240,000 calls / 15,000 calls/hr = 16 hours

It looks like the worst case time frame is 16 hours per day of constantly calling the broker API endpoint to get the candle data for every North American symbol. A trading day goes from 09:30 until 16:00 Toronto time – 6.5 hours of trading³. If we start the API call algorithm right at EOD, then 16:00 plus 16 hours means we would have all the previous day data by 08:00 the next morning. In theory with this approach, we will not have a data deficit. BUT, there will only be a scant 1.5 hours to perform technical analysis before placing the new day’s orders.

Stock API candle calls — Sample output of stock API candle calls

Single-threaded or multi-threaded?

Can we get away with a synchronous call algorithm? Let’s see.

15,000 calls/hr ≈ 4.16 calls/s ≈ 240 ms/call

For a single-threaded for-loop to succeed, each API call, including the overhead of DB storage, can take at most 240ms. I can tell you that this is an unreasonable assumption because:

A security with more trades⁴ has a bigger JSON payload which increases transit time
The broker’s servers go down for daily maintenance
There are weekly server problems
The downtime varies
The broker does not gzip their API responses⁵

From experience, some requests take over a second to process. The proposed algorithm must make asynchronous API calls.

Do we really need to make 240,000 API calls?

When I built my production algorithm, I did the math and it seems extreme to make 240,000 API calls per day. So, how to know which symbols to call and which to defer or ignore altogether? The symbols have flags to indicate if they are quotable and/or tradable (and much more). I wrote a helper class to enumerate the Cartesian products of the combination of types of symbols. Here is a high-level breakdown to the types of symbols:

Number of symbols: 122857
Number of quotable symbols: 48753
Number of non-quotable symbols: 74853 ← %61!
Number of only-quotable symbols: 19287
Number of tradable symbols: 29466

About 61% of the symbols are non-quotable. The breakdown of these symbols are:

Security Type	Count
Currency	4
-Not set-	38
Commodity	66
Right	3351
Mutual Fund	8367
Bond	11077
Index	11891
Stock	40084

All kinds of securities are not tradable. Why? Drilling down into the stocks in the breakdown reveals that all of them are no longer trading! Did they go bust? Maybe. It turns out that there is a lot of symbol turnover through name changes and delistings due to mergers, acquisitions, bankruptcies, and other reasons. At this moment in time, there are 48,753 quotable symbols.

48,753 symbols x 2 = 97,506 API calls needed
97,506 calls / 15,000 calls/hr = 6.5 hours vs. 16 hours

We’re now down to about 6.5 hours of API calls.

Do all symbols have equal priority?

We’ve established that we can process all the quotable symbols in about 6.5 hours. If the market closes at 16:00 Toronto time, then we can only start technical analysis from 22:30. AWS Labmda⁶ processing time is least expensive at night, so this is a silver lining. However, the next question to ask is if we need all the symbols before we start TA. If the answer is no, then which symbols can be deferred or excluded?

Here is a breakdown of the quotable symbols by exchange:

Quotable Security Exchange	Count
OPRAI	2
OTCBB	34
TSXI	47
NEO	82
S&P	259
⋮	⋮
TSXV	2150
TSX	2332
NYSE	3253
NASDAQ	3393
NASDAQI	15033
PINX	15999

Serious day and swing traders don’t bother with PINX, and the NASDAQI are niche indexes that aren’t as valuable as bigger indexes. Deferring them until the late evening or weekend saves about 31,000 x 2 API calls.

Number of quotable symbols excluding PINX and NASDAQI: 17648

The napkin math then becomes:

17,648 symbols x 2 = 35,296 API calls needed
35,296 calls / 15,000 calls/hr = 2.35 hours vs. 16 hours

Above is a diagnostics chart (with JFreeChart) that shows the real progress of the API calls and how outdated the symbols are. The green lines are symbols excluding PINX and NASDAQI, and the red lines are those including them. The yellow band is the trading period. The green band is an approximate after-hours trading band. The red band is when the broker servers are notorious for going down. This real acquisition of all the quotable 1-minute and 1-day candles completed in about 6.5 hours as expected.

Conclusion: In under two and a half hours an algorithm should be able acquire the most valuable one-minute and one-day candles. This means by about 18:30 Toronto time we can begin mass technical analysis. While that is happening by another algorithm, the first algorithm can go back and get the Pinx and NASDAQI symbols for completeness.

Next: The next step is to figure out the storage footprint, database architecture, and how to deal with multi-threaded concurrency of the databases: Storing Stock Candle Data Efficiently

Notes:

OHLCV = Open, High, Low, Close, Volume ↩
https://www.questrade.com/api/documentation/rate-limiting. ↩
6.5 hours of in-hours trading that does not include pre-market and after-hours trading. For example, when Facebook crashed on July 25, 2018, after-hours trading went on until 20:00 Toronto time! ↩
For example, AAPL or DJX ↩
I’ve called API support on this, and they refuse to implement gzip because of the extra CPU power required ↩
https://aws.amazon.com/lambda/ ↩