# Acquiring Candle Data for Quant Research

This would make a good interview question:

- There are about 120,000 public North American securities, bonds, rights, and index symbols.
- You have a paid API that can access all of them in OHLCV
^{1}format if they are quotable. - There are two critical API constraints
^{2}:- 15,000 calls per hour
- 30 calls per second

**Goal:**Design an algorithm that efficiently gets the all the daily trading candles for quantitative financial research.

## Napkin math

Minute candles are the most valuable, followed by day candles. Other time scale candles can be assembled from these without the need for unnecessary storage. Without any other information, the number of daily API calls would be:

120,000 symbols x 2 = 240,000 API calls needed

We are only allowed 15,000 calls per hour, so

240,000 calls / 15,000 calls/hr = 16 hours

It looks like the worst case time frame is 16 hours per day of constantly calling the broker API endpoint to get the candle data for every North American symbol. A trading day goes from 09:30 until 16:00 Toronto time – 6.5 hours of trading^{3}. If we start the API call algorithm right at EOD, then 16:00 plus 16 hours means we would have all the previous day data by 08:00 the next morning. In theory with this approach we will not have a data deficit. BUT, there will only be a scant 1.5 hours to perform technical analysis before placing the new day’s orders.

## Single-threaded or multi-threaded?

Can we get away with a synchronous call algorithm? Let’s see.

15,000 calls/hr ≈ 4.16 calls/s ≈ 240 ms/call

For a single-threaded for-loop to succeed, each API call, including the overhead of DB storage, can take at most 240ms. I can tell you that this is an unreasonable assumption because:

- A security with more trades
^{4}has a bigger JSON payload which increases transit time - The broker’s servers go down for daily maintenance
- There are weekly server problems
- The downtime varies
- The broker does
**not**gzip their API responses^{5}

From experience, some requests take over a second to process. The proposed algorithm must make asynchronous API calls.

## Do we really need to make 240,000 API calls?

When I built my production algorithm, I did the math and it seems extreme to make 240,000 API calls per day. So, how to know which symbols to call and which to defer or ignore altogether? The symbols have flags to indicate if they are quotable and/or tradable (and much more). I wrote a helper class to enumerate the Cartesian products of the combination of types of symbols. Here is a high-level breakdown to the types of symbols:

Number of symbols: 122857

Number of quotable symbols: 48753Number of non-quotable symbols: 74853 ← %61!

Number of only-quotable symbols: 19287

Number of tradable symbols: 29466

About 61% of the symbols are non-quotable. The breakdown of these symbols are:

Security Type | Count |
---|---|

Currency | 4 |

-Not set- | 38 |

Commodity | 66 |

Right | 3351 |

Mutual Fund | 8367 |

Bond | 11077 |

Index | 11891 |

Stock | 40084 |

All kinds of securities are not tradable. Why? Drilling down into the stocks in the breakdown reveals that all of them are no longer trading! Did they go bust? Maybe. It turns out that there is a lot of symbol turnover through name changes and delistings due to mergers, acquisitions, bankruptcies, and other reasons. At this moment in time, there are 48,753 quotable symbols.

48,753 symbols x 2 = 97,506 API calls needed

97,506 calls / 15,000 calls/hr =6.5 hours vs. 16 hours

We’re now down to about 6.5 hours of API calls.

## Do all symbols have equal priority?

We’ve established that we can process all the quotable symbols in about 6.5 hours. If the market closes at 16:00 Toronto time, then we can only start technical analysis from 22:30. AWS Labmda^{6} processing time is least expensive at night, so this is a silver lining. However, the next question to ask is if we need all the symbols before we start TA. If the answer is no, then which symbols can be deferred or excluded?

Here is a breakdown of the quotable symbols by exchange:

Quotable Security Exchange | Count |
---|---|

OPRAI | 2 |

OTCBB | 34 |

TSXI | 47 |

NEO | 82 |

S&P | 259 |

⋮ | ⋮ |

TSXV | 2150 |

TSX | 2332 |

NYSE | 3253 |

NASDAQ | 3393 |

NASDAQI | 15033 |

PINX | 15999 |

Serious day and swing traders don’t bother with PINX, and the NASDAQI are niche indexes that aren’t as valuable as bigger indexes. Deferring them until the late evening or weekend saves about 31,000 x 2 API calls.

Number of quotable symbols excluding PINX and NASDAQI:

17648

The napkin math then becomes:

17,648 symbols x 2 = 35,296 API calls needed

35,296 calls / 15,000 calls/hr =2.35 hours vs. 16 hours

Above is a diagnostics chart (with JFreeChart) that shows the real progress of the API calls and how outdated the symbols are. The green lines are symbols excluding PINX and NASDAQI, and the red lines are those including them. The yellow band is the trading period. The green band is an approximate after-hours trading band. The red band is when the broker servers are notorious for going down. This real acquisition of all the quotable 1-minute and 1-day candles completed in about 6.5 hours as expected.

**Conclusion:**In under two and a half hours an algorithm should be able acquire the most valuable one-minute and one-day candles. This means by about 18:30 Toronto time we can begin mass technical analysis. While that is happening by another algorithm, the first algorithm can go back and get the Pinx and NASDAQI symbols for completeness.

**Next:** The next step is to figure out the storage footprint, database architecture, and how to deal with multi-threaded concurrency of the databases: Storing Stock Candle Data Efficiently

Notes:

- OHLCV = Open, High, Low, Close, Volume ↩
- https://www.questrade.com/api/documentation/rate-limiting. ↩
- 6.5 hours of in-hours trading that does not include pre-market and after-hours trading. For example, when Facebook crashed on July 25, 2018, after-hours trading went on until 20:00 Toronto time! ↩
- For example, AAPL or DJX ↩
- I’ve called API support on this, and they refuse to implement gzip because of the extra CPU power required ↩
- https://aws.amazon.com/lambda/ ↩