On Market Data Sampling.

HangukQuant

May 05, 2025

∙ Paid

A reminder that access to prediction market arbitrage lectures will be closed in a week:

Prediction Market Arbitrage (released, batch 1)

HangukQuant

Apr 29

Read full story

This post will talk about various ways to sample tick data to have more amenable price series. I am currently on leave, so I will try to be as concise and to the point as possible.

Typically, the most popular tick data sampling method is the OHLCV time-based sampling, often known as candlestick or kline data. Many traders rely on time based sampling techniques for their trading strategies. This is evident, for instance, of momentum based traders in various timeframes. Like

a large volume of traders around 0 UTC submitting their orders on previous day trend or
correlated trades of market participants around popular ‘crossover’ such as 20/50 day moving averages etc.

But this has critical issues, particularly, that

choice of time axis is arbitrary, and market ‘speed’ is time-varying (derived effects, like volatility, are not constant in time space)
market variables (e.g. price changes) are not well behaved, and are more difficult to model and/or violate assumptions of useful models.

For example, a least squares regression itself doesn't inherently require errors to be Gaussian, but the hypothesis tests rely weakly on the assumptions of Gaussian errors. From one spectrum of assumptions (random sample) to the other (IID, Gaussian) - the better the data behaves, the easier they become to work with.

It is only fair that we consider sampling behaviour that is less arbitrary than simply binning in time space - so that we achieve returns closer to the IID Gaussian distributions.

God forbid, we must remember that they never will be, but what is the alternative? Should we assume a Paretian distribution with infinite variance and read tarot cards?

Since Mandelbrot’s work, the work of many quants explore the sampling of data in a matter that preserves some structural invariance, looking at statistical exhibits across volatility, time, dollars, cost and younameit space.

One of the most popular work to bring this into common knowledge is the work of de Prado’s wildly famous book Advances in Financial Machine Learning (AFML). You will find many interesting topics, and some lofty ones like quantum computers…you don’t have to eat everything at the buffet to enjoy one.

We will give a short section and discuss some of these sampling variants, and how to use quantpylib to seamlessly receive these ‘candles’ as data feeds. Quantpylib supports [t-o-h-l-c-v-n-vwap-T]

Tick Bars

Sample by number of tick count on your trade feed. It is important to ensure tick count reporting is consistent across data sources, taking into account things like auctions.

In quantpylib:

Volume Bars

Tick bars rely on arbitrary factors like order sizes of the resting orders. Circumvent the issue of order fragmentation using units in the number of contracts.

Dollar Bars

Prices are not stationary across time, contracts roll, instruments split, and do funny things. Use dollar volumes!

(Signed) Tick, Volume and Dollar bars.

Sample based on the cumulative imbalance of trade ticks, rather than their absolute values.

Imbalance Bars

There are various ‘imbalance bars’ as cited from AFML, and here is the tick imbalance one. Roughly,

b_t: sign price change from previous tick (if any)

T: random variable for dynamic bar duration

The expectation term holds because P(b_t = -1) + P(b_t = 1) = 1.

The argument is that when theta is more imbalanced than expected, a low value of T satisfies the conditions and sampling is accelerated. This occurs when directional trading occurs as informed traders cross the spread, and each candle is ‘information-content’ invariant.

Some Modifications

Admittedly, the formula is very nice and has elegant interpretations. For practical purposes though, we would need to make some changes. For instance, this implementation is likely to collapse to single tick bars…

HangukQuant Research