Code Walkthrough for the Alpha Simulator (for Programming Beginners)

Sep 07, 2023

∙ Paid

As we advance into our third year on this blog - it’s dawning upon me that many of the readers are getting left behind…the biggest concern by far is the complexity of the current Russian Doll model and not being sure how to proceed with using the statistical suite presented therein, together with the formulaic alphas.

Although I was intending to further take the Russian Doll to have integrated portfolio optimization methods, I think this is a fair time to step back and walkthrough the evolution to where we are today.

So the next series of lectures will be a lesson in history: how we have arrived at the powerful Python module that is. We will begin with a vanilla, random signal generating model and walk you through individual lines of code, as well as the design choices and logical behaviour. This should get you up to speed, before we charge ahead with more technical implementations for portfolio management. I am also creating video lectures and coding walkthrough - so if you would rather watch it in video format…you may skip this post. The lectures are due next month.

We will begin with the code, and walk you through line by line. We will also make some comments throughout the guide and tips on building larger scale systems, so you can build your own application scaling to more than tens of thousands of lines of code. (the code files are downloadable at the bottom of the post)

import pytz
import yfinance
import requests
import threading
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup

def get_sp500_tickers():
    res = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
    soup = BeautifulSoup(res.content,'html')
    table = soup.find_all('table')[0] 
    df = pd.read_html(str(table))
    tickers = list(df[0].Symbol)
    return tickers

def get_history(ticker, period_start, period_end, granularity="1d", tries=0):
    try:
        df = yfinance.Ticker(ticker).history(
            start=period_start,
            end=period_end,
            interval=granularity,
            auto_adjust=True
        ).reset_index()
    except Exception as err:
        if tries < 5:
            return get_history(ticker, period_start, period_end, granularity, tries+1)
        return pd.DataFrame()
    
    df = df.rename(columns={
        "Date":"datetime",
        "Open":"open",
        "High":"high",
        "Low":"low",
        "Close":"close",
        "Volume":"volume"
    })
    if df.empty:
        return pd.DataFrame()
    
    df["datetime"] = df["datetime"].dt.tz_localize(pytz.utc)
    df = df.drop(columns=["Dividends", "Stock Splits"])
    df = df.set_index("datetime",drop=True)
    return df

def get_histories(tickers, period_starts,period_ends, granularity="1d"):
    dfs = [None]*len(tickers)
    def _helper(i):
        print(tickers[i])
        df = get_history(
            tickers[i],
            period_starts[i], 
            period_ends[i], 
            granularity=granularity
        )
        dfs[i] = df
    threads = [threading.Thread(target=_helper,args=(i,)) for i in range(len(tickers))]
    [thread.start() for thread in threads]
    [thread.join() for thread in threads]
    #for i in range(len(tickers)): _helper(i) #can replace the 3 preceding lines for sequential polling 
    tickers = [tickers[i] for i in range(len(tickers)) if not dfs[i].empty]
    dfs = [df for df in dfs if not df.empty]
    return tickers, dfs

def get_ticker_dfs(start,end):
    from utils import load_pickle,save_pickle
    try:
        tickers, ticker_dfs = load_pickle("dataset.obj")
    except Exception as err:
        tickers = get_sp500_tickers()
        starts=[start]*len(tickers)
        ends=[end]*len(tickers)
        tickers,dfs = get_histories(tickers,starts,ends,granularity="1d")
        ticker_dfs = {ticker:df for ticker,df in zip(tickers,dfs)}
        save_pickle("dataset.obj", (tickers,ticker_dfs))
    return tickers, ticker_dfs 

from utils import Alpha
period_start = datetime(2010,1,1, tzinfo=pytz.utc)
period_end = datetime.now(pytz.utc)
tickers, ticker_dfs = get_ticker_dfs(start=period_start,end=period_end)
alpha = Alpha(insts=tickers,dfs=ticker_dfs,start=period_start,end=period_end)
df = alpha.run_simulation()
print(df)

The entry point begins with

period_start = datetime(2010,1,1, tzinfo=pytz.utc)
period_end = datetime.now(pytz.utc)

Here the most pertinent issue is the setting of a timezone - throughout our application, you may be using data from different vendors and/or time alignment, such as New York/Tokyo timezones. If you have a timezone-less datetime, it is ambiguous as to when the actual data was sampled. In particular instances where you are working with intraday data, not being aware of the timezone in which the time-series were sampled invites lookahead-bias and other deadly errors. Here we will be working with daily data, so we can safely set to the UTC standard - we shall be consistent across our whole application, so that we can make fair comparisons across application logic. The next line follows:

tickers, ticker_dfs = get_ticker_dfs(start=period_start,end=period_end)

triggering

def get_ticker_dfs(start,end):
    from utils import load_pickle,save_pickle
    try:
        tickers, ticker_dfs = load_pickle("dataset.obj")
    except Exception as err:
        tickers = get_sp500_tickers()
        starts=[start]*len(tickers)
        ends=[end]*len(tickers)
        tickers,dfs = get_histories(tickers,starts,ends,granularity="1d")
        ticker_dfs = {ticker:df for ticker,df in zip(tickers,dfs)}
        save_pickle("dataset.obj", (tickers,ticker_dfs))
    return tickers, ticker_dfs

We may ignore the try block for now, which simply looks for a data cache on our computer disk. The called function:

def get_sp500_tickers():
    res = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
    soup = BeautifulSoup(res.content,'html')
    table = soup.find_all('table')[0] 
    df = pd.read_html(str(table))
    tickers = list(df[0].Symbol)
    return tickers

goes to the URL specified and grabs the first <table> HTML block. You can visit the website on your Chrome or Safari and do F12 to open the developer console and inspect the associated HTML. Hover over the table…that is the HTML table we are grabbing, which we dump into an Pandas dataframe and extract the tickers from.

We make call to get_histories, which isn’t all too interesting, except see this few lines here:

    threads = [threading.Thread(target=_helper,args=(i,)) for i in range(len(tickers))]
    [thread.start() for thread in threads]
    [thread.join() for thread in threads]
    #for i in range(len(tickers)): _helper(i) #can replace the 3 preceding lines for sequential polling

The first 3 lines essentially does the multi-threaded version of the sequential request done in the commented out for loop. Instead of going through the requests one by one and waiting for the network requests to complete, we fire them all at the same time by spinning up worker threads. Here, we cannot use asyncio because the yfinance API used is blocking - the event loop will be blocked - check our paper here on more notes:

Notes on Asynchronous Python Programming

HangukQuant

August 23, 2023

Read full story

Next interesting function is get_history. Recall our timezone UTC standardisation:

df["datetime"] = df["datetime"].dt.tz_localize(pytz.utc)

This takes a non timezone aware datetime index in the pandas dataframe and makes it timezone aware. It is good practice to get rid of columns you won’t need:

df = df.drop(columns=["Dividends", "Stock Splits"])

which saves RAM…and this can make the application more performant in subtle ways. For instance, our cache and RAM will have less unnecessary data. We will get onto performant programming in the coming posts. Here’s a tip for writing scalable code: be sure to standardise the schema (data type, data pattern etc) for objects that are passed between various components in your software. Here, the dataframes will always be passed with the timezone aware datetime index datatype:

df = df.set_index("datetime",drop=True)

Following this principle we can create powerful wrappers that uses a common interface across multiple external APIs…recently a reader commented that this post was a total game changer for them:

Data Service Layer - Retrieval of Datasources

HangukQuant

August 5, 2022

Read full story

Each method or function should have a predefined interface, that specifies the what is guaranteed by the inputs and outputs.

Continuing….

HangukQuant Research

Code Walkthrough for the Alpha Simulator (for Programming Beginners)

Notes on Asynchronous Python Programming

Data Service Layer - Retrieval of Datasources

This post is for paid subscribers