Learning things we already know about stocks or, stock return series networks and sample correlation matrix regularization

Original article from https://rviews.rstudio.com/2017/08/22/stocks/

This example groups stocks together in a network that highlights associations within and between the groups using only historical price data. The result is far from ground-breaking: you can already guess the output. For the most part, the stocks get grouped together into pretty obvious business sectors.

Despite the obvious result, the process of teasing out latent groupings from historic price data is interesting. That’s the focus of this example. A central idea of the approach taken here comes from the great paper of Ledoit and Wolf, “Honey, I Shrunk the Sample Covariance Matrix” (http://www.ledoit.net/honey.pdf). This example employs an alternative approach based on a matrix eigenvalue decomposition, but it’s the same general idea.

This note follows an informal, how-to format. Rather than focus on mathematical analysis, which is well-detailed in the references, I try to spell out the how’s and why’s: how to do things step by step (using R) and a somewhat non-rigorous rationale for each step that’s hopefully at least convincing and intuitive.

For emphasis, allow me to restate the first sentence as an objective:

Group stocks together in a network that highlights associations within and between the groups using only historical price data.

That’s what the rest of this example will do, hopefully illuminating some key ideas about regularization along the way.

Software used in the example

The example uses R of course, and the following R packages, all available on CRAN (some of the packages themselves have dependencies):

quantmod (at least version 0.4-10)
igraph (at least version 1.1.2)
threejs (at least version 0.3.1)

Getting data

NOTE: You can skip ahead to the Sample correlation section by simply downloading a sample copy of processed log(return) data as follows:

# library(quantmod)
# load(url("http://illposed.net/logreturns.rdata"))

Otherwise, follow the next two sections to download the raw stock daily price data and process those data into log(returns).

Download daily closing price data from Google Finance

The quantmod package (Ulrich and Ryan, http://www.quantmod.com/) makes it ridiculously easy to download (and visualize) financial time series data. The following code uses quantmod to download daily stock price data for about 100 companies with the largest market capitalizations listed on the Standard & Poor’s 500 index at the time of this writing. The code downloads daily closing prices from 2012 until the present. Modify the code to experiment with different time periods or stocks as desired!

Because stock symbol names may change and companies my come and go, it’s possible that some of the data for some time periods are not available. The tryCatch() block in the code checks for a download error and flags problems by returning NA, later removed from the result. The upshot is that the output number of columns of stock price time series may be smaller than the input list of stock symbols.

The output of the following code is an xts time series matrix of stock prices called prices whose rows correspond to days and columns to stock symbols.

from <- "2012-05-17"
# {{{ sym
sym <- c("AAPL", "ABBV", "ABT", "ACN", "AGN", "AIG", "ALL", "AMGN",
         "AMZN", "AXP", "BA", "BAC", "BIIB", "BK", "BLK", "BMY",
         "BRK.B", "C", "CAT", "CELG", "CL", "CMCSA", "COF", "COP",
         "COST", "CSCO", "CVS", "CVX", "DD", "DHR", "DIS", "DOW",
         "DUK", "EMR", "EXC", "F", "FB", "FDX", "FOX", "FOXA", "GD",
         "GE", "GILD", "GM", "GOOG", "GOOGL", "GS", "HAL", "HD",
         "HON", "IBM", "INTC", "JNJ", "JPM", "KHC", "KMI", "KO",
         "LLY", "LMT", "LOW", "MA", "MCD", "MDLZ", "MDT", "MET",
         "MMM", "MO", "MON", "MRK", "MS", "MSFT", "NEE", "NKE",
         "ORCL", "OXY", "PCLN", "PEP", "PFE", "PG", "PM", "PYPL",
         "QCOM", "RTN", "SBUX", "SLB", "SO", "SPG", "T", "TGT", "TWX",
         "TXN", "UNH", "UNP", "UPS", "USB", "UTX", "V", "VZ", "WBA",
         "WFC", "WMT", "XOM")
# }}}

# prices <- parallel::mclapply(sym, \(ticker) {
#   print(ticker)
#   tryCatch(quantmod::getSymbols(ticker, src="yahoo", auto.assign=FALSE, env=NULL, from=from), error = \(e) NA)
# }, mc.cores=10)
# qs::qsave(x=prices, file="2022-02-01-prices-2.qs", nthreads=10)
prices <- qs::qread("2022-02-01-prices-2.qs", nthreads=10)

# identify symbols returning valid data
# prices <- prices[!simplify2array(parallel::mcMap(\(i) is.na(prices[i]), seq(length(prices)), mc.cores=10))]

# parallel::mcMap(\(i) i[1:2,], prices[1:2], mc.cores=10)

prices <- do.call(xts::merge.xts, prices)
prices <- prices[,grep("Close", colnames(prices))]
colnames(prices) <- gsub("\\.Close", "", colnames(prices))

Clean up and transform data

Not every stock symbol may have prices available for every day. Trading can be suspended for some reason, companies get acquired or go private, new companies form, etc.

Let’s fill in missing values going forward in time using the last reported price (piecewise constant interpolation)–a reasonable approach for stock price time series. After that, if there are still missing values, just remove those symbols that contain them, possibly further reducing the universe of stock symbols we’re working with.

prices <- zoo::na.locf(prices)
prices <- prices[,colSums(is.na(prices))==0]

Now that we have a universe of stocks with valid price data, convert those prices to log(returns) for the remaining analysis (by returns I mean simply the ratio of prices relative to the first price).

Why log(returns) instead of prices?

The log(returns) are closer to normally distributed than prices especially in the long run. Pat Burns wrote a note about this (with a Tom Waits soundrack): http://www.portfolioprobe.com/2012/01/23/the-distribution-of-financial-returns-made-simple/.

But why care about getting data closer to normally distributed?

That turns out to be important to us because later we’ll use a technique called partial correlation. That technique generally works better for normally distributed data than otherwise, see for example a nice technical discussion about this by Baba, Shibata, and Sibuya here: https://doi.org/10.1111%2Fj.1467-842X.2004.00360.x

The following simple code converts our prices matrix into a matrix of log(returns):

log_returns <- diff(log(prices))
log_returns <- log_returns[rowSums(is.na(log_returns))==0,]

Sample correlation matrix

It’s easy to convert the downloaded log(returns) data into a Pearson’s sample correlation matrix X:

X <- cor(log_returns)

L <- eigen(X, symmetric=TRUE)

The (i, j)th entry of the sample correlation matrix X above is a measurement of the degree of linear dependence between the log(return) series for the stocks in columns i and j.

There exist at least two issues that can lead to serious problems with the interpretation of the sample correlation values:

As Ledoit and Wolf point out, it’s well-known that empirical correlation estimates may contain lots of error.
Correlation estimates between two stock log(return) series can be misleading for many reasons, including spurious correlation or existence of confounding variables related to both series (http://www.tylervigen.com/spurious-correlations).

A Nobel-prize winning approach to dealing with the second problem considers cointegration between series instead of correlation, see for example notes by Eric Zivot (https://faculty.washington.edu/ezivot/econ584/notes/cointegrationslides.pdf), or Bernhard Pfaff’s lovely book “Analysis of Integrated and Cointegrated Time Series with R” (http://www.springer.com/us/book/9780387759661), or Wikipedia (https://en.wikipedia.org/wiki/Cointegration). (I also have some weird technical notes on the numerics of cointegration at http://illposed.net/cointegration.html.)

Cointegration is a wonderful but fairly technical topic. Instead, let’s try a simpler approach.

We can try to address issue 2 above by controlling for confounding variables, at least partially. One approach considers partial correlation instead of correlation (see for example the nice description in Wikipedia https://en.wikipedia.org/wiki/Partial_correlation). That approach works best in practice with approximately normal data–one reason for the switch to log(returns) instead of prices. We will treat the entries of the precision matrix as measures of association in a network of stocks below.

It’s worth stating that our simple approach basically treats the log(returns) series as a bunch of vectors and not so much bona fide time series, and can’t handle as many pathologies that might occur as well as cointegration can. But as we will see, this simple technique is still pretty effective at finding structure in our data. (And indeed related methods as discussed by Ledoit and Wolf and elsewhere are widely used in portfolio and risk analyses in practice.)

The partial correlation coefficients between all stock log(returns) series are the entries of the inverse of the sample correlation matrix (https://www.statlect.com/glossary/precision-matrix).

Market trading of our universe of companies, with myriad known and unknown associations between them and the larger economy, produced the stock prices we downloaded. Our objective is a kind of inverse problem: given a bunch of historical stock prices, produce a network of associations.

You may recall from some long ago class that, numerically speaking, inverting matrices is generally a bad idea. Even worse, issue 1 above says that our estimated correlation coefficients contain error (noise). Even a tiny amount noise can be hugely amplified if we invert the matrix. That’s because, as we will soon see, the sample correlation matrix contains tiny eigenvalues and matrix inversion effectively divides the noise by those tiny values. Simply stated, dividing by a tiny number returns a big number–that is, matrix inversion tends to blow the noise up. This is a fundamental issue (in a sense, the fundamental issue) common to many inverse problems.

Ledoit and Wolf’s sensible answer to reducing the influence of noise is regularization. Regularization replaces models with different, but related, models designed to reduce the influence of noise on their output. LW use a form of regularization related to ridge regression (a. k. a. Tikhonov regularization) with a peculiar regularization operator based on a highly structured estimate of the covariance. We will use a simpler kind of regularization based on an eigenvalue decomposition of the sample correlation matrix X.

Regularization

Here is an eigenvalue decomposition of the sample correlation matrix:

L = eigen(X, symmetric=TRUE)

Note that R’s eigen() function takes care to return the (real-valued) eigenvalues of a symmetric matrix in decreasing order for us. (Technically, the correlation matrix is symmetric positive semi-definite, and will have only nonnegative real eigenvalues.)

Each eigenvector represents an orthogonal projection of the sample correlation matrix into a line (a 1-d shadow of the data); The first two eigenvectors define a projection of the sample correlation matrix into a plane (2-d), and so on. The eigenvalues estimate the proportion of information (or variability if you prefer) from the original sample correlation matrix contained in each eigenvector. Because the eigenvectors are orthogonal, these measurements of projected information are additive.

Here is a plot of all the sample correlation matrix eigenvalues (along with a vertical line that will be explained in a moment):

plot(L$values, ylab="eigenvalues")
abline(v=10)

N <- 10  # (use 1st 10 eigenvectors, set N larger to reduce regularization)
P <- L$vectors[,1:N] %*% (t(L$vectors[,1:N]) / L$values[1:N])
P <- P / tcrossprod(sqrt(diag(P)))
colnames(P) <- colnames(X)

Other approaches

I’m not qualified to write about them, but you should be aware that Bayesian approaches to solving problems like this are also effectively (and effective!) regularization methods. I hope to someday better understand the connections between classical inverse problem solution methods that I know a little bit about, and Bayesian methods that I know substantially less about.

Put a package on it

There is a carefully written R package to construct regularized correlation and precision matrices: the corpcor package (https://cran.r-project.org/package=corpcor, and also see http://strimmerlab.org/software/corpcor/) by Juliane Schafer, Rainer Opgen-Rhein, Verena Zuber, Miika Ahdesmaki, A. Pedro Duarte Silva, and Korbinian Strimmer. Their package includes the original Ledoit Wolf-like regularization method, as well as refinements to it and many other methods. The corpcor package, like Ledoit Wolf, includes ways to use sophisticated regularization operators and can apply more broadly than the simple approach taken in this note.

You can use the corpcor package to form a Ledoit-Wolf-like regularized precision matrix P, and you should try it! The result is pretty similar to what we get from our simple truncated eigenvalue decomposition regularization in this example.

Networks and clustering

The (i, j)th entry of the precision matrix P is a measure of association between the log(return) time series for the stocks in columns i and j, with larger values corresponding to more association.

An interesting way to group related stocks together is to think of the precision matrix as an adjacency matrix defining a weighted, undirected network of stock associations. Thresholding entries of the precision matrix to include, say, only the top ten per cent results in a network of only the most strongly associated stocks.

Thinking in terms of networks opens up a huge and useful toolbox: graph theory. We gain access to all kinds of nifty ways to analyze and visualize data, including methods for clustering and community detection.

R’s comprehensive igraph package by Gábor Csárdi (https://cran.r-project.org/package=igraph) includes many network cluster detection algorithms. The example below uses Blondel and co-authors’ fast community detection algorithm implemented by igraph’s cluster_louvain() function to segment the thresholded precision matrix of stocks into groups. The code produces an igraph graph object g, with vertices colored by group membership.

threshold <- 0.90
Q <- P * (P > quantile(P, probs=threshold))  # thresholded precision matrix

g <- igraph::graph.adjacency(
    adjmatrix=Q,
    mode="undirected",
    weighted=TRUE,
    diag=FALSE)  # ...expressed as a graph

x <- igraph::groups(igraph::cluster_louvain(g))
i <- unlist(lapply(x, length))
d <- order(i, decreasing=TRUE)
x <- x[d]
j <- i > 1
s <- sum(j)
names(x)[j] <- seq(1, s)
names(x)[!j] <- s + 1
grp <- as.integer(rep(names(x), i))
clrs <- c(rainbow(s), "gray")[grp[order(unlist(x))]]
g <- igraph::set_vertex_attr(g, "color", value=clrs)

Use the latest threejs package to make a nice interactive visualization of the network (you can use your mouse/trackpad to rotate, zoom and pan the visualization).

threejs::graphjs(g, vertex.size=0.2, vertex.shape=colnames(X), edge.alpha=0.5)

The stock groups identified by this method are uncanny, but hardly all that surprising really. Look closely and you will see clusters made up of bank-like companies (AIG, BAC, BK, C, COF, GS, JPM, MET, MS, USB, WFC), pharmaceutical companies (ABT, AMGN, BIIB, BMY, CELG, GILD, JNJ, LLY, MRK, PFE), computer/technology-driven companies (AAPL, ACN, CSCO, IBM, INTC, MSFT, ORCL, QCOM, T, TXN, VZ, and so on. With the threshold value of 0.9 above, a few stocks aren’t connected to any others – they appear in gray.

The groups more or less correspond to what we already know!

The FB, GOOG, AMZN, PCLN (Facebook, Alphabet/Google, Amazon, Priceline) group is interesting–it includes credit card companies V (Visa), MA (Mastercard). Perhaps the returns of FB, GOOG and AMZN are more closely connected to consumer spending than technology!

This way of looking at things also nicely highlights connections between groups. For instance, the pharma group is connected to consumer products group through JNJ and PG (Johnson and Johnson and Proctor and Gamble). See the appendix below for a visualization that explores different precision matrix threshold values, including lower values with far greater network connectivity.

Review

We downloaded daily closing stock prices for 100 stocks from the S&P 500, and, using basic tools of statistics and analysis like correlation and regularization, we grouped the stocks together in a network that highlights associations within and between the groups. The structure teased out of the stock price data is reasonably intuitive.

Appendix: `threejs` tricks

The following self-contained example shows how the network changes with threshold value. It performs the same steps as we did above, but uses some tricks in threejs and an experimental extension to the crosstalk package and a few additional R packages to present an interactive animation. Enjoy!

# A function that creates a network for a given threshold and precision matrix
f <- \(threshold, P) {
    Q <- P * (P > quantile(P, probs=threshold))  # thresholded precision matrix
    g <- igraph::graph.adjacency(
        Q, 
        mode="undirected", 
        weighted=TRUE, 
        diag=FALSE)  # ...expressed as a graph

    x <- igraph::groups(igraph::cluster_louvain(g))
    i <- sapply(x, length)
    d <- order(i, decreasing=TRUE)
    x <- x[d]
    i <- i[d]
    j <- i > 1
    s <- sum(j)
    names(x)[j] <- seq(s)
    names(x)[! j] <- s + 1
    grp <- as.integer(rep(names(x), i))
    clrs <- c(rainbow(s), "gray")[grp[order(unlist(x))]]
    g <- igraph::set_vertex_attr(g, "color", value=clrs)
    igraph::set_vertex_attr(g, "shape", value=colnames(P))
}

threshold <- c(0.97, 0.95, 0.90, 0.85, 0.8)
g <- lapply(threshold, f, P=P)  # list of graphs, one for each threshold

# Compute force-directed network layouts for each threshold value
# A bit expensive to compute, so run in parallel!
l <- parallel::mcMap(\(x) igraph::layout_with_fr(x, dim=3, niter=150), 
    g, mc.cores=parallel::detectCores())

len <- length(threshold) - 1
sdf <- crosstalk::SharedData$new(
    data=data.frame(key=0:len), 
    key=~key)
slider <- crosstool::crosstool(
    data=sdf,
    class="transmitter",
    html=sprintf("<input type='range' min='0' max='%d' value='0'/>", len),
    width="100%",
    height=20,
    channel="filter")
vis <- threejs::graphjs(
    g=g,
    layout=l,
    vertex.size=0.2,
    main=as.list(threshold),
    defer=TRUE,
    edge.alpha=0.5,
    deferfps=30,
    crosstalk=sdf,
    width="100%",
    height=900)

htmltools::browsable(htmltools::div(list(
    htmltools::HTML("<center>"), 
    htmltools::tags$h3("Precision matrix quantile threshold (adjust slider to change)"), 
    slider, vis)))

Precision matrix quantile threshold (adjust slider to change)

Software.

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Arch Linux
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas.so.3.10.0
## LAPACK: /usr/lib/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_DK.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=nb_NO.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=nb_NO.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=nb_NO.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7          bslib_0.3.1         compiler_4.1.2     
##  [4] later_1.3.0         jquerylib_0.1.4     highr_0.9          
##  [7] base64enc_0.1-3     xts_0.12.1          tools_4.1.2        
## [10] qs_0.25.2           digest_0.6.29       lifecycle_1.0.1    
## [13] jsonlite_1.7.3      evaluate_0.14       lattice_0.20-45    
## [16] pkgconfig_2.0.3     rlang_1.0.0         igraph_1.2.7       
## [19] shiny_1.7.1         cli_3.1.0           crosstalk_1.2.0    
## [22] yaml_2.2.2          parallel_4.1.2      xfun_0.27          
## [25] fastmap_1.1.0       stringr_1.4.0       knitr_1.36         
## [28] sass_0.4.0          htmlwidgets_1.5.4   grid_4.1.2         
## [31] RApiSerialize_0.1.0 R6_2.5.1            crosstool_0.0.0    
## [34] rmarkdown_2.11      magrittr_2.0.2      ellipsis_0.3.2     
## [37] promises_1.2.0.1    threejs_0.3.3       htmltools_0.5.2    
## [40] xtable_1.8-4        mime_0.12           httpuv_1.6.3       
## [43] stringfish_0.15.5   stringi_1.7.5       lazyeval_0.2.2     
## [46] RcppParallel_5.1.4  zoo_1.8-9

Learning things we already know about stocks or, stock return series networks and sample correlation matrix regularization

by B. W. Lewis

2017-08-22

Software used in the example

Getting data

Download daily closing price data from Google Finance

Clean up and transform data

Sample correlation matrix

Regularization

Other approaches

Put a package on it

Networks and clustering

Review

Appendix: `threejs` tricks

Precision matrix quantile threshold (adjust slider to change)

Software.

Learning things we already know about stocks or, stock return series networks and sample correlation matrix regularization

by B. W. Lewis

2017-08-22

Software used in the example

Getting data

Download daily closing price data from Google Finance

Clean up and transform data

Sample correlation matrix

Regularization

Other approaches

Put a package on it

Networks and clustering

Review

Appendix: threejs tricks

Precision matrix quantile threshold (adjust slider to change)

Software.

Appendix: `threejs` tricks