# Cluster Shrinkage

Cluster Shrinkage
April 8, 2014

At GestaltU we see ourselves as incrementalists. We aren’t so much prone to true quantum leaps in thinking, but we excel at finding novel ways to apply others’ brilliant concepts. In other words, we appreciate the fact that, for the most part, we ‘stand on the shoulders of giants’.

There are of course some true giants in the field of portfolio theory. Aside from timeless luminaries like Markowitz, Black, Sharpe, Thorpe and Litterman,  we perceive thinkers like  Thierry Roncalli, Attilio Meucci, and Yves Choueifetay to be modern giants. We also admire the work of David Varadi for his contributions in the field of heuristic optimization, and his propensity to introduce concepts from fields outside of pure finance and mathematics. Also, Michael Kapler has created a truly emergent phenomenon in finance with his Systematic Investor Toolkit, which has served to open up the previously esoteric field of quantitative finance to a much wider set of practitioners. I (Adam) know I’ve missed many others, for which I deeply apologize and take full responsibility.  I never was very good with names.

In this article, we would like to integrate the cluster concepts we introduced in our article on Robust Risk Parity with some ideas proposed and explored by Varadi and Kapler in the last few months (see here and here). Candidly, as so often happens with the creative process, we stumbled on these ideas in the process of designing a Monte-Carlo based robustness test for our production algorithms, which we intend to explore in greater detail in a future post.

### The Curse of Dimensionality

In a recent article series, Varadi and Kapler proposed and validated some novel approaches to the ‘curse of dimensionality’ in correlation/covariance matrices for high dimensional problems with limited data histories. Varadi used the following slide from R. Gutierrez-Osuna to illustrate this concept.

Figure 1. Curse of Dimensionality

Source: R. Gutierrez-Osuna

The ‘curse of dimensionality’ sounds complicated but is actually quite simple. Imagine you seek to derive volatility estimates for a universe of 10 assets based on 60 days of historical data. The volatility of each asset is held in a 1 x 10 vector, where each of the 10 elements of the vector holds the volatility for one asset class. From a data density standpoint, we have 600 observations (60 days x 10 assets) contributing to 10 estimates, so our data density is 600/10 = 60 pieces of data per estimate. From a statistical standpoint, this is a meaningful sample size.

Now let’s instead consider trying to estimate the variance covariance matrix (VCV) for this universe of 10 assets, which we require in order to estimate the volatility of a portfolio constituted from this universe.  The covariance matrix is symmetrical along the diagonal, so that values in the bottom left half the matrix are repeated in the upper right half. So how might we calculate the number of independent elements in a covariance matrix with 10 assets?

For those who are interested in such things, the generalized formula for calculating the number of independent elements of a tensor of rank M with N elements is:

$large&space;large&space;E_i=frac{(M+N-1)!}{M!(N-1)!}$

For a rank 2 tensor (such as a covariance matrix) the number of independent elements is:

$large&space;E_i=frac{N(N+1)}{2}$

Therefore, accounting for the diagonal, the covariance matrix generates (10 * 11) / 2 = 55 independent pairwise variance and covariance estimates from the same 600 data points. In this case, each estimate is derived from an average of 600/55 =  10.9 data points per estimate.

Now imagine projecting the same 60 days into a rank 3 tensor (like the 3 dimensional cube in the figure above), like that used to derive the third moment (skewness) of a portfolio of assets. Now we have 10 x 10 x 10 = 1000 elements. The tensor is also symmetrical along each vertex (each corner of the cube is symmetrical), so we can calculate the number of independent elements using the generalized equation above, which reduces to the following expression for rank=3:

$large&space;E_i=frac{N(N+1)(N+2)}{6}$

Plugging in N=10, we easily calculate that there are (10 * 11 * 12)/6 =  220 independent estimates in this co-skewness tensor. Given that we have generated these estimates from the same 600 data points, we now have a data density of 600/220 = 2.7 pieces of data per estimate.

You can see how, even with just 10 assets to work with, to generate meaningful estimates for covariance, and especially higher order estimates like co-skewness and co-kurtosis (data density of 600/6500 = 0.09 observations per estimate), the amount of historical data required grows too large to be practical. For example, to achieve the same 60 data points per estimate for our covariance matrix as we have for our volatility vector would require 60*55 / 10 = 330 days of data per asset.

### Decay vs. Significance

In finance, we are often faced with a tradeoff between informational decay (or availability for testing purposes) and estimation error. On the one hand, we need a large enough data sample to derive statistically meaingful estimates. But on the other hand, price signals from long ago may carry less meaningful information than near term prices signals.

For example, a rule of thumb in statistics is that you need at least 30 data points in a sample to test for statistical significance. For this reason, when simulating methodologies with monthly data, many researchers will use the past 30 months of data to derive their estimates for covariance, volatility, etc. While the sample may be meaningful from a density standpoint (enough data points to be meaningful), it may not be quite as meaningful from an ‘economic’ standpoint, because price movements 2.5 years ago may not materially reflect current relationships.

To overcome this common challenge, researchers have proposed several ways to reduce the dimensionality of higher order estimates. For example, the concept of ‘shrinkage’ is often applied to covariance estimates for large dimensional universes in order to ‘shrink’ the individual estimates in a covariance matrix toward the average of all estimates in the matrix. Ledoit and Wolf pioneered this domain with their whitepaper, Honey I Shrank the Sample Covariance Matrix. Varadi and Kapler explore a variety of these methods, and propose some novel and exciting new methods in their recent article series. Overall, our humble observation from a these analyses and a quick survey of the literature is that while shrinkage methods help overcome some theoretical hurdles involved with time series parameter estimation, empirical results demonstrate mixed practical improvement.

### Cluster Shrinkage

Despite the mixed results of shrinkage methods in general, we felt there might be some value in proposing a slightly different type of shrinkage method which represents a sort of ‘compromise’ between traditional shrinkage methods and  estimates derived from the sample matrix with no adjustments. The compromise arises from the fact that our method introduces a layer of shrinkage that is more granular than the average of all estimates, but less granular than the sample matrix, by shrinking toward clusters.

Clustering is a method of dimensionality reduction because it segregates assets into groups with similar qualities based on information in the correlation matrix. As such, an asset universe of several dozens or even hundreds of securities can be reduced to a handful of significant moving parts. I would again direct readers to a thorough exploration of clustering methods by Varadi and Kapler here, and how clustering might be applied to robust risk parity in our previous article, here.

Figure 2 shows the major market clusters for calendar year 2013 and year-to-date 2014 derived using k-means, and where the number of relevant clusters is determined using the percentage of variance method (p>0.90) (find code here from Kapler).

Figure 2. Major market clusters in 2013-2014

In this universe there appear to have been 4 significant clusters over this period, which we might broadly categorize thusly:

1. Bond cluster (IEF, TLT)
2. Commodity (GLD, DBC)
3. Global equity cluster (EEM,EWJ,VGK,RWX,VTI)
4. U.S. Real Estate cluster (ICF)

Now that we have the clusters, we can think about each cluster as a new asset which captures a meaningful portion of the information from each of the constituents of the cluster. As such, once we choose a weighting scheme for how the assets are weighted inside each cluster, we can now form a correlation matrix from the 4 cluster ‘assets’, and this matrix will contain a meaningful portion of the information contained in the sample correlation matrix.

Figure 3. Example cluster correlation matrix

Once we have the cluster correlation matrix, the next step is to map each of the original assets to its respective cluster. Then we will ‘shrink’ each pairwise estimate in the sample correlation matrix toward the correlation estimate derived from the assets’ respective clusters. Where two assets are from the same cluster, we will shrink the sample pairwise correlation toward the average of all the pairwise correlations between assets of that cluster.

An example should help to cement the logic. Let’s assume the sample pairwise correlation between IEF and VTI is -0.1. Then we would shrink this pairwise correlation toward the correlation between the clusters to which IEF (bond cluster) and VTI (global equity cluster) respectively belong. From the table, we can see that the correlation between the bond and global equity clusters is 0.05, so the ‘shrunk’ pairwise correlation estimate for IEF and VTI becomes mean(-0.1, 0.05) = -0.025.

Next let’s use an example of two assets from the same cluster, say EWJ and VTI which both belong to the global equity cluster. Let’s assume the sample pairwise correlation between these assets is 0.6, and that the average of all pairwise correlations between all of the assets in the global equity cluster is 0.75. Then the ‘shrunk’ pairwise correlation estimate between EWJ and VTI becomes mean(0.6, 0.75) = 0.675.

### Empirical Results

We have coded up the logic for this method in R for use in Kapler’s Systematic Investor Toolback backtesting environment. The following tables offer a comparison of results on two universes. We ran minimum risk or equal risk contribution weighting methods with and without the application of our cluster shrinkage method, using a 250 day lookback window. All portfolios were rebalanced quarterly.

EW = Equal Weight (1/N)
MV = Minimum Variance
MD = Maximum Diversification
ERC = Equal Risk Contribution
MVA = David Varadi’s Heuristic Minimum Variance Algorithm

Results with cluster shrinkage show a .CS to the right of the weighting algorithm at the top of each performance table.

Table 1. 10 Global Asset Classes (DBC, EEM, EWJ, GLD, ICF, IEF, RWX, TLT, VGK, VTI)

Data from Bloomberg (extended with index or mutual fund data from 1995-)

Table 2. 10 U.S. sector SPYDER ETFs (XLY,XLP,XLE,XLF,XLV,XLI,XLB,XLK,XLU)

Data from Bloomberg

We can make some broad conclusions from these performance tables. At very least we have achieved golden rule number 1: first, do no harm. Most of the CS methods at least match the raw sample versions in terms of Sharpe ratio and MAR, and with comparable returns.

In fact, we might suggest that cluster shrinkage delivers meaningful improvement relative to the unadjusted versions, producing a noticeably higher Sharpe ratio for minimum variance, maximum diversification, and heuristic MVA algorithms for both universes, and for ERC as well with the sector universe. Further, we observe a material reduction in turnover as a result of the added stability of the shrinkage overlay, especially for the maximum diversification based simulations, where turnover was lower by 30-35% for both universes.

Cluster shrinkage appears to deliver a more consistent improvement for the sector universe than the asset class universe. This may be due to the fact that sector correlations are less stable than asset class correlations, and thus benefit from the added stability. If so, we should see even greater improvement on larger and noisier datasets such as individual stocks. We look forward to investigating this in the near future.