Machine learning risk models

pdf
Số trang Machine learning risk models 28 Cỡ tệp Machine learning risk models 627 KB Lượt tải Machine learning risk models 0 Lượt đọc Machine learning risk models 0
Đánh giá Machine learning risk models
4.2 ( 5 lượt)
Nhấn vào bên dưới để tải tài liệu
Đang xem trước 10 trên tổng 28 trang, để tải xuống xem đầy đủ hãy nhấn vào bên trên
Chủ đề liên quan

Nội dung

Risk Market Journals Journal of Risk & Control, 6(1), 37-64 | March 30, 2019 Machine Learning Risk Models §1 Zura Kakushadze and Willie Yu 2 Abstract We give an explicit algorithm and source code for constructing risk models based on machine learning techniques. The resultant covariance matrices are not factor models. Based on empirical backtests, we compare the performance of these machine learning risk models to other constructions, including statistical risk models, risk models based on fundamental industry classifications, and also those utilizing multilevel clustering based industry classifications. JEL Classification numbers: G00; G10; G11; G12; G23 Keywords: machine learning; risk model; clustering; k-means; statistical risk models; covariance; correlation; variance; cluster number; risk factor; optimization; regression; mean-reversion; factor loadings; principal component; industry classification; quant; trading; dollar-neutral; alpha; signal; backtest 1 Zura Kakushadze, Ph.D., is the President of Quantigicr Solutions LLC, and a Full Professor at Free University of Tbilisi. Email: zura@quantigic.com § Quantigicr Solutions LLC, 1127 High Ridge Road #135, Stamford, CT 06905 DISCLAIMER: This address is used by the corresponding author for no purpose other than to indicate his professional affiliation as is customary in publications. In particular, the contents of this paper are not intended as an investment, legal, tax or any other such advice, and in no way represent views of Quantigicr Solutions LLC, the website www.quantigic.com or any of their other affiliates. 2 Willie Yu, Ph.D., is a Research Fellow at Duke-NUS Medical School. Article Info: Received : March 4, 2019. Published online : March 30, 2019. 38 Machine Learning Risk Models 1 Introduction and Summary In most practical quant trading applications3 one faces an old problem when computing a sample covariance matrix of returns: the number N of returns (e.g., the number of stocks in the trading universe) is much larger than the number T of observations in the time series of returns. The sample covariance matrix Cij (i, j = 1, . . . , N ) in this case is badly singular: its rank is at best T − 1. So, it cannot be inverted, which is required in, e.g., mean-variance optimization [17]. In fact, the singularity of Cij is only a small part of the trouble: its offdiagonal elements (more precisely, sample correlations) are notoriously unstable out-of-sample. The aforesaid “ills” of the sample covariance matrix are usually cured via multifactor risk models,4 where stock returns are (linearly) decomposed into contributions stemming from some number K of common underlying factors plus idiosyncratic “noise” pertaining to each stock individually. This is a way of dimensionally reducing the problem in that one only needs to compute a factor covariance matrix ΦAB (A, B = 1, . . . , K), which is substantially smaller than Cij assuming K ¿ N .5 In statistical risk models6 the factors are based on the first K principal components of the sample covariance matrix Cij (or the sample correlation matrix).7 In this case the number of factors is limited (K ≤ T − 1), and, furthermore, the principal components beyond the first one are inherently unstable out-of-sample. In contrast, factors based on a granular fundamental industry classification8 are much more ubiquitous (in hundreds), and also stable, as stocks seldom jump industries. Heterotic risk models [8] based on such industry classifications sizably outperform statistical risk models.9 Another alternative is to replace the 3 Similar issues are also present in other practical applications unrelated to trading or finance. 4 For a general discussion, see, e.g., [3]. For explicit implementations (including source code), see, e.g., [10], [12]. 5 This does not solve all problems, however. Thus, unless K < T , the sample factor covariance matrix is still singular (albeit the model covariance matrix Γij that replaces Cij need not be). Furthermore, the out-of-sample instability is still present in sample factor correlations. This can be circumvented via the heterotic risk model construction [8]; see below. 6 See [12], which gives complete source code, and references therein. 7 The (often misconstrued) “shrinkage” method [13] is nothing but a special type of statistical risk models; see [9], [12] for details. 8 E.g., BICS (Bloomberg Industry Classification System), GICS (Global Industry Classification Standard), ICB (Industry Classification Benchmark), SIC (Standard Industrial Classification), etc. 9 In the heterotic risk model construction the sample factor covariance matrix at the most 39 Zura Kakushadze and Willie Yu fundamental industry classification in the heterotic risk model construction by a statistical industry classification based on clustering (using machine learning techniques) the return time series data [11],10 without any reference to a fundamental industry classification. Risk models based on statistical industry classifications outperform statistical risk models but underperform risk models based on fundamental industry classifications [11]. In this paper we discuss a different approach to building a risk model using machine learning techniques. The idea is simple. A sample covariance matrix Cij is singular (assuming T ¿ N ), but it is semi-positive definite. Imagine that (m) we could compute a large number M of “samplings” of Cij , call them Cij , m = 1, . . . , M , where each “sampling” is semi-positive definite. Consider their mean11 M 1 X (m) Γij = C (1) M m=1 ij (m) By construction Γij is semi-positive definite. In fact, assuming Cij are all (sizably) different from each other, Γij generically will be positive definite and invertible (for large enough M ). So, the idea is sound, at least superfluously, but (m) the question is, what should these “samplings” Cij be? Note that each element of the sample covariance matrix Cij (i 6= j) only depends on the time series of the corresponding two stock returns Ri (t) and Rj (t), and not on the universe of stocks, so any cross-sectional “samplings” cannot be based on sample covariance matrices. In principle, serial “samplings” could be considered if a long history were available. However, here we assume that our lookback is limited, be it due to a short history that is available, or, more prosaically, due to the fact that data from a while back is not pertinent to forecasting risk for short horizons as market conditions change. (m) A simple way around this is to consider cross-sectional “samplings” Cij that are not sample covariance matrices but are already dimensionally reduced, even granular level in the industry classification typically would be singular. However, this is rectified by modeling the factor covariance matrix by another factor model with factors based on the next-less-granular level in the industry classification, and this process of dimensional reduction is repeated until the resultant factor covariance matrix is small enough to be nonsingular and sufficiently stable out-of-sample [8], [10]. Here one can also include non-industry style factors. However, their number is limited (especially for short horizons) and, contrary to an apparent common misconception, style factors generally are poor proxies for modeling correlations and add little to no value [10]. 10 Such statistical industry classifications can be multilevel and granular. 11 In fact, instead of the arithmetic mean, here we can more generally consider a weighted (m) average with some positive weights wm (see below). Also, in this paper Cij are nonsingular. 40 Machine Learning Risk Models though they do not have to be invertible. Thus, given a clustering of N stocks into K clusters, we can build a multifactor risk model, e.g., via an incomplete heterotic construction (see below). Different clusterings then produce different (m) “samplings” Cij , which we average via Eq. (1) to obtain a positive definite Γij , which is not a factor model. However, as usual, the devil is in the detail, which we discuss in Section 2. E.g., the matrix (1) can have nearly degenerate or small eigenvalues, which requires further tweaking Γij to avert, e.g., undesirable effects on optimization. In Section 3 we discuss backtests to compare the machine learning risk models of this paper to statistical risk models, and heterotic risk models based on fundamental industry classification and statistical industry classification. We briefly conclude in Section 4. Appendix A provides R source code12 for machine learning risk models, and some important legalese relating to this code is relegated to Appendix B. 2 Heterotic Construction and Sampling So, we have time series of returns (say, daily close-to-close returns) Ris for our N stocks (i = 1, . . . , N , s = 1, . . . , T , and s = 1 corresponds to the most recent time in the time series). Let us assume that we have a clustering of our N stocks into K clusters, where K is sizably smaller than N , and each stock belongs to one and only one cluster. Let the clusters be labeled by A = 1, . . . , K. So, we have a map G : {1, . . . , N } 7→ {1, . . . , K} (2) Following [8], we can model the sample correlation matrix Ψij = Cij /σi σj (here σi2 = Cii are the sample variances) via a factor model: e ij = ξ 2 δij + Ψ i K X ΩiA ΦAB ΩjB = ξi2 δij + Ui Uj ΦG(i),G(j) (3) A,B=1 ΩiA = Ui δG(i),A ξi2 (4) Ui2 = 1 − λ(G(i)) X X ΦAB = Ui Ψij Uj (5) (6) i∈J(A) j∈J(B) 12 The code in Appendix A is not written to be “fancy” or optimized for speed or otherwise. 41 Zura Kakushadze and Willie Yu Here the NA components of Ui for i ∈ J(A) are given by the first principal component of the N (A)×N (A) matrix [Ψ(A)]ij = Ψij , i, j ∈ J(A), where J(A) = {i|G(i) = A} is the set of the values of the index i corresponding to the cluster labeled by A, and NA = |J(A)| is the number of such i. Also, λ(A) is the largest eigenvalue (corresponding to the first principal component) of the matrix [Ψ(A)]ij . The matrix ΩiA is the factor loadings matrix, ξi2 is the specific variance, and the factor covariance matrix ΦAB has the property that ΦAA = λ(A). By e ii = 1, and the matrix Ψ e ij is positive-definite. However, ΦAB is construction, Ψ singular unless K ≤ T − 1. (a) This is because the rank of Ψij is (at most) T − 1. Let Vi be the principal components of Ψij with the corresponding eigenvalues λ(a) ordered decreasingly (a = 1, . . . , N ). More precisely, at most T − 1 eigenvalues λ(a) , a = 1, . . . , T − 1 are nonzero, and the others vanish. So, we have ΦAB = T −1 X e (a) e (a) U λ(a) U B A (7) a=1 e (a) = U A X (a) Ui V i (8) i∈J(A) So, the rank of ΦAB is (at most) T − 1, and the above incomplete heterotic construction provides a particular regularization of the statistical risk model construction based on principal components. In the complete heterotic construction ΦAB itself is modeled via another factor model, and this nested “Russian-doll” embedding is continued until at the final step the factor covariance matrix (which gets smaller and smaller at each step) is nonsingular (and sufficiently stable outof-sample). 2.1 Sampling via Clustering However, there is another way, which is what we refer to as “machine learning e (m) be the risk models” here. Suppose we have M different clusterings. Let Ψ ij model correlation matrix (3) for the m-th clustering (m = 1, . . . , M ). Then we can construct a model correlation matrix as a weighted sum e ij = Ψ M X e (m) wm Ψ ij (9) m=1 M X m=1 wm = 1 (10) 42 Machine Learning Risk Models The simplest choice for the weights is to have equal weighting: wm = 1/M . More e ij generally, so long as the weights wm are positive, the model correlation matrix Ψ e ii = 1.) However, combining a large is positive-definite. (Also, by construction Ψ (m) e number M of “samplings” Ψ accomplishes something else: each “sampling” ij provides a particular regularization of the sample correlation matrix, and combining such samplings covers many more directions in the risk space than each e (a) in Eq. (7) are different for different individual “sampling”. This is because U A clusterings. 2.2 K-means We can use k-means [2], [14], [15], [4], [5], [16], [21] for our clusterings. Since k-means is nondeterministic, it automatically produces a different “sampling” with each run. The idea behind k-means is to partition N observations into K clusters such that each observation belongs to the cluster with the nearest mean. Each of the N observations is actually a d-vector, so we have an N × d matrix Xis , i = 1, . . . , N , s = 1, . . . , d. Let Ca be the K clusters, Ca = {i|i ∈ Ca }, a = 1, . . . , K. Then k-means attempts to minimize g= K XX d X (Xis − Yas )2 (11) a=1 i∈Ca s=1 where Yas = 1 X Xis na i∈C (12) a are the cluster centers (i.e., cross-sectional means),13 and na = |Ca | is the number of elements in the cluster Ca . In Eq. (11) the measure of “closeness” is chosen to be the Euclidean distance between points in Rd , albeit other measures are possible.14 2.3 What to Cluster? Here we are not going to reinvent the wheel. We will simply use the prescription of [11]. Basically, we can cluster the returns, i.e., take Xis = Ris (then d = T ). However, stock volatility is highly variable, and its cross-sectional distribution is not even quasi-normal but highly skewed, with a long tail at the higher end 13 14 Throughout this paper “cross-sectional” refers to “over the index i”. E.g., the Manhattan distance, cosine similarity, etc. Zura Kakushadze and Willie Yu 43 – it is roughly log-normal. Clustering returns does not take this skewness into account and inadvertently we might be clustering together returns that are not at all highly correlated solely due to the skewed volatility factor. A simple “maeis = Ris /σi , where chine learning” solution is to cluster the normalized returns R σi2 = Var(Ris ) is the serial variance (σi2 = Cii ). However, as was discussed in detail in [11], this choice would also be suboptimal and this is where quant trading experience and intuition trumps generic machine learning “lore”. It is more bis = Ris /σi2 (see [11] for a detailed explanation). A potential optimal to cluster R practical hiccup with this is that if some stocks have very low volatilities, we could bis for such stocks. To avoid any potential issues with computations, have large R we can “smooth” this out via “Winsorization” of sorts (MAD = mean absolute deviation):15 bis = Ris R σi ui σi ui = v v = exp(Median(ln(σi )) − 3 MAD(ln(σi ))) (13) (14) (15) bis that is used and for all ui < 1 we set ui = 1. This is the definition of R in the source code internally. Furthermore, Median(·) and MAD(·) above are cross-sectional. 2.4 A Tweak The number of clusters K is a hyperparameter. In principle, it can be fixed by adapting the methods discussed in [11]. However, in the context of this paper, we will simply keep it as a hyperparameter and test what we get for its various values. As K increases, in some cases it is possible to get relatively small eigenvalues in e ij , or nearly degenerate eigenvalues. This can the model correlation matrix Ψ cause convergence issues in optimization with bounds (see below). To circumvent e ij for such values of K. this, we can slightly deform Ψ Here is a simple method that deals with both of the aforesaid issues at once. To understand this method, it is helpful to look at the eigenvalue graphs given in Figures 1, 2, 3, 4, which are based on a typical data set of daily returns for N = 2000 stocks and T = 21 trading days. These graphs plot the eigenvalues for e ij based on averaging M = 100 “samplings” e (m) , as well as Ψ a single “sampling” Ψ ij (with equal weights), for K = 150 and K = 40 (K is the number of clusters). 15 This is one possible tweak. Others produce similar results. 44 Machine Learning Risk Models Unsurprisingly, there are some small eigenvalues. However, their fraction is small. Furthermore, these small eigenvalues get even smaller for larger values of K, but increase when averaging over multiple “samplings”, which also smoothes out the eigenvalue graph structure. e ij by tweaking the small eigenWhat we wish to do is to deform the matrix Ψ values at the tail. We need to define what we mean by the “tail”, i.e., which eigenvalues to include in it. There are many ways of doing this, some are simpler, some are more convoluted. We use a method based on eRank or effective rank [19], which can be more generally defined for any subset S of the eigenvalues of a matrix, which (for our purposes here) is assumed to be symmetric and semi-positive-definite. Let eRank(S) = exp(H) L X H=− pa ln(pa ) (16) (17) a=1 (a) λ pa = PL (b) b=1 λ (18) where λ(a) are the L positive eigenvalues in the subset S, and H has the meaning of the (Shannon a.k.a. spectral) entropy [1], [22]. e ij , then the meaning of If we take S to be the full set of N eigenvalues of Ψ eRank(S) is that it is a measure of the effective dimensionality of the matrix e ij . However, this is not what we need to do for our purposes here. This is Ψ e ij contribute heavily into eRank(S). So, we because the large eigenvalues of Ψ e(a) (a = 1, . . . , N ) of Ψ e ij that do not exceed define S to include all eigenvalues λ e(a) |λ e(a) ≤ 1}. Then we define (here floor(·) = b·c can be replaced by 1: S = {λ round(·)) n∗ = |S| − floor(eRank(S)) (19) e(a) of Ψ e ij . So, the tail is now defined as the set S∗ of the n∗ smallest eigenvalues λ e ij by (i) replacing the n∗ tail eigenvalues in S∗ by We can now deform Ψ e∗ = max(S∗ ), and (ii) then correcting for the fact that the so-deformed matrix λ 45 Zura Kakushadze and Willie Yu b ij is given by: no longer has a unit diagonal. The resulting matrix Ψ b ij = Ψ NX −n∗ e(a) Ve (a) Ve (a) + zi zj λ i j a=1 e∗ Ve (a) Ve (a) λ i j (20) a=N −n∗ +1 N X zi2 = yi−2 N X e(a) [Ve (a) ]2 λ i (21) a=N −n∗ +1 yi2 N X = e∗ [Ve (a) ]2 λ i (22) a=N −n∗ +1 e ij . This method is similar to that Here Vei are the principal components of Ψ of [18]. The key difference is that in [18] the “adjustments” zi are applied to all principal components, while here they are only applied to the tail principal components (for which the eigenvalues are deformed). This results in a smaller b ij has improved distortion of the original matrix. The resultant deformed matrix Ψ tail behavior (see Figure 5). Another bonus is that, while superfluously we only b ij are no longer Ve (a) modify the tail, the eigenvectors of the deformed matrix Ψ i for all values of a, and the eigenvalues outside of the tail are also deformed. In particular, in some cases there can be some (typically, a few) nearly degenerate16 e(a) in the densely populated region of λ e(a) (where they are of order eigenvalues λ 1), i.e., outside of the tail and the higher-end upward-sloping “neck”. The deformation splits such nearly degenerate eigenvalues, which is a welcome bonus. Indeed, the issue with nearly degenerate eigenvalues is that they can adversely affect convergence of the bounded optimization (see below) as the corresponding directions in the risk space have almost identical risk profiles. (a) 3 Backtests Here we discuss some backtests. We wish to see how our machine learning risk models compare with other constructions (see below). For this comparison, we run our backtests exactly as in [8], except that the model covariance matrix is build as above (as opposed to the full heterotic risk model construction of [8]). To facilitate the comparisons, the historical data we use in our backtests here is the same as in [8]17 and is described in detail in Subsections 6.2 and 6.3 thereof. The 16 They are not degenerate even within the machine precision. However, they are spaced much more closely than other eigenvalues (on average, that is). 17 The same data is also used in [11], [12]. 46 Machine Learning Risk Models trading universe selection is described in Subsection 6.2 of [8]. We assume that i) the portfolio is established at the open with fills at the open prices; and ii) it is liquidated at the close on the same day (so this is a purely intraday strategy) with fills at the close prices (see [6] for pertinent details). We include strict trading bounds |Hi | ≤ 0.01 Ai (23) Here Hi are the portfolio stock holdings (i = 1, . . . , N ), and Ai are the corresponding historical average daily dollar volumes computed as in Subsection 6.2 of [8]. We further impose strict dollar-neutrality on the portfolio, so that N X Hi = 0 (24) i=1 The total investment level in our backtests here is I = $20M (i.e., $10M long and $10M short), same as in [8]. For the Sharpe ratio optimization with bounds we use the R function bopt.calc.opt() in Appendix C of [8]. Table 1 gives summaries of the eigenvalues for various values of K. Considering that the algorithm is nondeterministic, the results are stable against reruns. Table 2 summarizes the backtest results. Here we can wonder whether the following would produce an improvement. Suppose we start from the sample correlation matrix Ψij and run e ij . Suppose now we the algorithm, which produces the model correlation matrix Ψ e ij instead rerun the algorithm (with the same number of “samplings” M ) but use Ψ (m) of Ψij in Eq. (6) to build “sampling” correlation matrices Ψij . In fact, we can do this iteratively, over and over again, which we refer to as multiple iterations in Table 3. The results in Table 3 indicate that we do get some improvement on the second iteration, but not beyond. Let us note that for K ≥ 100 with iterations (see Table 3) the method of Subsection 2.4 was insufficient to deal with the issues with small and nearly degenerate eigenvalues, so we used the full method of [18] instead (see Subsection 2.4 and Table 3 for details), which distorts the model correlation matrix more (and this affects performance). 4 Concluding Remarks So, the machine learning risk models we discuss in this paper outperform statistical risk models [12]. They have the performance essentially similar to the heterotic risk models based on statistical industry classifications using multilevel clustering [11]. However, here we have single-level clustering, and there is no
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.