An Intersection–Union Test for the Sharpe Ratio

An intersection–union test for supporting the hypothesis that a given investment strategy is optimal among a set of alternatives is presented. It compares the Sharpe ratio of the benchmark with that of each other strategy. The intersection–union test takes serial dependence into account and does not presume that asset returns are multivariate normally distributed. An empirical study based on the G–7 countries demonstrates that it is hard to find significant results due to the lack of data, which confirms a general observation in empirical finance.

equivalent or Sharpe ratio, with the performance of each other strategy that is taken into consideration.Let d > 1 be the number of investment strategies and i ∈ 1, 2, . . ., d be our benchmark.We may suppose that i = 1 without loss of generality.Furthermore, let η = (η 1 , η 2 , . . ., η d ) ∈ R d be a (column) vector of performance measures.Now, first of all, consider the hypotheses That is, H 0∧ states that our benchmark is optimal.After performing a (joint) hypothesis test, we could reject the null hypothesis H 0∧ in favor of the alternative hypothesis H 1∧ .In this case, we could say that there exists some strategy that is better than our benchmark, but not which one. 2 By contrast, if we are not able to reject H 0∧ , we must not conclude that our benchmark is optimal.A well-known method for testing the intersection of a number of single null hypotheses is studied by Roy (1953), which is called a union-intersection test (Sen and Silvapulle 2002).However, union-intersection tests are not the object of this work.
By contrast, I consider here the following hypotheses: Now, the joint null hypothesis H 0∨ asserts that our benchmark is not optimal.If we are able to reject H 0∨ , our benchmark turns out to be (significantly) optimal among all alternatives.By contrast, in the case in which we cannot reject the null hypothesis, we must not conclude that our benchmark is outperformed by any other strategy.Applying a test for H 0∨ might be the primary goal both in theoretical and in practical applications of portfolio theory.
The former test can be rewritten, equivalently, as whereas the latter test reads This explains the chosen symbols for the null and the alternative hypothesis.However, in the following, I focus on the latter test and write only "H 0 " and "H 1 " for notational convenience.The test proposed in this work is very simple: The null hypothesis is rejected if and only if we can reject each single hypothesis H 0i : η 1 < η i in favor of H 1i : η 1 ≥ η i .Let A i be the event that H 0i is rejected.The probability that all single null hypotheses are rejected amounts to If H 0i is true for some i ∈ 2, 3, . . ., d , we must have that P(A i ) ≤ α i , where α i ∈ (0, 1) denotes the significance level of the (single) hypothesis test for H 0i .Under H 0 , at least one single null hypothesis must be true and thus we have that In order to identify the outperforming strategies, we would have to apply a multiple test.For more details on that topic, see Frahm et al. (2012) as well as Romano and Wolf (2005).
Hence, the proposed test for H 0 has level α ∈ (0, 1) if α 2 , α 3 , . . ., α d ≤ α.The least conservative choice is α 2 = α 3 = . . .= α d = α, in which case H 0 is rejected if and only if the largest p-value of all single tests falls below α.Throughout this work, I assume that each single test has level α.
At first glance, this testing procedure might seem to suffer from a lack of power because it does not take the dependence structure of the single test statistics into account.Nonetheless, it is a likelihood-ratio test that is commonly referred to as an intersection-union test (Berger 1997).Thus, it inherits the general asymptotic optimality properties of likelihood-ratio tests that are known from likelihood theory (see, e.g., van der Vaart 1998, chp. 15 and 16).Another striking feature might be the fact that the overall test has the same significance level as each single test.This is because H 0 is rejected only if all single tests lead to a rejection and so we need no Bonferroni correction in order to preserve the significance level of each single test.For more details on that topic, see Berger (1997) as well as Sen and Silvapulle (2002).
In this work, I present an intersection-union test in order to decide whether a given investment strategy is optimal among a set of alternative strategies.This is done with respect to the Sharpe ratio.Joint and multiple tests for the Sharpe ratio are applied also in Frahm et al. (2012) by using a stationary block-bootstrap procedure.By contrast, I provide here analytical results.I refrain from assuming that asset returns are serially independent and multivariate normally distributed.Each single test represents a (nonparametric) generalization of the Jobson-Korkie test (Jobson and Korkie 1981;Memmel 2003).Finally, I apply the intersection-union test to historical data.
The same problem is addressed by Ledoit and Wolf (2008) as well as Schmid and Schmidt (2009) in a bivariate setting.However, the intersection-union test presented here is motivated by a multivariate point of view, i.e., d > 2, and its primary goal is to avoid any kind of selection bias that can occur when testing a joint hypothesis.Thus, it cannot be said that the intersection-union test is "better" or "worse" than the tests proposed by Ledoit and Wolf (2008).It is hardly possible to provide any general answer to this question at all (Ledoit and Wolf 2008, sct. 4 and 5).Instead, I try to fill a gap between Frahm et al. (2012) as well as Ledoit and Wolf (2008): (i) I derive closed-form expressions for the standard errors of the test statistics, instead of providing numerical results that have been obtained by bootstrapping, and (ii) I do this for the case d ≥ 2 but not (only) for d = 2.

Gordin's Condition
In the following, "X n → X" denotes almost sure convergence, whereas "X n X" stands for convergence in distribution.Let P t > 0 be the price of some asset or, more generally, the value of some strategy at time t ∈ Z.Throughout this work, the terms "asset" and "strategy" as well as "price" and "value" are used synonymously.The asset return after Period t is defined as R t := P t /P t−1 − 1.3 I assume that the return process {R t } is (strongly) stationary with expected return µ := E(R t ) and variance σ 2 := Var(R t ) < ∞.The process R t shall also be ergodic.This means that where the random variable R has the same distribution as each component of {R t }.This guarantees that every finite moment of R can be consistently estimated by the corresponding moment estimator.The return process is ergodic if it is mixing (Bradley 2005).More precisely, for all k, l = 1, 2, . . ., the random vector Hayashi 2000, p. 101).
The ergodicity of R t implies that µ n → µ, where µ n := 1 n ∑ n t=1 R t is the sample mean of R 1 , R 2 , . . ., R n .Put another way, the return process satisfies the Strong Law of Large Numbers.In order to preserve the Central Limit Theorem (CLT), i.e., √ n (µ n − µ) N 0, σ 2 L , we need an additional requirement.This is known as Gordin's condition (Hayashi 2000, p. 402).Let H and, according to Hayashi (2000, p. 403), we must have that , where Γ is the autocovariance function of {R t } (Hayashi 2000, Proposition 6.10).The number σ 2 L is referred to as the large-sample variance of {R t }, whereas σ 2 represents its stationary variance.In the following, I assume that The aforementioned requirements can easily be extended to any d-dimensional return process (Hayashi 2000, p. 405) and applied to a broad class of standard time-series models.There exists a number of alternative criteria for the CLT, which can be found, e.g., in Brockwell and Davis (1991, p. 213) as well as Hamilton (1994, p. 195).However, to the best of my knowledge, Gordin's condition represents the most unrestrictive set of assumptions about the serial dependence structure of a stochastic process (Eagleson 1975).In particular, it can be considered a natural generalization of the CLT for martingale difference sequences (Hayashi 2000, p. 106).
It is worth emphasizing that the number of dimensions, d, is supposed to be fixed.At least, we have to assume that n/d → ∞.If n/d tends to a finite number, the CLT might become invalid and other interesting issues, which are well-known from random matrix theory, can arise (Frahm and Jaekel 2015).By contrast, if the number of observations relative to the number of strategies is sufficiently large, we may expect that the CLT is satisfied under the aforementioned conditions.
I suppose, without loss of generality, that the risk-free interest rate is constantly zero.That is, I implicitly refer to asset returns in excess of the risk-free interest rate that can be observed at the beginning of each period.The Sharpe ratio η := µ/σ (Sharpe 1966) is frequently used as a performance measure both in theory and in practice.In the following section, I present the intersection-union test, which can be applied in order to judge whether a given investment strategy possesses the largest Sharpe ratio among a set of alternatives.This can be done under the quite general assumptions about the return process {R t } mentioned above.

Asymptotic Properties of Sharpe Ratios
In this section, I present some asymptotic properties of Sharpe ratios.The reader can find the derivations in Appendix A. It holds that This means that σ 2 n is a consistent estimator for the stationary variance σ 2 and √ n σ 2 n − σ 2 is asymptotically normally distributed with large-sample variance τ 2 L .For assessing the large-sample variance of R t , i.e., σ 2 L = ∑ ∞ k=−∞ Γ(k), we need to estimate the autocovariance function Γ.There are many ways to achieve this goal.Usually, one applies either heteroscedasticity-autocorrelation consistent (HAC) inference or some bootstrap procedure (Andrews 1991;Ledoit and Wolf 2008;Politis 2003).A nice comparison between HAC inference and bootstrapping in the context of performance measurement can be found in Ledoit and Wolf (2008).
Bootstrapping is a very powerful tool, but it can be computationally more intensive than HAC inference.Moreover, sometimes it is not clear whether or not the necessary (mathematical) conditions for the bootstrap are satisfied.The method proposed here, in some sense, bypasses the aforementioned problems.However, also HAC estimation can be somewhat obscure when it comes to choosing the right kernel and bandwidth, etc.For this reason, I keep things as simple as possible, i.e., I choose the box-kernel-type HAC-estimator where Γ n is the empirical autocovariance function of {R t } with l n (Hayashi 2000, p. 142), i.e., It is a stylized fact of empirical finance that Γ n (k) ≈ Γ(k) ≈ 0 for all k = 0, i.e., asset returns are not significantly autocorrelated, and so we may expect that σ 2 Ln ≈ σ 2 n .The large-sample variance of (R t − µ) 2 is τ 2 L , which can be estimated by Typically, asset returns are conditionally heteroscedastic.This means that, in contrast to σ 2 L vs. σ 2 , the large-sample variance τ 2 L can be significantly larger than the stationary variance τ 2 .Gordin's condition guarantees that where κ L represents the large-sample covariance between R and (R − µ) 2 .Due to the so-called "leverage effect" (Black 1976), we can expect that κ L is negative.Moreover, we already know that √ n (µ n − µ) N 0, σ 2 L and, by applying the delta method (van der Vaart 1998, Chp.3), we obtain which can be used in order to calculate the standard error of σ n .The Sharpe ratio is estimated by η n := µ n /σ n and the delta method leads to Schmid and Schmidt (2009) obtain the same large-sample variance of {η n } under the assumption that the processes are strongly mixing (Bradley 2005), but that assumption seems to be more restrictive than Gordin's condition.
To the best of my knowledge, Lo (2002) is the first who analyzes the potential impact of serial dependence when estimating the Sharpe ratio.Mertens (2002) points out that the formula for independent and identically distributed asset returns presented by Lo (2002) is based, implicitly, on the normal-distribution hypothesis.More precisely, he shows that the large-sample variance of if the components of {R t } are independent and identically distributed, where denote the skewness and the kurtosis of R t , respectively.Lo (2002) presumes that γ 3 = 0 and γ 4 = 3, in which case the large-sample variance of {η n } is 1 + η 2 /2.Some of those results can be found also in Opdyke ( 2007).However, Ledoit and Wolf (2008) mention that the formula for serially dependent asset returns presented by Opdyke ( 2007) is wrong because it does not distinguish between large-sample and stationary (co-)variances.One purpose of this work is to clarify the aforementioned misunderstandings.Suppose, without loss of generality, that we want to compare the Sharpe ratio of Strategy 1 with that of Strategy 2. In Appendix A, the reader can verify that with and with ∆η n := η 1n − η 2n and ∆η := η 1 − η 2 .It is worth emphasizing that the benchmark must be chosen before examining the Sharpe ratios.Otherwise, the entire procedure would suffer from a selection bias and then the results derived so far are no longer valid.However, this is not a serious drawback: If our choice of the benchmark is based on historical data, we can simply apply the test out of sample.
As already mentioned at the end of Section 1, the given result represents a nonparametric generalization of the Jobson-Korkie test (Jobson and Korkie 1981), which is frequently used in finance.The latter is based on the assumption that asset returns are serially independent and multivariate normally distributed.In this special case, it follows that where ρ 12 := σ 12 /(σ 1 σ 2 ) is the linear correlation coefficient between the return on Strategy 1 and the return on Strategy 2. This expression for the large-sample variance of {∆η n } corrects a typographical error made by Jobson and Korkie (1981), which is observed by Memmel (2003).

Empirical Study
In order to demonstrate the intersection-union test, I consider monthly excess returns on the MSCI stock indices for the G-7 countries, i.e., Canada, France, Germany, Italy, Japan, UK and USA, from January 1970 to January 2018.The given indices are calculated on the basis of USD stock prices that are adjusted for dividends, splits, etc. 4 The sample size corresponds to n = 577 and the risk-free interest rate is calculated on the basis of the secondary market 3-month US treasury bill rate at the beginning of each period. 5I choose the equally weighted portfolio (EWP) of all G-7 countries as a benchmark.This choice can be justified by the argument that investors should make use of international diversification (Jorion 1985).
For estimating the large-sample variances, I choose the lag length l = 12.First of all, I show that Γ n (k) ≈ 0 for all k ∈ 1, 2, . . ., l .For this purpose, I focus on the empirical autocorrelation function, i.e., k → ρ n (k) := Γ n (k)/Γ n (0). Figure A1 (see Appendix B) contains the correlograms with respect to {R t } for the EWP and each G-7 country.The red lines indicate the critical thresholds for the null hypothesis that the (true) autocorrelation at k is zero on the level α = 0.05.Furthermore, the reader can find the Ljung-Box Q-statistic in each plot, whose critical threshold on the level α = 0.05 amounts to 21.0261.The given results confirm the general opinion that first-order autocorrelations of asset returns do not significantly differ from zero. 6Put another way, the large-sample variances and covariances of asset returns are not significantly larger than their stationary counterparts.This picture changes substantially in Figure A2, which shows the empirical autocorrelations with respect to (R t − µ n ) 2 .Now, the Ljung-Box test always leads to a rejection of the null hypothesis H 0 : ρ(1) = ρ(2) = . . .= ρ(12) = 0.That is, there is a strong evidence that monthly asset returns exhibit conditional heteroscedasticity.
Table 1 contains the estimated large-sample variances divided by their stationary counterparts both for {R t } and for (R t − µ n ) 2 .We can see that the estimates of the large-sample variance of {R t } do not differ very much from the stationary ones-except for Japan, where the large-sample variance seems to be more than twice the stationary variance.By contrast, the estimates of the large-sample variance of (R t − µ n ) 2 are always more than twice their stationary counterparts.Hence, it is inappropriate to ignore the serial dependence structure of monthly asset returns.Table 2 contains the means, standard deviations, and Sharpe ratios for the EWP and the G-7 countries based on the monthly asset returns from January 1970 to January 2018.The standard errors are given in parentheses.Despite the large number of observations, the standard errors of µ n and η n are big compared to the corresponding estimates.This is a common problem in financial econometrics or, more specifically, in performance measurement.The last row of Table 2 contains the standard errors of the Sharpe ratios under the Jobson-Korkie assumption, i.e., that asset returns are serially independent and multivariate normally distributed.These numbers are smaller than their nonparametric counterparts and they do not vary too much.Under the Jobson-Korkie assumption, the large-sample variance of {η n } is 1 + η 2 /2 ≈ 1.Hence, the standard error of η n is approximately 1/ √ n , which explains why the standard errors are almost constant in the last row of Table 2. 4 The total returns have been retrieved from the MSCI webpage (MSCI 2018).

5
The data have been obtained from the Federal Reserve Bank of St. Louis (FRED 2018). 6 The only exception is Japan, where we can find a relatively large Q-statistic of 31.7637.Now, in principle, we would like to support the (alternative) hypothesis that the EWP is optimal compared to each G-7 country.Unfortunately, Table 2 shows that UK has the largest Sharpe ratio and so the EWP cannot be significantly better.Interestingly, this was not always the case.A closer inspection of the data reveals that the EWP had the largest Sharpe ratio before the financial crisis 2007-2008.However, now we have to stop our testing procedure.Nonetheless, for informational purposes, I provide the Sharpe-ratio differences for each seven pairs, the corresponding standard errors, and the associated t-statistics in Table 3.The reader can verify that it would have been hard to reject H 0 , anyway.The problem is that every t-statistic must be greater than Φ −1 (1 − α) = 1.6449 in order to reject H 0 , but this stringent condition is fulfilled only for Italy.
The lower part of Table 3 contains the standard errors of the Sharpe ratio differences and the t-statistics that are calculated under the Jobson-Korkie assumption.Although the standard errors of η n that are obtained under the same distributional assumption are always lower than their nonparametric counterparts (see the last row of Table 2), the same effect cannot be observed regarding ∆η n .The Jobson-Korkie assumption underestimates the standard errors for some indices, but it overestimates them for other indices.All in all it appears to be very difficult to compare investment strategies by historical observation because the given results are hardly ever significant if we apply a joint or a multiple hypothesis test (Frahm et al. 2012).

Conclusions
In portfolio optimization, we are often concerned with the question of whether a given investment strategy is optimal among a set of alternatives.In this work, I presented an intersection-union test for the null hypothesis that the benchmark is suboptimal in terms of the Sharpe ratio.The proposed test can easily be implemented.Furthermore, it accounts for serial dependence and it does not presume that asset returns are multivariate normally distributed.Thus, it is compatible with the stylized facts of empirical finance.However, an empirical study demonstrates that, in most practical applications, it is hard to reject the null hypothesis due to the lack of data.

Conflicts of Interest:
The authors declare no conflict of interest.

ωFigure A1 .
Figure A1.Correlograms with respect to {R t } of the EWP and each G-7 country.

Figure A2 .
Figure A2.Correlograms with respect to (R t − µ n ) 2 of the EWP and each G-7 country.

Table 2 .
Means, standard deviations, and Sharpe ratios for the EWP and the G-7 countries.The standard errors are given in parentheses.

Table 3 .
Sharpe ratio differences, standard errors, and t-statistics.