1. Introduction
Synthetic control method, proposed and discussed by
Abadie and Gardeazabal (
2003) and
Abadie et al. (
2010), is a very useful way of conducting comparative studies when exact matches are unavailable. Estimation of treatment effects usually takes the form of comparing outcomes between the treated unit and the control unit. Common sense suggests that, for the comparison to be meaningful, the control unit needs to be similar to the treated unit in the absence of the treatment in various dimensions. Such a requirement may not be satisfied in many observational studies. In some cases, availability of panel data makes such comparisons reasonable, the difference-in-differences method being a very well-known example. The difference-in-differences method requires a very specific set of assumptions, i.e., the common trend assumption, which may not be plausible for many applications. The synthetic control method offers a sensible generalization of the difference-in-differences. The synthetic control is a linear combination of the potential control outcomes, where the weights are manufactured by analyzing the pre-intervention outcomes.
For the purpose of statistical inference with synthetic control, i.e., confidence interval and hypothesis testing, various versions of placebo tests are often adopted. The idea underlying the placebo tests is the usual permutation tests, where the critical value of a test statistic is computed under all possible permutations of the “treatment” assignments in the control units.
The idea of permutation test is very intuitive and attractive. Applying the synthetic control method to every potential control unit presumably allows researchers to assess the distribution of a test statistic under the null hypothesis of no treatment effects, and the inference is seemingly exact in the sense that the burden of asymptotic approximation can be obviated.
The purpose of this paper is very specific. We ask whether the permutation test is a reasonable idea in the context of the synthetic control method, and argue that the intuitive appeal of the permutation test is misplaced. The validity of permutation tests usually requires certain symmetry assumption, which is often violated in the context of synthetic control studies. Using Monte Carlo simulations, we document the size distortion of the permutation tests. We also discuss a few alternative methods of inference.
Alberto Abadie kindly pointed out that the placebo test in synthetic control is often based on randomization inference idea, under which the symmetry restriction is built-in, while our analysis is predicated from the usual random sampling perspective, which leads to the violation of symmetry. This perspective is shared with an anonymous referee, who notes that (i) the synthetic control literature uses permutation tests in the context of design-based inference, and, as such, the permutation tests have exact size; (ii) the present article shows that permutation tests may not have correct size under a different mode of inference based on repeated sampling, although interpreting the permutation tests in the previous literature as tests based on repeated sampling would be incorrect; and (iii) the present article also proposes some alternatives that are valid in a repeated sampling setting. It would be useful to understand the exact mechanism through which the difference between the two perspectives manifests itself. The same referee points out that the present paper adopts a setting where , while the original litereature assumes fixed .
2. Placebo Test and Synthetic Control
In this section, we provide a brief discussion of the placebo test in the context of the synthetic control method. We begin with an overview of the synthetic control, borrowing heavily from discussions in
Abadie et al. (
2010) and
Doudchenko and Imbens (
2016). We then move on to describe the placebo test, and point out the importance of the symmetry assumption. We argue that the symmetry assumption is violated in general for placebo tests using linear combinations of outcomes, such as synthetic control. We conclude this section that such violation should be expected in general even when a normalized version of the test statistic is adopted.
We start with the overview of the synthetic control method. Consider a panel data with
cross sectional units observed over the time periods
. Units
are the control units that receive the treatment in none of the time periods. The unit
receives no treatment in periods
, and receives active treatment in time periods
. For simplicity, we will often assume that
. The outcome variable
is such that
if the
jth unit receives treatment in time
t, and
otherwise. Obviously,
The idea underlying the synthetic control is that if there were some weights
1 such that
during the pre-intervention periods (
). Then,
can be used as a (synthetic) control for
during the post-intervention periods (
).
Abadie et al. (
2010) and
Doudchenko and Imbens (
2016) discuss various methods of finding the
’s so that the requirement in Equation (
1) is satisfied. We analyze the weights and the nature of approximation from the asymptotic perspective where
. Note that a special case of the estimator discussed by
Abadie et al. (
2010, p. 496) solves
Under our interpretation, above is an estimator of .
Suppose that
,
, and
satisfy strict stationarity. Without loss of generality, we also assume that
and
. We would have
in probability as
. Assuming that
satisfies
we can understand that the population version of the synthetic control
is such that the difference
is designed to have a mean zero.
Our
asymptotic interpretation is not the only possible one.
Doudchenko and Imbens (
2016) provide an in-depth analysis of many possible methods. Our interpretation, however, is helpful for two reasons. First, it makes a concrete interpretation of
s as estimates of some pseudo-parameter, say
’s, along with analytic expressions of the
’s, which makes it easy to understand the potential pitfalls of permutation methods afterwards. Second, it helps us to motivate alternative methods of inference exploiting time series variation.
We now discuss how placebo tests can be used in the context of synthetic control. For this purpose, we first present a summary of the placebo tests/permutation tests. The tests are motivated to deal with the case where the number of the treated is small and the number of controls is relatively large. In order to focus on the salient feature of the tests, we will consider an extreme case and assume that there is only one treated unit.
The basic intuition underlying the general placebo test can be gleaned by examining a standard textbook case of randomized treatments. Suppose that there is cross sectional data with
units, where the units
are the control units and the unit
receives the active treatment. A reasonable estimator of the treatment effect is the difference
, where
is the outcome of the unit
, and
denotes the average of the outcomes of the controls. Suppose that we are interested in testing whether the treatment had impact. Given that there is only one treated unit, the standard
t-test comparing the difference of the mean outcomes is not applicable. On the other hand, common sense suggests that we may implement such a test by “assigning” each control unit to fictitious treatment. More precisely, one can estimate the empirical distribution of
for
, and use it as if it were the distribution of the treatment effect under the null hypothesis.
3Implementation of the placebo test with synthetic control requires a bit more notation. First let denote the estimator of . Although we will use the method of exact balancing later in our Monte Carlo simulations, we do not need to restrict ourselves to this particular estimator. For now, we can view as an output from a blackbox and let denote its probability limit as . Second, let denote the outcome of the same blackbox except that we use the kth unit as the outcome of the treated unit, and with as our control units. The placebo test then uses the empirical distribution of for as if it were the distribution of the treatment effect under the null hypothesis of no treatment effect. If the estimated effect belongs to the extreme tails of the empirical distribution, it is understood to be the evidence that the null hypothesis is incorrect.
In order to understand the size property of the placebo test, it helps to recall that the placebo test is a version of the permutation test, which requires for its validity what may be called the symmetry assumption. For review of this property, we will borrow the short discussion in
Canay et al. (
2017).
4 Suppose that a researcher observes a vector of observations
X, whose joint distribution is
P. The objective is to test whether
, where
is a collection of probability distributions such that the distribution of
X is equal to that of
for every
g in
, where
is a finite collection of transformations. The permutation test has the exact size if, for the test statistic
, the critical value is taken from the distribution of
for every
g in
. In the context of the placebo test above, one can understand
X to be the vector
, and
to be the permutation of the
Ys.
We note that the symmetry is not mathematically obvious in the context of synthetic control. In order for the permutation test to be valid, it is necessary for the distribution of
and those of
for
to be identical. Even for the relatively simple model in Equation (
3), the nature of the synthetic control is such that the symmetry does not naturally follow. Using the restriction in Equation (
4), we may write
Even if the first two terms on the right-hand side of Equation (
5) were identically equal to zero over the permutations, we believe that the third term is not likely to satisfy the symmetry property. This is because we believe that under the further restriction that the
’s have a finite variance, the term can be symmetric only when they are normally distributed.
We show that normality is necessary if the distribution of the error term
in Equation (
5), where
, is to be symmetric up to normalization.
5 Suppose that
are i.i.d., and their common distribution is such that the variance is finite and the characteristic function does not disappear. If
is a nontrivial function of
s and
s, then symmetry over the permutations requires that the marginal distributions of
for
should remain invariant over all possible
s. Without loss of generality, we can focus on the distribution of
, and conclude that the symmetry requires that there exists a random variable
Y such that the distribution of
is the same as that of
for some scalar
c. Because the standard deviation of
is proportional to
, we may without loss of generality take
. This implies that the distribution of
only depends on
. In other words, for
such that
, the distribution of
is identical to that of
. In particular, let all components of
be zero except for the first one. Then, the distribution of
is identical to that of
. This implies that
should have a stable distribution.
6 Because the only stable distribution with a finite variance is the normal distribution, we should conclude that normality is a necessary condition of the symmetry (up to normalization). Note that the third term in Equation (
5) arises in an ideal situation where the weights
do not need to be estimated and the first two terms completely disappear. Our analysis suggests that even if we normalize the third term by its standard deviation, the symmetry requires normal distribution. The necessity of normality assumption is about
any linear combination so it applies a fortiori to synthetic control.
3. Monte Carlo
The discussion at the end of the previous section casts doubt on the placebo test, even for the simple case where the first two terms in Equation (
5) can be ignored. In order to understand the roles that the first two terms may play, we adopt Monte Carlo simulations. We try to find data generating processes (DGPs hereafter) that generate a large amount of size distortions. This is helpful in understanding the potential problem of the placebo test from the uniformity perspective; after all, the mathematical definition of the “size” of a test is the maximum probability of rejection under the null, and here the null hypothesis is a composite hypothesis where the only requirement on the DGP is that the treatment effect is zero, which allows many possibilities on the terms in Equation (
5). For this purpose, we found it most convenient to work with the first two terms in Equation (
5), although we acknowledge that there may be other important sources of size distortion that we have not explored. Since the last paragraph of
Section 2 showed that normalization does not abate the symmetry requirement, we examine the importance of the first two terms in Equation (
5) using a more natural statistic. The version of the synthetic control that we use in the Monte Carlo is the method of exact balancing, the population version of which minimizes
subject to
and
.
7The method of exact balancing may not be an ideal version of the synthetic control, but it reflects a certain ambiguity in the method of synthetic control. In the factor model in Equation (
3), it is impossible to find weights
such that
for every
, if
is large enough, as long as
is continuously distributed. In other words, the condition (2) in
Abadie et al. (
2010) is incompatible with the factor model unless
. The assumption
has at least two implications.
8 First, the weights
can be estimated without error with sufficiently large
. Second, the distribution of the permutation test would have the point mass at zero, and as such, there is no reason to conduct any test. Both implications are questionable. In any case, under the assumption
, the weights can be estimated (without error) by the method of least squares that minimizes
. If the assumption
is violated, the method of least squares would be subject to a version of measurement error problem; the true regressor there is
in Equation (
3), and the
plays the role of a regressor with measurement error
.
9 Note that such a problem is avoided by the method of exact balancing.
We consider the method of exact balancing in this section not because it is necessarily an ideal version of the synthetic control, but because it is a convenient way of examining the impact of the first two terms in Equation (
5). As mentioned at the beginning of this section, our analysis at the end of the previous section suggests that the placebo test may have a problem even when these two terms are dismissed, and the purpose of our Monte Carlo exercise is to focus on the potential impact of these two terms.
For our Monte Carlo analysis, we adopted a simplified version of the factor model in Equation (
3) such that (i)
; (ii)
is a scalar; (iii)
; (iv)
is i.i.d. over
j and
t. In matrix notation, our estimator
solves
where
ℓ is a vector of ones. Because
, we can see that the population counterpart
solves
where
.
Note that the term B(i) is equal to 0 by design here, although it can be in principle different from 0 depending on the DGP and the estimator chosen. We speculate that the placebo test is used in the hope that (a) is dominated by the term D(i) above; (b) the four terms A, B(ii), C(ii) and D(ii) above, which reflect the noise of estimating by , are ignorable; and (c) the two terms C(i) and D(i) more or less satisfy the symmetry property.
We argued in the previous section that the term D(i) is likely to violate the symmetry property. In order to assess the impacts of other terms, we consider the following variations in DGPs:
Vary the values of ’s such that (a) none of the components of dominates; (b) only two of the elements are non-zero.
Vary the values of ’s such that the unbalanced unobservable factors C(i) (a) disappear; and (b) are present.
Vary such that the estimation errors in the weights are (a) prominent; and (b) negligible.
Combinations of the first two variations give us four different DGPs, shown as DGP No. 1 to No. 4 in
Table 1.
We considered two versions of the placebo tests: the first one is what might be called a feasible version of the test. Formally, for
, let
be a
vector of outcomes for the
jth control unit, let
, and let
be a
matrix that deletes the
jth column from
Y. Then,
Similar to Equation (
6), define the leave-one-out synthetic control weights
for the
jth control unit as a solution to
where
is to delete the
jth element from
. We likewise define the population counterpart
as a solution to
For
and
, let
be the element in
that corresponds to the
kth control unit. In addition, define
for
. Then, for
, we can compute
Let be the order statistics of ’s. We reject if or .
The second test is an infeasible version of the test, which is identical to the first test, except that we use the true value of
, i.e.,
and we reject
if
or
.
For each DGP, we try , and . For all designs, we set the level of the tests to be , and the number of Monte Carlo runs to be 1000.
The results are summarized in
Table 2.
10 We see size distortions in
Table 2, especially DGP No. 2 and No. 4. The size distortion there cannot be attributed to the noise of estimating
. First, the problem persists even as
approaches unrealistically large values. Second, the size distortion is similar over the feasible and infeasible versions of the test. We suspect that the problem is a fundamental problem that may have something to do with the violation of symmetry. (An anonymous referee pointed out that DGPs No. 2 and No. 4 cannot produce synthetic controls that approximate the trajectory of the outcome for the treated, and that synthetic controls should not be applied in those settings.)
Our Monte Carlo analysis indicates that the placebo test does have the size distortion problem. The results in
Table 2 suggest that the size problem is potentially bigger in DGPs No. 2 and No. 4. DGPs No. 2 and No. 4 differ from No. 1 and No. 3 in that the
’s are nonzero and the aggregate shock
plays a role as a consequence. Therefore, it is of interest to investigate further sources of asymmetry. For this purpose, we revisit the decomposition in Equation (
5) of
, assuming that the first and second terms in the factor model in Equation (
3) are not present:
11This implies that the variance of
can be written as
under the assumptions of the DGPs, where
is the covariance matrix of the vector
. Likewise, the variances of the permutation statistics are
Depending on the relative magnitudes of ’s, we can easily construct examples that violate the symmetry, such as DGPs No. 2 and No. 4. As of now, it is not clear to us whether there is another venue (other than the variation in the size of ), which leads to a violation of the symmetry.
4. Possible Alternatives to Placebo Tests
If we take the time series asymptotics (
) seriously, the problem can be avoided by using the same idea as in
Andrews (
2003). The hypothesis of no treatment effects can be understood to be a hypothesis of stationarity of the time series
. In particular, the researcher is interested in whether the distribution of
is the same as that of
, for which
Andrews (
2003)’s test is well-suited. In the simple case that we consider where
, one rejects the null if
belongs to the extreme tails of the empirical distribution of
. We conducted Monte Carlo simulations for all the DGPs considered in the previous section, and verified that
Andrews (
2003)’s test suffered no size distortion.
12 Andrews (
2003)’s test is geared for application in time series, and as such, robust to certain heteroscedasticity. If the variances of
in Equation (
3) were different across
js, most of the available methods exploiting cross sectional variation may need to be used with caution, as noted by
Ferman and Pinto (
2017).
Andrews (
2003)’s end-of-sample instability test being a test of stationarity of
, its validity does not depend on whether the
’s have identical variances or not. The usefulness of
Andrews (
2003)’s test in this context was recognized earlier by
Ferman and Pinto (
2017).
Andrews (
2003)’s test utilizes time series variation seriously. When
is relatively small, perhaps the researcher would like to have a procedure that is based on cross sectional variation. If the factor structure is taken seriously and if the number of factors is a priori known, we can produce such a procedure by combining the ideas in
Conley and Taber (
2011) and
Holtz-Eakin et al. (
1988). For simplicity, assume that the model is given by
where we normalize
. Let
and
. This is a case where
, and
. We then have
Under strict exogeneity assumption on
x’s, we can consistently estimate
as
by using the control group. Now, assume that
are i.i.d., which would imply
are i.i.d. A simple modification of
Conley and Taber (
2011)’s argument establishes that the distribution of
can be consistently estimated by the empirical distribution of
where
denotes
Holtz-Eakin et al. (
1988)’s estimator. Therefore, in order to test that
, it suffices to consider a test that rejects whenever
is in the extreme tails of such empirical distribution.
Ahn et al. (
2013), for example, discussed how
Holtz-Eakin et al. (
1988)’s method can be generalized when there are multiple factors. The idea of combining
Holtz-Eakin et al. (
1988) with
Conley and Taber (
2011), although straightforward, does not seem to have been considered elsewhere.
We have considered two alternative methods of inference, one based on
asymptotics, and the other based on
asymptotics. In addition to these two methods, we can also entertain the possibility that if both
and
J are large, it may be possible to use the panel technique as in
Bai (
2009) as well.
13 See, e.g.,
Gobillon and Magnac (
2016). The latter two procedures are based on the presumption that the researcher takes the linear factor structure seriously, so it may be more powerful than the
Andrews (
2003)’s test. On the other hand, if a researcher views the linear factor model as just a toy model
14 to illustrate the potential problem of difference-in-differences methods, then she would probably be hesitant to discard the synthetic control method, which may be able to accommodate potentially complicated statistical structures that may go beyond the linear factor model.
The three methods that we discussed here as possible alternatives are all theoretically valid under some asymptotics. Asymptotic validity does not necessarily imply that any given method performs reasonably for a given finite sample. A serious Monte Carlo comparison of the relative performance of the three alternatives, which is beyond the scope of the current paper, is required to determine a method to be recommended to practitioners.