Causal Random Forests Model Using Instrumental Variable Quantile Regression

We propose an econometric procedure based mainly on the generalized random forests method. Not only does this process estimate the quantile treatment effect nonparametrically, but our procedure yields a measure of variable importance in terms of heterogeneity among control variables. We also apply the proposed procedure to reinvestigate the distributional effect of 401(k) participation on net financial assets, and the quantile earnings effect of participating in a job training program.


Introduction
Causal machine learning, which is based on two approaches: the double machine learning (DML), cf. Chernozhukov et al. (2018), and the generalized random forests method (GRF), cf. , has been actively studied in economics in recent years. With the identification strategy of selection on observables, empirical applications have been investigated by using the aforementioned two approaches, including the works by Gilchrist and Sands (2016) and Davis and Heller (2017). When it comes to the identification strategy of selection on unobservables, few empirical papers using causal machine learning can be found in the existing literature. Those empirical applications very often lack important observed control variables or involve reverse causality, and thus researchers resort to the instrumental variable approach. Additionally, it remains unclear how the quantile treatment effect is to be estimated under the DML and GRF methods. In this paper, with the use of instrumental variables, we propose an econometric procedure for estimating quantile treatment effects based primarily on the generalized random forests of . Chernozhukov and Hansen (2005) propose an estimator that addresses endogeneity in quantile regressions via rank similarity, a crucial feature absent in the prior approaches. Using rank similarity, this estimator studies the heterogeneous quantile effects of an endogenous variable over the entire population (rather than for the compliers). Rank similarity thus identifies population-based quantile treatment effects, cf. Frandsen and Lefgren (2018). This approach does not require the monotonicity assumption used in Abadie et al. (2002) and allows for binary or continuous endogenous and instrumental variables. Chernozhukov and Hansen (2008) create a bridge between two-stage least squares (2SLS) estimator and their 2005 estimator, and propose an estimator robust to weak instruments. However, it is noteworthy that these estimator are unable to estimate unconditional quantiles, which are, as discussed in Guilhem et al. (2019), quantities that should be of utmost interest to empirical researchers. In this paper, we use the instrumental variable quantile regression of Chernozhukov and Hansen (2008) as a vehicle for identifying the quantile treatment effect. Athey and Imbens (2016) is the first paper that develops the regression tree model to estimate heterogeneous treatment effects using the honest splitting algorithm. Wager and Athey (2018) extend the regression tree model to causal forests. Recently,  have developed the generalized random forests model, which is a unified framework in the sense that it is built on local moment conditions capable of encompassing many models. Therefore, we bring the first order condition of the instrumental variable quantile regression into the local moment conditions and then modify the GRF algorithm. Accordingly, the quantile treatment effect can be estimated under the framework of causal random forests. Thus, our proposed estimator and the generalized random forests model both share the advantage of estimating the conditional quantile treatment effect nonparametrically. Chen and Tien (2019) investigate the instrumental variable quantile regression in the context of double machine learning. Although related to their paper, our procedure is not considering the same high-dimensional setting. Further, in contrast to the DML for instrumental variable quantile regressions, the proposed econometric procedure yields a measure of variable importance in terms of heterogeneity among control variables. The pattern of variable importance across quantiles can be revealed as well. We highlight the usage of exploring variable importance by reinvestigating two empirical studies -the distributional effect of 401(k) participation on net financial assets, and the quantile effect of participating in a job training program on earnings.
The rest of the paper is organized as follows. The model specification and practical algorithm are introduced in Section 2. Section 3 presents the measure of variable importance. Section 4 presents two empirical applications. Section 5 concludes the paper. The Appendix A discusses the usage of a doubly robust method along with the causal random forests structure for achieving more efficient estimation. The Appendix A also discusses the identifying restrictions and regularity conditions for the instrumental variable quantile regression and the generalized random forests, and further verifies conditions for establishing consistency and asymptotic normality of the proposed estimator.

The Model and Algorithm
We propose the causal random forests with the instrumental variable quantile regression (GRF-IVQR, hereafter). Estimation procedure of the GRF-IVQR is constructed as below, essentially based on the method developed in .

Generalized Random Forests
The classification tree and regression tree (CART) and its extension, random forests Breiman (2001), are effective methods for flexibly estimating regression functions in terms of out-of-sample predictive power. Random forests have become particularly popular methods. A key attraction is that they require relatively little tuning and have superior performance to more complex methods such as deep learning neural networks, cf. Section 3.2 of Athey and Imbens (2019). Recently, random forests have garnered interest and have been extended to causal effects; that is, the generalized random forests estimator.
In what follows, we describe how we incorporate the instrumental variable quantile regression into the framework of GRF and modify the resulting estimator accordingly.
Given data (X i , O i ) ∈ X × O, we estimate the parameter of interest θ(x) via the following moment conditions where ψ(·) stands for the score function and ν(x) are optional nuisance parameters. The above moment conditions, similar to the generalized method of moments (GMM), can be used to identify many objects of interest from an economic perspective. We seek forest-based estimates,θ(x), which are the conditional quantile treatment effects, in the context of instrumental variable quantile regressions. Chernozhukov and Hansen (2005) laid the theoretical foundations for the instrumental variable quantile regression (IVQR). With outcome Y i , endogenous treatment variable D i , instrumental variable Z i , and control variables X i , the IVQR can be represented as the following moment conditions where θ(τ) is the conditional quantile treatment effect, ν(τ) are the nuisance parameters, 1(·) is the indicator function, and τ is a quantile index.
The sample counterpart of the local moment conditions and the estimator of θ are introduced by Athey et al. (2019) and defined as below.
where α i (x) are tree-based weights averaged by the forest, which measure how often each training example falls in the same leaf as x. In other words, these weights represent the relevance of each sample when we estimate θ. Specifically, the weights are obtained by a forest-based algorithm. For the point of interest x, let L b (x) represent the set of samples which fall in the same terminal leaf and contain x in bth tree, where b ∈ {1, 2, ..., B}. That is to say, the weight α i (x) of each sample for the point of interest x will be the frequency with which the ith sample is in the same terminal leaf among all trees {1, 2, ..., B}. That is, With such forest-based weights and a pre-specified quantile index τ, we minimize the criterion function constructed using sample moment conditions, and then an estimate of the conditional quantile treatment effectθ(τ) is obtained. In the subsequent section, we discuss how to grow the trees and the forests with the instrumental variable quantile regression.

Tree Splitting Rules
Growing a tree is a recursive binary splitting process. The spirit of the tree-based algorithm is to split the data in the parent node P in half by maximizing the heterogeneity of the associated two children nodes {C 1 , C 2 }.
Specifically, for node j with data J , we define the node parameters as follows.
where j ∈ {P, C}. In each node, we minimize the following criterion which is based on the GRF method. However, the minimization is infeasible due to the unknown value of θ(τ, X).  turn this minimization problem of err(C 1 , C 2 ) into an accessible model-free maximization problem of where n C 1 , n C 2 , n P are numbers of observations in children and parent nodes. Along the way of maximizing ∆, the θ C j (τ) is estimated by the IVQR with respect to all possible splittings which correspond to the set {all possible values for X i , ∀i}. Consequently, the estimation is computationally infeasible. To circumvent this difficulty,  suggest a gradient tree algorithm which maximizes an approximate criterion∆(C 1 , C 2 ). In what follows, with two new ingredients A p and ρ defined below, we construct ∆(C 1 , C 2 ) step by step.
We first define A p as the gradient of the expectation of the moment condition.
where F(·) is a cumulative distribution function, and m is the dimension of X. For simplicity of derivation, we fix the following notations.
which means the estimation is conditional on the parent node. Accordingly, A p can be written as the gradient of g 0 , g 1 , · · · , g m with respect to the parent node parameters.
where f (·) is the probability density function of F(·). Therefore, the inverse of A p , We then construct the pseudo-outcomes, where ξ is a vector that picks out the θ(τ)-coordinate from the (θ(τ), ν(τ)) vector. In the case with one treatment variable D, the ξ vector is (1, 0, ..., 0). Thus, the corresponding pseudo-outcomes are where the # C denotes the number of observations in the children node C. The splitting rule is to maximize the following approximate criterioñ Notice that since some terms in ρ i , such as f (·), do not affect the optimization of∆(C 1 , C 2 ), the ρ i can be further simplified as follows.
Using the modified ρ i above,∆(C 1 , C 2 ) is our splitting rule for the instrumental variable quantile regression within the framework of generalized random forests. Based on the splitting rule, the tree is grown by recursively partitioning the data until a stopping criterion is met, cf. Section 2.4.

The Algorithm and an Example Illustrating Weights Calculation
With the splitting rule established, we can now grow the entire forests. In Athey and Imbens (2016) and Wager and Athey (2018), the concept of honest estimation is introduced, which is also included in the generalized random forests model. A model is honest if the information for the model construction and estimation is not the same. In the tree-forming case, the honesty here is consider as a sub-sample splitting between tree forming and weight calculation.
Here is an example of the implementation of honest estimation. Suppose we have eight samples in our data J, where J = {a, b, c, d, e, f , g, h}. We split the sample in half honestly, and we have two sub-samples J 1 = {a, b, c, d} for tree forming and J 2 = {e, f , g, h} weight calculation. By the splitting rule, we can construct the following tree with J 1 = {a, b, c, d}, Next, we identify where the data of J 2 = {e, f , g, h} is located in the tree.  2019) prove that with proper honest sub-sampling rate and regularity conditions, the generalized random forests estimatorθ(x) is consistent and asymptotic normal to θ(x).
To build the random forests with honest tree, we first randomly select 1 2 of sample for each tree. Then in each tree, we use 1 2 subsampling rate for honest splitting. For the average quantile treatment effect, we adopt each point in the data as the point of interest using their own weights one by one and get the average of all results.

Practical Implementation
When implementing the generalized random forests algorithm, we first obtain baseline grids through the conventional IVQR estimator, and then utilize those grids to grow the tree. With the IVQR estimator γ pre and its standard errorσ pre , we construct the interval [γ pre − 3σ pre ,γ pre + 3σ pre ]. We divide this interval into 100 equal parts, and then obtain the baseline grid baseline grid = γ pre − 3σ pre , · · · ,γ pre + 3σ pre .
For tree b ∈ {1, 2, ..., B} in the random forest estimation, half of the data is randomly selected. Consequently, we should reconstruct the grid for each tree. Similarly, we build the grid for the tree b grid for the tree b = γ tree b − 3σ tree b , · · · ,γ tree b + 3σ tree b , which is obtained via the randomly selected half of data in the tree b.
Following the concept of honest estimation, we further split the data into two parts denoted as data J 1 and J 2 . Data J 1 is used to grow the tree, and data J 2 is used to form the weight α i . As to grow the tree with data J 1 , in what follows, we outline the splitting process in each node. We first estimate the parent node parametersθ P (τ) andν P (τ) by optimizing with the grid for the tree b. We then implement the splitting criterion The tree keeps splitting recursively until they reach the minimum-node-size constraint or a situation that the data in the parent node has little variation, therefore further splitting is infeasible. These two practical stopping criteria on splitting suffice for reasonable estimates.
Regarding estimation of the weight, we first identify where the observations in J 2 will be located in the tree constructed by the data J 1 . Using the algorithm discussed in the Section 2.3, we compute the weight for every data point. Accordingly, we have determined the estimation of growing a tree b.
By growing a total of B trees and averaging the weight in each tree, we obtain the weight of each observation. With the weight α i (x), we estimate the conditional local quantile treatment effect To yield the local quantile treatment effect, we could average all x-pointwise conditional local quantile treatment effects. However, the averaging procedure can be further modified to get more efficient estimates, which is discussed in the Appendix A. Nevertheless, our empirical studies in Section 4 suggest that with a proper sampled data, the aforementioned practical procedure performs substantially well.  and the associated grf R package develop a measurement for sorting variable importance which is a unique advantage of tree-based models. To explore the variable importance across quantiles, we adopt their measure of importance reproduced as follows.

Variable Importance
where the number of maximum depth is pre-specified by empirical researchers. Specifically, this measure of variable importance only considers the splitting frequency for variable X i in trees b = 1, ..., B. This version of importance measurement shares similarity with the Gini importance widely used in random forests. Therefore, both algorithms prefer continuous variables since they have more potential splitting chances compared to binary variables. We thus shall be cautious when interpreting variable importance between a continuous variable and a categorical variable. Another important remark is that we should not conclude a particular covariate is unrelated to treatment effects simply because the tree did not split on it. There can be many different ways to pick out a subgroup of units with high or low treatment effects. Thus by comparing the average characteristics of units with high treatment effects to those with low treatment effects, researchers could obtain a fuller picture of the differences between these groups across all covariates.
Similar to the R-squared, variable importance signifies whether a variable yields enough explanatory power to the outcome variable in light of variation. Variable importance can also be used for model selection. In recent literature, e.g., O'Neill and Weeks (2018), researchers adopt variable importance measurement for policy making. Given hundreds of variables, the forest-based algorithm picks out important variables, which suffices for policy makers to identify their benchmark models.

Empirical Studies
In this section, we reinvestigate two empirical studies on quantile treatment effects: the effect of 401(k) participation on wealth, cf. Chernozhukov and Hansen (2004), and the effect of job training program participation on earnings, cf. Abadie et al. (2002). Not only does this conduct data-driven robustness checks on the econometric results, but the GRF-IVQR yields a measure of variable importance in terms of heterogeneity among control variables. This complements the existing empirical findings. In addition, we compare our empirical results with those from Chen and Tien (2019), the IVQR estimation based on the double machine learning approach, which is an alternative in causal machine learning literature.
As a critical note, we do not estimate and report the conditional quantile treatment effect (CQTE) in the applications. When the outcome level has an impact on the effect size and the conditional outcome variable are heterogeneous, then the CQTE could report spurious heterogeneity; see comprehensive summary of the problem in Strittmatter (2019). The same problem carries through to the importance measure. Therefore, the variable importance has to be interpreted with caution across different quantiles.

The 401(k) Retirement Savings Plan
Examining the effects of 401(k) plans on accumulated wealth is an issue of long-standing empirical interest. For example, based on the identification of selection on observables, Chiou et al. (2018) and Chernozhukov and Hansen (2013) suggest that the income nonlinear effect exists in the 401(k) study. Nonlinear effects from other control variables are identified as well. Few papers, however, investigate variable importance among control variables, cf. Chen and Tien (2019). In addition to estimating the quantile treatment effect of 401(k) participation, we fully explore variable importance across the conditional quantiles of accumulated wealth in light of the generalized random forests. The corresponding findings shed some light on the existing literature.
The data with 9915 observations are from the 1991 Survey of Income and Program Participation. The outcome variable is the net financial asset. The treatment variable is a binary variable standing for participation in the 401(k) plan. The instrument is an indicator for being eligible to enroll in the 401(k) plan. Control variables consist of age, income, family size, education, marital status, two-earner status, defined benefit pension status, individual retirement account (IRA) participation status, and homeownership status, which follow the model specification used in Chernozhukov and Hansen (2004). Table 1 signifies that the quantile treatment effects estimated by the GRF-IVQR are similar to those calculated in Chernozhukov and Hansen (2004). The 401(k) participation has larger positive effects on net financial assets for people with higher savings propensity which corresponds to the upper conditional quantiles. The estimated treatment effects show a monotonically increasing pattern across the conditional distribution of net financial assets. Thus, the pattern identified by Chernozhukov and Hansen (2004) is assured through our data-driven robustness checks. Based on the measure of variable importance introduced in Section 3, Table 2 and Figure 1 depict that income, age, education, and family size are the first four important variables in the analysis 1 . On average, income and age are the most important variables accounting for heterogeneity, which lead to values of the variable importance 64.4% and 15.6%, respectively. We should interpret the variable importance measure with caution, because researchers could reduce the importance measure of one variable by adding a highly correlated additional variable to the model. Accordingly, in this case, the two highly correlated variables have to share the sample splits. However, even with the caution mentioned above, we now have an additional dimension, τ, which suffices to compare variable importance across quantiles. Particularly, the importance of age variable increases as the savings propensity (quantile index) goes up. The importance 1 Following the default setting of the grf package, we set the max.depth equal to 4. of income variable, however, decreases across conditional distribution of net financial assets. In addition, these four variables are also identified as important in the context of double machine learning, cf. Chen and Tien (2019).

The Job Training Program
Abadie et al. (2002) use the Job Training Partnership Act (JTPA) data to estimate the quantile treatment effect of job training on the earning distribution 2 . The data is from Title II of the JTPA in the early 1990's, which consists of 11,204 samples, 5102 of which are male, and 6102 of which are female. In the estimation, they take 30-month earnings as the outcome variable, enrollment for JTPA service as the treatment variable, and a randomized offer of JTPA enrollment as the instrumental variable. Control variables include education, race, marital status, previous year work status, job training service strategies, age, and whether earnings data is from the second follow-up survey. In the female group, an additional control, aid to families with dependent children (AFDC), is added. We follow the same model specifications when estimating the GRF-IVQR.
Tables 3 and 4 show that for females, job training program generates a significantly positive treatment effect on earnings at 0.5 and 0.75 quantiles. GRF-IVQR signifies similar results.  Note: GRF-IV: 2127.544 (607.943). The GRF-IV stands for the 2SLS in the context of generalized random forests. AAI, CH and GRF-IVQR stand for, respectively, Abadie, Angrist and Imbens (2002), Chernozhukov and Hansen (2005) and our estimator. Numbers in parentheses are standard errors.
For the male group, Table 5 and Figure 2 depicts that work less than 13 weeks (wlkess13) and on-the-job training and/or job search assistance (ojt_jsa) are the most important variables. However, there is no apparent pattern suggesting that variable importance differs across quantiles. The pattern of variable importance resulting from the GRF-IV and the GRF-IVQR are different as well.   As to the GRF-IVQR, in Table 5, the variance importance for Hispanics and the age group 45 to 54 are 0 across all quantiles, while the GRF-IV suggests these two variables are of some importance. Possible explanations are as follows. Compared to the GRF-IVQR's moment condition, the GRF-IV still performs well in nodes with a relatively small amount of data. Consequently, the GRF-IVQR is more restrictive for growing a shallower tree than the GRF-IV. Therefore, some variables used to make a split in deeper nodes will not be chosen by the GRF-IVQR algorithm. Besides, at deeper nodes, the data is very similar in each node. Specifically, this situation occurs frequently with a large number of binary variables, and thus leads to no variation in a certain variable. Therefore, in practical estimation, the GRF-IVQR grows a relatively small tree.

GRF-IVQR at Specific
For the female group, Table 6 and Figure 3 depicts that classroom training (class_tr) and on-the-job training and/or job search assistance (ojt_jsa) are the most important variables. The importance of on-the-job training and/or job search assistance decreases across quantiles, which is different from the pattern in the male group. The issue concerning no variation of a binary variable in deeper nodes becomes severe in the female group.
The variance importance for Hispanics and several age binary variables are 0 across all quantiles, which indicates that in the female group, the aforementioned characteristics variables are more homogeneous over the conditional distribution of earnings.

Conclusions
Based on the generalized random forests of , we propose an econometric procedure to estimate the quantile treatment effect. Not only does this method estimate the treatment effect nonparametrically, but our procedure yields a measure of variable importance, in terms of heterogeneity among control variables. We provide the practical algorithm and the associated R codes. We also apply the proposed procedure to reinvestigate the distributional effect of 401(k) participation on net financial assets, and the quantile effect of participating a job training program on earnings. Income, age, education, and family size are identified as the first-four important variables in the 401(k) analysis. In the job training program example, our procedure suggests that the previous year work status and the job training service strategies are important control variables.
Author Contributions: Both authors contributed equally to the paper.

Acknowledgments:
We are grateful to the two anonymous referees for their constructive comments that have greatly improved this paper. We thank Patrick DeJarnette, Masaru Inaba, Min-Jeng Lin and Shinya Tanaka for discussions and comments. This paper has benefited from presentation at the Aoyama Gakuin University, Kansai University, and 2019 Annual Conference of Taiwan Economic Association. The usual disclaimer applies.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

DML
Double machine learning GRF Generalized random forests IVQR Instrumental variable quantile regression Appendix A Appendix A.1. Improving Efficiency by Doubly Robust Estimators Chernozhukov et al. (2018) and Athey and Wager (2018) pioneered the use of the doubly robust estimator embedded in a framework of causal machine learning. The resulting estimator becomes more accurate and gains efficiency. In light of their idea, it might be beneficial to incorporate the doubly robust estimation in our methodology.
A doubly robust augmented inverse propensity weighted (AIPW) estimator was introduced by Robins et al. (1994). The AIPW estimator for average treatment effect is constructed by two components as follows.
being the propensity score. The first line in the equation represents the inverse probability weighted estimator, and the second line depicts a weighted regression. The AIPW estimator is doubly robust because the estimator will be consistent, provided that at least one of the two components is correctly specified.

Appendix A.2. The Doubly Robust Estimation for Causal Forests
Athey and Wager (2019) and their grf R package implement a variant of doubly robust AIPW estimators for causal forests. Specifically, for estimating average treatment effect, their doubly robust estimator is shown as follows.
whereγ(X i ) is the conditional average treatment effect estimator based on causal forest,Γ i is the conditional average treatment effect estimator adjusted by inverse probability weighting,m(x) andê(x) are the estimators of E[Y|X = x] and E[D|X = x] which are based on random forest with honest splitting, and the average treatment effect estimatorγ is simply the sample average of those adjusted conditional average treatment effect estimates. Glynn and Quinn (2009) provide some evidence that the doubly robust estimator performs better in terms of efficiency than inverse probability weighting estimators, matching estimators, and regression estimators. To explore how adapting the doubly robust method in the causal forest estimator affects the efficiency and accuracy, we follow their DGP designs and conduct Monte Carlo experiments with different degree of confoundedness. In the simulation, X 1 , X 2 , and X 3 are covariates following N(0, 1), D is the treatment variable, Y is the outcome variable, and is the disturbance which follows N(0, 1). Two data generating processes are considered. Degree of confoundedness are modeled in three levels: low, moderate, and severe.

Outcome (control) Outcome (treatment)
Simple DGP Y = X 2 + X 3 + Y = 5 + 3X 2 + X 3 + Complicate DGP Y = X 2 + X 3 + Y = 5 + 3X 2 + X 3 + 2X 2 2 + 2X 2 3 + Degree of confoundedness True treatment assignment probabilities With three different sample sizes, 250, 500, and 1000, three degrees of confoundedness, and two DGP settings, the Monte Carlo results are tabulated in Table A2. The results confirm that the causal forest with doubly robust estimation indeed has efficiency gains over the conventional causal forest. Given Assumptions ATW1-ATW6, the Theorems 3 and 5 of  guarantee that the GRF estimator achieves consistency and asymptotic normality. In what follows, we check each assumptions for the proposed GRF-IVQR estimator.
In Chernozhukov and Hansen (2008), the moment functions are conditional on {X i , D i , Z i }. For simplicity, we write conditional functions as [ · |X i = x] when considering splitting in X i within the framework of generalized random forests.
Thus the expected score function We want the conditional cumulative distribution function is Lipschitz continuous. Since every function with bounded first derivatives is Lipschitz, we need the conditional density is bounded.
Assumption CH3 states that the conditional density f Y (Y|X, D, Z) is bounded by a constantf a.s.. In particular, f Y (Y|X, D, Z) is a density of a convolution of a continuous random variable and a discrete random variable, we also need the continuous variable not to be degenerate.
We want V is invertible and therefore Z i D i x D i Z i x x x needs to be invertible. In addition, the conditional density f (D i θ + X i ν|X i = x) is required to have continuous uniformly bounded first derivative. If f (D i θ + X i ν|X i = x) is continuously differentiable, then its first derivative is uniformly bounded. Those conditions are implied by Assumptions CH4 and CH5. Thus A p is invertible as well.
Taylor expansion implies the following approximation of γ.