Simultaneous Maximum Likelihood Estimation for Piecewise Linear Instrumental Variable Models

Analysis of instrumental variables is an effective approach to dealing with endogenous variables and unmeasured confounding issue in causal inference. We propose using the piecewise linear model to fit the relationship between the continuous instrumental variable and the continuous explanatory variable, as well as the relationship between the continuous explanatory variable and the outcome variable, which generalizes the traditional linear instrumental variable models. The two-stage least square and limited information maximum likelihood methods are used for the simultaneous estimation of the regression coefficients and the threshold parameters. Furthermore, we study the limiting distribution of the estimators in the correctly specified and misspecified models and provide a robust estimation of the variance-covariance matrix. We illustrate the finite sample properties of the estimation in terms of the Monte Carlo biases, standard errors, and coverage probabilities via the simulated data. Our proposed model is applied to an education-salary data, which investigates the causal effect of children’s years of schooling on estimated hourly wage with father’s years of schooling as the instrumental variable.


Introduction
In observational studies, the measured confounders can be controlled by a variety of methods such as propensity score based matching and regression adjustment. However, when the confounding variable is unmeasured, the traditional causal inference methods usually lead to biased estimators since changes in the unmeasured confounder will lead to changes in the explanatory variable, both of which will result in changes in the response variable. Failing to adjust such a confounder will lead to spurious association between the explanatory variable and the outcome. Analysis of instrumental variables (IV) has gained popularity in causal inference, such as investigating causal graphical structures [1,2] and controlling for unmeasured confounding [3,4]. An instrument is a variable that is correlated with the explanatory variable but not associated with any unmeasured confounders. In addition, the instrumental variable is supposed to have influence on the response variable only through the explanatory variable, i.e., there is no direct effect of this variable on the response. Instrumental variable analysis can be applied to many areas and disciplines, such as economics and epidemiology. For example, causality between the years of schooling and earnings in economics has been studied in the literature [5]. This example exploits the college proximity as the instrumental variable because it is revealed that those living near college or university usually have significantly higher level of education than others. On the other hand, it is believed that college proximity may improve earnings only by increasing the subject's years of schooling. Both indicate that college proximity is a useful instrumental variable. In biomedical and epidemiological research, the main interest is to investigate the causal effect of an exposure variable on a certain disease outcome. A gene can be assumed as a good instrument if it is closely linked to the exposure but has no direct effect on the disease [6]. The study of genetic variants as instrumental variables is known as Mendelian randomization, which is discussed extensively in the literature (e.g., [7,8]). For instance, a set of 32 recently identified genetic variants are used as instrumental variables to study whether child fat mass causally affects academic achievement and blood pressure [9].

Related Work
Since the development of instrumental variables, a plenty of instrumental variable estimation methods have been proposed for the causal effect estimation. Two-stage least squares (2SLS) [10] is one of the most commonly used methods for the instrumental variable estimation. Theoretical analyses such as consistency and asymptotic normality also exist in the literature. When the response variable is binary, the second stage can be modified with logistic regression in mendelian randomization studies [11]. Another method is the likelihood-based method, particularly the limited information maximum likelihood (LIML) [12]. It is proved that the LIML method is more effective in dealing with the weak instruments [13]. The phenomenon of weak instruments occurs when the correlation between the instrument(s) and the explanatory variable is close to zero. When there are weak instruments, 2SLS is generally unstable and the causal effect estimators are badly biased. The typical rule of thumb to detect weak instruments is the F-statistic, which states that an instrument may be weak if the first-stage F-statistic is less than 10 [14].
Most of the IV approaches impose linear assumptions among the instrument, explanatory and response variables. However, this is not always the case. For example, a subject's years of schooling may only have a positive effect on subsequent earnings if the subject obtained at least a high-school degree. There would be no difference in the earnings if the subject obtained either an elementary or middle school degree. In this hypothetical scenario, a linear regression model between the explanatory and response variables is clearly misspecified. When the null hypothesis of linearity relationship is rejected, one strategy could be to develop piecewise linear models, which is more interpretable compared to the completely nonlinear models.
In this paper, we propose a piecewise linear instrumental variable (PLIV) model for estimating the causal effect via a continuous threshold function. The continuous threshold function assumes that both the explanatory variable and the instrumental variable are continuous. Instrumental variable models with continuous variables have been studied extensively in the literature. For example, continuous instruments have been used in the classical IV models, developed in a structural equation modeling framework [15]. A recent paper proposes semiparametric doubly robust estimators of causal effects with the continuous instruments [16]. Moreover, some discussions about continuous exposure and a continuous response for Mendelian randomization can be found in a review paper [8] .
A threshold in a variable occurs when there is a sudden change in the values of this variable. We call the point where the change happens as a cut-off point or a threshold. The subset causal effect exists when there is a threshold in the explanatory variable. The proposed PLIV model is useful because it can study the subset causal effect when the true model is not linear and it can also degenerate to a linear instrumental variable model when the relationship among the variables is indeed linear. In other words, by using piecewise linear functions, we can quantitatively find the subset effects of the explanatory and the instrumental variables.
We use the Rectified Linear Unit (ReLU) function, mathematically defined in Equation (1), to incorporate the piecewise relationships. Utilization of ReLU function for defining the subset effects have been studied in the literature, such as a regression kink model that tests the presence of the threshold [17] and the segmented and hinge models to study the subset effects in logistic regression [18]. Besides, the continuous threshold models via the ReLU function with two-way interactions is considered in the Cox's proportional hazards model, where the asymptotic normality under mild conditions is established [19]. In this paper, we use a continuous threshold function with multiple thresholds to formulate the piecewise linear instrumental variable models. A similar study of the piecewise linear instrumental variable model through the random slope approach is studied in the literature [20]. It divides the data into a few segments and analyzes the data in each segment individually. However, this method suffers from huge efficiency and accuracy loss.

Contribution of This Article
In this paper, we consider a piecewise linear model when the linearity assumption of the data is inappropriate and provide a rigorous treatment of the statistical properties of the model. Our contributions can be summarized as follows.

•
We simultaneously estimate the coefficients and thresholds of the piecewise linear instrumental variable model by the limited information maximum likelihood (LIML) method, assuming the number of thresholds is known. • The proposed piecewise linear instrumental variable model will degenerate to the linear instrumental variable model if there are no thresholds. Therefore, it provides a generalization to the linear instrumental variable model. To our best knowledge, this is the first work on the piecewise linear extension to the traditional linear instrumental variable models. • We also study the theoretical properties of the PLIV model, including the consistency and asymptotic normality of the estimators.

Piecewise Linear Instrumental Variable Model
Notations: we denote scalars by unbolded lowercase letters (e.g., sample size n and the i-th observation of outcome variable y i ), random variable by unbolded capital letter (e.g., X), random vectors by boldface lowercase letters (e.g., x i and β), and matrices with boldface capital letters (e.g., X ).
In the ordinary linear regression model y i = x i β + i , there is an assumption that the explanatory variables are uncorrelated with the error term, i.e., cov(x i , i ) = 0. However, there are some situations where the covariance between the explanatory variables and error term exists. This leads to inconsistent estimation of ordinary least squares due to the phenomenon of endogeneity in x. One way to deal with this issue is to introduce an instrument variable, whose changes are related to changes in the explanatory variable but do not lead to the change in the response variable directly.
Let (x i , y i , z i ), i = 1, . . . , n, denotes the observed data for (X, Y, Z), where X is the explanatory variable, Y is the response variable, and Z is the instrumental variable. To estimate the subset causal effect and establish the piecewise linear relationship, for any threshold parameter t ∈ R, we use a continuous threshold function which is defined as: where I(·) is an indicator function. ReLU function, commonly used as an activation function in deep learning, is a special case with t = 0 such that ϕ( The proposed model provides sparsity and computational efficiency compared to the smoothing or approximation approach in the literature. The estimation stage involves indicator functions but it does not require an approximation of the indicator function. Let K and J denote the number of thresholds in Z and X, respectively. Denote c = (c 1 , . . . , c K ) T as the vector of thresholds in Z and denote t = (t 1 , . . . , t J ) T as the vector of thresholds in X. We propose the following piecewise linear instrumental variable model: where β = (β 0 , . . . , β J+1 ) T is the vector of coefficients representing the causal effect of X on Y; α = (α 0 , . . . , α K+1 ) T is the vector of coefficients representing the instrumental effect of Z on X; u i and v i are the error terms for the ith observation. In the context of causal inference, we interpret β as the causal effect of x on y. More specifically, for t j < x ≤ t j+1 , 1 ≤ j ≤ J with t J+1 denoting the maximum value of x, one unit increase in x leads to β J+1 + ∑ j j =1 β j units change in y. Besides, β J+1 represents the change in y that is caused by one unit increase in x for t 0 < x ≤ t 1 where t 0 is the minimum value of x. To better understand this, in Figure 1, we plot the function y = ϕ(x, 2) + 3 × ϕ(x, 3) + 2x where β 1 = 1, β 2 = 3, β 3 = 2 as an example. When 2 < x ≤ 3, the slope is β 1 + β 3 = 3. When 3 < x ≤ 4, the slope is Here, we assume K and J are prespecified according to some prior knowledge or theoretical justifications. Practically, we may use the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) [21] to select them. A more elegant examination of the condition for the number of thresholds can be found in Newey [22]. In particular, when α 1 = · · · = α K = 0 and β 1 = · · · = β J = 0, our proposed model degenerates to the traditional linear instrumental variable model.
For instrumental variable analysis, an instrumental variable is correlated with the explanatory variable but not correlated with the error term. In our model, (Z − c) + = {(Z − c 1 ) + , · · · , (Z − c K ) + } is the vector of instrumental variables with the following properties: We assume K ≥ J for identifiability, i.e., the number of instruments should be larger than or equal to the number of endogenous variables.

Remark 1.
Note that intensive research about nonlinear instrumental variable models has been conducted in the literature, such as the nonparametric instrumental regression [23][24][25]. We point out that the target of our method is to quantitatively find the thresholds and estimate the subset causal effects. We aim to generalize the traditional linear IV model and fit an interpretable model rather than approximate the data by a nonlinear function.
To estimate the unknown parameters in (2) and (3), we utilize the two-stage least square (2SLS) method and the limited information maximum likelihood (LIML) method. Details about the proposed estimation methods are discussed below.

Simultaneous Maximum Likelihood Estimation
We first introduce how the LIML method is used in our model and initialize the naive estimators by the 2SLS method.

Limited Information Maximum Likelihood
As discussed in the introduction about the advantages, limited information maximum likelihood is another popular approach for estimation in the instrumental variable models. Here, we assume the error terms (U, V) are jointly normally distributed and correlated to some extent due to the unmeasured confounding effect. Let 0 be the zero-mean vector and ρ be the correlation of (U, V). Denote σ 2 u and σ 2 v as the variance of the error terms U and V, respectively. Then the probability density function of the bivariate normal (U,V) can be written as: . For a single observation, the log-likelihood is To simplify notations, we let (θ) = (u i , v i ; θ) denote the log-likelihood. The maximum likelihood estimates for θ is obtained by maximizing the log-likelihood within the compact set Θ ⊂ R D(θ) such thatθ n = arg max θ∈Θ n (θ), where n (θ) = 1/n ∑ n i=1 (θ). However, there is no closed-form solution for θ, so we take the gradient-based algorithm for estimation. This yields approximate M-estimators. To speed up estimation, we use the two-stage least square method to initialize the estimators.

Initialization: Two-Stage Least Square
The traditional two-stage least squares method regresses the explanatory variable on the instrumental variable and computes the predictionsx in the first stage. In the second stage, it regresses the response variable on the predictionsx. The causal effect of interest is estimated from the second stage. In our method, we employ 2SLS to obtain the initial values of the parameters of the piecewise linear instrumental variable model. Below we describe the 2SLS procedures for initializations: Stage 1: First, we regress x on {(z − c) + , z} and then obtain the fitted valuesx, where Thus, in the second stage, we fit the following regression model: For each combination of the number of thresholds in X and Z, we could pick c, t and the regression coefficients simultaneously through grid search when the sum of squared errors (SSE) of Y is minimized. However, for J ≥ 2 or K ≥ 2, it is slightly computationally expensive to conduct grid search. Since we only need 2SLS to provide the initialization of the parameters in our method, we choose c to be a vector of the points that are evenly spaced between the 5% to 95% quantiles of Z. Similarly, we choose t to be a vector of the points that are evenly spaced between the 5% to 95% quantiles of X. We ignore points below and above the 5% to 95% quantiles in order to avoid boundary effects. The regression coefficients are obtained accordingly.

Theoretical Analysis
Under mild conditions, we study the statistical properties of the proposed model and establish the robust variance-covariance estimators for the estimated parameters under the correctly specified and misspecified models, separately. To investigate the theoretical properties, we consider the following regularity conditions: C2. The explanatory variable X and the instrumental variable Z are continuous in the parameter space, i.e., they have continuous probability density functions f X (·) and f Z (·). The density functions are uniformly bounded, that is, there exist constants c 1 , c 2 ,c 1 , andc 2 such that Furthermore, the true value of the coefficients for the threshold effects satisfy

Remark 2.
Condition C1 is commonly used in regression models. Condition C2 is used for estimating the unknown thresholds and ensures the model is identifiable. The continuity requirements of X and Z are used to estimate the thresholds. Condition C3 is used to establish the consistency and the asymptotic normality of the maximum likelihood estimator.
In terms of estimation, we take the gradient-based method which depends on the first order derivative˙ (θ) = ∂ (θ)/∂θ (details can be found in Appendix A) with the initialized estimators by 2SLS. In this paper, we do not approximate the indicator function by the logistic function as some researchers do (e.g., [18,26,27]). The gradient-based algorithm for the ReLU function has shown success in the context of deep learning and machine learning. Compared to the approximation techniques as discussed in Section 1, model estimation with the ReLU function is computationally cheaper since no approximation of the indicator function is required. In fact, as long as Condition C2 is satisfied which requires variables X and Z to be continuous, the gradients composed of the indicator functions converge to a continuous function of the threshold parameters as n → ∞, for example, for k = 1, . . . , K by the law of large numbers. Therefore, the second order derivative of the ReLU function with respect to the thresholds can be derived based on the resulting continuous probability function. More specifically, the second order derivative with respect to c k is simply − f Z (c k ).
To prove the asymptotic normality, we first need to show the consistency of the proposed estimators.
Proof. The proof follows the Theorem 5.7 of van der Vaart [28]. For completeness, we include it as Theorem A1 in Appendix B. To utilize Theorem 5.7, we need to check the condition that (θ n ) ≥ (θ 0 ) − o P (1) for some θ 0 ∈ Θ 0 . This is true since n (θ) is continuous in θ, n (θ) converges to (θ) uniformly, andθ n (approximately) maximizes n (θ). Thus, all the conditions are satisfied and the result follows.

Theorem 2.
Under conditions C1-C4, let θ 0 be the true value of θ. Let˙ (θ) be a measurable function with E ˙ (θ)˙ (θ) T (i,j) < ∞ for i, j = 1, . . . , |θ| * , where |θ| * denotes the number of elements in θ, is the first order derivative of (θ) with respect to θ evaluated at θ 0 and V θ 0 is the second order derivative of E{ (θ)} with respect to θ evaluated at θ 0 (derivations in Appendix A). V θ has the form where 0 denotes a zero vector or a zero matrix and 0 denotes a scalar. Details of V (1) θ and V (2) θ are given in the Appendix A.
Proof. First, note that (θ) is Lipschitz continuous in θ. Moreover, the fact that V θ is continuous in θ admits the Taylor expansion of E XYZ (θ): Sinceθ is the maximum likelihood estimate of θ, 1 . Plus the result from Theorem 1 thatθ n p → θ 0 , we conclude from Theorem 5.14 of van der Vaart [28] that: which implies an asymptotic normal distribution with mean 0 and variance-covariance For completeness, we include Theorem 5.14 of van der Vaart [28] (2000) as Theorem A2 in Appendix B. When the model is correctly specified, V θ 0 = −M θ 0 , the asymptotic variance is the inverse of Fisher information. Matrices V θ 0 and M θ 0 are estimated through the replacement of θ 0 by the MLEθ n . Thus, for the correctly specified model, the variancecovariance matrix is estimated by the inverse of Mθ n . For the misspecified model, the variance-covariance matrix is estimated by V −1 θ n Mθ n V −1 θ n . Let us define V n as the second derivative of n (θ) with respect to θ, then we can decompose V n the same way as V θ into two matrices V (1) n and V (2) n . Note that V n is the empirical process of V θ and V n p → V θ by the law of large numbers, so we use the estimated probability densitiesf Z (ĉ k ) andf X (t j ) for f Z (c k ) and f X (t j ) for k = 1, . . . , K and j = 1, . . . , J, respectively.

Simulation Studies
In this section, we evaluate the performance of the proposed model using simulated datasets. We consider two scenarios with the same sample size n = 500. We let error terms U and V be jointly normally distributed with mean 0 and correlation ρ ∈ {0.2, 0.5, 0.8}. Here, we consider a common standard deviation σ u = σ v = √ 0.3. Besides, we simulate the instrumental variable Z ∼ N(0, 1). The first scenario has one threshold in X and one threshold in z, and it takes the following form: The true values of the parameters in PLIV models are α = (−1, 0.5, 1), β = (−0.2, 1, 0.5), c = 0.5, and t = 0. The second scenario has two thresholds in x and two thresholds in z, and it takes the following form: The true parameters are α = (−1, 0.5, 1, 1), β = (−1, 1.2, 1, 0.5), c = (−1, 1), and t = (−1, 2). We show the simulated piecewise linear instrumental variable models for scenario 1 and scenario 2 in Figure 2. We replicate the simulation 1000 times to evaluate the finite sample properties of the proposed model by the PLIV method.  Table 1 summarizes the biases, standard errors ofθ and coverage probabilities of θ by the proposed PLIV method for scenario 1, where tse is the theoretical standard error and ese is the empirical standard error. As we can see in the table, all the biases ofθ are close to zero. We also find that the theoretical standard error and the empirical standard error are close enough, which confirms the validity of our theoretical results in Section 3. The results show that our model estimation is quite accurate and therefore provides unbiased and consistent estimators. Besides, we notice that the coverage probabilities are around 95% under different values of ρ. Moreover, biases and the standard errors decrease as we increase ρ because the instrumental variables becomes stronger.  Table 2 summarizes the biases, standard errors ofθ and 95% coverage probabilities of θ by the PLIV method for scenario 2, where tse is the theoretical standard error and ese is the empirical standard error. We find the similar patterns as in Table 1 from scenario 1. For instance, all the biases are small. Theoretical standard errors and the empirical standard errors are close to each other. Most coverage probabilities are around 95% when ρ = 0.2 and ρ = 0.5. We also observe that the coverage probabilities of the thresholds are slightly low when ρ = 0.8. The reason might be due to the high correlation between errors. With multiple thresholds and high correlation, it poses challenges to estimate the exact locations. We include results with a sample size of 1000 in Appendix C, while fixing ρ = 0.5. Overall, as n increases, we observe that both biases and standard errors drop.

Application
In this section, we revisit the Card's education data [5]. We apply the proposed model to study the causal effect of years of schooling on hourly wage in cents with father's years of schooling as the instrumental variable. The interest here is to find a threshold and study the threshold effect of the years of schooling. It is generally believed that a child's years of schooling has a direct effect on the child's wage and parents' education only affects the child's income by affecting the child's education level. In other words, parents' education level has no direct effect on child's wage. Therefore, the father's years of schooling can be treated as a valid instrumental variable.
In Card's data, we remove the missing values and include a total of n = 2657 observations. The explanatory variable X (child's years of education) is between 1 and 18 with median 13, and the instrumental variable Z (father's years of education) has minimum 0, maximum 18, and median 12. Figure 3 indicates that variables X and Y are skewed and have heavy tails so transformations are needed before the analysis. A log transformation is applied to both.  Table 3 shows the point estimate, standard error, and associated 95% confidence interval of θ by the proposed model with K = 1 and J = 0, which are selected by BIC. In the table, α 1 and c are the coefficient and threshold for the transformed father's years of schooling, respectively. β 1 is the causal effect of years of schooling on earnings. The estimated causal effect of interestβ 1 is 0.87, which results in a difference of exp(0.87 × a) units increase in wage if there are a units increase in the log of years of schooling. In economics,β 1 is interpreted as "elasticity". That is, if years of education increases by 1%, the person's income will increase by 0.87% by our estimation. In terms of the instrumental variable, we notice that the threshold c is estimated to be 7.86. The corresponding p-value is not calculated since testing c = 0 is meaningless in this context. It shows that there exists a threshold at around 8 in the father's years of schooling. That is, the father's years of schooling only has a positive effect on the child's years of schooling if father receives at least 8 years of education. This information can not be observed if the traditional 2SLS method or nonparametric approaches are applied to analyze the data. The threshold effect as well as the thresholds are all statistically significant since their corresponding p-values are far less than 0.05.

Discussion, Limitations, and Future Research
In this paper, we propose a simultaneous maximum likelihood estimation for a piecewise linear instrumental variable model. We use the two-stage least square estimators as the initial values and the limited information maximum likelihood methods to estimate the regression coefficients and the threshold parameters simultaneously. We also provide a robust inference of the proposed model. The proposed model with the piecewise linear functions allows us to find the thresholds for both the explanatory and the instrumental variables, which generalizes the traditional linear instrumental variable models. In the simulation study, we evaluate the performance of the proposed model and find that it behaves well in terms of the biases, standard errors, and coverage probabilities in different settings.
In our model, we include a single continuous explanatory variable and a single continuous instrumental variable. We assume the explanatory variable and the instrumental variable are continuous. More complicated cases can be considered. For example, developing a piecewise linear model with count data might be interesting. However, finding the optimal number of thresholds as well as the locations is challenging from the theoretical side. Furthermore, we assume the number of thresholds K and J are prespecified. Treating the numbers of thresholds as random variables, finding the optimal values, and investigating the theoretical properties can be future research. Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data used in the application section come from the ivmodel package of CRAN, which can be downloaded from https://github.com/hyunseungkang/ivmodel/tree/master/ data (accessed on 31 August 2022). Codes to simulate data, generate tables and plots in Section 4 can be found at https://github.com/shuoshuoliu/PLIV (accessed on 31 August 2022).

Acknowledgments:
We thank the editor, the associate editor, and the three reviewers for careful reviews and insightful comments, which have improved this article.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix B. Theorems
Define P f as the expectation E f (X) = f dP and abbreviate the average n −1 ∑ n i=1 f (X i ) to P n f , an empirical distribution. Furthermore, we define Theorem A1 (Theorem 5.7 of van der Vaart [28]). Let M n be random functions and let M be a fixed function of θ such that for every > 0 Then every sequence of estimatorsθ n with M n (θ n ) ≥ M n (θ 0 ) − o P (1) converges in probability to θ 0 . Theorem A2 (Theorem 5.14 of van der Vaart [28]). For each θ in an open subset of Euclidean space, let θ → ψ θ (x) be twice continuously differentiable for every x. Suppose that Pψ θ 0 = 0, that P ψ θ 0 2 < ∞ and that the matrix Pψ θ 0 exists and is nonsingular. Assume that the second-order partial derivatives are dominated by a fixed integrable functionψ(x) for every θ in a neighborhood of θ 0 . Then every consistent estimator sequenceθ n such that Ψ n (θ n ) = 0 for every n satisfies In particular, the sequence √ n θ n − θ 0 is asymptotically normal with mean zero and covariance matrix Pψ θ 0 −1 Pψ θ 0 ψ T θ 0 Pψ θ 0 −1 .  Note: all numbers are multiplied by 1000. These results are based on 1000 replications.