Sequentially Estimating the Approximate Conditional Mean Using Extreme Learning Machines

This study examined the extreme learning machine (ELM) applied to the Wald test statistic for the model specification of the conditional mean, which we call the WELM testing procedure. The omnibus test statistics available in the literature weakly converge to a Gaussian stochastic process under the null that the model is correct, and this makes their application inconvenient. By contrast, the WELM testing procedure is straightforwardly applicable when detecting model misspecification. We applied the WELM testing procedure to the sequential testing procedure formed by a set of polynomial models and estimate an approximate conditional expectation. We then conducted extensive Monte Carlo experiments to evaluate the performance of the sequential WELM testing procedure and verify that it consistently estimates the most parsimonious conditional mean when the set of polynomial models contains a correctly specified model. Otherwise, it consistently rejects all the models in the set.


Introduction
Conducting data inference using correctly specified models is desirable for predicting future observations. If models are misspecified, however, proper data inference cannot be conducted, and predicting future observations may then involve an undesired bias. Because of this, previous studies have developed methodologies to test the correct model assumptions. For example, in a classical study, Ramsey [1] provides a test statistic for non-linearity. In another classical study, Bierens [2] provides an omnibus model specification test statistic that detects arbitrary model misspecification consistently. In addition to these works, a number of studies provide correct model specification testing methodologies [3][4][5][6][7].
Despite the rapid development of correct model specification testing, researchers may still be unable to obtain a correctly specified model and may have to predict future observations using misspecified models. If all candidate models are misspecified by model specification tests, the model with the lowest mean square error is typically chosen to forecast future observations, even if it is known to be misspecified.
To address this concern, the present study provides a robust methodology to search for a correct model in a systematic way. To do so, we developed a sequential testing procedure that combines the model specification test statistic available in the previous literature with high-degree polynomial models, so that a close approximation of the conditional mean equation can be consistently estimated.
variable does not have to be positively valued as required by Cho and Phillips [15]. Even when the conditioning variable is negatively valued, the sequential testing procedure using the WELM test statistic is directly applicable.
The rest of this paper is organized as follows. Section 2 focuses on the polynomial model and provides the null limit distribution of the WELM test statistic, along with a literature review. Section 3 applies the WELM test statistic to the sequential testing procedure and provides the theoretical results. Section 4 discusses the extensive simulations conducted using the WELM test statistic and sequential testing procedure. We consider three data-generating processes (DGPs) and examine how the sequential testing procedure responds to various plans for the level of significance. Section 5 provides concluding remarks and summarizes the main findings. All the mathematical proofs are presented in the Appendix A.

Method 1: Application of the WELM Test to the Polynomial Model
In this section, we first describe the main motivation of this study in relation to the development of the literature in terms of model specification testing. To fix our idea, we focus on the WELM test statistic applied to the polynomial model. Our primary interest is in developing a statistical methodology to estimate the conditional mean equation of time-series observations. We therefore suppose that data are weakly dependent observations as follows: Assumption 1 (DGP). Let (Ω, F , P) be a complete probability space and let k ∈ N. Let {(y t , x t , d t ) : Ω → R 2+k : t = 1, 2, . . . } be a strictly stationary and absolutely regular process with mixing coefficients β τ such that for some ρ > 1, ∑ ∞ t=1 τ 2ρ/(ρ−1) β τ < ∞ and x t is strictly non-negative with probability 1.
Here, y t and z t := (x t , d t ) are serially dependent target and explanatory variables, respectively, and z t can contain the lagged target variables, so that dynamic misspecification can be removed from our consideration. Specifically, researchers are concerned about possible non-linearity with respect to x t when they attempt to approximate the conditional mean equation using the p-th-degree polynomial function: where x t (p) := (1, x t , ..., x p t ) , θ * (p) := (α * (p) , η * ) is the linear coefficient of (x(p) , d t ) , and F t is the smallest σ-field generated by (z t , y t−1 , z t−1 , y t−2 , . . .).
The polynomial functions are uniformly dense and this motivates us to estimate the conditional mean using the above specification. The Stone-Weierstrass theorem implies that continuous functions are uniformly approximated by polynomial functions with high levels of degrees, so that the above polynomial function becomes a successful approximation of the conditional mean if the degree p is sufficiently large.
The current study seeks to provide a statistical method to estimate the degree of the polynomial function in the most parsimonious manner. The non-local behavior of a high-degree polynomial model is understood as one of the drawbacks of estimating the high-degree polynomial model using regression. That is, the outlier of x t can substantially affect the estimated forecast, and this can reduce the utility of the polynomial model estimation [21].
We accommodate this aspect by estimating the polynomial using the most parsimonious model. Specifically, we estimate the polynomial degree p as small as possible, and for this purpose, we provide a sequential testing methodology described in the next section. In particular, our testing approach is based upon the GCR property of an ANN model and ELMs.
To describe our testing procedure using the ELM applied to the GCR property, we note that Stinchcombe and White [10] show that when the regression model is estimated by attaching an analytic function to a linear model, the linear coefficient consistently estimates a non-zero coefficient if and only if the regression model is misspecified for the conditional mean equation; the authors refer to this as the GCR property. Specifically, the following assumption gives the model advocated by Stinchcombe and White [10]: and Ψ(·), δ, and λ, are the additional hidden unit constructed by an analytic function, input-to-hidden weight, and hidden-to-output weight, respectively.
Here, if we let λ * be the probability limit of the parameter estimated by regression, the GCR property implies that the estimated coefficient for λ * is consistently different from zero if the p-th-degree polynomial model is misspecified for the conditional mean. Therefore, we can detect whether the p-th-degree polynomial model is correct by testing whether the coefficient of the hidden unit is zero. That is, if the estimated hidden-to-output weight is statistically different from zero, it means that M p does not approximate the model sufficiently well. Otherwise, M p becomes a successful approximation of the conditional mean, motivating us to rephrase the following hypotheses: into the following equivalent hypotheses: in their framework. This implies that we can let our null model be and f 0 (z t ; θ(p)) := x t (p) α(p) + d t η for p = 1, 2, . . .. In what follows, we let Ψ t (δ) denote Ψ(δx t ) for notational simplicity. This aspect now implies that the GCR property can be exploited by testing H 0 against H 1 , and we need to test whether the input-to-output weight is zero.
We now provide the regularity conditions for the regular behavior of the test statistics provided below: Assumption 3 (Regularity). (i) (∆, D, Q) and (Ω × ∆, F × D, P · Q) are complete probability spaces.
(ii) For p ∈ N, Θ(p) is a non-empty compact and convex set, and Λ and ∆ are non-empty compact and convex subsets such that 0 is an interior element of Λ.
t ] < ∞, and there is a sequence of stationary and ergodic random variables Assumptions 1-3 are obtained by adapting the regularity conditions in Cho and White [15] to the current polynomial model structure. Their model assumes non-linearity with respect to the parameters, and we further simplify their assumptions by imposing the polynomial model structure used herein, so that the limit results provided below can be obtained as corollaries of their theorems.
Indeed, testing H 0 : λ * = 0 is irregular because it involves Davies' [11,12] identification problem. That is, if λ * = 0, δ * is not identified, δ * is identified only when λ * = 0, so that the null limit distribution of the t-test statistic testing H 0 becomes different from the standard normal distribution. The null limit distribution is found to be characterized by a Gaussian stochastic process indexed by the unidentified parameter δ. That is, if we let t n be the standard t-test statistic testing H 0 , it follows that under H 0 and Assumptions 1-3, where G(·) is a Gaussian stochastic process such that for every δ ∈ ∆, E[G(δ)] = 0, and for each (δ, δ ), Here, we let . This limit distribution makes it inconvenient to apply the standard t-test statistic when testing H 0 against H 1 . The limit distribution is affected by too many factors in terms of the data and model. If the error u t is conditionally homoscedastic, the associated Gaussian process is a standard Gaussian process in the sense that for every δ ∈ ∆, G(δ) ∼ N(0, 1). However, this does not hold if u t is conditionally heteroscedastic. Furthermore, there are many candidate analytic functions for Ψ(·). As Cho and White [7] highlight, the previous literature chooses different functions for Ψ(·), namely the logistic cumulative distribution function in White [22], exponential function in Bierens [2] and ridgelet function in Candés [23], among others. Different covariance kernel structures are obtained for the different analytic functions selected for Ψ(·), and this leads to different null limit distributions for the t-test statistic. Empirical researchers applying the standard t-test statistic have to apply different critical values to different models and data, making it more difficult to obtain asymptotic critical values than the test statistic value itself. This aspect also analogously applies to other standard test statistics, such as Wald, Lagrange multiplier, and QLR.
To overcome this, we use another testing method that applies the ELM proposed by Huang, Zhu, and Siew [16]. Cho and White [15] note that the functional ordinary least squares (FOLS) estimator suggested by Cho, Huang, and White [14] and Cho, Phillips, and Seo [13] can be exploited to yield a straightforward statistic to test H 0 against H 1 by applying the ELM. As we detail below, the FOLS estimator has a limit distribution involved with integration, which lets the estimator follow a normal distribution asymptotically instead of being characterized by the Gaussian process. Using this property, we can convert the FOLS estimator into a Wald test statistic to follow a chi-squared distribution asymptotically under the null hypothesis. Here, the ELMs are exploited to compute the involved integrations.
Specifically, first, for each δ, E[u t Ψ t (δ)] = 0 under H 0 because Ψ t (δ) := Ψ(δx t ) is measurable with respect to F t and u t is a martingale difference sequence from the fact that u t : , the estimated coefficient of u t has to be zero irrespective of δ. Therefore, instead of testing H 0 against H 1 , we opt to test Here, E[u t ] = 0 and thus u t is a martingale difference sequence. Nevertheless, many of the entities on the right side are unknown to the researcher, necessitating the estimation of each expectation by its sample analog: for each δ ∈ ∆, where u t is the regression residual obtained from M 0 p , namely so that it also follows that ∑ n t=1 u t ≡ 0 and consistently estimates u t under H 0 . As there are a continuum of δs in ∆, Cho and White [15] integrate the above estimators using an adjunct probability measure Q(·) and obtain the following limit distribution: under H 0 and Assumptions 1-3, where Q(·) is an adjunct probability measure defined on ∆ (which is selected by the researcher), and G 1 (·) and G 2 (·) are two independent Gaussian processes such that for each δ ∈ ∆, E[G 1 (δ)] = 0 and E[G 2 (δ)] = 0, and for each δ and δ ∈ ∆, This null limit distribution is indeed obtained by following the limit distribution theory of the FOLS estimator in Cho, Huang, and White [14] and Cho, Phillips, and Seo [13], in which they test the population mean function of functional data by estimating a parametric model using the FOLS estimator. More precisely, the FOLS estimator is obtained by minimizing the following functional mean squared errors: with respect to γ and ξ. If we let ( γ n , ξ n ) denote the FOLS estimator minimizing Q n (·, ·), it now follows that under Assumptions 1-3, leading to that This limit distribution is now identical to that in (1), and ξ * = 0 under H 0 .
Based on the FOLS estimator, Cho and White [15] test the null hypothesis using the Wald test statistic. Integrating Gaussian processes produces a normally distributed random variable, implying that This further implies that under H 0 , where σ 2 u := E[u 2 t ], and the null limit distribution of the FOLS estimator now motivates us to construct the following Wald test statistic: which follows a chi-squared distribution with one degree of freedom under H 0 and Assumptions 1-3, where σ 2 ξ,n and σ 2 u,n are consistent estimators of σ 2 ξ and σ 2 u , respectively. Under H 1 , ∆ β * (δ)dQ(δ) is not necessarily equal to zero, and we can expect power for this test statistic from this aspect. The test statistic is defined following Wald's [24] test principle. Owing to its trivial null limit behavior, its empirical applicability is more straightforward than other test statistics requiring extra efforts by the researcher to obtain the asymptotic critical values, namely the QLR test statistic in Baek, Cho, and Phillips [18] and Cho and Phillips [17].
Nevertheless, the burden of computing the Wald test statistic can be immense because of the involved integrations. To compute the statistic, it is thus necessary to calculate the integration of Ψ t (·) for each t, and if n is large, the involved computational burden can be huge.
Cho and White [15] recommend resolving this issue by applying the ELM proposed by Huang, Zhu, and Siew [16]. That is, if we let {δ i : i = 1, 2, . . . , m} be a set of identically and independently distributed (IID) random variables following the Q distribution, it follows that by the law of large numbers, so that the FOLS estimator can be well approximated by if m is sufficiently large. To implement this plan, we formally assume the following condition.
The only difference between W n and W m,n is in the fact that ξ m,n is used to estimate ∆ β n (δ)dQ(δ). Following Cho and White [15], we also refer to this as the WELM test statistic.
Cho and White [15] show by simulation that the null distribution of the WELM test statistic is well approximated by the chi-squared distribution by letting n and m be sufficiently large when their null model is the first-order autoregressive model. In addition, they verify that the WELM test statistic displays respectful power.
Before moving onto the next section, we collect the main claims in this section into the following lemma.
1 under H 0 as m and n → ∞; and for any positive sequence {c n } such that c n = o(n), P(W m,n > c n ) → 1 under H 1 as m and n → ∞.
Lemma 1(i and ii) are provided by Cho, Huang, and White [14] and Cho and White [15], respectively, in a general context, but we provide their proofs in the Appendix A to fit the current context.

Method 2: Sequential WELM Testing Procedure
In this section, we examine the sequential testing procedure combined with the WELM test statistic. The WELM test statistic developed by Cho and White [15] focuses on specification testing. We develop a testing methodology to estimate the most parsimonious polynomial model by combining the WELM test statistic with a sequential testing procedure.
To fix our idea on the sequential testing procedure, we first provide our model. The model in Assumption 2 assumes a p-th-degree polynomial model, and we now suppose that there arep polynomial models altogether: so that M(p) and M 0 (p) are the sets of the alternative and null models, respectively. These model sets encompass the models in Assumption 2 as special cases. That is, M p and M 0 p in Assumption 2 are elements of M(p) and M 0 (p), respectively. The most parsimonious model, which we seek to estimate using a sequential testing procedure, is obtained by testing smaller models against larger models sequentially. Specifically, the following procedure is proposed as our sequential testing procedure: Step 1: We test M 0 1 against M 1 using the WELM test statistic. If M 0 1 cannot be rejected at the level of significance α, we stop the sequential testing procedure and conclude that the conditional mean is linear with respect to x t . Otherwise, we move onto the next step. The regression residual is computed by regressing y t on (1, x t ) when computing the WELM test statistic.
Step 2: We test M 0 2 against M 2 using the WELM test statistic. If M 0 2 cannot be rejected at the level of significance α, we stop the sequential testing procedure; otherwise, we move onto the next step. In this way, we continue our testing procedure until we reach p =p. As in the first step, the regression residual is computed by regressing y t on (1, x t , . . . , x p t ) to compute the WELM statistic, which tests M 0 p against M p for p = 2, 3, . . . ,p − 1.

Step 3:
We test M 0 p against Mp using the WELM test statistic. If M 0 p cannot be rejected, we stop the sequential testing procedure to conclude that E[y t |F t ] is sufficiently well approximated by M 0 p ; otherwise, we conclude that M 0 (p) is entirely misspecified for E[y t |F t ].
Using this procedure, the most parsimonious and correct model is consistently detected. For a specific discussion, for some α * (p) and η * , let p * be defined as Note that p * is the smallest polynomial degree such that the conditional mean is equal to the conditional mean. If p > p * , the coefficients of degrees greater than p * must be zero. Therefore, if M 0 p * can be estimated, the most parsimonious polynomial model can be estimated, and the sequential testing procedure described above is designed to estimate p * . The WELM testing procedure has the GCR property [10], and the sequential testing procedure starts model testing from the smallest model to larger ones. Therefore, if the lower-degree polynomial model is misspecified for the conditional mean, it will be consistently rejected by the WELM test statistic, so that we can expect to estimate the most parsimonious correct model using the sequential testing procedure. From this result, we obtain the following corollary. Corollary 1. Given Assumption 1, if Assumptions 2-4 hold for each p ∈ P := {1, 2, . . . ,p} and p * ∈ P, for any > 0, lim n→∞ P(| p n (α) − p * | > ) = α, where p n (α) is the polynomial degree estimator obtained by applying the WELM test statistics to the sequential testing procedure with the level of significance α.
Corollary 1 implies that the degree estimator p n (α) has a consistent estimation error equal to the level of significance α; hence, if this estimation error is not removed from the above procedure, the degree estimator is not consistent for p * .
Further, the significance level α is selected by the researcher. We can let α be dependent on the sample size n, so that, if α n → 0 as n → ∞, the degree estimation error can be allowed to converge to zero, leading to a consistent estimator. We contain this result in the following theorem. Theorem 1. Given Assumption 1, if Assumptions 2 and 3 hold for each p ∈ P := {1, 2, . . . ,p}, p * ∈ P, and α n = 1 − C(c n ) such that for some δ ∈ (0, 1), c n = O(n δ ), then for any > 0, lim n→∞ P(| p n (α n ) − p * | > ) = 0, where C(·) is the chi-squared distribution function with one degree of freedom.
The results in Corollary 1 and Theorem 1 correspond to the results using the sequential testing procedure in the literature. Hosoya [20] examines the sequential testing procedure for a set of models nested by larger models using the likelihood ratio test statistic, so that the likelihood ratio test statistics can be sequentially applied using the chi-squared null limit distributions. Nevertheless, the models assumed by Hosoya [20] do not have the identification problem that we examine herein. Theorem 2 of Cho and Phillips [17] also provides a result analogous to Theorem 1 of the current study, but their conditions are more relaxed in the following senses. First, they apply the QLR test statistic for their sequential testing problem, which compares the mean square errors obtained from the null and alternative models such that the alternative model is constructed by letting Ψ(δx t = x δ t ). They show that a multifold identification problem exists under the null that the conditional mean is correctly specified by the polynomial model. Therefore, their QLR test statistic weakly converges to a functional of a Gaussian stochastic process. Consequently, the null limit distribution of their test statistic does not follow a chi-squared distribution. The null limit distribution is obtained using the weighted bootstrapping proposed by Hansen [19], making its application inconvenient. Second, the particular form of power transformation for Ψ t (·) restricts their applications. If x t is negatively valued, it may not be properly defined. Note that x δ t = exp(δ log(x t )), which is defined only when x t > 0, so that the application of their methodology is restrictive if x t can be negatively valued. Finally, the level of significance α n is assumed to slowly converge to zero relative to the convergence rate herein. They require that log(α n )/n → 0 in addition to α n → 0, whereas the latter is only assumed in Theorem 1. This requirement is imposed mainly because the null limit distribution of the QLR test statistic is characterized by the maximum of the squared Gaussian process. The tail distribution of the maximum is approximated by associating it with that from the squared fractional Brownian motion using the Slepian inequality. log(α n )/n → 0 is required to yield a sequence of critical values uniformly dominated by those from the squared fractional Brownian motion. On the contrary, our sequential testing procedure does not need to satisfy this additional condition.

Results: Monte Carlo Simulations
In this section, we illustrate the sequential WELM testing procedure by conducting Monte Carlo simulations using stationary time-series observations.

Linear Function and Sequential Testing Procedure
Without loss of generality, we first suppose the following dynamic and stationary time-series DGP: , and t = 1, 2, ..., n such that |φ * | < 1 and η * | < 1. The last two inequality conditions are imposed for the stationarity of the data.
Given this DGP condition, we let our model be constructed by polynomial models. We first consider a linear model as the first-degree polynomial model. For this purpose, we let the explanatory variable vector x t (p) be simply x t , so that p = 1, and we also let d t be the lagged dependent variable y t−1 . Therefore, if we let θ = (α 1 , α 2 , η) , the null model M 0 1 becomes {Φ(·, ) : θ ∈ Θ}, where Φ(X t , θ) := α 0 + α 1 x t + ηy t−1 . For our alternative model, we let the exponential function be Ψ(·), so that such that Λ := [−λ,λ] and ∆ := [δ,δ]. Next, we compute the WELM test statistic W n,m by first approximatingΨ m,t := ∆ Ψ(X t δ)dQ(δ) usingΨ m,t := m −1 ∑ m i=1 exp(δx t ) and next by letting w t (1) := [1, x t , y t−1 ] , where we suppose that Q is a probability measure uniformly distributed on ∆ = δ,δ . This linear model is correctly specified for the DGP. Therefore, we should expect that the WELM test statistic rejects this model α × 100% asymptotically when the level of significance is α.
Next, we extend the model scope to higher-degree polynomial models. For this purpose, we further let x t (p) := (1, x t , x 2 t , ..., x p t ) to specify the following null and alternative models: Given this, we further let w t (p) := [1, x t , x 2 t , . . . , x p t , y t−1 ] to compute the WELM test statistic to test the p-th-degree polynomial model. If p = 1, the WELM test statistic is the same as that obtained using the linear model. For p = 2, 3 andp = 4, the null models M 0 p are correctly specified, so that the WELM test statistics are also expected to reject the null model α × 100% asymptotically. Through this, we construct the following sets of alternative and null models: to apply the sequential testing procedure. For this DGP and the models, we conduct simulations and report the simulation results in Table 1, which are obtained by applying the sequential testing procedure. We generate data by letting (α 1 * , η * , φ * , σ 2 * ) = (0.5, 0.5, 0.5, 1.0) and also let the levels of significance α be 10%, 5%, and 1%. Given these simulation environments, we examine the empirical rejection rates of the WELM test statistic for n = 50, 100, 200, 500, 1000, 2000, and 5000. We also let m = 5000, and the total number of experiments is 5000. In the Supplementary Materials, we provide the URL address containing this simulation code made in R language.
The simulation results can be summarized as follows. First, the sequential testing procedure stops mostly at the first step, which implies that the sequential WELM test identifies the correct degree of the unknown polynomial function correctly. More specifically, as the sample size n increases, the WELM test statistic detects the linear model as the correct model approximately (1 − α) × 100%. This aspect is observed irrespective of the sample size, so that we can expect that the WELM test statistic controls the type-I error precisely; hence, the most parsimonious correct model can be efficiently estimated. Second, even when the sequential testing procedure estimates the models whose polynomial degrees are greater than unity, most of the selected models are quadratic models. This implies that the sequential testing procedure has a strong tendency to select the next most parsimonious model for the conditional mean function. As a result, selected models are mostly linear or quadratic functions. Third, as the level of significance α decreases, more precise estimation results are delivered from the experiments. However, this result in another way implies that the estimation error cannot be eliminated altogether as long as the level of significance is fixed. Table 1. Estimated polynomial degrees using the sequential Wald extreme learning machine (WELM) testing procedure (in percent). Number of replications: 5000. This table reports the proportion of estimated polynomial degrees using the sequential WELM testing procedure. DGP: y t = α 1 * x t + η * y t−1 + t , where x t = φ * x t−1 + u t , (x 0 , y 0 ) ∼ IID N(0, I 2 ), ( t , u t ) ∼ IID N(0, σ 2 * I 2 ), δ i ∼ IID U(0, 1), and (α 1 * , η * , φ * , σ 2 * ) = (0.5, 0.5, 0.5, 1.0). Here, the given hypotheses are provided as follows: H We therefore conduct another simulation by letting the level of significance be dependent upon the sample size. Specifically, we let the level of significance α n be n −1/2 , n −1 , n −3/2 , and n −2 . For these levels of significance, α n reduces to zero as n increases, so that the sequential testing procedure is expected to eliminate the estimation error asymptotically. Among the levels of significance, n −2 approaches zero more quickly than the other levels of significance. Table 2 reports the simulation results obtained from 5000 experiments. The figures in the first panel denote P n (α n ) := r −1 ∑ r i=1 I(p n,i = 1), where r denotes the total number of experiments set to be 5000 andp n,i denotes the degree estimated by the sequential testing procedure from the i-th experiment when the level of significance is α n . Here, I(·) denotes the indicator function. For each plan for the level of significance α n , P n (α n ) estimates the empirical probability for the estimated degree using the sequential testing procedure to be equal to 100%. The other figures in parentheses denote the hypothetical proportion measured by (1 − α n ) × 100%. As α n reduces to zero more quickly, the hypothetical proportion more quickly arrives at 100%. In addition to the sequential testing procedure, we compare these estimation results with standard information criterion-based estimations using the Akaike information criterion (AIC), Bayesian information criterion (BIC), and small-sample corrected AIC (AICc). These information criteria are applied to the null models M 0 p with p = 1, 2, 3, 4 and we compute the proportions measured by P n := r −1 ∑ r i=1 I( p n,i = 1), where p n,i denotes the degree selected by the information criterion. The figures in the second panel report the proportions estimated by the information criteria. Finally, we apply the same information criteria to the alternative models M p with p = 1, 2, 3, 4 and report the estimated proportions in the third panel, which are obtained using the same methodology. We distinguish them from the earlier information criteria by attaching " " to AIC, BIC, and AICc, so that AIC , BIC , and AICc denote the information criteria applied to the alternative models. Table 2. Proportion of sequentially estimated polynomial degrees using the sequential WELM testing procedure (in percent). Number of replications: 5000. This table reports the percentages of the correctly estimated polynomial degree using the sequential WELM testing procedure and the information criteria. The figures in the first panel denote P n (α n ) × 100, and those in the second and third panels areP n × 100. In addition, the figures in parentheses denote (1 − α n ) × 100, where we let P n (α n ) := r −1 ∑ r i=1 I(p n,i = p * ). r is the number of iterations,p n,i denotes the degree estimator obtained from the sequential testing procedure for the i-th simulation, and I(·) is the indicator function. Similarly, P n := r −1 ∑ r i=1 I(p n,i = p * ), wherep n,i is the degree estimator obtained by the information criteria. MODEL: M p := {x t (p) α(p) + ηy t−1 + Ψ(δx t )}, where p = 1, 2, 3, 4. The Akaike information criterion (AIC), Bayesian information criterion (BIC), and small-sample corrected AIC (AICc) are the information criteria applied to M 0 p := {x t (p) α(p) + ηy t−1 }, and the AIC , BIC , and AICc are those applied to M p , where p = 1, 2, 3, 4. DGP: , δ i ∼ IID U(0, 1), and (α 1 * , η * , φ * , σ 2 * ) = (0.5, 0.5, 0.5, 1.0). The simulation results in Table 2 can be summarized as follows. First, for every significance level α n , the distance between P n (α n ) and (1 − α n ) approaches zero as the sample size n increases. This suggests that the first-degree polynomial model is successfully estimated using the sequential estimation procedure. Second, the distance between P n (α n ) and (1 − α n ) is closest to zero when the plan for the level of significance is set to α n = n −2 . This implies that the sequential testing procedure can estimate the first-degree polynomial model more precisely than xxxx by letting the level of significance converge to zero more quickly. Third, as the second panel shows, the BIC converges to 100% with the increase in sample size, whereas the AIC and AICc are not as fast as the BIC. Fourth, as the third panel shows, the BIC performs similarly to the BIC, whereas the AIC and AICc perform a little worse than the AIC and AICc, respectively. Finally, when comparing the BIC (or the BIC ) with the sequential WELM testing procedure, the performance of the information criteria is inferior to those of the sequential testing procedure when α n converges to zero quickly. Specifically, if we let α n be n −3/2 or n −2 , the performance of the sequential testing procedure is better than that obtained by the BIC for all sample sizes. By contrast, if we let α n be n −1/2 , the performance of the BIC is superior to the sequential testing procedure for all sample sizes. In the middle, if α n reduces to zero at a moderate rate, namely n −1 , the performance of the sequential testing procedure is dependent upon the sample size n. That is, if n is relatively small, the sequential testing procedure performs better than the BIC; however, if n is relatively large, the BIC performs better than the sequential testing procedure. This aspect implies that letting the level of significance converge to zero as quickly as possible can produce the best estimation result if the first-degree polynomial model is a correct model.

Quadratic Function and Sequential Testing Procedure
We extend the earlier simulation by conducting another simulation. We examine a different DGP. Specifically, we suppose that data are generated by , and (α 1 * , α 2 * , η * , φ * , σ 2 * ) = (0.5, 0.5, 0.5, 0.5, 1.0). Therefore, the first-degree polynomial model is now incorrectly specified, whereas the second-, third-, and fourth-degree polynomial models are correctly specified. Hence, the desired sequential testing procedure should estimate the second-degree polynomial model as the most parsimonious and correctly specified model. We attempt this second simulation to verify whether the lessons we could have obtained from the simulations in Section 4.1 are still valid for other DGPs.
As before, we first conduct the simulations by fixing the levels of significance and next by letting them depend on the sample size. Tables 3 and 4, respectively, report the simulation results for the first and second cases obtained in the same simulation environments as for Tables 1 and 2.   Table 3. Estimated polynomial degrees using the sequential WELM testing procedure (in percent). Number of replications: 5000. This table reports the proportion of estimated polynomial degrees using the sequential WELM testing procedure. DGP: , δ i ∼ IID U(0, 1), and (α 1 * , α 2 * , η * , φ * , σ 2 * ) = (0.5, 0.5, 0.5, 0.5, 1.0). Here, the hypotheses are provided as follows: H  Table 4. Proportion of sequentially estimated polynomial degrees using the sequential WELM testing procedure (in percent). Number of replications: 5000. This table reports the percentages of the correctly estimated polynomial degree using the sequential WELM testing procedure and the information criteria. The figures in the first panel denote P n (α n ) × 100 and those in the second and third panels areP n × 100. In addition, the figures in parentheses denote (1 − α n ) × 100, where we let P n (α n ) := r −1 ∑ r i=1 I(p n,i = p * ); r is the number of iterations.p n,i denotes the degree estimator obtained using the sequential testing procedure for the i-th simulation and I(·) is the indicator function. Similarly, P n := r −1 ∑ r i=1 I(p n,i = p * ), wherep n,i is the degree estimator obtained by the information criteria. MODEL: M p := {x t (p) α(p) + ηy t−1 + Ψ(δx t )}, where p = 1, 2, 3, 4. The AIC, BIC, and AICc are the information criteria applied to M 0 p := {x t (p) α(p) + ηy t−1 }, and the AIC , BIC , and AICc are those applied to M p , where p = 1, 2, 3, 4. DGP: , δ i ∼ IID U(0, 1), and (α 1 * , α 2 * , η * , φ * , σ 2 * ) = (0.5, 0.5, 0.5, 0.5, 1.0). We can summarize the simulation results as follows. First, Table 3 shows that the proportion of the linear model selected by the sequential testing procedure decreases to zero as the sample size n increases. For each level of significance, 10%, 5%, and 1%, the first-degree polynomial model is selected less and less as n increases. This aspect implies that the WELM test statistic has a consistent power to reject the misspecified model. Second, as shown in Table 3, the second-degree polynomial model is asymptotically selected (1 − α) × 100%, and this implies that the WELM test statistic controls the type-I error efficiently. Hence, the most parsimonious and correctly specified model can be consistently selected by the sequential testing procedure. Third, as before, the estimation error incurred by the sequential testing procedure cannot be removed altogether as long as the level of significance is fixed irrespective of the sample size. Fourth, Table 4 reports the proportions of the polynomial degrees estimated by the sequential WELM testing procedure with the significance levels dependent on the sample size, and the information criteria. As we can see, for α n = n −1 , α n = n −6/4 , and α n = n −2 , the distance between P n (α n ) and (1 − α n ) approaches zero with an increase in sample size. Fifth, if the sample size is relatively small, slowly converging levels of significance estimate the correct degree better than quickly converging levels. For example, if n = 50, letting α n = n −1/2 produces higher proportions than that obtained by letting α n be n −2 . Nevertheless, as the sample size increases, they show different estimation patterns. For α n = n −1/2 , the proportion converges to 100% slowly, whereas for α n = n −2 , it converges to 100% quickly, implying that the plans for the level of significance have to be carefully chosen to apply them to the sequential testing procedure. If a relatively large sample data set is examined, the correct degree of the polynomial model can be better estimated by letting the level of significance converge to zero quickly. On the contrary, if the sample size is small, a level of significance converging to zero relatively slowly should be chosen. Sixth, we also compare the performances of the information criteria and observe that the BIC overall performs better than the AIC and AICc, and the same thing holds among the AIC , BIC , and AICc . Further, the BIC always provides better estimates than the BIC . Finally, we compare the simulation results using the sequential testing procedure with the BIC. If the sample size is small, the BIC always dominates all the estimation results from using the sequential testing procedures; however, if the sample size is sufficiently large, say more than 2000, the sequential testing procedure with a level of significance converging to zero quickly provides better estimates than the BIC. This simulation result is different from what we observed in Section 4.1. The sequential testing procedure does not always perform better than the BIC. If the polynomial function has a lower degree in the DGP, the sequential testing procedure may perform better than the information criterion. In particular, if the sample size is sufficiently large, the use of the sequential testing procedure appears more amenable.

Misspecified Models and Sequential Testing Procedure
As our final simulation, we now suppose that none of the models are correctly specified by supposing that y t = π * cos(y t−1 ) + t , where y 0 ∼ N(0, σ 2 y 0 ) and u t ∼ IID N(0, σ 2 u ). Here, we let (π * , σ 2 y 0 , σ 2 u ) = (1.0, 1.0, 1.0). We apply the same models as before and select the best model using the sequential testing procedure. Note that the cos(·) function is expressed as an infinite-degree polynomial function by Taylor's expansion, so that the fourth-degree polynomial model cannot be correctly specified for this DGP. This implies that the sequential testing procedure is expected to estimate a degree greater than 4. Our primary interest in this simulation is in investigating how the earlier finite sample properties of the sequential testing procedure are modified by this new DGP condition. As the model conditions and simulation environments are the same as before, we do not iterate.
Tables 5 and 6 report the simulation results. Table 5 is obtained by fixing the levels of significance and Table 6 is obtained by letting the levels of significance depend on the sample size. The simulation results are summarized as follows. First, as the sample size n increases, the empirical rejection rates also increase for each degree p = 1, 2, and 3 and the sequential testing procedure concludes that the polynomial degree is greater than or equal to 4 for most experiments. For example, if n = 2000, the sums of the proportions of p = 1, 2, 3 are only 0.58%, 4.82%, and 11.55% for the 10%, 5%, and 1% significance levels, respectively, and they further decrease as n increases to 5000. This result indicates that the power of the sequential WELM testing procedure performs well if the sample size is sufficiently large. Second, when an incorrect model is selected, the quadratic model is overall selected more often than the linear or cubic models. That is, the second-degree polynomial model is preferred to the firstand third-degree polynomial models. This is mainly because the cosine function is an even function around zero, so that the quadratic function may better approximate the cosine function when the sample size is not sufficiently large. Third, we now let the levels of significance depend on the sample size and examine the simulation results in Table 6.
As we can see, the distance between P n (α n ) and (1 − α n ) reduces with a rise in sample size. Although the distance is not as close to zero as in Tables 2 and 4, the distance reduces. Further, if n is small, slowly converging levels of significance provide better estimates than quickly converging plans. Nevertheless, as n increases, the proportions converge to 100% more quickly when we let α n be n −2 than when we let α n be n −1/2 . Hence, if the data set has a large sample size, the level of significance converging to zero relatively quickly should be chosen. This is the same observation as in Section 4.2. Moreover, we now compare the performances of the information criteria and observe that the AIC overall performs better than the BIC and AICc, and the same thing holds among the AIC , BIC , and AICc . In addition, the AIC always provides better estimates than the AIC . Finally, we compare the simulation results using the sequential testing procedure with the AIC. The AIC always dominates all the estimations from using the sequential testing procedures. This simulation result implies that the BIC is not always the best performing information criterion and the sequential testing procedure can dominate the BIC even when the sample size is small. Furthermore, if all the considered models are misspecified, it is difficult to draw regular patterns among the sequential testing procedure and information criteria. Table 5. Estimated polynomial degrees using the sequential WELM testing procedure (in percent). Number of replications: 5000. This table reports the proportion of estimated polynomial degrees using the sequential WELM testing procedure. DGP: y t = π * cos(y t−1 ) + t , where y 0 ∼ N(0, σ 2 y 0 ), δ i ∼ IID U(−1, 1), and u t ∼ IID N(0, σ 2 u ). Here, we let (π * , σ 2 y 0 , σ 2 u ) = (1.0, 1.0, 1.0). The hypotheses are provided as follows: H t + θ y * y t−1 . All these null hypotheses are misspecified for the DGP. We further let Ψ(x t δ) = exp(x t δ) to compute the WELM test statistic.  Table 6. Proportion of sequentially estimated polynomial degrees using the sequential WELM testing procedure (in percent). Number of replications: 5000. This table reports the percentages of the correctly estimated polynomial degree using the sequential WELM testing procedure and the information criteria. The figures in the first panel denote P n (α n ) × 100 and those in the second and third panels areP n × 100. In addition, the figures in parentheses denote (1 − α n ) × 100, where we let P n (α n ) := r −1 ∑ r i=1 I(p n,i = p * ). r is the number of iterations,p n,i denotes the degree estimator obtained using the sequential testing procedure for the i-th simulation, and I(·) is the indicator function. Similarly, P n := r −1 ∑ r i=1 I(p n,i = p * ), wherep n,i is the degree estimator obtained by the information criteria. MODEL: M p := {x t (p) α(p) + ηy t−1 + Ψ(δx t )}, where p = 1, 2, 3, 4. The AIC, BIC, and AICc are the information criteria applied to M 0 p := {x t (p) α(p) + ηy t−1 }, and the AIC , BIC , and AICc are those applied to M p , where p = 1, 2, 3, 4. DGP: y t = π * cos(y t−1 ) + t , where y 0 ∼ N(0, σ 2 y 0 ), δ i ∼ IID U(−1, 1), u t ∼ IID N(0, σ 2 u ), and (π * , σ 2 y 0 , σ 2 u ) = (1.0, 1.0, 1.0).

Conclusions
We applied the Wald test statistic assisted by the ELM to test the correct model assumption and estimate a close approximation of the conditional mean. When testing for the model misspecification of the conditional mean, omnibus test statistics typically weakly converge to a Gaussian stochastic process under the null hypothesis that the model is correctly specified. This aspect makes their applications inconvenient. We defined the Wald test statistic using the functional regression and applied the ELM to compute the test statistic efficiently (i.e., WELM), following Cho and White [15]. The WELM test statistic is GCR and follows a chi-squared distribution under the null. We further applied the WELM test statistic to a sequential testing procedure to search for an approximate conditional expectation and conduct extensive Monte Carlo experiments to evaluate its performance. Using simulation, we verified that, if the candidate polynomial models are correctly specified, the sequential WELM testing procedure estimates the most parsimonious and correct model consistently. Further, it consistently rejects all the candidate models if none of the polynomial models are correctly specified. We further compared the performance of standard information criteria, such as the BIC and AIC, as well as its small-sample adjusted version. From this comparison, we find that the model estimation using the sequential testing procedure has competitive power in estimating the most parsimonious and correct model.  By contrast, if p ≥ p * , the model is correctly specified and W m,n (p) A ∼ X 2 1 from the structure of the WELM test statistic, so that P(W m,n (p) > cv(α)) → α as n → ∞. That is, it follows that for each p ≥ p * , lim n→∞ P (I(p = p * )|W m,n (p)) = 1 − α.
Proof of Theorem 1. Let cv n be the critical value corresponding to α n , namely which is O(n δ ) and also o(n) because δ ∈ (0, 1) by the given condition. Therefore, for each p, (A3) implies that P(W m,n (p) > cv n ) → 1, implying that lim n→∞ P( p n (α n ) ≥ p * ) = 1.