A New Semiparametric Regression Framework For Analyzing Non-Linear Data

: This work introduces a straightforward framework for semiparametric non-linear models as an alternative to existing non-linear parametric models, whose interpretation primarily depends on biological or physical aspects that are not always available in every practical situation. The proposed methodology does not require intensive numerical methods to obtain estimates in non-linear contexts, which is attractive as such algorithms’ convergence strongly depends on assigning good initial values. Moreover, the proposed structure can be compared with standard polynomial approximations often used for explaining non-linear data behaviors. Approximate posterior inferences for the semiparametric model parameters were obtained from a fully Bayesian approach based on the Metropolis-within-Gibbs algorithm. The proposed structures were considered to analyze artiﬁcial and real datasets. Our results indicated that the semiparametric models outperform linear regression approximations to predict the behavior of response variables in non-linear settings.


Introduction
Non-linear models are often applied in many areas of quantitative research, such as biology, chemistry, epidemiology, and physics, among many others. These models have appeal in these areas as they have straightforward interpretability (regarding the nature of the underlying process) and typically provide excellent predictions for the response variable [1]. Alternatively to non-linear parametric models, researchers may consider approximating unknown non-linear functions using linear polynomial models. However, adopting such linear approximations to describe non-linear behaviors may be cumbersome as it could involve estimating more (and not easily interpretable) parameters [2]. On the other hand, non-linear models' are sensitive and should be carefully chosen depending on the application, and their parameters generally do not have analytical forms for their estimators. The absence of analytical solutions implies the need for numerical algorithms whose convergence strongly depends on the initial values chosen for the iterative procedures.
Several proposals for non-linear models can be found in the literature. However, modern techniques can also be used for non-linear modeling and predictions, such as nonparametric regression based on spline smoothing [3][4][5] and Generalized Additive Models (GAMs) [6,7]. In the context of agricultural applications, ref. [8] classifies non-linear models into six groups, as detailed in Table 1. Moreover, ref. [9] provides an excellent review on Gaussian Processes (GPs) and Relevance Vector Machines (RVMs), discussing how those nonparametric methods can be applied in non-linear frameworks for regressing over large datasets and how they can be effective for dealing with sequential data. The primary advantage of RVMs is that one can choose more general basis functions, and GPs present excellent behavior for predicting variances, although more restrictive regarding the kernel function choice. Ref. [10] has also studied such methods and concluded that the difficulty of GPs is learning by maximizing the evidence, where the hyperparameters could be learned, which is distinct from RVMs with fixed basis functions where inputs are the learning targets. In the presented context, this work aims to introduce a semiparametric non-linear regression model framework that can be very useful for obtaining highly accurate fits to non-linear datasets. Our goal is to provide an alternative to the existing linear approximations and nonparametric models. The proposed approach does not require numerical methods that strongly depend on precise initial values to reach convergence. By adopting the proposed framework, one could derive accurate inferences and predictions under a fully Bayesian approach [29] using standard MCMC (Markov Chain Monte Carlo) methods [30][31][32] (e.g., Gibbs Sampling, Metropolis-Hastings, and Metropolis-within-Gibbs (MwG), among others). In this paper, we have chosen to work with the MwG algorithm [33] to draw pseudo-random samples from the approximate posterior distribution of model parameters.
This paper is organized as follows. In Section 2, we present fundamental concepts regarding the formulation and estimation of parametric non-linear regression models.
In Section 3, we analyze and discuss the results obtained using the proposed methodology for modeling artificial and real datasets featuring non-linear relationships. Model comparisons regarding linear polynomial approximations are also presented. General comments and concluding remarks are addressed in Section 4.

Materials and Methods
Non-linear models are similar to linear regressions [34] in the sense of outlining the functional relationship between a continuous response variable Y and a set of covariates, thus providing a statistical prediction tool. Linear regressions are used to build purely empirical models, while non-linear models are typically applied when biological or physical interpretations imply relationships between responses and covariates that are not linear [35,36]. It is important to establish that either linearity or non-linearity is related to the unknown parameters and not the response-covariates relationship. In this context, a non-linear regression model for representing a response variable Y i (i = 1, . . . , n) has the general form where f is a known function of the designed covariate z i , and α = (α 1 . . . , α p ) is a pdimensional vector of non-linear parameters indexing f . Moreover, i denotes the random error, which is typically assumed to be normally distributed with zero mean and constant variance. It is also usual to assume that the errors are uncorrelated, that is, C( i , j ) = 0 for all i = j.
The most popular method for estimating α is the non-linear least squares, which is based on minimizing where = ( 1 , . . . , n ). It is worth mentioning that if i ∼ N (0, σ 2 ), then the least squares and maximum likelihood estimators of α are the same. Typically, point estimates for non-linear regression coefficients are obtained from iterative optimization processes based on techniques to minimize the error sum of squares. A widespread iterative method to derive least-squares estimates for non-linear models is the Gauss-Newton algorithm. In this context, if f (z i , α) in Equation (1) is continuously differentiable at α, then f can be linearized locally at α 0 as where Z 0 is the n × p Jacobian matrix whose elements are evaluated at α = α 0 . Thus, the iterative algorithm to estimate α is given by where α (0) = α 0 is the vector of initial values for α, and is evaluated at α = α (k) . If the errors are independent and normally distributed, then the Gauss-Newton algorithm is an application of the Fisher Scoring method. Implementations of the Gauss-Newton algorithm are available in most of the existing statistical software, but, in practice, there is no guarantee that the algorithm will converge from initial values that are far from the solution. In this sense, some improvements for this method can be found in the literature, such as the Gradient Descent and Levenberg-Marquart algorithms [36].
After obtaining point estimates for α, one may derive confidence intervals and conduct hypothesis tests by assumingα where σ 2 can estimated byσ

The Semiparametric Non-Linear Regression Model
Suppose that a random experiment is conducted with n subjects. The primary response in this setting is described by a random variable Y i denoting the outcome for the i-th subject (i = 1, . . . , n). The full response vector of the experiment is given by Y = (Y 1 , . . . , Y n ), and we assume that the behavior of Y i can be partially explained by a non-linear relationship involving a designed covariate z i through a known function f . Simultaneously, we can consider that part of the variability of Y i can also be linearly modeled by a k-dimensional vector x i = (x 1i , . . . , x ki ) of fixed covariates [37][38][39]. In this context, we have the non-linear regression model where β = (β 1 , . . . , β k ) is a k-dimensional vector of regression coefficients related to x i , and i is the random error of the i-th observation. Here, we assume that the errors are uncorrelated and normally distribution with zero mean and constant variance (σ 2 ). A particular case arising from Equation (2) is the p-order polynomial regression model, which can be obtained by taking In the context of Model (2), let z = (z 1 , . . . , z n ) be the full vector of designed values. In order to obtain an approximation for f , we assume that z 1 z 2 · · · z n , and then we associate these values to each Y i non-linearly by α. Thus, for each point which is based in a Taylor's series of the function f (z i ) around z i−1 . Now, one can notice that replacing f (z i ) with the observed data on the right side of the previous equation leads to the approximation Therefore, an alternative for Model (2) is the semiparametric non-linear regression model given by which holds for i ∈ {3, . . . , n}.

Bayesian Inference
In this subsection, we address the problem of estimating and making inferences from Model (2) under a fully Bayesian perspective. Firstly, the log-likelihood of vector θ = (α, β, ζ) can be written as where ζ = σ −2 is the precision parameter. For the p-order polynomial model, we have α = (α 0 , α 1 , . . . , α p ) and, specifically for the semiparametric non-linear regression model, we have α = (α 1 , α 2 , α 3 ). In either case, the log-likelihood function of θ can be expressed by In this work, we have adopted weakly informative Normal prior distributions for the vectors α and β, that is α ∼ N q 0, 1 q and β ∼ N k (0, 1 k ), where 1 q and 1 k are identity matrices of sizes q and k, respectively. For the p-order polynomial model, we have that q = p + 1. As for parameter ζ, we have adopted a Gamma prior distribution with both hyperparameters equal to 0.01. We further assume prior independence among all parameters. Now, we can express the posterior distribution of θ as π(θ; y, x, z) ∝ exp{ (θ; y, x, z) + log[π(α)] + log[π(β)] + log[π(ζ)]}.
From the Bayesian point of view, inferences for the elements of θ can be derived from their marginal posterior distribution. Here, we have opted to use a suitable iterative procedure to draw pseudo-random samples from the approximate posterior density (Equation (4)) in order to make inferences for θ. Thus, in order to generate N pseudorandom values for each element of θ, we have adopted the MwG algorithm.
The simulated sequences' convergence can be monitored using trace, autocorrelation plots, and statistical tests (e.g., Heidelberger and Welch [40] and Geweke [41]). After diagnosing convergence, some samples can be discarded as burn-in. The strategy to decrease the correlation between generated values is based on getting thinned steps, and so the final sample is supposed to have size B N. After that, a descriptive summary of Equation (4) can be obtained through approximate Monte Carlo estimators using the generated chains. We choose the posterior expected value as the Bayesian point estimator for the elements of θ.
The next section illustrates the usefulness of the proposed semiparametric non-linear regression model using artificial and real datasets. All computations were performed using the R2jags package, which is available in the R environment [42]. The executable scripts can be made available by the authors upon justified request.

Model Comparison
There are many methods for Bayesian model selection that are useful for comparing competing models. The most popular method is the Deviance Information Criterion (DIC), which works simultaneously to measure the model's fit and complexity. The DIC criterion is defined as where D(θ) = −2 (θ; y, x, z) is the deviance function, and p D = D(θ) − D(θ) is the effective number of model parameters, whereθ is the posterior expected value. Noticeably, we are not able to compute the expectation of D(θ) over θ analytically. Therefore, an approximate Monte Carlo estimator for such a measure is and so the DIC can be estimated by The Expected Akaike (EAIC) and the Expected Bayesian (EBIC) information criteria can also be used when comparing Bayesian models [43,44]. Based on the approximation for the expected value of D(θ), these measures can be estimated by EAIC = D(θ) + 2k and EBIC = D(θ) + k log(n), where k = dim(θ).
Another widely used criterion is derived as a posterior measure of goodness-of-fit based on the observed and predicted values. This measure is given by whereμ i denotes the estimated mean of Y i , which depends on the adopted model (m). For instance, under the semiparametric non-linear regression model in Equation (3), we have that since the first two observations are not considered when computing A under the semiparametric model in Equation (3).

Non-Linear Data Analysis
To illustrate the usefulness of the proposed methodology, we have considered three datasets and the non-linear models presented in Section 2.1. Using the MwG algorithm, a total of N = 110,000 pseudo-random values from the approximate posterior distribution in Equation (4) of θ were obtained. After generating the values, the first 10,000 samples were discarded (burn-in period). Then, 1 out of every 100 generated values was kept, resulting in sequences of the size B = 1000 for each element of θ. Finally, trace plots were used to assess the stationarity of the obtained chains.

Artificial Data
Let us consider the artificial dataset displayed on Table 2. This dataset has n = 21 observations and is composed of a designed and a fixed covariate. For analyzing these data, we have adopted the two-order polynomial model and the semiparametric non-linear regression model  Table 3 presents the posterior parameter estimates and the 95% Credible Intervals (CIs) based on the fitted models. From the displayed results, one can make some conclusions. Firstly, one can notice that the CIs of parameter α 1 of both models do not contain the value zero, which constitute z and g 1 (z) as relevant covariates to explain part of the response's variability. The comparison procedure between the fitted models is presented in Table 4. One can notice that Model (6) has performed poorly compared with the proposed semiparametric non-linear regression model.  Figure 1 illustrates the behavior of the predicted responses against the values of the designed covariate. When considering the goodness-of-fit measure in Equation (5), we have that A[(6)] = 1.8061 and A[(7)] = 0.0007, which indicates that the proposed semiparametric non-linear model has performed better in predicting the response variable. In order to reassure such a conclusion, these models were refitted considering only the first 20 observations, so we could predict the 21st outcome (y 21 = 10). From Model (6), we have obtainedŷ 21 = 11.8 (±2.2), and from the semiparametric non-linear regression model, we obtainedŷ 21 = 10.4 (±3.3), which also suggest a better fit of Model (7).

COVID-19 Count Data
As a second application, we have considered data from n = 358 daily counts of cases and deaths caused by COVID-19 in Brazil (from 17 March 2020 to 21 March 2021). For analyzing these data, we have adopted a second-order autoregressive (AR) model with lagged effects given by and, for the moving averages (MA) of daily COVID-19 counts (average of last seven days), we have considered the following semiparametric non-linear model: Table 5 presents the posterior parameter estimates and the 95% CIs based on the fitted models. From the fitted AR(2) model, it can be noticed that the covariate day is not relevant to describing the incidence behavior of cases and deaths by COVID-19 in the observed time frame. The comparison procedure between the fitted models is presented in Table 6. Noticeably, the semiparametric non-linear model outperformed the AR(2) model in both cases, which can be acknowledged as an excellent result since Model (9) has one less parameter.  Figure 2 illustrates the fitted means (moving averages for COVID-19 cases and deaths) across days. Noticeably, both models provide excellent fits for the COVID-19 counts, with the semiparametric non-linear model being slightly better than the AR(2) since we have

Tuberculosis Count Data
For the last application, we have considered data from n = 216 monthly counts of tuberculosis cases in Brazil (from January 2001 to December 2018). For analyzing these data, we have adopted the three-order polynomial regression model: and the following semiparametric non-linear regression model: Table 7 presents the posterior parameter estimates and the 95% Credible Intervals (CIs) based on the fitted models. From the displayed results, one can make some conclusions. Firstly, one can notice that the CIs of parameter β of Model (10) do not contain the value zero, which constitute year as a relevant covariate to explain part of the response' variability. The comparison procedure between the fitted models is presented in Table 8. One can notice that even having one less parameter, the proposed semiparametric non-linear model (Equation (11)) has performed much better than the polynomial model (Equation (10)).  Figure 3 illustrates the fitted means for the daily number of tuberculosis cases. When considering the goodness-of-fit measure from Equation (5), we have that A[(10)] = 407 and A[(11)] = 315.83, which indicates that the semiparametric non-linear regression model (Equation (11)) has performed better in predicting the number of tuberculosis cases. In the following, these models were refitted considering only the first 215 observations, so we could predict the 216th outcome (y 216 = 6836). From Model (10), we have obtained y 216 = 7926.76 (±0.019), and from the semiparametric non-linear regression model we obtainedŷ 216 = 7030.41 (±0.003), which also suggest a better fit of Model (11).

Concluding Remarks
Parametric non-linear approaches typically involve choosing a model among many existing non-linear formulations, which can be a burden in many applications. Moreover, most numerical iterative methods for model fitting strongly depend on choosing precise initial values. However, non-linear models often provide insightful parameter (biological or physical) interpretations for many researchers. In this sense, we aimed to introduce a semiparametric non-linear regression framework as an alternative to standard non-linear models. The proposed model can be considered an excellent alternative to many existing nonparametric regression techniques based on spline smoothing and GAM. Approximate posterior inferences for the model parameters were obtained from a fully Bayesian approach based on MwG with weakly informative priors. The proposed model and some wellestablished non-linear models were considered for analyzing three datasets. Based on the prediction accuracy, we could conclude that the proposed semiparametric framework can be a powerful alternative for estimation and prediction in non-linear settings.