Next Article in Journal
General Equilibrium with Price Adjustments—A Dynamic Programming Approach
Previous Article in Journal
Analytics Capability and Firm Performance in Supply Chain Organizations: The Role of Employees’ Analytics Skills
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A New Semiparametric Regression Framework for Analyzing Non-Linear Data

by
Wesley Bertoli
1,*,
Ricardo P. Oliveira
2 and
Jorge A. Achcar
3
1
Department of Statistics, Federal University of Technology-Paraná, Curitiba 80230-901, Brazil
2
Department of Statistics, Maringá State University, Maringá 87020-900, Brazil
3
Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto 14049-900, Brazil
*
Author to whom correspondence should be addressed.
Analytics 2022, 1(1), 15-26; https://doi.org/10.3390/analytics1010002
Submission received: 11 May 2022 / Revised: 31 May 2022 / Accepted: 7 June 2022 / Published: 16 June 2022

Abstract

:
This work introduces a straightforward framework for semiparametric non-linear models as an alternative to existing non-linear parametric models, whose interpretation primarily depends on biological or physical aspects that are not always available in every practical situation. The proposed methodology does not require intensive numerical methods to obtain estimates in non-linear contexts, which is attractive as such algorithms’ convergence strongly depends on assigning good initial values. Moreover, the proposed structure can be compared with standard polynomial approximations often used for explaining non-linear data behaviors. Approximate posterior inferences for the semiparametric model parameters were obtained from a fully Bayesian approach based on the Metropolis-within-Gibbs algorithm. The proposed structures were considered to analyze artificial and real datasets. Our results indicated that the semiparametric models outperform linear polynomial regression approximations to predict the behavior of response variables in non-linear settings.

1. Introduction

Non-linear models are often applied in many areas of quantitative research, such as biology, chemistry, epidemiology, and physics, among many others. These models have appeal in these areas as they have straightforward interpretability (regarding the nature of the underlying process) and typically provide excellent predictions for the response variable [1]. Alternatively to non-linear parametric models, researchers may consider approximating unknown non-linear functions using linear polynomial models. However, adopting such linear approximations to describe non-linear behaviors may be cumbersome as it could involve estimating more (and not easily interpretable) parameters [2]. On the other hand, non-linear models’ are sensitive and should be carefully chosen depending on the application, and their parameters generally do not have analytical forms for their estimators. The absence of analytical solutions implies the need for numerical algorithms whose convergence strongly depends on the initial values chosen for the iterative procedures.
Several proposals for non-linear models can be found in the literature. However, modern techniques can also be used for non-linear modeling and predictions, such as nonparametric regression based on spline smoothing [3,4,5] and Generalized Additive Models (GAMs) [6,7]. In the context of agricultural applications, ref. [8] classifies non-linear models into six groups, as detailed in Table 1. Moreover, ref. [9] provides an excellent review on Gaussian Processes (GPs) and Relevance Vector Machines (RVMs), discussing how those nonparametric methods can be applied in non-linear frameworks for regressing over large datasets and how they can be effective for dealing with sequential data. The primary advantage of RVMs is that one can choose more general basis functions, and GPs present excellent behavior for predicting variances, although more restrictive regarding the kernel function choice. Ref. [10] has also studied such methods and concluded that the difficulty of GPs is learning by maximizing the evidence, where the hyperparameters could be learned, which is distinct from RVMs with fixed basis functions where inputs are the learning targets.
In the presented context, this work aims to introduce a semiparametric non-linear regression model framework that can be very useful for obtaining highly accurate fits to non-linear datasets. Our goal is to provide an alternative to the existing linear approximations and nonparametric models. The proposed approach does not require numerical methods that strongly depend on precise initial values to reach convergence. By adopting the proposed framework, one could derive accurate inferences and predictions under a fully Bayesian approach [29] using standard MCMC (Markov Chain Monte Carlo) methods [30,31,32] (e.g., Gibbs Sampling, Metropolis–Hastings, and Metropolis-within-Gibbs (MwG), among others). In this paper, we have chosen to work with the MwG algorithm [33] to draw pseudo-random samples from the approximate posterior distribution of model parameters.
This paper is organized as follows. In Section 2, we present fundamental concepts regarding the formulation and estimation of parametric non-linear regression models. In Section 3, we analyze and discuss the results obtained using the proposed methodology for modeling artificial and real datasets featuring non-linear relationships. Model comparisons regarding linear polynomial approximations are also presented. General comments and concluding remarks are addressed in Section 4.

2. Materials and Methods

Non-linear models are similar to linear regressions [34] in the sense of outlining the functional relationship between a continuous response variable Y and a set of covariates, thus providing a statistical prediction tool. Linear regressions are used to build purely empirical models, while non-linear models are typically applied when biological or physical interpretations imply relationships between responses and covariates that are not linear [35,36]. It is important to establish that either linearity or non-linearity is related to the unknown parameters and not the response–covariates relationship. In this context, a non-linear regression model for representing a response variable Y i ( i = 1 , , n ) has the general form
Y i = f ( z i , α ) + ϵ i ,
where f is a known function of the designed covariate z i , and α = ( α 1 , α p ) is a p-dimensional vector of non-linear parameters indexing f. Moreover, ϵ i denotes the random error, which is typically assumed to be normally distributed with zero mean and constant variance. It is also usual to assume that the errors are uncorrelated, that is, C ( ϵ i , ϵ j ) = 0 for all i j .
The most popular method for estimating α is the non-linear least squares, which is based on minimizing
S ( ϵ ) = i = 1 n ϵ i 2 = i = 1 n y i f ( z i , α ) 2 ,
where ϵ = ( ϵ 1 , , ϵ n ) . It is worth mentioning that if ϵ i N ( 0 , σ ϵ 2 ) , then the least squares and maximum likelihood estimators of α are the same.
Typically, point estimates for non-linear regression coefficients are obtained from iterative optimization processes based on techniques to minimize the error sum of squares. A widespread iterative method to derive least-squares estimates for non-linear models is the Gauss–Newton algorithm. In this context, if  f ( z i , α ) in Equation (1) is continuously differentiable at α , then f can be linearized locally at α 0 as
f ( z i , α ) = f ( z i , α 0 ) + z 0 ( α α 0 ) ,
where z 0 is the n × p Jacobian matrix whose elements
f ( z i , α ) α j
are evaluated at α = α 0 . Thus, the iterative algorithm to estimate α is given by
α ( k + 1 ) = α ( k ) + z 0 z 0 1 z 0 ϵ ,
where α ( 0 ) = α 0 is the vector of initial values for α , and ϵ is evaluated at α = α ( k ) . If the errors are independent and normally distributed, then the Gauss–Newton algorithm is an application of the Fisher Scoring method.
Implementations of the Gauss–Newton algorithm are available in most of the existing statistical software, but, in practice, there is no guarantee that the algorithm will converge from initial values that are far from the solution. In this sense, some improvements for this method can be found in the literature, such as the Gradient Descent and Levenberg–Marquart algorithms [36].
After obtaining point estimates for α , one may derive confidence intervals and conduct hypothesis tests by assuming
α ^ a N p α , σ ϵ 2 ( z 0 z 0 ) 1 ,
where σ ϵ 2 can estimated by
σ ^ ϵ 2 = 1 n p i = 1 n y i f ( z i , α ^ ) 2 .

2.1. The Semiparametric Non-Linear Regression Model

Suppose that a random experiment is conducted with n subjects. The primary response in this setting is described by a random variable Y i denoting the outcome for the i-th subject ( i = 1 , , n ) . The full response vector of the experiment is given by Y = ( Y 1 , , Y n ) , and we assume that the behavior of Y i can be partially explained by a non-linear relationship involving a designed covariate z i through a known function f. Simultaneously, we can consider that part of the variability of Y i can also be linearly modeled by a k-dimensional vector x i = ( x 1 i , , x k i ) of fixed covariates [37,38,39]. In this context, we have the non-linear regression model
Y i = f ( z i , α ) + x i β + ϵ i ,
where β = ( β 1 , , β k ) is a k-dimensional vector of regression coefficients related to x i , and ϵ i is the random error of the i-th observation. Here, we assume that the errors are uncorrelated and normally distribution with zero mean and constant variance ( σ ϵ 2 ) .
A particular case arising from Equation (2) is the p-order polynomial regression model, which can be obtained by taking
f ( z i , α ) = α 0 + j = 1 p α j z i j .
In the context of Model (2), let z = ( z 1 , , z n ) be the full vector of designed values. In order to obtain an approximation for f, we assume that z 1 z 2 z n , and then we associate these values to each Y i non-linearly by α . Thus, for each point z i ( i = 3 , , n ) , we take a = z i 1 , a + h = z i , and  a h = z i 2 to express the approximation f ( z i ) to f as
f ( z i ) = f ( z i 1 ) + [ f ( z i ) f ( z i 1 ) ] ( z i z i 1 ) h i + [ f ( z i ) 2 f ( z i 1 ) + f ( z i 2 ) ] ( z i z i 1 ) 2 2 h i 2 ,
which is based in a Taylor’s series of the function f ( z i ) around z i 1 . Now, one can notice that replacing f ( z i ) with the observed data on the right side of the previous equation leads to the approximation
f ( z i ) y i 1 + g 1 ( z i ) + g 2 ( z i ) ,
where
g 1 ( z i ) = ( y i y i 1 ) ( z i z i 1 ) h i     and     g 2 ( z i ) = ( y i 2 y i 1 + y i 2 ) ( z i z i 1 ) 2 2 h i 2 ,
with h i = z i + 1 z i . Therefore, an alternative for Model (2) is the semiparametric non-linear regression model given by
Y i = α 1 Y i 1 + α 2 g 1 ( z i ) + α 3 g 2 ( z i ) + x i β + ϵ i ,
which holds for i { 3 , , n } .

2.2. Bayesian Inference

In this subsection, we address the problem of estimating and making inferences from Model (2) under a fully Bayesian perspective. Firstly, the log-likelihood of vector θ = ( α , β , ζ ) can be written as
( θ ; y , x , z ) n 2 log ( ζ ) ζ 2 i = 1 n y i f ( z i , α ) x i β 2 ,
where ζ = σ ϵ 2 is the precision parameter.
For the p-order polynomial model, we have α = ( α 0 , α 1 , , α p ) and, specifically for the semiparametric non-linear regression model, we have α = ( α 1 , α 2 , α 3 ) . In either case, the log-likelihood function of θ can be expressed by
( θ ; y , x , z ) n 2 log ( ζ ) ζ 2 i = 3 n y i α 1 y i 1 α 2 g 1 ( z i ) α 3 g 2 ( z i ) x i β 2 .
In this work, we have adopted weakly informative Normal prior distributions for the vectors α and β , that is   
α N q 0 , 1 q     and     β N k 0 , 1 k ,
where 1 q and 1 k are identity matrices of sizes q and k, respectively. For the p-order polynomial model, we have that q = p + 1 . As for parameter ζ , we have adopted a Gamma prior distribution with both hyperparameters equal to 0.01. We further assume prior independence among all parameters.
Now, we can express the posterior distribution of θ as
π ( θ ; y , x , z ) exp { ( θ ; y , x , z ) + log [ π ( α ) ] + log [ π ( β ) ] + log [ π ( ζ ) ] } .
From the Bayesian point of view, inferences for the elements of θ can be derived from their marginal posterior distribution. Here, we have opted to use a suitable iterative procedure to draw pseudo-random samples from the approximate posterior density (Equation (4)) in order to make inferences for θ . Thus, in order to generate N pseudo-random values for each element of θ , we have adopted the MwG algorithm.
The simulated sequences’ convergence can be monitored using trace, autocorrelation plots, and statistical tests (e.g., Heidelberger and Welch [40] and Geweke [41]). After diagnosing convergence, some samples can be discarded as burn-in. The strategy to decrease the correlation between generated values is based on getting thinned steps, and so the final sample is supposed to have size B N . After that, a descriptive summary of Equation (4) can be obtained through approximate Monte Carlo estimators using the generated chains. We choose the posterior expected value as the Bayesian point estimator for the elements of θ .
The next section illustrates the usefulness of the proposed semiparametric non-linear regression model using artificial and real datasets. All computations were performed using the R2jags package, which is available in the R environment [42]. The executable scripts can be made available by the authors upon justified request.

2.3. Model Comparison

There are many methods for Bayesian model selection that are useful for comparing competing models. The most popular method is the Deviance Information Criterion (DIC), which works simultaneously to measure the model’s fit and complexity. The DIC criterion is defined as
DIC = E θ D θ + p D = D ̲ θ + p D ,
where D ( θ ) = 2 ( θ ; y , x , z ) is the deviance function, and p D = D ̲ ( θ ) D ( θ ^ ) is the effective number of model parameters, where θ ^ is the posterior expected value.
Noticeably, we are not able to compute the expectation of D ( θ ) over θ analytically. Therefore, an approximate Monte Carlo estimator for such a measure is
D ̲ ^ θ = 2 B i = 1 B θ i ; y , x , z ,
and so the DIC can be estimated by
DIC ^   = 2 D ̲ ^ θ D ( θ ^ ) .
The Expected Akaike (EAIC) and the Expected Bayesian (EBIC) information criteria can also be used when comparing Bayesian models [43,44]. Based on the approximation for the expected value of D ( θ ) , these measures can be estimated by
EAIC ^ = D ̲ ^ θ + 2 k           and           EBIC ^ = D ̲ ^ θ + k log n ,
where k = dim ( θ ) .
Another widely used criterion is derived as a posterior measure of goodness-of-fit based on the observed and predicted values. This measure is given by
A [ m ] = 1 n i = 1 n   y i μ ^ i ,
where μ ^ i denotes the estimated mean of Y i , which depends on the adopted model ( m ) . For instance, under the semiparametric non-linear regression model in Equation (3), we have that
A [ ( 3 ) ] = 1 n 2 i = 3 n   y i α ^ 1 y i 1 α ^ 2 g 1 ( z i ) α ^ 3 g 2 ( z i ) x i β ^ ,
since the first two observations are not considered when computing A under the semiparametric model in Equation (3).

3. Non-Linear Data Analysis

To illustrate the usefulness of the proposed methodology, we have considered three datasets and the non-linear models presented in Section 2.1. Using the MwG algorithm, a total of N = 110,000 pseudo-random values from the approximate posterior distribution in Equation (4) of θ were obtained. After generating the values, the first 10,000 samples were discarded (burn-in period). Then, 1 out of every 100 generated values was kept, resulting in sequences of the size B = 1000 for each element of θ . Finally, trace plots were used to assess the stationarity of the obtained chains.

3.1. Artificial Data

Let us consider the artificial dataset displayed on Table 2. This dataset has n = 21 observations and is composed of a designed and a fixed covariate. For analyzing these data, we have adopted the two-order polynomial model
Y i = α 0 + α 1 z i + α 2 z i 2 + x i β + ϵ i   ( i = 1 , , 21 ) ,
and the semiparametric non-linear regression model
Y i = α 1 Y i 1 + α 2 g 1 ( z i ) + α 3 g 2 ( z i ) + x i β + ϵ i   ( i = 3 , , 21 ) .
Table 3 presents the posterior parameter estimates and the 95% Credible Intervals (CIs) based on the fitted models. From the displayed results, one can make some conclusions. Firstly, one can notice that the CIs of parameter α 1 of both models do not contain the value zero, which constitute z and g 1 ( z ) as relevant covariates to explain part of the response’s variability. The comparison procedure between the fitted models is presented in Table 4. One can notice that Model (6) has performed poorly compared with the proposed semiparametric non-linear regression model.
Figure 1 illustrates the behavior of the predicted responses against the values of the designed covariate. When considering the goodness-of-fit measure in Equation (5), we have that A[(6)] = 1.8061 and A[(7)] = 0.0007, which indicates that the proposed semiparametric non-linear model has performed better in predicting the response variable. In order to reassure such a conclusion, these models were refitted considering only the first 20 observations, so we could predict the 21st outcome ( y 21 = 10 ) . From Model (6), we have obtained y ^ 21 = 11.8   ( ± 2.2 ) , and from the semiparametric non-linear regression model, we obtained y ^ 21 = 10.4   ( ± 3.3 ) , which also suggest a better fit of Model (7).

3.2. COVID-19 Count Data

As a second application, we have considered data from n = 358 daily counts of cases and deaths caused by COVID-19 in Brazil (from 17 March 2020 to 21 March 2021). For analyzing these data, we have adopted a second-order autoregressive (AR) model with lagged effects given by
Y i = β 0 + β 1 d a y i + β 2 Y i 1 + β 3 Y i 2 + ϵ i   ( i = 1 , , 358 ) ,
and, for the moving averages (MA) of daily COVID-19 counts (average of last seven days), we have considered the following semiparametric non-linear model:
Y i = α 1 Y i 1 + α 2 g 1 ( d a y s i ) + α 3 g 2 ( d a y s i ) + ϵ i   ( i = 1 , , 358 ) .
Table 5 presents the posterior parameter estimates and the 95% CIs based on the fitted models. From the fitted AR(2) model, it can be noticed that the covariate day is not relevant to describing the incidence behavior of cases and deaths by COVID-19 in the observed time frame. The comparison procedure between the fitted models is presented in Table 6. Noticeably, the semiparametric non-linear model outperformed the AR(2) model in both cases, which can be acknowledged as an excellent result since Model (9) has one less parameter.
Figure 2 illustrates the fitted means (moving averages for COVID-19 cases and deaths) across days. Noticeably, both models provide excellent fits for the COVID-19 counts, with the semiparametric non-linear model being slightly better than the AR(2) since we have A[(8)] = 994.03 against A[(9)] = 934.11 for the number of cases and A[(8)] = 21.59 against A[(9)] = 20.68 for the number of deaths.

3.3. Tuberculosis Count Data

For the last application, we have considered data from n = 216 monthly counts of tuberculosis cases in Brazil (from January 2001 to December 2018). For analyzing these data, we have adopted the three-order polynomial regression model:
Y i = α 0 + α 1 z i + α 2 z i 2 + α 3 z i 3 + β   y e a r i + ϵ i   ( i = 1 , , 216 ) ,
and the following semiparametric non-linear regression model:
Y i = α 1 Y i 1 + α 2 g 1 ( z i ) + α 3 g 2 ( z i ) + β ( y e a r i 2000 ) + ϵ i   ( i = 1 , , 216 ) .
Table 7 presents the posterior parameter estimates and the 95% Credible Intervals (CIs) based on the fitted models. From the displayed results, one can make some conclusions. Firstly, one can notice that the CIs of parameter β of Model (10) do not contain the value zero, which constitute year as a relevant covariate to explain part of the response’ variability.
The comparison procedure between the fitted models is presented in Table 8. One can notice that even having one less parameter, the proposed semiparametric non-linear model (Equation (11)) has performed much better than the polynomial model (Equation (10)).
Figure 3 illustrates the fitted means for the daily number of tuberculosis cases. When considering the goodness-of-fit measure from Equation (5), we have that A[(10)] = 407 and A[(11)] = 315.83, which indicates that the semiparametric non-linear regression model (Equation (11)) has performed better in predicting the number of tuberculosis cases. In the following, these models were refitted considering only the first 215 observations, so we could predict the 216th outcome ( y 216 = 6836 ) . From Model (10), we have obtained y ^ 216 = 7926.76   ( ± 0.019 ) , and from the semiparametric non-linear regression model we obtained y ^ 216 = 7030.41   ( ± 0.003 ) , which also suggest a better fit of Model (11).

4. Concluding Remarks

Parametric non-linear approaches typically involve choosing a model among many existing non-linear formulations, which can be a burden in many applications. Moreover, most numerical iterative methods for model fitting strongly depend on choosing precise initial values. However, non-linear models often provide insightful parameter (biological or physical) interpretations for many researchers. In this sense, we aimed to introduce a semiparametric non-linear regression framework as an alternative to standard non-linear models. The proposed model can be considered an excellent alternative to many existing nonparametric regression techniques based on spline smoothing and GAM. Approximate posterior inferences for the model parameters were obtained from a fully Bayesian approach based on MwG with weakly informative priors. The proposed model and some well-established non-linear models were considered for analyzing three datasets. Based on the prediction accuracy, we could conclude that the proposed semiparametric framework can be a powerful alternative for estimation and prediction in non-linear settings.

Supplementary Materials

The datasets information can be downloaded at: https://www.mdpi.com/article/10.3390/analytics1010002/s1.

Author Contributions

Conceptualization, W.B., R.P.O. and J.A.A.; Formal analysis, W.B., R.P.O. and J.A.A.; Methodology, W.B., R.P.O. and J.A.A.; Software, W.B., R.P.O. and J.A.A.; Writing– original draft, W.B., R.P.O. and J.A.A.; Writing–review and editing, W.B., R.P.O. and J.A.A. All authors equally contributed to developing this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this work were made available as Supplementary Materials.

Acknowledgments

We would like to thank the Associate Editor and the three anonymous referees for their careful reading and thoughtful suggestions, which helped us improve this work’s content and presentation.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ARAutoregressive
CICredible Interval
GAMGeneralized Additive Models
GPGaussian Process
MAMoving Average
MCMCMarkov Chain Monte Carlo
MwGMetropolis-within-Gibbs
RVMRelevance Vector Machine

References

  1. Bates, D.M.; Watts, D.G. Nonlinear Regression Analysis and Its Applications, 2nd ed.; John Wiley & Sons: New York, NY, USA, 2007. [Google Scholar]
  2. Pinheiro, J.C.; Bates, D.M. Mixed-Effects Models in S and S-Plus; Springer: New York, NY, USA, 2000. [Google Scholar]
  3. Eubank, R.L. Spline Smoothing and Nonparametric Regression; Marcel Dekker: New York, NY, USA, 1988. [Google Scholar]
  4. Green, P.J.; Silverman, B.W. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach; Chapman & Hall: London, UK, 1994. [Google Scholar]
  5. Gu, C. Smoothing Spline ANOVA Models, 2nd ed.; Springer: New York, NY, USA, 2013. [Google Scholar]
  6. Hastie, T.; Tibshirani, R. Generalized Additive Models; Chapman & Hall: London, UK, 1990. [Google Scholar]
  7. Hastie, T.; Tibshirani, R. Varying-coefficient models. J. R. Stat. Soc. Ser. B 1993, 55, 757–796. [Google Scholar] [CrossRef]
  8. Archontoulis, S.V.; Miguez, F.E. Nonlinear regression models and applications in agricultural research. Agron. J. 2015, 107, 786–798. [Google Scholar] [CrossRef] [Green Version]
  9. Martino, L.; Read, J. A joint introduction to Gaussian processes and relevance vector machines with connections to Kalman filtering and other kernel smoothers. Inf. Fusion 2021, 74, 17–38. [Google Scholar] [CrossRef]
  10. Candela, J.Q. Learning with Uncertainty-Gaussian Processes and Relevance Vector Machines; Technical University of Denmark: Copenhagen, Denmark, 2004; pp. 1–152. [Google Scholar]
  11. Dixon, B.L.; Sonka, S.T. A note on the use of exponential functions for estimating farm size distributions. Am. J. Agric. Econ. 1979, 61, 554–557. [Google Scholar] [CrossRef]
  12. Shimojo, M.; Nakano, Y. An investigation into relationships between exponential functions and some natural phenomena. J. Fac. Agric. Kyushu Univ. 2013, 58, 51–53. [Google Scholar] [CrossRef]
  13. Gompertz, B. On the nature of the function expressive of the law of human mortality, and on a new mode of determining the value of life contingencies. Philos. Trans. R. Soc. B 1825, 115, 513–585. [Google Scholar]
  14. Verhulst, P.F. A note on population growth. Corresp. Math. Phys. 1838, 10, 113–121. [Google Scholar]
  15. Weibull, W. A statistical distribution function of wide applicability. J. Appl. Math. 1951, 18, 293–297. [Google Scholar] [CrossRef]
  16. Richards, F.J. A flexible growth function for empirical use. J. Exp. Bot. 1959, 10, 290–300. [Google Scholar] [CrossRef]
  17. Yin, X.; Goudriaan, J.; Lantinga, E.A.; Vos, J.; Spiertz, J.H.J. A flexible sigmoid function of determinate growth. Ann. Bot. 2003, 91, 361–371. [Google Scholar] [CrossRef] [PubMed]
  18. Blackman, F.F. Optima and limiting factors. Ann. Bot. 1905, 19, 281–295. [Google Scholar] [CrossRef]
  19. Sinclair, T.R.; Horie, T. Leaf nitrogen, photosynthesis, and crop radiation use efficiency: A review. Crop Sci. 1989, 29, 90–98. [Google Scholar] [CrossRef]
  20. van’t Hoff, J.H. Lectures on Theoretical and Physical Chemistry. Part 1: Chemical Dynamics; Edward Arnold: London, UK, 1898. [Google Scholar]
  21. Arrhenius, S. Über die Reaktionsgeschwindigkeit bei der Inversion von Rohrzucker durch Säuren. Z. Für Phys. Chem. 1889, 4, 226–248. [Google Scholar] [CrossRef] [Green Version]
  22. Ratkowsky, D.A.; Olley, J.; McMeekin, T.A.; Ball, A. Relationship between temperature and growth rate of bacterial cultures. J. Bacteriol. 1982, 149, 1–5. [Google Scholar] [CrossRef] [Green Version]
  23. Lloyd, J.; Taylor, J.A. On the temperature dependence of soil respiration. Funct. Ecol. 1994, 8, 315–323. [Google Scholar] [CrossRef]
  24. Yin, X.; Kroff, M.J.; McLean, G.; Visperas, R.M. A nonlinear model for crop development as a function of temperature. Agric. For. Meteorol. 1995, 77, 1–16. [Google Scholar] [CrossRef] [Green Version]
  25. Hu, Y.; Tao, V.; Croitoru, A. Understanding the rational function model: Methods and applications. Int. Arch. Photogramm. Remote Sens. 2004, 20, 119–124. [Google Scholar]
  26. Braverman, E.; Kinzebulatov, D. On linear perturbations of the Ricker model. Math. Biosci. 2006, 202, 323–339. [Google Scholar] [CrossRef] [PubMed]
  27. Nijland, G.O.; Schouls, J.; Goudriaan, J. Integrating the production functions of Liebig, Michaelis-Menten, Mitscherlich and Liebscher into one system dynamics model. NJAS-Wagening. J. Life Sci. 2008, 55, 199–224. [Google Scholar] [CrossRef] [Green Version]
  28. Ye, Z.; Zhao, Z. A modified rectangular hyperbola to describe the light-response curve of photosynthesis of Bidens pilosa L. grown under low and high light conditions. Front. Agric. China 2010, 4, 50–55. [Google Scholar] [CrossRef]
  29. Bernardo, J.M.; Smith, A.F.M. Bayesian Theory; John Wiley & Sons: New York, NY, USA, 1994. [Google Scholar]
  30. Gelfand, A.E.; Smith, A.F.M. Sampling based approaches to calculating marginal densities. J. Am. Stat. Assoc. 1990, 85, 398–409. [Google Scholar] [CrossRef]
  31. Casella, G.; George, E.I. Explaining the Gibbs sampler. Am. Stat. 1992, 46, 167–174. [Google Scholar]
  32. Chib, S.; Greenberg, E. Understanding the Metropolis-Hastings algorithm. Am. Stat. 1995, 49, 327–335. [Google Scholar]
  33. Gilks, W.R.; Best, N.G.; Tan, K.K. Adaptive rejection Metropolis sampling within Gibbs sampling. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1995, 44, 455–472. [Google Scholar] [CrossRef] [Green Version]
  34. Seber, G.A.F.; Lee, A.J. Linear Regression Analysis, 2nd ed.; John Wiley & Sons: New York, NY, USA, 2003. [Google Scholar]
  35. Ratkowsky, D.A. Nonlinear Regression Modelling: A Unified Practical Approach; Marcel Dekker: New York, NY, USA, 1983. [Google Scholar]
  36. Seber, G.A.F.; Wild, C.J. Nonlinear Regression; John Wiley & Sons: New York, NY, USA, 1989. [Google Scholar]
  37. Koop, G.; Poirier, D.J. Bayesian variants of some classical semiparametric regression techniques. J. Econom. 2004, 123, 259–282. [Google Scholar] [CrossRef] [Green Version]
  38. Munkin, M.; Trivedi, P. Bayesian analysis of the ordered probit model with endogenous selection. J. Econom. 2008, 143, 334–348. [Google Scholar] [CrossRef]
  39. Feng, L.; Munkin, M. Bayesian semiparametric analysis on the relationship between BMI and income for rural and urban workers in China. J. Appl. Stat. 2021. [Google Scholar] [CrossRef]
  40. Heidelberger, P.; Welch, P.D. Simulation run length control in the presence of an initial transient. Oper. Res. 1983, 31, 1109–1144. [Google Scholar] [CrossRef]
  41. Geweke, J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. J. R. Stat. Soc. 1994, 56, 501–514. [Google Scholar]
  42. R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020. [Google Scholar]
  43. Carlin, B.P.; Louis, T.A. Bayes and Empirical Bayes Methods for Data Analysis; Chapman & Hall: Boca Raton, FL, USA, 2001. [Google Scholar]
  44. Brooks, S.P. Discussion on the paper by Spiegelhalter, Best, Carlin, and van der Linde. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2002, 64, 616–639. [Google Scholar]
Figure 1. Predicted responses vs. designed covariate ( z ) for the artificial dataset.
Figure 1. Predicted responses vs. designed covariate ( z ) for the artificial dataset.
Analytics 01 00002 g001
Figure 2. Fitted means for the daily number of COVID-19 cases (left panel) and deaths (right panel).
Figure 2. Fitted means for the daily number of COVID-19 cases (left panel) and deaths (right panel).
Analytics 01 00002 g002
Figure 3. Fitted means for the daily number of tuberculosis cases.
Figure 3. Fitted means for the daily number of tuberculosis cases.
Analytics 01 00002 g003
Table 1. Groups of non-linear models in agricultural applications.
Table 1. Groups of non-linear models in agricultural applications.
GroupModelsPublished Works
IExponential Functions[11,12]
IISigmoids (e.g., Logistic Function)[13,14,15,16,17]
IIIAsymptotic Exponential[18,19]
Modified Logistic
Photosynthesis
IVModified Arrhenius[20,21,22,23]
Temperature Dependencies
va not ( Q 10 Function)
VBell-shaped Curves[24]
Gaussian Function
VIMichaelis–Menten[8,25,26,27,28]
Modified Hyperbola
Power Functions
Rational Functions
Ricker Curve
Table 2. Artificial dataset from a hypothetical experiment with 21 subjects.
Table 2. Artificial dataset from a hypothetical experiment with 21 subjects.
yxzyxz
12110.037531128.9171
14211.4128291310.0933
1539.8035271411.9097
1849.9774251511.0709
2059.0706201610.3041
21610.822019179.4895
2279.617018189.1792
2589.135416199.5295
28910.018015209.2414
301010.1596102110.3354
341110.3520---
Table 3. Posterior parameter estimates and 95% credible intervals for the artificial dataset.
Table 3. Posterior parameter estimates and 95% credible intervals for the artificial dataset.
ModelParameterMeanStd. Dev.95% CI
LowerUpper
(6) α 0 1.37802.7771−3.74117.0241
α 1 3.64900.40372.89504.4070
α 2 −0.16590.0181−0.1990−0.1297
β 0.59750.3185−0.01851.2420
ζ 0.16970.05920.07560.3008
(7) α 1 1.00000.00290.99441.0061
α 2 0.99900.01210.97491.0230
α 3 0.00120.0257−0.04680.0518
β −0.00020.0064−0.01310.0122
ζ 85.260029.600039.1100152.5000
Table 4. Posterior comparison criteria for the fitted models for the artificial dataset.
Table 4. Posterior comparison criteria for the fitted models for the artificial dataset.
Modelk p D DICEAICEBIC
(6)54.826103.986109.160114.382
(7)55.02259.85869.41379.857
Table 5. Posterior parameter estimates and 95% credible intervals for the COVID-19 dataset.
Table 5. Posterior parameter estimates and 95% credible intervals for the COVID-19 dataset.
CountModelParameterMeanStd. Dev.95% CI
LowerUpper
Cases(8) β 0 0.03711.0080−1.86201.9800
β 1 0.89590.7839−0.64512.5610
β 2 1.30700.05401.21001.4230
β 3 −0.30890.0545−0.4249−0.2088
ζ <0.0001<0.0001<0.0001<0.0001
(9) α 1 1.00600.00211.00211.0100
α 2 −0.16620.0222−0.2113−0.1265
α 3 −5.37100.5814−6.5710−4.3110
ζ <0.0001<0.0001<0.0001<0.0001
Deaths(8) β 0 −0.02960.9695−1.95001.8390
β 1 0.01610.0185−0.02140.0556
β 2 1.29300.05501.19401.4080
β 3 −0.29080.0558−0.4049−0.1893
ζ 0.0009<0.0001<0.0001<0.0001
(9) α 1 1.01010.00181.00601.0130
α 2 −0.16810.0221−0.2136−0.1287
α 3 −5.40500.5801−6.5920−4.3460
ζ 0.0011<0.0001<0.0001<0.0001
Table 6. Posterior comparison criteria for the fitted models for the COVID-19 dataset.
Table 6. Posterior comparison criteria for the fitted models for the COVID-19 dataset.
CountModelk p D DICEAICEBIC
Cases (8)53.0846237.0006243.6376263.040
 (9)43.1866194.9206193.7226197.602
Deaths (8)54.1744104.0004109.9944129.397
 (9)43.9454057.9204060.3844063.967
Table 7. Posterior parameter estimates and 95% credible intervals for the tuberculosis dataset.
Table 7. Posterior parameter estimates and 95% credible intervals for the tuberculosis dataset.
ModelParameterMeanStd. Dev.95% CI
LowerUpper
(10) α 0 0.72850.7262−0.56831.9650
α 1 0.00070.0008<−0.00010.0023
α 2 < 0.0001 <0.0001<−0.0001<−0.0001
α 3 <0.0001<0.0001<0.0001<0.0001
β 0.00410.00040.00350.0047
ζ 208.100019.8200172.4000249.7000
(11) α 1 0.99980.00090.99821.0000
α 2 −0.18170.0197−0.2273−0.1478
α 3 −6.20300.5525−7.3740−5.1280
β 0.00020.0007−0.00120.0015
ζ 340.100034.2600275.5000411.0000
Table 8. Posterior comparison criteria for the fitted models for the tuberculosis dataset.
Table 8. Posterior comparison criteria for the fitted models for the tuberculosis dataset.
Modelk p D DICEAICEBIC
(10)64.1713316.0003323.9423344.194
(11)54.1211582.0001596.3441613.220
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bertoli, W.; Oliveira, R.P.; Achcar, J.A. A New Semiparametric Regression Framework for Analyzing Non-Linear Data. Analytics 2022, 1, 15-26. https://doi.org/10.3390/analytics1010002

AMA Style

Bertoli W, Oliveira RP, Achcar JA. A New Semiparametric Regression Framework for Analyzing Non-Linear Data. Analytics. 2022; 1(1):15-26. https://doi.org/10.3390/analytics1010002

Chicago/Turabian Style

Bertoli, Wesley, Ricardo P. Oliveira, and Jorge A. Achcar. 2022. "A New Semiparametric Regression Framework for Analyzing Non-Linear Data" Analytics 1, no. 1: 15-26. https://doi.org/10.3390/analytics1010002

Article Metrics

Back to TopTop