Penalized Spline Estimator for Semiparametric Binary Logistic Regression Model with Application to Coronary Heart Disease Risk Factors

Chamidah, Nur; Rifada, Marisa; Lestari, Budi; Aydin, Dursun; Siregar, Naufal Ramadhan Al Akhwal

doi:10.3390/sym18030432

Open AccessArticle

Penalized Spline Estimator for Semiparametric Binary Logistic Regression Model with Application to Coronary Heart Disease Risk Factors

by

Nur Chamidah

^1,2,*

,

Marisa Rifada

^1,2

,

Budi Lestari

^2,3,

Dursun Aydin

^4,5

and

Naufal Ramadhan Al Akhwal Siregar

⁶

¹

Department of Mathematics, Faculty of Science and Technology, Airlangga University, Surabaya 60115, Indonesia

²

Research Group of Statistical Modeling in Life Science, Faculty of Science and Technology, Airlangga University, Surabaya 60115, Indonesia

³

Department of Mathematics, Faculty of Mathematics and Natural Sciences, The University of Jember, Jember 68121, Indonesia

⁴

Department of Statistics, Faculty of Science, Muğla Sıtkı Koçman University, Muğla 48000, Turkey

⁵

Department of Mathematics, University of Wisconsin, Oshkosh Algoma Blvd, Oshkosh, WI 54901, USA

⁶

Master of Mathematics Study Program, Department of Mathematics, Faculty of Science and Technology, Airlangga University, Surabaya 60115, Indonesia

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(3), 432; https://doi.org/10.3390/sym18030432

Submission received: 16 December 2025 / Revised: 4 February 2026 / Accepted: 12 February 2026 / Published: 28 February 2026

(This article belongs to the Section Mathematics)

Download

Browse Figures

Versions Notes

Abstract

In this study, we develop a regression analysis method, namely, the Semiparametric Binary Logistic Regression (SBLR), by extending the classical logistic regression that integrates both parametric and nonparametric components, which allows it to simultaneously model linear and non-linear relationships. Here, to obtain the estimation of a nonparametric component in the form of a non-linear curve (sigmoid curve), we use the penalized spline, which is a smoothing technique used in the nonparametric approach due to its ability to produce smooth and adaptive curves for fluctuating data. In this smoothing technique, selecting the optimal smoothing parameters plays an important role in fitting the model. Commonly, this selection is based on the minimum value of ordinary Cross-Validation (CV) or Generalized Cross-Validation (GCV). However, these CV and GCV criteria cannot be used when the CV and GCV curves continuously decline and never rise; the minimum CV and GCV values would not be achieved because they are not directly applicable due to the non-quadratic nature of the log-likelihood function. Therefore, a Generalized Approximate Cross-Validation (GACV) criterion is used to address such cases. This distinguishes it from previous studies that used the CV or GCV criterion. In the application to real data, we define an SBLR model of Coronary Heart Disease (CHD) risk factors that can be used for prediction and interpretation purposes. The results of the study successfully demonstrate the efficacy of the proposed method in identifying critical non-linear thresholds for CHD risk factors, and it is statistically valid and highly effective for CHD risk prediction. In the future, we can use the results of this research as a basis of an early warning system, specifically alerting individuals with moderate stress levels and dietary habits exceeding the identified thresholds to be aware of the heightened probability of developing CHD. In addition, this research aligns with point three of the Sustainable Development Goals (SDGs), namely, premature mortality reduction from non-communicable diseases by 2030.

Keywords:

coronary heart disease; generalized approximate cross-validation; non-communicable diseases; odds ratio; penalized spline; semiparametric binary logistic regression

1. Introduction

A logistic regression represents one of the most fundamental tools in statistical modeling due to its strong performance in both prediction and classification tasks [1,2,3,4]. A common variant, Binary Logistic Regression (BLR), relies on an assumed functional relationship between the covariates and the logit of the response [5,6,7,8]. Despite its popularity, the parametric logistic regression formulation is constrained by the assumption of linearity in the logit, which may lead to biased or inefficient estimates when the true relationship deviates from this structure [9,10,11]. To address these limitations, nonparametric logistic regression approaches provide a flexible alternative because they do not impose a predefined form of functional association between the predictor and response variables [12,13]. These methods even construct smooth curves driven entirely by the observed data, minimizing potential bias in modeling [14,15]. A wide range of smoothing techniques has been developed for nonparametric regression, including kernel estimators [16], local linear estimators [17,18], local polynomial [19,20], truncated spline estimators [21], penalized spline estimators [22], least square spline [23], smoothing spline [24,25,26], Fourier series estimators [27], and a mixed estimator [28]. Among these techniques, penalized spline estimators are particularly advantageous because they effectively capture complex non-linear patterns, while ensuring smooth estimation curves through the Penalized Least Squares (PLS) criterion, regulated by a smoothing parameter notated

λ

[22].

However, in practical application, the functional association between response variables and certain predictors may exhibit specific patterns, while others may not. Although the nonparametric approach offers significant flexibility, it can suffer from dimensionality issues and reduced interpretability as the degree of the polynomial in each predictor variable increases [29,30]. To balance model flexibility and simplicity, a semiparametric logistic regression emerges as an alternative solution [31,32]. This approach combines components of both parametric and nonparametric approaches into a single model structure, allowing for some predictors to enter linearly while others are modeled through data-driven smooth functions [33,34]. This maintains the interpretability of parametric effects while accommodating non-linear relationships where necessary. Therefore, semiparametric regression models are better suited to capture such data structures [35,36].

On the other hand, Binary Logistic Regression (BLR) is a statistical analysis method for predicting outcomes with only two possible categories (binary), such as “yes/no”, using one or more than one independent variables (predictors) for modeling the probability of an event occurring. This technique uses a logistic (sigmoid) function to convert the output into a probability between 0 and 1. The BLR has become an important tool for binary classification in various fields such as health, finance, and marketing. If the functional association between response variables and some of the predictors presents specific curves, such as a linear curve, while the remaining predictor variables form a non-linear or sigmoid function, this would produce a model called the Semiparametric Binary Logistic Regression (SBLR) model. In other words, in the model of SBLR, the functional association between its predictors and the binary response is formulated through a decomposition of linear and smooth functional components. These smooth components are commonly represented using penalized spline bases, which ensure both flexibility and smoothness through appropriate regularization [37]. Next, the SBLR models based on penalized spline estimators have gained considerable attention due to their computational efficiency and favorable theoretical properties. These models incorporate a spline-based representation for the nonparametric component, regulated by a parameter of smoothing that controls the balance between fit and smoothness [38]. By doing so, the SBLR model is capable of capturing complex non-linear patterns without sacrificing the stability associated with parametric estimation [36]. This hybrid structure is particularly advantageous in biomedical and epidemiological applications, where certain risk factors exhibit well-established linear effects, whereas others demonstrate inherently-non-linear behavior.

The development of a flexible and accurate prediction model by combining the SBLR method with one of the data smoothing techniques, namely, the penalized spline estimator, can be applied to the prediction of disease risks such as Cardiovascular Disease (CVD) type, namely, CHD (Coronary Heart Disease). CVD is one of the major non-communicable diseases (NCDs) in the world. In many developed countries, the proportion of deaths caused by CVD decreased from around 48% in 1990 to about 43% in 2010. In contrast, developing countries experienced an increase over the same period, with the share rising from 18% to 25% [39]. Indonesia, as one of the developing countries, has shown a similar upward trend, with CVD becoming a major contributor to mortality over the past two decades [40]. Among the various types of CVDs, CHD stands out as the main cause of death, contributing around 32% of the total number of deaths in the world. According to the WHO (World Health Organization), there were 17.9 million deaths caused by CVD in 2018, of which 85% were due to CHD [41]. Data from basic health research in Indonesia from 2013 and 2018 also indicated an increasing trend in the prevalence of heart disease, namely from 0.5% in 2013 to 1.5% in 2018 [42].

One potential strategy for early diagnosis and intervention is the development of risk prediction models for CHD. Numerous studies have demonstrated that CHD risk is influenced by multiple predictors, shaped by a complex interplay of demographic, metabolic, dietary, and psychosocial determinants. As atherosclerosis is the primary pathological mechanism underlying CHD, the disease is often associated with degenerative processes and is more common among individuals with established risk factors and older adults. Age, in particular, is consistently identified as one of the strongest predictors of CHD due to cumulative vascular degeneration over time [43,44], where the 65–74 year age group had the highest observed prevalence [45]. Anthropometric measures, especially body weight and height, have also been shown to correlate strongly with CHD risk, as they reflect an individual’s metabolic load and potential predisposition to cardio-metabolic disorders [46]. Furthermore, lifestyle-related behaviors further contribute to this burden. The Indonesian Cardiology Society (PERKI) emphasizes that the rising prevalence of CHD in Indonesia is largely attributable to unhealthy lifestyle patterns, noting that approximately 50% of CHD patients are at risk of experiencing sudden cardiac arrest or sudden cardiac death [47]. Dietary habits also play a critical role; high sugar intake is associated with insulin resistance and metabolic syndrome, whereas the excessive consumption of saturated and trans fats accelerates the formation of atherosclerotic plaques [48]. In addition, psychological stress has been shown to exacerbate CHD risk through prolonged activation of neuroendocrine and inflammatory pathways [49], indicating that both physiological and behavioral responses contribute to overall disease progression. Given the multifactorial and potentially non-linear nature of these risk factors, previous studies have increasingly emphasized the need for statistical models capable of capturing both linear andnon-linearr relationships among CHD predictors, thereby improving estimation accuracy and predictive performance.

The novelty of the method discussed here lies in the development of the Semiparametric Binary Logistic Regression (SBLR) method, which is an extension of classical logistic regression that integrates both parametric and nonparametric components, allowing it to simultaneously model linear and non-linear relationships. Here, to obtain the estimation of the nonparametric component in the form of a non-linear curve (sigmoid curve), we use the penalized spline, which is a smoothing technique used in the nonparametric approach due to its ability to produce smooth and adaptive curves for fluctuating data. Research conducted by Wirtz and Von Känel [50] highlighted that penalized splines can produce smooth curves that are also adaptive to data patterns, making them highly suitable for the classification of heterogeneous medical data. Here, we not only developed a theoretical penalized spline estimator using a Generalized Approximate Cross-Validation (GACV) criterion to estimate the SBLR model but also applied it in estimating the SBLR model of CHD risk factors, which can be used for prediction, interpretation, and prevention purposes. The GACV criterion is used in the present study because these classical criteria, namely CV (Cross-Validation) and GCV (Generalized Cross-Validation), cannot be employed directly to the non-quadratic nature of the log-likelihood function. This is what differentiates the present study from previous studies, which always used CV and GCV criteria.

Explicitly, the fundamental novelty of this study lies in two main aspects: (a) Methodological Contribution: While previous studies have extensively used standard spline estimators (such as B-Spline) in semiparametric regression, many suffer from subjectivity in selecting the optimal number and location of knots. This study develops a Penalized Spline (P-Spline) estimator specifically for the SBLR model. The P-Spline approach introduces a penalty term that automatically controls the smoothness of the curve, overcoming the over-fitting or under-fitting issues often found in ordinary spline methods. This derivation for the binary logistic context is the core mathematical novelty of our work. (b) Practical Significance: We demonstrate the significance of this estimator by applying it to CHD risk factors. Standard parametric logistic regression assumes a linear relationship between predictors and the log-odds of CHD, which is often unrealistic in medical data. Our proposed model successfully captures the complex, non-linear patterns of CHD risk factors with higher accuracy than standard models, providing more reliable insights for medical interventions.

A majority of the existing studies on CHD risk modeling rely heavily on Parametric Binary Logistic Regression. While useful, this approach assumes a strict linear relationship between the logit of the response and the predictors, which often fails to capture complex, non-linear biological patterns. On the other hand, standard nonparametric spline methods often suffer from subjectivity in knot selection, leading to potential over-fitting. Our proposed method fills this gap by utilizing a Penalized Spline (P-Spline) Estimator. This method overcomes previous limitations by incorporating a penalty term that automatically balances the trade-off between goodness-of-fit and smoothness, removing the subjectivity of knot selection and providing a more robust model for non-linear medical data.

Furthermore, the methods discussed by other previous studies, such as Harrell [51] and Gauthier et al. [52], primarily rely on regression splines (e.g., Restricted Cubic Splines or B-splines) where the smoothness is determined by the number and location of knots selected a priori by the user. This approach is technically parametric. In contrast, the proposed SBLR method utilizes a Penalized Spline (P-Spline) approach. We employ a high number of knots combined with a roughness penalty term in the likelihood function. The smoothness is not controlled by knot selection, but by a smoothing parameter (λ) optimized via GACV. This allows for a more adaptive fit to the data structure compared to the fixed-knot approach in the cited literature. Additionally, the cited references guide researchers on how to apply existing functions available in standard R packages (e.g., the rms package by Harrell [51]). The novelty of the proposed SBLR method lies in the mathematical derivation and computational construction of the estimator itself. We do not rely on pre-existing functions to estimate the model parameters. Instead, we have manually derived the penalized likelihood, the Gradient vector, and the Hessian matrix, and constructed our own Iteratively Reweighted Penalized Least Squares (IRPLS) algorithm from scratch. This allows for greater flexibility in controlling the estimation process and integrating the specific GACV criteria, which is not standard in the basic glm. Thus, the proposed method not only advances statistical modeling methodology but also provides practical insights for managing and mitigating CHD risk. Moreover, this research aligns with the third goal of the Sustainable Development Goals (SDGs), namely, premature mortality reduction from non-communicable diseases by 2030. In other words, the proposed SBLR model has practical implications for SDG target 3.4, which aims to reduce premature mortality from non-communicable diseases, specifically highlighting early screening and risk stratification. Unlike standard models that may under-estimate risk due to linearity assumptions, the proposed SBLR model accurately identifies non-linear risk patterns. This allows for the development of a more precise “Early Warning System” where patients can be stratified based on calculated probabilities. High-risk individuals identified by the model can be prioritized for further medical attention, which is crucial for preventative care.

2. Materials and Methods

The first step is that we theoretically discuss the estimation method of the SBLR model parameters by employing a penalized likelihood method based on the Penalized Spline (P-Spline) estimator. The P-Spline estimator is particularly appropriate for this study because it addresses the “curse of dimensionality” and the sensitivity to knot placement found in ordinary B-splines or Smoothing Splines. In the context of medical risk factors, interpretability and curve smoothness are crucial. The P-Spline method allows for flexible data fitting without the erratic fluctuations seen in unpenalized methods. Compared to fully Bayesian approaches, P-Spline is also computationally more efficient while still providing robust estimation for the binary response variable suitable for CHD classification. If, in this step, we obtain equations that are not closed form, namely, implicit equations, so that the solution cannot be obtained directly, then to obtain the SBLR model parameter estimates, we would use a numerical method, namely, the Iteratively Reweighted Penalized Least Squares (IRPLS) method [53]. Next, we apply the estimation method to the medical data to estimate the SBLR model of the CHD risks influenced by age, body weight, body height, stress level, and consumption of sugar and fat, through the stages as follows: (i) determine the characteristics data; (ii) perform Eta correlation test; (iii) determine the optimal value of smoothing parameter based on the GACV criterion; (iv) estimate the SBLR model of CHD risks using R-code; (v) calculate the value of Deviance and conduct an investigation into the stability of the classification model; (vi) determine the optimal classification threshold based on Youden’s J index; (vii) calculate the values of accuracy, sensitivity, specificity, and AUC (Area Under the Curve). All statistical computations and algorithm implementations were performed in the R programming environment using RStudio (version 2025.09.02) of the Open Source Software R. Additionally, in Section 2.1,Section 2.2,Section 2.3,Section 2.4, we provide a brief review of the concepts of the Semiparametric Binary Logistic Regression (SBLR) model, truncated spline bases, the penalized log-likelihood method, and Generalized Approximate Cross-Validation (GACV) that are used to achieve the goals of this study.

2.1. Semiparametric Binary Logistic Regression Model

The Semiparametric Binary Logistic Regression (SBLR) model is a regression model built by combining parametric and nonparametric components which is used to analyze response variables that have a binary scale. This model can be estimated by integrating the framework of the GLM (Generalized Linear Model) and the GAM (Generalized Additive Model), namely GAPLM (Generalized Additive Partially Linear Models) [54]. Let

y_{i}

be a variable of binary response for the i-th observation

(i = 1,2, \dots, n)

, assumed to follow a Bernoulli distribution,

y_{i} ~ b (π_{i})

, where

π_{i} = P (y_{i} = 1 | x_{i}, t_{i})

. Let

x_{i} = {(1, x_{i 1}, \dots, x_{i p})}^{T}

denote the vector of the predictors modeled parametrically, and

t_{i} = {(t_{i 1}, \dots, t_{i q})}^{T}

denote the vector of predictors modeled nonparametrically using smooth functions.

The functional association between the conditional mean of the response variable and the predictors is modeled through a logit link function so that we may express the SBLR model as an additive model as follows [55]:

η_{i} = \ln (\frac{π_{i}}{1 - π_{i}}) = x_{i}^{T} β + \sum_{j = 1}^{q} g_{j} (t_{j i})

(1)

where

η_{i}

is the logit function, and

g_{j} (\cdot)

represent the unknown smooth functions associated with the continuous predictors

(t_{j i})

in the nonparametric component.

2.2. Truncated Spline Bases

One common method for representing spline functions, such as penalized splines, is through truncated spline bases [37]. For a degree d polynomial spline with the number of knots K located at

ξ_{j 1}, ξ_{j 2}, ξ_{j 3}, \dots, ξ_{j r_{j}}

, then the function

g (t_{j i})

can be expressed as follows:

g_{j} (t_{j i}) = θ_{j 0} + θ_{j} t_{j i} + \dots + θ_{j d_{j}} t_{j i}^{d_{j}} + \sum_{k_{j} = 1}^{r_{j}} θ_{j k_{j}} {(t_{j i} - ξ_{j k_{j}})}_{+}^{d_{j}}

(2)

In this study, the unknown functions

g_{j} (t_{j i})

are approximate using penalized spline (P-spline) based on the truncated linear spline basis. For a specific predictor

t_{j i}

with a degree of 1, and a set of knots,

ξ_{j 1}, ξ_{j 2}, ξ_{j 3}, \dots, ξ_{j r_{j}}

, then the functions

g_{j} (t_{j i})

can be represented as:

g_{j} (t_{j i}) = θ_{j 0} + θ_{j} t_{j i} + \sum_{k_{j} = 1}^{r_{j}} θ_{j k_{j}} {(t_{j i} - ξ_{j k_{j}})}_{+}

(3)

where

{(t_{j i} - ξ_{j k_{j}})}_{+} = \{\begin{matrix} {(t_{j i} - ξ_{j k_{j}})}_{+}, t_{j i} \geq ξ_{j k_{j}} \\ 0, t_{j i} < ξ_{j k_{j}} \end{matrix}

, and

θ_{j k_{j}}

represents the coefficients for the base functions associated with the knots.

Here, we selected linear splines (degree d = 1) primarily for clinical interpretability. While cubic splines (degree d = 3) produce smoother curves, the resulting coefficients are often complex and difficult to translate into actionable medical insights. In contrast, linear splines provide a piecewise linear structure, allowing us to estimate distinct slopes (gradients) for different intervals. This makes it straightforward to explain to medical practitioners how the risk changes (e.g., “Risk increases sharply after threshold X”) compared to the more abstract curvature of cubic splines. Furthermore, linear splines avoid the boundary artifacts (“wiggliness”) sometimes observed with higher-order splines, which is crucial for reliable risk factor analysis. Also, we employed quantile-based knots rather than equidistant knots to ensure robustness against data sparsity. Medical data often have skewed distributions; using quantiles ensures that each interval contains a sufficient and comparable number of observations, stabilizing the estimation of the spline coefficients.

2.3. Penalized Log-Likelihood Method

The penalized log-likelihood method is a statistical method that modifies the standard MLE (Maximum Likelihood Estimation) by subtracting a penalty term from the log-likelihood function that aims to prevent overfitting, improve model stability, and handle complex models, namely models with many parameters [55,56]. Penalized splines have emerged as a flexible and robust tool for modeling non-linear relationships in regression analysis. They are constructed from piecewise polynomial segments that are smoothly joined at specific knot locations to ensure a desired degree of continuity. In most applications, the knot locations are determined using the sample quantiles of the unique values of the explanatory variable, which allows the method to adapt to the data distribution. To address the inherent trade-off between flexibility and overfitting, P-splines generally employ a relatively large number of knots to capture complex data structures. However, to avoid excessive fluctuations, a penalty term is introduced into the likelihood function to control smoothness and stabilize estimation [38]. In spline-based regression models, the estimation process can be performed using the penalized log-likelihood method. This penalized log-likelihood method modifies the standard log-likelihood function by incorporating a penalty term that constrains the roughness of the fitted curve. The general formula of the penalized log-likelihood is given as follows [38,56,57]:

l_{p} (ω) = l (ω) - \frac{λ}{2} ω^{T} S ω

(4)

2.4. Generalized Approximate Cross-Validation (GACV)

The estimation of the parameters of the SBLR model using a penalized spline estimator relies heavily on the selection of the smoothing parameter, namely

λ

. This parameter plays a crucial role in controlling the balance between the fitting and smoothing of the spline curve. A small value for

λ

tends to produce a rough curve that over-fits the data, whereas a large value for

λ

results in an overly smooth curve that may lead to under-fitting (high bias). Therefore, an optimal selection method is required to determine the value of

λ

that balances bias and variance.

In the context of non-Gaussian data, such as the binary response in this study, the use of ordinary CV (Cross-Validation) and GCV (Generalized Cross-Validation) methods in linear regression was not directly applicable because of the non-quadratic nature of the log-likelihood function. While standard GCV is computationally efficient for linear models, it is known to be less stable for non-Gaussian data (such as binary responses in logistic regression) and can lead to under-smoothing. We highlight that this study offers more direct control over the penalty structure and provides better interpretability for the specific risk thresholds associated with CHD factors. Therefore, this can be addressed by employing the GACV (Generalized Approximate Cross-Validation) method, which is an LOOCV (Leave-One-Out Cross-Validation) approximation adapted for Generalized Linear Models (GLMs). The optimal smoothing parameter

λ

is obtained by minimizing the GACV score function. Based on the formulation by Xiang and Wahba [58], the GACV criterion for a penalized likelihood model is defined as follows:

G A C V (λ) = - \frac{1}{n} l (\hat{ω}) + \frac{t r (H)}{n} \frac{y^{T} (y - \hat{π})}{n - t r (W^{1 / 2} H W^{1 / 2})} .

(5)

Equation (5) can be rewritten as follows:

G A C V (λ) = \frac{1}{n} \sum_{i = 1}^{n} (- y_{i} η_{λ} (x_{i}, t_{i}) + \log (1 + e^{η_{λ} (x_{i}, t_{i})})) + \frac{t r (H)}{n} \frac{\sum_{i = 1}^{n} y_{i} (y_{i} - π_{λ} (x_{i}, t_{i}))}{n - t r (W^{1 / 2} H W^{1 / 2})} .

(6)

3. Results and Discussions

Here, we discuss the theoretical estimation results of the SBLR model parameters using penalized spline and the application of the SBLR model on real data, namely, coronary heart disease (CHD) risk factors for prediction, interpretation, and prevention purposes.

3.1. Estimation of SBLR Model

Suppose we get paired observation data,

\{x_{i}, t_{i}, y_{i}\}

, where

i = 1,2, 3, \dots, n

,

x_{i} = {(x_{1 i}, x_{2 i}, \dots, x_{p i})}^{T}

is the predictor variable which is a parametric component,

t_{i} = {(t_{1 i}, t_{2 i}, \dots, t_{s i})}^{T}

is the predictor variable which is a nonparametric component, and

y_{i}

is an binary response variable assumed to belong to the exponential distribution family, namely,

y_{i}

has a Bernoulli distribution, written as

y_{i} ~ B (1, π (x_{i}, t_{i})), i = 1,2, \dots, n

, with probability mass function as follows:

P (Y_{i} = y_{i}) = π_{i} {(x_{i}, t_{i})}^{y_{i}} {(1 - π_{i} (x_{i}, t_{i}))}^{1 - y_{i}}; y_{i} = 0, 1; 0 < π_{i} (x_{i}, t_{i}) < 1

(7)

where the probability of success is

π_{i} (x_{i}, t_{i}) = P (y_{i} = 1 | x_{i}, t_{i})

.

In the SBLR model, the mean value of response variable

y_{i}

is

π_{i}

that depends on predictor variables (

x_{i}, t_{i}

), and given by the following formula:

π_{i} (x_{i}, t_{i}) = \frac{e^{(x_{i}^{T} β + \sum_{j = 1}^{q} g_{j} (t_{j i}))}}{1 + e^{(x_{i}^{T} β + \sum_{j = 1}^{q} g_{j} (t_{j i}))}}

(8)

where

\sum_{j = 1}^{q} g_{j} (t_{j i})

in Equation (8) is a nonparametric component regression function of the SBLR model with the shape of the regression curve is unknown, and it is assumed to be smooth. The regression function

\sum_{j = 1}^{q} g_{j} (t_{j i})

is estimated by using a linear multi-predictor spline estimator (i.e., degree one spline), with the spline function

g_{j} (t_{j i})

given as follows:

g_{j} (t_{j i}) = γ_{j 0} + γ_{j 1} t_{j i} + \sum_{k_{j} = 1}^{r_{j}} γ_{j (1 + k_{j})} {(t_{j i} - ξ_{j k_{j}})}_{+}, j = 1,2, \dots, q

(9)

where

{(t_{j i} - ξ_{j k})}_{+} = \{\begin{matrix} {(t_{j i} - ξ_{j k})}_{+}, t_{j i} \geq ξ_{j k} \\ 0, t_{j i} < ξ_{j k} \end{matrix}

.

Therefore, we have the splines as follows:

\begin{matrix} g_{1} (t_{1 i}) = γ_{10} + γ_{11} t_{1 i} + \sum_{k_{1} = 1}^{r_{1}} γ_{1; (1 + k_{1})} {(t_{1 i} - ξ_{1 k_{1}})}_{+} \\ g_{2} (t_{2 i}) = γ_{20} + γ_{21} t_{2 i} + \sum_{k_{2} = 1}^{r_{2}} γ_{2; (1 + k_{2})} {(t_{2 i} - ξ_{2 k_{2}})}_{+} \\ ⋮ \\ g_{q} (t_{q i}) = γ_{q 0} + γ_{q 1} t_{q i} + \sum_{k_{q} = 1}^{r_{q}} γ_{q; (1 + k_{q})} {(t_{q i} - ξ_{q k_{q}})}_{+} \end{matrix}\}

(10)

Hence, we can express the linear multi-predictor spline

\sum_{j = 1}^{q} g_{j} (t_{j i})

presented in Equation (10) in a matrix notation as follows:

g (t_{i}) = t_{i}^{T} γ = (\begin{matrix} 1 & t_{1 i} & \begin{matrix} t_{2 i} & \dots & t_{q i} \end{matrix} \end{matrix}) {(\begin{matrix} \sum_{j = 1}^{q} γ_{j 0} & γ_{1} & \begin{matrix} γ_{2} & \dots & γ_{q} \end{matrix} \end{matrix})}^{T}

(11)

where

\begin{matrix} t_{1 i} = [t_{1 i} {(t_{1 i} - ξ_{11})}_{+} {(t_{1 i} - ξ_{12})}_{+} \dots {(t_{1 i} - ξ_{1 k_{1}})}_{+}] \\ t_{2 i} = [t_{2 i} {(t_{2 i} - ξ_{21})}_{+} {(t_{2 i} - ξ_{22})}_{+} \dots {(t_{2 i} - ξ_{2 k_{2}})}_{+}] \\ ⋮ \\ t_{q i} = [t_{q i} {(t_{q i} - ξ_{q 1})}_{+} {(t_{q i} - ξ_{q 2})}_{+} \dots {(t_{q i} - ξ_{q k_{q}})}_{+}] \end{matrix}\} and \begin{matrix} γ_{1} = [γ_{11} γ_{12} \dots γ_{1 (r_{2})}] \\ γ_{2} = [γ_{21} γ_{22} \dots γ_{2 (r_{2})}] \\ ⋮ \\ γ_{q} = [γ_{q 1} γ_{q 2} \dots γ_{q (r_{q})}] \end{matrix}\} .

Next, by substituting Equation (11) into Equation (8), we obtain the following formula:

π_{i} (x_{i}, t_{i}) = \frac{e^{(x_{i}^{T} β + t_{i}^{T} γ)}}{1 + e^{(x_{i}^{T} β + t_{i}^{T} γ)}} .

(12)

Hence, based on Equation (1), we obtain the SBLR model as follows:

η_{i} (x_{i}, t_{i}) = \ln (\frac{π_{i} (x_{i}, t_{i})}{1 - π_{i} (x_{i}, t_{i})}) = x_{i}^{T} β + t_{i}^{T} γ

(13)

Thus, for

i = 1,2, \dots, n

, we can present Equation (13) as follows:

η = \ln (\frac{π (x, t)}{1 - π (x, t)}) = X β + T γ

(14)

where

X = [1 x_{1} x_{2} \dots x_{p}]; β = [β_{0} β_{1} β_{2} \dots β_{p}];

T = [\begin{matrix} 1 & t_{1} & \begin{matrix} t_{2} & \dots & t_{q} \end{matrix} \end{matrix}]

; and

γ = {[\begin{matrix} \sum_{j = 1}^{q} γ_{j 0} & γ_{1} & \begin{matrix} γ_{2} & \dots & γ_{q} \end{matrix} \end{matrix}]}^{T} .

The next step, after obtaining the equation in matrix notation in Equation (14), is to form a new matrix notation using partitioned matrices to facilitate the statistical analysis, as follows:

η = \ln (\frac{π (x, t)}{1 - π (x, t)}) = C ω

(15)

where

C = [X ⋮ T]

and

ω = {[β ⋮ γ]}^{T}

.

Hereinafter, we estimate the parameters

ω

in Equation (15) by employing the Penalized Maximum Likelihood Estimation (P-MLE) method. In this P-MLE method, we first determine the likelihood function of the parameters

ω

as follows:

\begin{matrix} L (ω) & = \prod_{i = 1}^{n} P (Y_{i} = y_{i}) \\ = \prod_{i = 1}^{n} π_{i} {(x_{i}, t_{i})}^{y_{i}} {(1 - π_{i} (x_{i}, t_{i}))}^{1 - y_{i}} \\ = π_{i} {(x_{i}, t_{i})}^{\sum_{i = 1}^{n} y_{i}} {(1 - π_{i} (x_{i}, t_{i}))}^{n - \sum_{i = 1}^{n} y_{i}} \\ = π_{i} {(x_{i}, t_{i})}^{\sum_{i = 1}^{n} y_{i}} {(1 - π_{i} (x_{i}, t_{i}))}^{n - \sum_{i = 1}^{n} y_{i}} \end{matrix}

(16)

To simplify the calculation, the likelihood function is converted into the log-likelihood function as follows:

\begin{matrix} l (ω) = & \ln (L (ω)) = \ln (π_{i} {(x_{i}, t_{i})}^{\sum_{i = 1}^{n} y_{i}} {(1 - π_{i} (x_{i}, t_{i}))}^{\sum_{i = 1}^{n} 1 - y_{i}}) \\ = \sum_{i = 1}^{n} y_{i} \ln (π_{i} ((x_{i}, t_{i}))) + \sum_{i = 1}^{n} (1 - y_{i}) \ln (1 - π_{i} (x_{i}, t_{i})) \\ = \sum_{i = 1}^{n} [y_{i} \ln (\frac{π_{i} ((x_{i}, t_{i}))}{1 - π_{i} (x_{i}, t_{i})}) + \ln (1 - π_{i} ((x_{i}, t_{i})))] \\ = \sum_{i = 1}^{n} [y_{i} \ln (\frac{π_{i} ((x_{i}, t_{i}))}{1 - π_{i} (x_{i}, t_{i})}) + \ln (\frac{1}{1 + \exp (η_{i})})] \end{matrix}

(17)

By substituting Equation (15) into Equation (17), we have the following function of log-likelihood:

l (ω) = \sum_{i = 1}^{n} [y_{i} c_{i}^{T} ω - \ln (1 + e^{c_{i}^{T} ω})]

(18)

After the log-likelihood function is obtained in Equation (18), a penalty function will be added, which can be directly written in the general quadratic form as follows:

l_{p} (ω) = l (ω) - \frac{1}{2} λ ω^{T} S ω

(19)

Let S be a diagonal matrix defined as follows:

S = [\begin{matrix} 0_{p \times p} & 0_{p \times K} \\ 0_{K \times p} & S_{γ} \end{matrix}]

(20)

where p represents the number of parameters in the parametric component of the SBLR model that is not penalized,

K = \sum_{j = 1}^{q} r_{j}

is the total number of knots across all predictors, and q represents the number of predictor variables. Here,

S_{γ}

is the penalty matrix defined as follows:

S_{γ} = {[\begin{matrix} S_{γ_{1}} & 0 & \begin{matrix} \dots & 0 \end{matrix} \\ 0 & S_{γ_{2}} & \begin{matrix} \dots & 0 \end{matrix} \\ \begin{matrix} ⋮ \\ 0 \end{matrix} & \begin{matrix} ⋮ \\ 0 \end{matrix} & \begin{matrix} \begin{matrix} ⋱ & 0 \end{matrix} \\ \begin{matrix} 0 & S_{γ_{q}} \end{matrix} \end{matrix} \end{matrix}]}_{K \times K}

(21)

where

S_{γ_{j}}

is an r_j × r_j sub-matrix that represents the penalty component for the spline basis coefficients of the q-th nonparametric predictor variable. It is defined as follows:

S_{γ_{j}} = I_{r_{j} \times r_{j}} = {[\begin{matrix} 1 & 0 & \begin{matrix} \dots & 0 \end{matrix} \\ 0 & 1 & \begin{matrix} \dots & 0 \end{matrix} \\ \begin{matrix} ⋮ \\ 0 \end{matrix} & \begin{matrix} ⋮ \\ 0 \end{matrix} & \begin{matrix} \begin{matrix} ⋱ & 0 \end{matrix} \\ \begin{matrix} 0 & 1 \end{matrix} \end{matrix} \end{matrix}]}_{r_{j} \times r_{j}}

(22)

Then, by substituting Equation (18) into Equation (19), we obtain:

l_{p} (ω) = \sum_{i = 1}^{n} [y_{i} c_{i}^{T} ω - \ln (1 + e^{c_{i}^{T} ω})] - \frac{λ}{2} ω^{T} S ω

(23)

The next step is to take the first partially derivative of Equation (23) with respect to the parameter ω as follows:

\begin{matrix} \frac{\partial l_{p} (ω)}{\partial ω} & = \frac{\partial}{\partial ω} (\sum_{i = 1}^{n} [y_{i} c_{i}^{T} ω - \ln (1 + e^{c_{i}^{T} ω})] - \frac{λ}{2} ω^{T} S ω) \\ = \sum_{i = 1}^{n} [y_{i} c_{i} - (\frac{1}{1 + e^{c_{i}^{T} ω}}) e^{c_{i}^{T} ω} . c_{i}] - λ S ω \\ = \sum_{i = 1}^{n} [y_{i} c_{i} - (π (x_{i}, t_{i})) c_{i}] - λ S ω \\ = \sum_{i = 1}^{n} [y_{i} - π (x_{i}, t_{i})] c_{i} - λ S ω \\ = C^{T} \sum_{i = 1}^{n} [y_{i} - π (x_{i}, t_{i})] - λ S ω \end{matrix}

(24)

Hence, from Equation (24) we obtain the first partial derivative with respect to the parameter ω that is also called as gradient vector,

g

, as follows:

g = C^{T} (y - π) - λ S ω .

(25)

Since the equation obtained from the first derivative (the gradient vector) does not have a closed-form solution, an iterative numerical procedure is required. This is carried out using the local scoring and back-fitting algorithm, also known as Iteratively Reweighted Penalized Least Squares (IRPLS) [53], that is based on the method of Newton–Raphson and proceeds as follows:

ω^{(m + 1)} = ω^{(m)} - {[H (ω^{(m)})]}^{- 1} g (ω^{(m)})

(26)

where

ω^{(m)}

is vector of parameter estimate at the m-th iteration,

g (ω^{(m)})

is the gradient vector evaluated

ω^{(m)}

, and

H (ω^{(m)})

is the Hessian matrix evaluated at

ω^{(m)}

.

However, to perform this iterative process, local information or the direction of curvature of the regression curve is required, which is obtained by computing the second partially derivative with respect to parameter ω. It means that another partial differentiation with respect to

ω^{T}

must be carried out. To ensure that the partial derivative attains a maximum, the result of the second partial derivative must satisfy the required condition, namely,

\frac{\partial^{2} l_{p} (ω)}{\partial ω \partial ω^{T}} < 0

, that is negative.

It should be noted that ω is a vector; therefore, its derivative results in a Hessian matrix given by:

\frac{\partial^{2} l_{p} (ω)}{\partial ω \partial ω^{T}} = \frac{\partial}{\partial ω^{T}} (\sum_{i = 1}^{n} [y_{i} - π (x_{i}, t_{i})] c_{i} - λ S ω)

(27)

with the initial information obtained,

π (x_{i}, t_{i}) = (\frac{e^{c_{i}^{T} ω}}{1 + e^{c_{i}^{T} ω}})

, the derivative can then be expressed as follows:

\begin{matrix} \frac{\partial^{2} l_{p} (ω)}{\partial ω \partial ω^{T}} & = \frac{\partial}{\partial ω^{T}} (\sum_{i = 1}^{n} [y_{i} - \frac{e^{c_{i}^{T} ω}}{1 + e^{c_{i}^{T} ω}}] c_{i} - λ S ω) \\ = \frac{\partial}{\partial ω^{T}} (\sum_{i = 1}^{n} [y_{i} - \frac{e^{c_{i}^{T} ω}}{1 + e^{c_{i}^{T} ω}}] c_{i}) - \frac{\partial}{\partial ω^{T}} (- λ S ω) \\ = (\sum_{i = 1}^{n} - [\frac{e^{c_{i}^{T} ω}}{1 + e^{c_{i}^{T} ω}} c_{i}^{T} - \frac{e^{c_{i}^{T} ω}}{1 + e^{c_{i}^{T} ω}} \frac{e^{c_{i}^{T} ω}}{1 + e^{c_{i}^{T} ω}} c_{i}^{T}] c_{i}) - λ S \\ = (\sum_{i = 1}^{n} - [\frac{e^{c_{i}^{T} ω}}{1 + e^{c_{i}^{T} ω}} [1 - \frac{e^{c_{i}^{T} ω}}{1 + e^{c_{i}^{T} ω}}] c_{i}^{T}] c_{i}) - λ S \\ = (\sum_{i = 1}^{n} - [π_{i} (x_{i}, t_{i}) [1 - π_{i} (x_{i}, t_{i})] c_{i}^{T}] c_{i}) - λ S \end{matrix}

Thus, from the second-derivative process, the Hessian matrix can be written as follows:

H = - (C^{T} W C + λ S)

(28)

where

H

is a negative definite Hessian matrix,

W

is a diagonal matrix with elements

W_{i i} = π_{i} (x_{i}, t_{i}) (1 - π_{i} (x_{i}, t_{i}))

,

S

is matrix of penalty, and

λ

is a parameter of smoothing.

After obtaining the first partial derivative (or gradient vector) in Equation (25) and the second partial derivative (or Hessian matrix) in Equation (28), the next step is to substitute these results into the iterative process in Equation (26), yielding:

ω^{(m + 1)} = ω^{(m)} + {(C^{T} W^{(m)} C + λ S)}^{- 1} [C^{T} (y - π^{(m)}) - λ S ω^{(m)}]

(29)

The updated Newton–Raphson equation above can be algebraically manipulated to reveal its relationship with the Weighted Least Squares method so that it becomes as follows:

(C^{T} W^{(m)} C + λ S) ω^{(m + 1)} = C^{T} W^{(m)} (C ω^{(m)} + {(W^{(m)})}^{- 1} (y - π^{(m)}))

(30)

Next, a new variable called the adjusted dependent variable

z^{(m)}

is defined, which represents the linearization of the model at the m-th iteration. The variable is as follows:

z^{(m)} = C ω^{(m)} + {(W^{(m)})}^{- 1} [(y - π^{(m)})]

(31)

Then, we substitute Equation (31) into Equation (30), and it yields the following equation:

(C^{T} W^{(m)} C + λ S) ω^{(m + 1)} = C^{T} W^{(m)} z^{(m)}

(32)

Thus, the transformation of the Newton–Raphson process results in the core equation of the IRPLS algorithm as follows:

ω^{(m + 1)} = {(C^{T} W^{(m)} C + λ S)}^{- 1} C^{T} W^{(m)} z^{(m)}

(33)

The iterative process in Equation (33) is solved by using the IRPLS algorithm, as follows:

Define the initial value $ω^{(0)}$ and tolerance ( $δ$ ) as a very small positive number, namely $10^{- 6} .$
Iterate the scoring step for the adjusted dependent variable $(z)$ as follows:

$z^{(m)} = η^{(m)} + {(W^{(m)})}^{- 1} [(y - π^{(m)})]$

(34)
Update the parameter estimates $ω^{(m + 1)}$ , then solve the Penalized Weighted Least Squares (PWLS) problem using the adjusted dependent $z^{(m)}$ and weights $W^{(m)}$ obtained in the previous step:

$ω^{(m + 1)} = {(C^{T} W^{(m)} C + λ S)}^{- 1} C^{T} W^{(m)} z^{(m)}$

(35)
Using the new parameter estimates $ω^{(m + 1)}$ , calculate the new $η^{(m + 1)}$ and the predicted probabilities $π^{(m + 1)}$ .
Calculate the Deviance of the model $D^{(m + 1)}$ using the updated probabilities. Details of the Deviance will be discussed in Section 3.2.6. Check if the algorithm has converged by comparing the difference in Deviance between iterations against the tolerance $δ$ , namely:

$‖D^{(m + 1)} - D^{(m)}‖ \leq δ .$

(36)

If the condition is met, stop the iteration, and the resulting

{\hat{ω}}^{(m + 1)}

are the final parameter estimates, namely:

{\hat{ω}}^{(m + 1)} = {(C^{T} W^{(m)} C + λ S)}^{- 1} C^{T} W^{(m)} {z *}^{(m)}

(37)

If not, set

m = m + 1

and return to Step 2.

After obtaining the estimation of parameters of the SBLR model, based on Equation (15), we obtain the estimated SBLR model as follows:

\hat{η} (x_{i}, t_{i}) = \ln (\frac{\hat{π} (x_{i}, t_{i})}{1 - \hat{π} (x_{i}, t_{i})}) = X \hat{β} + T \hat{γ} = C \hat{ω}

(38)

Based on Equation (37), we may express the Equation (38) as follows:

= C {(C^{T} W C + λ S)}^{- 1} C^{T} W z * = H z *

(39)

where the Hessian matrix,

H

, in Equation (39) is given by:

H = C {(C^{T} W C + λ S)}^{- 1} C^{T} W .

Finally, we can express the estimation of the Semiparametric Binary Logistic Regression (SBLR) model as follows:

\hat{η} (x_{i}, t_{i}) = H z * .

(40)

3.2. Application to Coronary Heart Disease (CHD) Risk Factors

The dataset used in this study is secondary data that represent clinical records from Universitas Airlangga Hospital (RSUA) in 2025, capturing key risk factors for Coronary Heart Disease (CHD). While we acknowledge that data from a single center may introduce selection bias, the demographic characteristics of our sample are consistent with broader epidemiological trends in Indonesia. The dataset was obtained through patient questionnaires and interviews conducted during medical care visits. Specifically, medical record data from 80 post-cardiac catheterization patients were used.

3.2.1. Statistical Description

The descriptive analysis aims to identify the distributional patterns and fundamental characteristics of the research subjects concerning CHD risks. To facilitate interpretation, the predictor variables in this research are classified into three primary categories. The first category pertains to objective factors, represented by age, body weight, and body height. The second category encompasses biochemical parameters, specifically fat and sugar levels, which serve as crucial metabolic indicators. The third category involves psychological factors, quantified through the Stress Level variable. Stress levels were categorized based on the knot points identified in the spline regression analysis, which align with the clinical progression from normal to severe stress states. The detailed summary statistics for both groups are presented in Table 1.

Table 1 presents the statistical description of the predictor variables categorized by the response variable status, namely Non-CHD (healthy) and CHD (Coronary Heart Disease patients) where it shows that:

(a): The most striking difference was seen in the Age variable. Patients in the CHD group had a significantly higher mean age (59.3 years) compared to the Non-CHD group (39.7 years). In addition, the variance in the Non-CHD group (380.25) was much larger than that in the CHD group (104.04), indicating that the healthy control group covered a much wider age range, whereas CHD cases were more concentrated at older ages.
(b): Regarding anthropometric measurements, the CHD group showed a higher average body weight (69.2 kg) compared to the Non-CHD group (63.8 kg). Similarly, the average height was slightly higher in the CHD group (164.4 cm) compared to the Non-CHD group (160.7 cm).
(c): An interesting pattern also emerged regarding dietary habits and psychological factors. The group with coronary heart disease (CHD) recorded a higher average consumption of sugary foods (14.69) compared to the non-CHD group (10.42). Conversely, average consumption of fatty foods was unexpectedly lower in the CHD group (16.73) compared to the non-CHD group (21.37). These counterintuitive findings may be due to dietary adjustments or medical interventions implemented by CHD patients after their diagnosis, or perhaps age-related differences in food metabolism.
(d): In terms of psychological factors, the Non-CHD group showed a slightly higher mean Stress Level score (5.27) compared to the CHD group (4.167). However, the variance for Stress Level was relatively similar between the two groups (11.56 and 12.92, respectively), indicating a consistent distribution of stress scores across the sample population regardless of disease status.

Based on the descriptive analysis, the data highlight complex and potentially non-linear relationships between dietary habits, particularly sugar and fatty food consumption and the disease status. These complexities warrant the use of the Semiparametric Binary Logistic Regression (SBLR) model approach. Unlike standard parametric methods, the SBLR provides the necessary flexibility to capture these irregular patterns.

3.2.2. Observed Logit Plot

The Semiparametric Binary Logistic Regression (SBBLR) modeling begins with an analysis of the functional association between variables of response and predictor. To assess the linearity assumption inherent in logistic regression, we examined the functional association between the continuous predictors and the observed log-odds (logit) of Coronary Heart Disease (CHD). The scatterplot results are presented in Figure 1.

The visual evidence in Figure 1 clearly shows that the functional form of the association between the predictor and response variables is non-linear and complex. Applying a strict parametric (linear) structure would lead to model misspecification and biased estimates. Consequently, these findings provide a compelling empirical justification for employing the SBLR model. The use of the penalized spline estimator is necessary to flexibly capture these local fluctuations and provide a more accurate representation of the data structure.

3.2.3. Eta Correlation Test

While the graphical exploration via observed logit plots in the previous section provided visual indications of fluctuating and non-linear patterns, interpretations based solely on visual inspection are inherently subjective. A pattern that appears non-linear visually must be statistically verified to ensure that the deviation from linearity is significant and not merely due to random noise. Therefore, to overcome this subjectivity and objectively validate the assumption for using the nonparametric functions, we employ the Eta Correlation (

η

) hypothesis testing as follows [59], where the hypothesis is

H_{0} :

There is a non-linear relationship between the dependent variable and the independent variable versus

H_{1} :

There is a linear relationship between the dependent variable and the independent variable. The statistical test is given by the following equation:

F = \frac{η^{2} (N - k)}{(1 - η^{2}) (k - 1)}

(41)

where

η^{2} = 1 - \frac{\sum Υ_{T} - (n_{1}) {({\bar{Υ}}_{1})}^{2} - (n_{2}) {({\bar{Υ}}_{2})}^{2} - (n_{3}) {({\bar{Υ}}_{3})}^{2}}{\sum Υ_{T}^{2} - (n_{1} + n_{2} + n_{3}) {({\bar{Υ}}_{T})}^{2}}

. In this test, we will reject the null hypothesis (

H_{0}

), if the p-value is less than significance level

(α)

, such as

α = 0.05

.

In Table 2, we present the calculation results of the linearity test using the Eta correlation test for the every predictor with a significance level of

α = 0.05

.

Based on the calculation results given in Table 2, it indicates that the predictor variables exhibit a mixed structure, containing both linear and non-linear components. This condition is well-suited for the Semiparametric Binary Logistic Regression (SBLR) model, which can simultaneously accommodate the linear effect of the age variable and the non-linear effects of the other risk factors. Therefore, the SBLR is not only an appropriate method but also the effective modeling approach for this dataset.

3.2.4. Model Specification and Optimal Smoothing Parameters Selection

In the penalized spline estimator, selection of optimal smoothing parameters including the optimal knots points and the optimal smoothing parameter (λ) must be performed to determine the best SBLR model. This selection of the optimal smoothing parameters is based on the GACV criterion by selecting the minimum value of the GACV function. The GACV function is given by Equation (5) or Equation (6). Selecting the optimal smoothing parameters can be performed by using the following algorithm, namely, Algorithm 1.

Algorithm 1. Selecting Optimal Smoothing Parameters Based on GACV Criterion

Creating the general inverse function: “ginverse”.
Creating a function, “quant”, for determining the optimal knot point.
Creating a function, “trun”, for determining the value of the knot point location for each variable.
Creating a function, “matrikx”, for obtaining optimal knot point iterations and knot locations.
Creating a function, “matrikd”, for constructing the penalty matrix.
Creating a function, “hitung.gacv”, for obtaining the minimum GACV value.
Creating a function, “plot_gacv”, for obtaining the plot of GACV value.
Defining q for i-th predictor on components of nonparametric.
Defining a vector of nonparametric predictor variable (t).
Defining a vector of response variable (y).
Sorting the data on each predictor for obtaining the optimal knot point and knot location.
Making iterations to obtain the minimum GACV value with the syntax “carioptimal_logistic_gacv”, so that we obtain the optimal knot point, knot location, and lambda $(λ)$ based on the minimum value of GACV.

Based on the R-code outputs for nonparametric component, we obtain the results, which include the order of the penalized spline of

d = 1

(i.e., linear order), the optimal number of knot points, the optimal knot points, the optimal value of λ (smoothing parameter), and the minimum GACV value for all the predictor variables. The results are given in Table 3.

3.2.5. Estimation Results

The next step after obtaining the optimal values of smoothing parameters for all predictor variables is that we use the IRPLS method to perform the iteration of initial values and create the R-code of penalized spline estimator for estimating the multipredictor SBLR model of the CHD data. The penalized spline algorithm used to estimate the SBLR model of the CHD data is provided in Algorithm 2.

Algorithm 2. Penalized Spline Estimation for SBLR Model of CHD Data

Defining $n$ , namely the number of observations in the CHD data.
Defining $x$ , namely a vector of parametric predictor variable.
Defining $t$ , namely a vector of nonparametric predictor variable.
Defining $y$ , namely a vector of response variable.
Constructing the combined design matrix C and penalty matrix S:
- Step 5a: Form the parametric design matrix X and the nonparametric spline T.
- Step 5b: Combine them into C = [X|T].
- Step 5c: Construct the block-diagonal penalty matrix S.
Selecting the optimal smoothing parameter $(k_{o p t}, λ_{o p t})$ using the GACV method:
- Step 6a: Define a grid of $k, λ$ values.
- Step 6b: For each $k, λ$ , run the IRPLS process until convergent.
- Step 6c: Calculate the GACV score based on the converged model.
- Step 6d: Choose $k_{o p t}, λ_{o p t}$ that minimizes the GACV score.
Iterating the Iteratively Reweighted Penalized Least Square (IRPLS) algorithm using $k_{o p t}, λ_{o p t}$ to obtain final parameters $\hat{ω}$
- Step 7a: Initialized parameter vector $ω^{(0)}$ and set tolerance $δ = 10^{- 6}$ .
- Step 7b: Calculate the fitted probabilities $π^{(m)}$ and the diagonal weighting matrix $W^{(m)}$ .
- Step 7c: Calculate the adjusted dependent vector $z^{(m)}$ based on Equation (31).
- Step 7d: Update the parameter estimates based on Equation (32).
- Step 7e: Calculate the model deviance $D^{(m + 1)}$ .
- Step 7f: Check convergence; otherwise, set $m = m + 1$ and repeat from step 7b.
Extracting the final estimator $\hat{ω}$ , separating the parametric coefficients $(\hat{β})$ and the nonparametric spline coefficients $(\hat{γ})$ .
Calculating the vector of estimated probabilities $\hat{π}$ based on step 8.
Determining the optimal classification threshold $(c_{o p t})$ based on Youden’s J index:
- Step 10a: Define a sequence of candidate thresholds $τ \in [0,1]$ .
- Step 10b: Classify observations For Each Candidate $τ$ $: {\hat{y}}_{i} = 1$ $if {\hat{π}}_{i} > τ$ , else 0.
- Step 10c: Calculate Sensitivity $({S e}_{τ})$ $, and Specificity ({S p}_{τ})$ for each threshold.
- Step 10d: Compute Youden’s Index: $J_{τ} = {S e}_{τ} + {S p}_{τ} - 1$ .
- Step 10e: Select the optimal threshold that maximized the index $c_{o p t} = \max (J_{τ})$
Performing the final classification using $c_{o p t}$ .
Calculating the Press’Q, McNemar, Deviance, Brief Score (BS), and Brief Skill Score (BSS) values.
Calculating the final Confusion Matrix, and model evaluation metrics (Accuracy, Sensitivity, Specificity, and AUC).

All statistical computations and algorithm implementations were performed in the R programming environment using RStudio (version 2025.09.02). It is important to note that the proposed SBLR method was implemented using scripts written manually by the authors, without relying on standard built-in packages. This ensures that the estimation process strictly follows the specific Iteratively Reweighted Penalized Least Squares (IRPLS) algorithm and GACV optimization derived in this study.

Based on the R-code outputs, the estimation of parameters of the multipredictor SBLR model can be obtained. These estimated parameters of the model are as follows:

\begin{matrix} {\hat{θ}}_{0} ({\hat{β}}_{0} + {\hat{γ}}_{0}) = 1.0124 \\ {\hat{β}}_{1} = 0.1360 \\ {\hat{γ}}_{1} = {[\begin{matrix} - 0.1722 & 0.8029 & - 0.5068 \end{matrix}]}^{T} \\ {\hat{γ}}_{2} = {[\begin{matrix} 0.0059 & 0.1931 & \begin{matrix} - 0.6561 & 0.6266 \end{matrix} \end{matrix}]}^{T} \\ {\hat{γ}}_{3} = {[\begin{matrix} - 0.3675 & 0.6871 & \begin{matrix} - 0.1121 & 0.0970 \end{matrix} \end{matrix}]}^{T} \\ {\hat{γ}}_{4} = {[\begin{matrix} - 0.0273 & - 0.1570 \end{matrix}]}^{T} \\ {\hat{γ}}_{5} = {[\begin{matrix} - 0.6763 & 0.1225 & \begin{matrix} 1.6701 & - 0.7207 \end{matrix} \end{matrix}]}^{T} \end{matrix}\}

(42)

Next, based on values given in Equation (42), we can determine the estimation value of the nonparametric penalized spline function,

{\hat{g}}_{j} (t_{j i})

, of the CHD data for the initial value of each predictor variable for n observations as follows:

(a): By assuming that the other variables are constant, we obtain the estimation of nonparametric penalized spline function, ${\hat{g}}_{1} (t_{1 i})$ , on the first predictor, namely, Body Weight, as follows:

${\hat{g}}_{1} (t_{1 i}) = - 0.1722 t_{1 i} + 0.8029 {(t_{1 i} - 63)}_{+} - 0.5068 {(t_{1 i} - 71)}_{+} = \{\begin{matrix} - 0.1722 t_{1 i} for t_{1 i} < 63 \\ - 50.5827 + 0.6307 t_{1 i} for 63 \leq t_{1 i} < 71 \\ - 14.5999 + 0.1239 t_{1 i} for t_{1 i} \geq 71 \end{matrix}$

(43)
(b): By assuming that the other variables are constant, we obtain the estimation of nonparametric penalized spline function, ${\hat{g}}_{2} (t_{2 i})$ , on the second predictor, namely, Body Height, as follows:

$\begin{matrix} {\hat{g}}_{2} (t_{2 i}) & = 0.0059 t_{2 i} + 0.1931 {(t_{2 i} - 159)}_{+} - 0.6561 {(t_{2 i} - 162.5)}_{+} + 0.6266 {(t_{2 i} - 168)}_{+} \\ = \{\begin{matrix} 0.0059 t_{2 i} for t_{2 i} < 159 \\ - 30.7029 + 0.1990 t_{2 i} for 159 \leq t_{2 i} < 162.5 \\ 75.9133 - 0.4571 t_{2 i} for 162.5 \leq t_{2 i} < 168 \\ - 29.3555 + 0.1695 t_{2 i} for t_{2 i} \geq 168 \end{matrix} \end{matrix}$

(44)
(c): By assuming that the other variables are constant, we obtain the estimation of nonparametric penalized spline function, ${\hat{g}}_{3} (t_{3 i})$ , on the third predictor, namely, Sugary Food Consumption, as follows:

$\begin{matrix} {\hat{g}}_{3} (t_{3 i}) & = - 0.3675 t_{3 i} + 0.6871 {(t_{3 i} - 4.34)}_{+} - 0.1121 {(t_{2 i} - 8.40)}_{+} + 0.0970 {(t_{2 i} - 18.83)}_{+} \\ = \{\begin{matrix} - 0.3675 t_{3 i} for t_{3 i} < 4.34 \\ - 2.9820 + 0.3196 t_{3 i} for 4.34 \leq t_{3 i} < 8.40 \\ - 2.0403 + 0.2075 t_{2 i} for 8.40 \leq t_{3 i} < 18.83 \\ - 3.8668 + 0.3045 t_{3 i} for t_{3 i} \geq 18.83 \end{matrix} \end{matrix}$

(45)
(d): By assuming that the other variables are constant, we obtain the estimation of nonparametric penalized spline function, ${\hat{g}}_{4} (t_{4 i})$ , on the forth predictor, namely Fatty Food Consumption, as follows:

$\begin{matrix} {\hat{g}}_{4} (t_{4 i}) & = - 0.0273 t_{4 i} + 0.1570 {(t_{4 i} - 17.29)}_{+} \\ = \{\begin{matrix} - 0.0273 t_{4 i} for t_{4 i} < 17.29 \\ - 2.7145 + 0.1297 t_{4 i} for t_{4 i} \geq 17.29 \end{matrix} \end{matrix}$

(46)
(e): By assuming that the other variables are constant, we obtain the estimation of nonparametric penalized spline function, ${\hat{g}}_{5} (t_{5 i})$ , on the fifth predictor, namely Stress Level, as follows:

\begin{matrix} {\hat{g}}_{5} (t_{5 i}) & = - 0.6763 t_{5 i} + 0.1225 {(t_{5 i} - 2)}_{+} + 1.6701 {(t_{5 i} - 5)}_{+} - 0.7207 {(t_{5 i} - 7)}_{+} \\ = \{\begin{matrix} - 0.6763 t_{5 i}, for t_{5 i} < 2 \\ - 0.2450 - 0.5538 t_{5 i}, for 2 \leq t_{5 i} < 5 \\ - 8.5955 + 1.1163 t_{5 i}, for 5 \leq t_{5 i} < 7 \\ - 3.5506 + 0.3956 t_{5 i}, for t_{5 i} \geq 7 \end{matrix} \end{matrix}

(47)

Based on the estimation of nonparametric penalized spline functions in Equations (42)–(47), we obtain the estimation of probability models of the SBLR as follows:

{\hat{π}}_{i} = \frac{e^{1.0124 + 0.1360 - 0.1722 t_{1 i} + 0.8029 {(t_{1 i} - 63)}_{+} - 0.5068 {(t_{1 i} - 71)}_{+} - \dots - 0.6763 t_{5 i} + 0.1225 {(t_{5 i} - 2)}_{+} + 1.6701 {(t_{5 i} - 5)}_{+} - 0.7207 {(t_{5 i} - 7)}_{+}}}{1 + e^{1.0124 + 0.1360 - 0.1722 t_{1 i} + 0.8029 {(t_{1 i} - 63)}_{+} - 0.5068 {(t_{1 i} - 71)}_{+} - \dots - 0.6763 t_{5 i} + 0.1225 {(t_{5 i} - 2)}_{+} + 1.6701 {(t_{5 i} - 5)}_{+} - 0.7207 {(t_{5 i} - 7)}_{+}}} .

(48)

Next, the SBLR model based on penalized spline can be written as follows:

\begin{matrix} {\hat{η}}_{i} = \ln (\frac{{\hat{π}}_{i}}{1 - {\hat{π}}_{i}}) = & 1.0124 + 0.1360 - 0.1722 t_{1 i} + 0.8029 {(t_{1 i} - 63)}_{+} + - 0.5068 {(t_{1 i} - 71)}_{+} - \dots - 0.6763 t_{5 i} + \\ 0.1225 {(t_{5 i} - 2)}_{+} + 1.6701 {(t_{5 i} - 5)}_{+} - 0.7207 {(t_{5 i} - 7)}_{+} \end{matrix}

(49)

Based on the obtained estimation results presented in Equations (42)–(49), the model interpretation can be performed by calculating the Odds Ratio (OR) because, in the context of medical risk analysis (logistic framework), the OR is the “gold standard” for communicating relative risk. In the SBLR model based on the penalized spline estimator, predictor variables included in the parametric systematic component have a single OR value only. In contrast, predictor variables belonging to the nonparametric systematic component may have multiple OR values, corresponding to the number of optimal knots determined for each variable. Accordingly, the Odds Ratio calculations for all predictor variables are given in Table 4.

Based on Table 4, it shows that the predictor variables reveal that the CHD risk is driven by significant linear association by Age and non-linear associations across Body Weight, Body Height, Sugary Food Consumption, Fatty Food Consumption, and Stress Level factors. It can be explained as follows:

(a)

For the Age (X) factor, the OR for the variable is

\exp (0.1360) = 1.1457 .

By assuming that other predictor variables remain constant, this OR shows that the probability of a patient developing coronary heart disease (CHD) increases by 1.1457 times when the patient’s age increases by one year.

(b)

For the Body Weight

(t_{1 i})

factor:

-: By assuming that other predictor variables remain constant, for body weight below 63 kg, the OR shows that the probability of a patient developing coronary heart disease (CHD) decreases by 0.8418 times when the patient’s body weight increases by 1 kg.
-: By assuming that other predictor variables remain constant, for body weight between 63 kg and 71 kg, the OR shows that the probability of a patient developing coronary heart disease (CHD) increases by 1.8789 times when the patient’s body weight increases by 1 kg.
-: By assuming that other predictor variables remain constant, for body weight more than 71 kg, the OR shows that the probability of the patient getting coronary heart disease (CHD) increases 1.1319 times when the patient’s body weight increases by 1 kg.

(c)

For the Body Height

(t_{2 i})

factor:

-: By assuming that the other predictor variables remain constant, for height less than 159 cm, the OR shows that the probability of a patient developing coronary heart disease (CHD) increases by 1.0059 times when the patient’s height increases by 1 cm.
-: By assuming that the other predictor variables remain constant, for heights between 159 cm and 162.5 cm, the OR shows that the probability of a patient developing coronary heart disease (CHD) increases by 1.2201 times when the patient’s height increases by 1 cm.
-: By assuming that the other predictor variables remain constant, for heights between 162.5 cm and 168 cm, the OR shows that the probability of a patient developing coronary heart disease (CHD) decreases by 0.6331 times when the patient’s height increases by 1 cm.
-: By assuming that the other predictor variables remain constant, for heights greater than 168 cm, the OR shows that the probability of a patient developing coronary heart disease (CHD) increases by 1.1847 times when the patient’s height increases by 1 cm.

(d)

For the Sugary Food Consumption

(t_{3 i})

factor:

-: By assuming that the other predictor variables remain constant, for the consumption of sugary foods less than 4.34 g, the OR shows that the possibility of a patient developing coronary heart disease (CHD) decreases by 0.6925 times when the patient’s sugar food consumption increases by 1 g.
-: By assuming that the other predictor variables remain constant, for sugary food consumption between 4.34 g and 8.40 g, the OR shows that the probability of a patient developing coronary heart disease (CHD) increases 1.3766 times when the patient’s sugar food consumption increases by 1 g.
-: By assuming that the other predictor variables remain constant, for sugary food consumption between 8.40 g and 18.83 g, the OR shows that the probability of a patient developing coronary heart disease (CHD) increases by 1.2306 times when the patient’s sugar food consumption increases by 1 g.
-: By assuming that the other predictor variables remain constant, for the consumption of sugary foods of more than 18.83 g, the OR shows that the possibility of a patient developing coronary heart disease (CHD) increases by 1.3559 times when the patient’s sugar food consumption increases by 1 g.

(e)

For the Fatty Food Consumption

(t_{4 i})

factor:

-: By assuming that the other predictor variables remain constant, for fatty food consumption of less than 17.29 g, the OR shows that the possibility of a patient developing coronary heart disease (CHD) decreases by 0.8418 times when the patient’s fatty food consumption increases by 1 g.
-: By assuming that the other predictor variables remain constant, for fatty food consumption of more than 17.29 g, the OR shows that the probability of a patient developing coronary heart disease (CHD) increases 1.1385 times when the patient’s fatty food consumption increases by 1 g.

(f)

For the Stress Level

(t_{5 i})

factor:

-: By assuming that the other predictor variables remain constant, for stress levels below 2 (i.e., Normal category), the OR shows that the probability of a patient developing coronary heart disease (CHD) decreases by 0.5085 times when the patient’s stress level increases by one point.
-: By assuming that the other predictor variables remain constant, for stress levels between 2 and 5 (i.e., the Mild Stress category), the OR shows that the probability of a patient developing coronary heart disease (CHD) decreases by 0.5748 times when the patient’s stress level increases by one point.
-: By assuming that the other predictor variables remain constant, for stress levels between 5 and 7 (i.e., the Moderate Stress category), the OR shows that the probability of a patient developing coronary heart disease (CHD) increases 3.0535-fold when the patient’s stress level increases by one point.
-: By assuming that the other predictor variables remain constant, for stress levels greater than 7 (i.e., the Severe Stress category), the OR shows that the probability of a patient developing coronary heart disease (CHD) increases by 1.4853 times when the patient’s stress level increases by one point.

Here, each observation has a linear component and penalized spline component in the SBLR model. Therefore, we only give an example of discussion by using one of the in-sample data, namely a patient with Age—71 years, Body Weight—72 kg, Body Height—164 cm, Sugary Food Consumption—20.63 mg/dL, Fatty Food Consumption—33.66 mg/dL, and Stress Level—2. Meanwhile, the calculation process for other in-sample data was performed similarly. The results of calculation for this example are as follows:

(a): On the predictor of parametric component, namely Age, based on Equation (42), the value of $f (x)$ for $x = 71$ is:

$f (x) = 0.1360 x = 0.1360 (71) = 9.6560 .$

(50)
(b): On the first predictor of nonparametric component, namely Body Weight. Based on Equation (43), $t_{1} = 72$ is included in the criteria $t_{1 i} \geq 71$ , so the penalized spline value for estimated nonparametric function ${\hat{g}}_{1} (t_{1})$ is:

${\hat{g}}_{1} (t_{1}) = - 14.5999 + 0.1239 t_{1 i} = - 14.5999 + 0.1239 (72) = - 5.6791 .$

(51)
(c): On the second predictor of nonparametric component, namely Body Height. Based on Equation (44), $t_{2} = 164$ is included in the criteria $162.5 \leq t_{2 i} < 168$ , so the penalized spline value for estimated nonparametric function ${\hat{g}}_{2} (t_{2})$ is:

${\hat{g}}_{2} (t_{2}) = 75.9133 - 0.4571 t_{2 i} = 75.9133 - 0.4571 (164) = 0.9489 .$

(52)
(d): On the third predictor of nonparametric component, namely Sugary Food Consumption. Based on Equation (45), $t_{3} = 20.63,$ is included in the criteria $t_{3 i} \geq 18.83$ , so the penalized spline value for estimated nonparametric function ${\hat{g}}_{3} (t_{3})$ is:

${\hat{g}}_{3} (t_{3}) = - 3.8668 + 0.3045 t_{3 i} = - 3.8668 + 0.3045 (20.63) = 2.4150 .$

(53)
(e): On the fourth predictor of nonparametric component, namely Fatty Food Consumption. Based on Equation (46), $t_{4} = 33.66$ is included in the criteria $t_{4 i} \geq 17.29$ , so the penalized spline value for estimated nonparametric function ${\hat{g}}_{4} (t_{4})$ is:

${\hat{g}}_{4} (t_{4}) = - 2.7145 + 0.1297 t_{4 i} = - 2.7145 + 0.1297 (33.66) = 1.6512 .$

(54)
(f): On the fifth predictor of nonparametric component, namely Stress Level. Based on Equation (47), $t_{5} = 2$ , is included in the criteria $2 \leq t_{5 i} < 5$ , so the penalized spline value for estimated nonparametric function ${\hat{g}}_{5} (t_{5})$ is:

${\hat{g}}_{5} (t_{5}) = - 0.2450 - 0.5538 t_{5 i} = - 0.2450 - 0.5538 (2) = - 1.3526 .$

(55)

Based on the calculation results for parametric component and nonparametric penalized spline given in Equations (50)–(55), we obtain value of the estimated probability model as follows:

\begin{matrix} {\hat{π}}_{i} & = \frac{e^{(1.0124 + 9.6560 - 5.6791 + 0.9489 + 2.4150 + 1.6512 - 1.3526)}}{1 + e^{(1.0124 + 9.6560 - 5.6791 + 0.9489 + 2.4150 + 1.6512 - 1.3526)}} \\ = \frac{e^{8.6518}}{1 + e^{8.6518}} = 0.999 . \end{matrix}

(56)

The estimated probability

{\hat{π}}_{i} = 0.999

derived in Equation (56) represents the likelihood of the event occurring. To classify this observation into a binary outcome, either CHD (i.e.,

y = 1

) or Non-CHD (i.e.,

y = 0

), a decision threshold (c) is required. Instead of utilizing the default threshold of 0.5, which may yield biased classifications in medical datasets, we determined the optimal cut-off point using Youden’s J statistic. This method selects the threshold that maximizes the difference between the True Positive Rate (Sensitivity) and the False Positive Rate, as follows [60]:

J_{τ} = \max_{c} ({S e}_{τ} + {S p}_{τ} - 1)

(57)

Top ten of optimal thresholds based on Youden’s J coefficient are given in Table 5.

Based on the values presented in Table 5, the optimal threshold was identified as

c_{o p t} = 0.39

. By applying this threshold to the calculated sample observations, we observed that

{\hat{π}}_{i} = 0.971

is greater than the value of

c_{o p t}

(i.e.,

{\hat{π}}_{i} = 0.971 > c_{o p t}

). Since the estimated probability exceeds the optimal threshold, the observation is classified as

{\hat{y}}_{i} = 1

. This indicates that the patient is predicted to be at a high risk for Coronary Heart Disease (CHD), based on their specific parametric and nonparametric risk factors. Hereinafter, we can obtain the estimation results for other in-sample data by using the similar calculation method. Using this optimal threshold across the entire dataset, the predictive performance of the SBLR model is evaluated.

3.2.6. Model Evaluation and Comparison

Based on the obtained results from the SBLR model estimation, we then calculate the model suitability criterion based on the deviance statistical test, D, on the SBLR model by using the following Deviance formula [61,62]:

D = - 2 \sum_{i = 1}^{n} \{y_{i} \log {\hat{π}}_{i} + (1 - y_{i}) \log (1 - {\hat{π}}_{i})\} .

(58)

The hypothesis used for testing the suitability criteria for the semiparametric binary logistic regression (SBLR) model based on the Deviance value is: H₀: The semiparametric binary logistic regression model is suitable, versus, H₁: The semiparametric binary logistic regression model is not suitable. The null hypothesis (H₀) is rejected if the p-value is less than the significance level (α) value [63,64,65].

Furthermore, we use the Press’Q statistical test proposed by Maroco et al. [66] to test whether the model’s classification accuracy significantly exceeds chance variation. The Press’Q formula given by Maroco et al. [66] is as follows:

P r e s s^{'} Q = \frac{(N - n K)}{N (K - 1)} ~ χ_{(1)}^{2}

(59)

A higher Press’Q value indicates stronger predictive power [66,67,68,69]. In addition, to evaluate the symmetry of misclassification we use the McNemar test which is based on a

χ^{2}

test that compares the distribution of counts expected under the null hypothesis. The following statistics are calculated and shown to be (approximately)

χ^{2}

distributed with 1 degree of freedom [70,71,72,73,74,75,76]:

M c N e m a r = \frac{{[(F P - F N) - 1]}^{2}}{F P + F N} ~ χ_{(1)}^{2}

(60)

In Equation (60), make the decision: if the null hypothesis is correct, then the probability that this statistic is greater than

χ_{(1; 0.05)}^{2} = 3.841

, or, if the p-value is less than 0.05, we will reject the null hypothesis (H₀) and conclude that the two classifiers A and B make errors differently. In this context, a non-significant result is desirable, as it suggests no systematic bias in the model’s errors. In addition, the Brier Score (BS) [77,78,79,80,81,82,83,84] and the Brier Skill Score (BSS) [85,86,87,88] are a widely used metric evaluating the accuracy of probabilistic predictions in binary outcomes for clinical research [89,90,91]. The BS and BSS are defined, respectively, by:

B S (y_{i}, {\hat{π}}_{i}) = \frac{1}{n} \sum_{i = 1}^{n} {({\hat{π}}_{i} - y_{i})}^{2}

(61)

B S S (y_{i}, {\hat{π}}_{i}) = 1 - \frac{B S}{B S_{r}} .

(62)

Here, we select standard parametric logistic regression as the primary baseline because it remains the “standard” in clinical research for odds-ratio interpretation. Furthermore, we compare the proposed Penalized Spline (P-Spline) semiparametric model against P-Spline nonparametric model to explicitly demonstrate the improvement. These comparisons were chosen to highlight the specific evolution from parametric, nonparametric, and semiparametric penalized methods. We believe this comparison effectively isolates the contribution of the P-Spline estimator in handling non-linearity more effectively than the baselines.

Table 6 presents a performance comparison between the Binary Logistic Regression (BLR) as a parametric regression approach, Nonparametric Binary Logistic Regression (NBLR) based on penalized spline as a nonparametric regression approach, and the SBLR based on penalized spline as a semiparametric regression approach and as the proposed model. Based on the goodness-of-fit criteria, the proposed model, namely, the SBLR model based on penalized spline, demonstrates the best performance, indicated by the lowest Deviance value of 41.43 compared to that of BLR with 55.30 and that of NBLR with 48.20. The p-value of the Deviance test for the SBLR model is 0.628 that is more than significance level

α = 0.05

, suggesting that the model fits the data adequately.

Furthermore, based on Table 6, all models yielded a p-value for Press’Q of 0.000, confirming that their predictive accuracy is statistically significant compared to random chance. Although, the NBLR model based on penalized spline achieved the highest Press-Q score, namely 36.00, the proposed SBLR model followed closely with a score of 33.06, which is substantially higher than that of the parametric BLR model (18.06). This indicates that the proposed model, namely SBLR approach successfully captures the predictive strength of nonparametric methods while maintaining interpretability.

According to Table 6, it shows that the proposed SBLR model achieves a p-value of 0.5050. This result validates the robustness of the proposed model and the absence of significant systematic bias. In terms of predictive accuracy, the proposed model outperforms the others with the lowest Brier Score (i.e., 0.1048) and the highest Brier Skill Score (i.e., 0.5741). Thus, the SBLR is confirmed as the most suitable model approach for modeling the CHD dataset in this study, and it is statistically valid for analyzing the CHD dataset in this study. The classification results for the entire in-sample, out-sample, and overall dataset are presented in Table 7.

Table 7 and Figure 2 present a comprehensive comparison of classification performance across Training, Testing, and Full Dataset scenarios using Sensitivity, Specificity, Accuracy, and Area Under the Curve (AUC) metrics [91]. The analysis reveals that purely nonparametric methods, despite their high flexibility, are prone to over-fitting when the data contains underlying parametric structures. This is clearly demonstrated by the PS-NBLR (Penalized Spline Nonparametric Binary Logistic Regression) model, which achieved excellent training metrics (AUC = 0.915) but failed to generalize, with Testing Accuracy plummeting to 56.25%. This performance gap indicates that omitting the fixed parametric components destabilizes prediction on unseen data. Conversely, the parametric BLR model also proved inadequate (Testing Accuracy = 62.50%), likely due to its inability to capture non-linear complexities. The SBLR model resolves these limitations by integrating a parametric part to handle systematic patterns, thereby mitigating over-fitting and ensuring a more robust generalization capability compared to its counterparts.

The proposed model, namely, the SBLR model, demonstrates robustness and stability. Based on Figure 2 and Figure 3, and Table 7, the SBLR model maintained the highest consistency when applied to the testing data, achieving a balanced Sensitivity and Specificity of 0.75, along with the highest Testing Accuracy of 75% and a Testing AUC of 0.81. Furthermore, across the entire dataset, the SBLR model tied for the highest AUC (0.899) but offered a more balanced trade-off between sensitivity (0.82) and specificity (0.86). These results suggest that the proposed semiparametric approach successfully mitigates the over-fitting issues inherent in purely nonparametric models while offering greater flexibility than parametric ones, making it the most reliable model for predictive applications.

Based on the results described above, we can summarize the following statistical tests used to validate the model’s significance and robustness:

(a): Deviance Test (Goodness-of-Fit). We utilized the Deviance statistic to formally test the simultaneous significance of the parameters. The results confirm that the proposed SBLR model fits the data significantly better than the null model.
(b): Press’s Q Statistic. To validate the predictive accuracy, we employed the Press’s Q test. The calculated value exceeds the critical value, statistically confirming that the classification accuracy of our model is significantly better than chance.
(c): McNemar’s Test. To demonstrate robustness and superiority over baseline methods, we conducted McNemar’s test. This test statistically confirms that the difference in prediction accuracy between the proposed SBLR model and the standard parametric logistic regression is significant, not merely due to random variation.

3.2.7. Limitations and Future Work

The proposed Semiparametric Binary Logistic Regression (SBLR) model relies on several key statistical assumptions. The model presumes the correct specification of the logit link function to relate the binary response variable to the systematic component, which consists of both parametric (linear) and nonparametric (smooth) terms. While the penalized spline estimator effectively captures non-linear patterns, the flexibility of the model presents certain limitations. Specifically, although the Generalized Approximate Cross-Validation (GACV) criterion is employed to optimize the smoothing parameter, the method requires a sufficient sample size to reliably estimate the local behavior of the spline functions without over-fitting. Furthermore, as the number of knots increases to capture finer data structures, the computational intensity of the Iteratively Reweighted Penalized Least Squares (IRPLS) algorithm may increase, necessitating careful consideration of the trade-off between model complexity and computational efficiency.

Note that although this model demonstrates high internal validity, external validation on a multi-center dataset is recommended for further work to improve generalizability. More explicitly, there are two main limitations of this proposed method, namely:

(a): Computational Complexity. The selection of the optimal smoothing parameter (λ) for multiple predictors can be computationally intensive compared to simple parametric models.
(b): Data Constraints. The study relies on cross-sectional data, which limits causal inference compared to longitudinal data. For future work, we propose extending this estimator to Generalized Additive Models (GAM) for high-dimensional data and applying it to longitudinal datasets to observe risk factor progression over time.

4. Conclusions

The results of this research successfully demonstrate the efficacy of the proposed model, namely, the Semiparametric Binary Logistic Regression (SBLR) model based on penalized spline, in identifying critical non-linear thresholds for Coronary Heart Disease (CHD) risk factors. Unlike the Binary Logistic Regression (BLR) model and the Nonparametric Binary Logistic Regression (NBLR) model based on penalized spline, the SBLR model revealed knot points that serve as physiological turning points for disease risk. Key findings indicate that Age (

X_{i}

) maintains a linear positive correlation with CHD risk. However, continuous variables such as Body Weight, Sugary Food Consumption, and Stress Level exhibit significant non-linear behaviors. Specifically, the analysis identified a critical risk threshold for Sugary Food Consumption

(t_{3 i})

at 4.34 g, where the risk shifts from protective to hazardous. Similarly, Fatty Food Consumption

(t_{4 i})

shows a clear transition to increased risk beyond 17.29 g. Most notably, the model captured a distinct risk pattern in Stress Level

(t_{5 i})

. While low stress (less than 5) is protective, moderate stress (score 5–7) was identified as the most dangerous predictor, with the highest Odds Ratio (OR = 3.0535) and a significantly increased risk associated with severe stress. The results also confirm that the SBLR model offers a robust alternative by effectively balancing interpretability and predictive precision.

Statistically, the proposed model outperformed both parametric and nonparametric baselines. The Nonparametric Binary Logistic Regression (NBLR) model based on penalized spline exhibited signs of overfitting, evidenced by a sharp decline in testing accuracy to 56.25%. The SBLR model maintained superior generalization capabilities. It achieved the highest testing accuracy (75%) and testing AUC (0.81), alongside a symmetric error distribution (McNemar p-value = 0.5050), confirming the absence of significant systematic bias. The implication of these results is that the SBLR model is statistically valid and highly effective for predicting the risk of CHD. Therefore, in the future, the results of this study can be used as a basis for an early warning system, specifically alerting individuals with moderate stress levels and dietary habits exceeding the identified thresholds to be aware of the heightened probability of developing CHD.

Author Contributions

All authors have contributed to this research article, namely, Conceptualization, N.C., M.R. and N.R.A.A.S.; Methodology, N.C., D.A. and N.R.A.A.S.; Software, M.R., N.R.A.A.S. and D.A.; Validation, N.C. and B.L.; Formal analysis, N.C., B.L., D.A. and N.R.A.A.S.; Investigation, N.C. and M.R.; Resources and Data curation, N.C. and N.R.A.A.S.; Writing—original draft preparation, N.R.A.A.S.; Writing—review and editing, N.C. and B.L.; Visualization, N.C., B.L. and D.A.; Supervision, N.C. and M.R.; Project administration, N.C. and M.R.; Funding acquisition, N.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Directorate of Research and Community Service (Direktorat Riset dan Pengabdian kepada Masyarakat-DRPM), the Ministry of Research, Technology and Higher Education of the Republic of Indonesia, through the Regular Fundamental Research Scheme grant (SKIM PFR DIPA-DRPM) fiscal year 2025, with the Master Contract No.: 059/C3/DT.05.00/PL/2025 and the Derivative Contract No.: 2351/B/UN3.LPPM/PT.01.03/2025.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of UNIVERSITAS AIRLANGGA (protocol number: UA–02 –24072 and date of approval: 6 June 2024).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The authors confirm that the data used in this study are available from the corresponding author upon reasonable request.

Acknowledgments

Authors thank the Directorate of Research and Community Service (Direktorat Riset dan Pengabdian kepada Masyarakat-DRPM), Ministry of Research, Technology and Higher Education of the Republic of Indonesia, for funding this research. In addition, authors thank the Cardiology Clinic of the Universitas Airlangga Hospital for providing data and research facilities, and also thank Airlangga University for technical support.

Conflicts of Interest

The authors declare no conflicts of interest. Also, the funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the article, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

PS-SBLR	Penalized Spline–Semiparametric Binary Logistic Regression
PS-NBLR	Penalized Spline–Nonparametric Binary Logistic Regression
BS	Brier Score
BSS	Brier Skill Score
GACV	Generalized Approximate Cross-Validation
BLR	Binary Logistic Regression
GLM	Generalized Linear Model
GAM	Generalized Additive Model
GAPLM	Generalized Additive Partial Linear Model

References

Bewick, V.; Cheek, L.; Ball, J. Statistics Review 14: Logistic Regression. Crit. Care 2005, 9, 112–118. [Google Scholar] [CrossRef]
Nick, T.G.; Campbell, K.M. Logistic Regression. Methods Mol. Biol. 2007, 404, 273–301. [Google Scholar] [CrossRef]
Domínguez-Almendros, S.; Benítez-Parejo, N.; Gonzalez-Ramirez, A.R. Logistic Regression Models. Allergol. Immunopathol. 2011, 39, 295–305. [Google Scholar] [CrossRef] [PubMed]
Stoltzfus, J.C. Logistic Regression: A Brief Primer. Acad. Emerg. Med. 2011, 18, 1099–1104. [Google Scholar] [CrossRef]
Kleinbaum, D.G.; Klein, M. Logistic Regression: A Self-Learning Text, 3rd ed.; Springer Nature: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Sperandei, S. Understanding Logistic Regression Analysis. Biochem. Med. 2014, 24, 12–18. [Google Scholar] [CrossRef]
Ranganathan, P.; Pramesh, C.S.; Aggarwal, R. Common Pitfalls in Statistical Analysis: Logistic Regression. Perspect. Clin. Res. 2017, 8, 148–151. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.Q.; Yu, S.C.; Qi, X.; Hu, Y.H.; Zheng, W.J.; Shi, J.X.; Yao, H.Y. Overview of Logistic Regression Model Analysis and Application. Chin. J. Prev. Med. 2019, 53, 955–960. [Google Scholar] [CrossRef]
Yu, X.; Li, S.; Chen, J. A Three-parameter Logistic Regression Model. Stat. Theory Relat. Fields 2021, 5, 265–274. [Google Scholar] [CrossRef]
Castro, H.M.; Ferreira, J.C. Linear and Logistic Regression Models: When to Use and How to Interpret them? J. Bras. Pneumol. 2022, 48, e20220439. [Google Scholar] [CrossRef]
Wang, T.; Tang, W.; Lin, Y.; Su, W. Semi-Supervised Inference for Nonparametric Logistic Regression. Stats. Med. 2023, 42, 2573–2589. [Google Scholar] [CrossRef]
Yara, A.; Terada, Y. Nonparametric Logistic Regression with Deep Learning. Bernoulli 2026, 32, 952–977. [Google Scholar] [CrossRef]
Bonnini, S.; Borghesi, M. Nonparametric Test for Logistic Regression with Application to Italian Enterprises’ Propensity for Innovation. Mathematics 2024, 12, 2955. [Google Scholar] [CrossRef]
Kima, S.; Bak, K.-Y. Nonparametric Logistic Regression Based on Sparse Triangulation Over a Compact Domain. Commun. Stat. Appl. Methods 2024, 31, 557–569. [Google Scholar] [CrossRef]
Ali, T.H. Modification of the Adaptive Nadaraya-Watson Kernel Method for Nonparametric Regression (Simulation Study). Commun. Stat.-Simul. Comput. 2022, 51, 391–403. [Google Scholar] [CrossRef]
Linke, Y.; Borisov, I.S.; Ruzankin, P.; Kutsenko, V.A.; Yarovaya, E.; Shalnova, S.A.S. Universal Local Linear Kernel Estimators in Nonparametric Regression. Mathematics 2022, 10, 2693. [Google Scholar] [CrossRef]
Chamidah, N.; Lestari, B.; Larasati, T.N.; Muniroh, L. Designing Z-Score Standard Growth Charts Based on Height-for-Age of Toddlers Using Local Linear Estimator for Determining Stunting. AIP Conf. Proc. 2024, 3083, 030002. [Google Scholar] [CrossRef]
Aydin, D.; Chamidah, N.; Lestari, B.; Mohammad, S.; Yilmaz, E. Local Polynomial Estimation for Multi-response Semiparametric Regression Models with Right Censored Data. Commun. Stats. Simul. Comput. 2025, 1–32. [Google Scholar] [CrossRef]
Utami, T.W.; Chamidah, N.; Saifudin, T.; Lestari, B.; Aydin, D. Estimation of Biresponse Semiparametric Regression Model for Longitudinal Data Using Local Polynomial Kernel Estimator. Symmetry 2025, 17, 392. [Google Scholar] [CrossRef]
Islamiyati, A.; Kalondeng, A.; Sunusi, N.; Zakir, M.; Amir, A.K. Biresponse Nonparametric Regression Model in Principal Component Analysis with Truncated Spline Estimator. J. King Saud. Univ.-Sci. 2022, 34, 101892. [Google Scholar] [CrossRef]
Chamidah, N.; Lestari, B.; Susilo, H.; Dewi, T.K.; Saifudin, T.; Siregar, N.R.A.A.; Aydin, D. Modeling Coronary Heart Disease Risk Based on Age, Fatty Food Consumption and Anxiety Factors Using Penalized Spline Nonparametric Logistic Regression. MethodsX 2025, 14, 103320. [Google Scholar] [CrossRef]
Chamidah, N.; Lestari, B.; Wulandari, A.Y.; Muniroh, L. Z-Score Standard Growth Chart Design of Toddler Weight Using Least Square Spline Semiparametric Regression. AIP Conf. Proc. 2021, 2329, 060031. [Google Scholar] [CrossRef]
Lestari, B.; Chamidah, N.; Aydin, D.; Yilmaz, E. Reproducing Kernel Hilbert Space Approach to Multiresponse Smoothing Spline Regression Function. Symmetry 2022, 14, 2227. [Google Scholar] [CrossRef]
Lestari, B.; Chamidah, N.; Budiantara, I.N.; Aydin, D. Determining Confidence Interval and Asymptotic Distribution for Parameters of Multiresponse Semiparametric Regression Model Using Smoothing Spline Estimator. J. King Saud. Univ.-Sci. 2023, 35, 102664. [Google Scholar] [CrossRef]
Wang, Y. Smoothing Splines: Methods and Applications; CRC Press: New York, NY, USA, 2011. [Google Scholar]
Zulfadhli, M.; Budiantara, I.N.; Ratnasari, V. Nonparametric Regression Estimator of Multivariable Fourier Series for Categorical Data. MethodsX 2024, 13, 102983. [Google Scholar] [CrossRef]
Chamidah, N.; Lestari, B.; Budiantara, I.N.; Aydin, D. Estimation of Multiresponse Multipredictor Nonparametric Regression Model Using Mixed Estimator. Symmetry 2024, 16, 386. [Google Scholar] [CrossRef]
Alswaitti, M.; Siddique, K.; Jiang, S.; Alomoush, W.; Alrosan, A. Dimensionality Reduction, Modelling, and Optimization of Multivariate Problems Based on Machine Learning. Symmetry 2022, 14, 1282. [Google Scholar] [CrossRef]
Reddy, T.A.; Henze, G.P. Parametric and Non-Parametric Regression Methods: Applied Data Analysis and Modeling for Energy Engineers and Scientists; Springer: Berlin/Heidelberg, Germany, 2023; pp. 355–407. [Google Scholar] [CrossRef]
Carroll, R.J.; Wand, M.P. Semiparametric Estimation in Logistic Measurement Error Models. J. R. Stat. Soc. Ser. B (Methodol.) 1991, 53, 573–585. [Google Scholar] [CrossRef]
Fang, F.; Li, J.; Xia, X. Semiparametric Model Averaging Prediction for Dichotomous Response. J. Econom. 2020, 229, 219–245. [Google Scholar] [CrossRef]
Hesamian, G.; Akbari, M.G. Semi-parametric Partially Logistic Regression Model with Exact Inputs and Intuitionistic Fuzzy Outputs. Appl. Soft Comput. 2017, 58, 517–526. [Google Scholar] [CrossRef]
Razzaq, A.; Shemaila, H.A. Fuzzy Semi-Parametric Logistic Quantile Regression Model. Wasit J. Pure Sci. 2023, 2, 184–194. [Google Scholar] [CrossRef]
Breslow, N.E.; Robins, J.M.; Wellner, J.A. On the Semi-parametric Efficiency of Logistic Regression Under Case-Control Sampling. Bernoulli 2000, 6, 447–455. [Google Scholar] [CrossRef]
Zheng, X.; Rong, Y.; Liu, L.; Cheng, W. A More Accurate Estimation of Semiparametric Logistic Regression. Mathematics 2021, 9, 2376. [Google Scholar] [CrossRef]
Mullah, M.; Hanley, J.A.; Benedetti, A. LASSO Type Penalized Spline Regression for Binary Data. BMC Med. Res. Methodol. 2021, 21, 83. [Google Scholar] [CrossRef] [PubMed]
Yu, J.; Shi, J.; Liu, A.; Wang, Y. Smoothing Spline Semiparametric Density Models. J. Am. Stat. Assoc. 2022, 117, 237–250. [Google Scholar] [CrossRef]
Maharani, A.; Tampubolon, G. Unmet Needs for Cardiovascular Care in Indonesia. PLoS ONE 2014, 9, e105831. [Google Scholar] [CrossRef]
Savira, F.; Wang, B.; Kompa, A.R.; Ademi, Z.; Owen, A.; Liew, D.; Zomer, E. The Impact of Coronary Heart Disease Prevention on Work Productivity: A 10-Year Analysis. Eur. J. Prev. Cardiol. 2021, 28, 418–425. [Google Scholar] [CrossRef]
WHO. Non-Communicable Diseases Country Profiles 2018; World Health Organization (WHO): Geneva, Switzerland, 2018; Available online: https://iris.who.int/handle/10665/274512 (accessed on 15 December 2025).
Kementerian Kesehatan RI. Laporan Nasional Riskesdas 2018; Kementerian Kesehatan RI: Jakarta, Indonesia, 2018. Available online: https://repository.kemkes.go.id/book/1323 (accessed on 20 December 2025).
Hosseini, K.; Mortazavi, S.H.; Sadeghian, S.; Ayati, A.; Nalini, M.; Aminorroaya, A.; Tavolinejad, H.; Salarifar, M.; Pourhosseini, H.; Aein, A. Prevalence and Trends of Coronary Artery Disease Risk Factors and Their Effect on Age of Diagnosis in Patients with Established Coronary Artery Disease: Tehran Heart Center (2005–2015). BMC Cardiovasc. Disord. 2021, 21, 477. [Google Scholar] [CrossRef]
Lee, Y.T.H.; Fang, J.; Schieb, L.; Park, S.; Casper, M.; Gillespie, C. Prevalence and Trends of Coronary Heart Disease in the United States, 2011 to 2018. JAMA Cardiol. 2022, 7, 459–462. [Google Scholar] [CrossRef] [PubMed]
Rethemiotaki, I. Global Prevalence of Cardiovascular Diseases by Gender and Age during 2010–2019. Arch. Med. Sci. Atheroscler. Dis. 2023, 8, e196. [Google Scholar] [CrossRef] [PubMed]
Meyer, J.F.; Larsen, S.B.; Blond, K.; Damsgaard, C.T.; Bjerregaard, L.G.; Baker, J.L. Associations Between Body Mass Index and Height during Childhood and Adolescence and the Risk of Coronary Heart Disease in Adulthood: A Systematic Review and Meta-Analysis. Obes. Rev. 2021, 22, e13276. [Google Scholar] [CrossRef]
Juzar, D.A. Pedoman Tatalaksana Sindroma Koroner Akut; Perhimpunan Dokter Spesialis Kardiovaskular Indonesia (PERKI): Jakarta, Indonesia, 2024. [Google Scholar]
Temple, N.J. Fat, Sugar, Whole Grains and Heart Disease: 50 Years of Confusion. Nutrients 2018, 10, 39. [Google Scholar] [CrossRef]
Wirtz, P.H.; Von Känel, R. Psychological Stress, Inflammation, and Coronary Heart Disease. Curr. Cardiol. Rep. 2017, 19, 111. [Google Scholar] [CrossRef]
Marra, G.; Radice, R. Penalised Regression Splines: Theory and Application to Medical Research. Stat. Methods Med. Res. 2010, 19, 107–125. [Google Scholar] [CrossRef]
Harrell, F.E., Jr. Regression Modeling Strategies with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, 2nd ed.; Springer International Publishing: Cham, Switzerland, 2015. [Google Scholar]
Gauthier, J.; Wu, Q.V.; Gooley, T.A. Cubic Splines to Model Relationships Between Continuous Variables and Outcomes: A Guide for Clinicians. Bone Marrow Transpl. 2020, 55, 675–680, Erratum in Bone Marrow Transpl. 2023, 58, 962. https://doi.org/10.1038/s41409-023-01993-7. [Google Scholar] [CrossRef] [PubMed]
Saveliev, A.A.; Galeeva, E.V.; Semanov, D.A.; Galeev, R.R.; Aryslanov, I.R.; Falaleeva, T.S.; Davletshin, R.R. Adaptive Noise Model Based Iteratively Reweighted Penalized Least Squares for Fluorescence Background Subtraction from Raman Spectra. J. Raman Spectrosc. 2022, 53, 247–255. [Google Scholar] [CrossRef]
Liu, R.; Härdle, W.K. Statistical Inference for Generalized Additive Partially Linear Model. J. Multivar. Anal. 2017, 162, 1–15. [Google Scholar] [CrossRef]
Ma, S.; Kosorok, M.R. Penalized Log-Likelihood Estimation for Partly Linear Transformation Models with Current Status Data. Ann. Stats. 2005, 33, 2256–2290. [Google Scholar] [CrossRef]
Cole, S.R.; Chu, H.; Greenland, S. Maximum Likelihood, Profile Likelihood, and Penalized Likelihood: A Primer. Am. J. Epidemiol. 2013, 179, 252–260. [Google Scholar] [CrossRef]
Manghi, R.F.; Cysneiros, F.J.A.; Paula, G.A. Generalized Additive Partial Linear Models for Analyzing Correlated Data. Comput. Stat. Data Anal. 2019, 129, 47–60. [Google Scholar] [CrossRef]
Xiang, D.; Wahba, G. A Generalized Approximate Cross Validation for Smoothing Splines with Non-Gaussian Data. Stat. Sin. 1996, 6, 675–692. Available online: https://www3.stat.sinica.edu.tw/statistica/oldpdf/A6n312.pdf (accessed on 20 December 2025).
Uçar, M.K. Eta Correlation Coefficient Based Feature Selection Algorithm for Machine Learning: E-Score Feature Selection Algorithm. J. Intell. Syst. Theory Appl. 2019, 2, 7–12. [Google Scholar] [CrossRef]
Ruopp, M.D.; Perkins, N.J.; Whitcomb, B.W.; Schisterman, E.F. Youden Index and Optimal Cut-Point Estimated from Observations Affected by a Lower Limit of Detection. Biom. J. 2008, 50, 419–430. [Google Scholar] [CrossRef]
Harris, J.K. Primer on Binary Logistic Regression. Fam. Med. Community Health 2021, 9, e001290. [Google Scholar] [CrossRef]
Western, B. Concepts and Suggestions for Robust Regression Analysis. Amer. J. Political Sci. 1995, 39, 786–817. [Google Scholar] [CrossRef]
Taneichi, N.; Sekiya, Y.; Toyama, J. Improved Transformed Deviance Statistic for Testing a Logistic Regression Model. J. Multivar. Anal. 2011, 102, 1263–1279. [Google Scholar] [CrossRef]
Shen, J.; He, X. Generalized F Test and Generalized Deviance Test in Two-Way ANOVA Models for Randomized Trials. J. Biopharm. Stat. 2014, 24, 523–534. [Google Scholar] [CrossRef] [PubMed]
Hardle, W.K.; Huang, L.-S. Analysis of Deviance for Hypothesis Testing in Generalized Partially Linear Models. J. Bus. Econ. Stat. 2017, 37, 322–333. [Google Scholar] [CrossRef]
Maroco, J.; Silva, D.; Rodrigues, A.; Guerreiro, M.; Santana, I.; De Mendonça, A. Data Mining Methods in the Prediction of Dementia: A Real-Data Comparison of the Accuracy, Sensitivity and Specificity of Linear Discriminant Analysis, Logistic Regression, Neural Networks, Support Vector Machines, Classification Trees and Random Forests. BMC Res. Notes 2011, 4, 299. [Google Scholar] [CrossRef]
Veronese, G.; Pepe, A.; Giordano, F. Child Psychological Adjustment to War and Displacement: A Discriminant Analysis of Resilience and Trauma in Syrian Refugee Children. J. Child. Fam. Stud. 2021, 30, 2575–2588, Erratum in J. Child. Fam. Stud. 2022, 31, 337. https://doi.org/10.1007/s10826-021-02118-8. [Google Scholar] [CrossRef]
Ruiz, E.D.; González, F.J.N.; Jurado, J.M.L.; Arbulu, A.A.; Bermejo, J.V.D.; Ariza, A.G. Effects of Supplementation of Different Antioxidants to Cryopreservation Extender on the Post-Thaw Quality of Rooster Semen—A Meta-Analysis. Animals 2024, 14, 2936. [Google Scholar] [CrossRef]
Ruiz, E.D.; Bermejo, J.V.D.; Ariza, A.G.; Jurado, J.M.L.; Arbulu, A.A.; González, F.J.N. Effects of Meteorology and Lunar cycle on the Post-Thawing Quality of Avian Sperm. Front. Vet. Sci. 2024, 11, 1394004. [Google Scholar] [CrossRef]
Levin, J.R.; Serlin, R.C. Changing Students’ Perspectives of McNemar’s Test of Change. J. Stats. Educ. 2000, 8, 1–9. [Google Scholar] [CrossRef]
Simion, C.; Borza, S.I. Inspection Performance’s Estimation Using McNemar Statistical Test. In Proceedings of the 2005 WSEAS Int. Conf. on Dynamical Systems and Control, Venice, Italy, 2–4 November 2005; pp. 131–135. [Google Scholar]
Fisher, M.J.; Marshall, A.P.; Mitchell, M. Testing Differences in Proportions. Aust. Crit. Care 2011, 24, 133–138. [Google Scholar] [CrossRef] [PubMed]
Adedokun, O.A.; Burgess, W.D. Analysis of Paired Dichotomous Data: A Gentle Introduction to the McNemar Test in SPSS. J. Multidiscip. Eval. 2012, 8, 125–131. [Google Scholar] [CrossRef]
Fagerland, M.W.; Lydersen, S.; Laake, P. The McNemar Test for Binary Matched-Pairs Data: Mid-p and Asymptotic are Better Than Exact Conditional. BMC Med. Res. Methodol. 2013, 13, 91. [Google Scholar] [CrossRef] [PubMed]
Smith, M.Q.R.P.; Ruxton, G.D. Effective Use of the McNemar Test. Behav. Ecol. Sociobiol. 2020, 74, 133. [Google Scholar] [CrossRef]
Wu, Y. Weighted McNemar’s Test for the Comparison of Two Screening Tests in the Presence of Verification Bias. Stat. Med. 2022, 41, 3149–3163. [Google Scholar] [CrossRef]
Roulston, M.S. Performance Targets and the Brier Score. Meteorol. Appl. 2007, 14, 105–207. [Google Scholar] [CrossRef]
Weigel, A.P.; Liniger, M.A.; Appenzeller, C. The Discrete Brier and Ranked Probability Skill Scores. Mon. Weather Rev. 2007, 135, 118–124. [Google Scholar] [CrossRef]
Ferro, C.A.T.; Fricker, T.E. A Bias-Corrected Decomposition of the Brier Score. Q. J. R. Meteorol. Soc. 2012, 138, 1954–1960. [Google Scholar] [CrossRef]
Assel, M.; Sjoberg, D.D.; Vickers, A.J. The Brier Score Does Not Evaluate the Clinical Utility of Diagnostic Tests or Prediction Models. Diagn. Progn. Res. 2017, 1, 19. [Google Scholar] [CrossRef]
Yang, W.; Jiang, J.; Schnellinger, E.M.; Kimmel, S.E.; Guo, W. Modified Brier Score for Evaluating Prediction Accuracy for Binary Outcomes. Stat. Methods Med. Res. 2022, 31, 2287–2296. [Google Scholar] [CrossRef]
Patel, A.R.; Dhingra, A.; Liggesmeyer, P. Leveraging the Brier Score to Enhance Predictive Accuracy in Learning-Based Risk Assessment. In Proceedings of the 8th International Conference on System Reliability and Safety (ICSRS), Sicily, Italy, 20–22 November 2024; pp. 418–425. [Google Scholar] [CrossRef]
Zhu, K.; Zheng, Y.; Chan, K.C.G. Weighted Brier Score—An Overall Summary Measure for Risk Prediction Models with Clinical Utility Consideration. Stat. Biosci. 2025, 1–29, Erratum in Stats. Biosci. 2025. https://doi.org/10.1007/s12561-025-09507-3. [Google Scholar] [CrossRef] [PubMed]
Hoessly, L. On Misconceptions about the Brier Score in Binary Prediction Models. Glob. Epidemiol. 2026, 11, 100242. [Google Scholar] [CrossRef] [PubMed]
Bradley, A.A.; Schwartz, S.S.; Hashino, T. Sampling Uncertainty and Confidence Intervals for the Brier Score and Brier Skill Score. Weather Forecast. 2008, 23, 992–1006. [Google Scholar] [CrossRef]
Hoss, F.; Fischbeck, P.S. Performance and Robustness of Probabilistic River Forecasts Computed with Quantile Regression Based on Multiple Independent Variables in the North Central USA. Hydrol. Earth Syst. Sci. Discuss. 2014, 11, 11281–11333. [Google Scholar] [CrossRef]
Bosboom, J.; Reniers, A. The Deceptive Simplicity of the Brier Skill Score. Coast. Eng. Environ. Fluid. Mech. 2017, 22, 1639–1663. [Google Scholar] [CrossRef]
Li, Y.; Wang, X.; Zhao, B.; Ying, M.; Liu, Y.; Vitart, F. Using the Debiased Brier Skill Score to Evaluate S2S Tropical Cyclone Forecasting. J. Mar. Sci. Eng. 2025, 13, 1035. [Google Scholar] [CrossRef]
Steyerberg, E.W. Statistical Models for Prediction. In Clinical Prediction Models; Statistics for Biology and Health; Springer International Publishing: Cham, Switzerland, 2019; pp. 59–93. ISBN 978-3-030-16398-3. [Google Scholar]
Steyerberg, E.W. Clinical Prediction Models: A Practical Approach to Development, Validation and Updating. Biom. J. 2020, 62, 1122–1123. [Google Scholar] [CrossRef]
Wibowo, W.; Amelia, R.; Octavia, F.A.; Wilantari, R.N. Classification Using Nonparametric Logistic Regression for Predicting Working Status. AIP Conf. Proc. 2021, 2329, 060032. [Google Scholar] [CrossRef]

Figure 1. Observed logit plot of each predictor variable, including (a) Age, (b) Body Weight, (c) Body Height, (d) Sugar Consumption, (e) Fat Consumption, and (f) Stress Level.

Figure 2. Comparison of metric evaluations for each model.

Figure 3. Confusion matrix of SBLR model.

Table 1. Statistical description for every predictor variable.

Variable	Status	Minimum	Maximum	Mean	Variance
Age	Non-CHD	18	78	39.7	380.25
Age	CHD	31	74	59.3	104.04
Body Weight	Non-CHD	36	97	63.8	176.89
Body Weight	CHD	38	100	69.2	141.61
Body Height	Non-CHD	147	178	160.7	60.83
Body Height	CHD	155	178	164.4	30.58
Sugary Food Consumption	Non-CHD	0.63	32.27	10.42	68.66
Sugary Food Consumption	CHD	1.57	40.29	14.69	125.02
Fatty Food Consumption	Non-CHD	2.84	54.64	21.37	164.16
Fatty Food Consumption	CHD	2.52	42.93	16.73	96.78
Stress Level	Non-CHD	0	14	5.27	11.56
Stress Level	CHD	0	12	4.167	12.92

Table 2. Calculation results of the linearity test for the every predictor.

Variable	Value of $η$	Value of $η^{2}$	p-Value	Decision
Age	0.541	0.293	0.000	Reject H₀ (Linear)
Body Weight	0.191	0.036	0.131	Not Reject H₀ (Non-Linear)
Body Height	0.246	0.060	0.051	Not Reject H₀ (Non-Linear)
Sugary Food Consumption	0.179	0.032	0.155	Not Reject H₀ (Non-Linear)
Fatty Food Consumption	0.193	0.037	0.126	Not Reject H₀ (Non-Linear)
Stress level	0.103	0.011	0.416	Not Reject H₀ (Non-Linear)

Table 3. Results obtained from R-code output for nonparametric components.

Predictor Variable	The Number of Knots	Knot Point	Smoothing $Parameter (λ)$	Minimum GACV Value
Body Weight	2	63; 71	0.045	0.6377
Body Height	3	159; 162.5; 168	0.06	0.6552
Sugary Food Consumption	3	4.34; 8.40; 18.83	0.078	0.6730
Fatty Food Consumption	1	17.29	2	0.6728
Stress Level	3	2; 5; 7	0.0011	0.6641

Table 4. Calculation results of odds ratio for each predictor variable.

Predictor Variable	Knot Points	Parameter Estimation	Odds Ratio [OR]	Risk Effect of CHD
Age ( $X_{i}$ )/(Year)	-	0.1360	1.1457	Increase
Body Weight $(t_{1 i})$ /(Kg)	$t_{1 i} < 63$	−0.1722	0.8418	Decrease
	$63 \leq t_{1 i} < 71$	0.6307	1.8789	Increase
	$t_{1 i} \geq 71$	0.1239	1.1319	Increase
Body Height $(t_{2 i})$ /(centimeter)	$t_{2 i} < 159$	0.0059	1.0059	Increase
	$159 \leq t_{2 i} < 162.5$	0.1990	1.2201	Increase
	$162.5 \leq t_{2 i} < 168$	−0.4571	0.6331	Decrease
	$t_{2 i} \geq 168$	0.1695	1.1847	Increase
Sugary Food Consumption $(t_{3 i})$ /(gram)	$t_{3 i} < 4.34$	−0.3675	0.6925	Decrease
	$4.34 \leq t_{3 i} < 8.40$	0.3196	1.3766	Increase
	$8.40 \leq t_{3 i} < 18.83$	0.2075	1.2306	Increase
	$t_{3 i} \geq 18.83$	0.3045	1.3559	Increase
Fatty Food Consumption $(t_{4 i})$ /(gram)	$t_{4 i} < 17.29$	−0.0273	0.9731	Decrease
Fatty Food Consumption $(t_{4 i})$ /(gram)	$t_{4 i} \geq 17.29$	0.1297	1.1385	Increase
Stress Level $(t_{5 i})$ /(point)	$t_{5 i} < 2$	−0.6763	0.5085	Decrease
	$2 \leq t_{5 i} < 5$	−0.5538	0.5748	Decrease
	$5 \leq t_{5 i} < 7$	1.1163	3.0535	Increase
	$t_{5 i} \geq 7$	0.3956	1.4853	Increase

Table 5. Results of optimal threshold by Youden’s J statistic.

Threshold	Sensitivity	Specificity	Youden’s J
0.39–0.46	0.8929	0.8333	0.7262
0.38	0.8929	0.8056	0.6984
0.47	0.8214	0.8611	0.6825

Table 6. Evaluation of goodness-of-fit criteria for model comparison.

Model Evaluation Criteria		Compared Models
Model Evaluation Criteria	BLR (Parametric Model)	NBLR (Nonparametric Model)	SBLR (Proposed Model)
Deviance	55.3011	48.2016	41.4327
p-value	0.5391	0.5861	0.6281
Press-Q	18.0625	36	33.0625
p-value	0.000	0.000	0.000
McNemar Test	0.0833	1.1250	0.4444
p-value	0.7728	0.2888	0.5050
Brier Score	0.1449	0.1191	0.1048
Brier Skill Score	0.4111	0.5161	0.5741

Table 7. Evaluation of goodness-of-fit criteria for model comparison.

Methods	Training				Testing				All Observation
Methods	Sens.	Spec.	Acc.	AUC	Sens.	Spec.	Acc. (%)	AUC	Sens.	Spec.	Acc. (%)	AUC
BLR (Parametric)	0.75	0.86	81.25	0.876	0.50	0.75	62.50	0.80	0.69	0.84	77.5	0.861
PS-NBLR (Nonparametric)	0.93	0.83	87.50	0.915	0.50	0.625	56.25	0.7	0.88	0.81	85	0.892
PS-SBLR (Semiparametric)	0.83	0.89	85.94	0.892	0.75	0.75	75	0.81	0.82	0.86	83.75	0.899

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chamidah, N.; Rifada, M.; Lestari, B.; Aydin, D.; Siregar, N.R.A.A. Penalized Spline Estimator for Semiparametric Binary Logistic Regression Model with Application to Coronary Heart Disease Risk Factors. Symmetry 2026, 18, 432. https://doi.org/10.3390/sym18030432

AMA Style

Chamidah N, Rifada M, Lestari B, Aydin D, Siregar NRAA. Penalized Spline Estimator for Semiparametric Binary Logistic Regression Model with Application to Coronary Heart Disease Risk Factors. Symmetry. 2026; 18(3):432. https://doi.org/10.3390/sym18030432

Chicago/Turabian Style

Chamidah, Nur, Marisa Rifada, Budi Lestari, Dursun Aydin, and Naufal Ramadhan Al Akhwal Siregar. 2026. "Penalized Spline Estimator for Semiparametric Binary Logistic Regression Model with Application to Coronary Heart Disease Risk Factors" Symmetry 18, no. 3: 432. https://doi.org/10.3390/sym18030432

APA Style

Chamidah, N., Rifada, M., Lestari, B., Aydin, D., & Siregar, N. R. A. A. (2026). Penalized Spline Estimator for Semiparametric Binary Logistic Regression Model with Application to Coronary Heart Disease Risk Factors. Symmetry, 18(3), 432. https://doi.org/10.3390/sym18030432

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Penalized Spline Estimator for Semiparametric Binary Logistic Regression Model with Application to Coronary Heart Disease Risk Factors

Abstract

1. Introduction

2. Materials and Methods

2.1. Semiparametric Binary Logistic Regression Model

2.2. Truncated Spline Bases

2.3. Penalized Log-Likelihood Method

2.4. Generalized Approximate Cross-Validation (GACV)

3. Results and Discussions

3.1. Estimation of SBLR Model

3.2. Application to Coronary Heart Disease (CHD) Risk Factors

3.2.1. Statistical Description

3.2.2. Observed Logit Plot

3.2.3. Eta Correlation Test

3.2.4. Model Specification and Optimal Smoothing Parameters Selection

3.2.5. Estimation Results

3.2.6. Model Evaluation and Comparison

3.2.7. Limitations and Future Work

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI