A Method of Accuracy Increment Using Segmented Regression

: The main purpose of mathematical model building while employing statistical data analysis is to obtain high accuracy of approximation within the range of observed data and sufficient predictive properties. One of the methods for creating mathematical models is to use the techniques of regression analysis. Regression analysis usually applies single polynomial functions of higher order as approximating curves. Such an approach provides high accuracy; however, in many cases, it does not match the geometrical structure of the observed data, which results in unsatisfactory predictive properties. Another approach is associated with the use of segmented functions as approximating curves. Such an approach has the problem of estimating the coordinates of the breakpoint between adjacent segments. This article proposes a new method for determining abscissas of the breakpoint for segmented regression, minimizing the standard deviation based on multidimensional paraboloid usage. The proposed method is explained by calculation examples obtained using statistical simulation and real data observation.


Introduction
Scientists use various models when studying different environmental phenomena.Mathematical models provide an opportunity to determine equations and dependencies to correlate the parameters of miscellaneous objects and processes.Mathematical models are built for various reasons, including the achievement of the best understanding of the objects under study, the possibility of mathematical analysis, and the possibility of conducting experimentation with the model in case it is difficult to repeat the experiment with the objects under study [1].
The process of mathematical model building contains several steps: (1) Experimental study and the measuring of the parameters of real-world systems and phenomena; (2) Collecting initial data for the model; (3) Mathematical formulations and fitting one or more models; (4) The statistical simulation of the model to validate it [2].
There are general rules for building mathematical models.These rules assume the following: (1) collecting background information for the phenomenon under study, (2) using simple models at the first stage, (3) determining all parameters and the quantities and correlations between them based on data analysis, (4) complicating the model based on the nature of the phenomenon under study, (5) estimating the efficiency of the model, and (6) others [3].The efficiency analysis involves choosing the optimal mathematical model for the problem considered.
There are various efficiency measures for mathematical models.Generally, researchers use the following parameters: (1) Accuracy-for the coincidence analysis of the output of a mathematical model with observed data; (2) Reliability-for the analysis of the precision of a mathematical model; (3) Transparency-for the analysis of choices and assumptions of the output expectations [4,5].
To analyze mathematical models, researchers can use additional criteria, such as model simplicity, calculation time, costs, depth level, and others.
The main parameters for the efficiency level of mathematical models in terms of accuracy analysis are standard deviation [6,7], the sum of absolute deviations between the model output and the observed data [8], a weighted sum of squared deviations [9,10], and the maximal deviation [11].The criterion for these parameters is the minimum value of the estimated parameter [12,13].
This article contains seven sections.The first section discusses the background information for the problems of mathematical model building.The second section presents a literature review regarding the topic of research and presents the statement of the problem.The third section deals with the description of mathematical tools for segmented regression building while using ordinary least squares.The fourth section proposes the step-by-step procedure for accuracy increment during segmented regression usage.The fifth section concentrates on the analysis of the proposed method based on statistical simulations.The sixth section discusses the implementation of the proposed method in real data examples, and the seventh section presents the conclusions.

Literature Review and Statement of the Problem
Mathematical model building aims at decreasing the uncertainty level for the objects being studied [14,15].The analysis of the level, location, and nature of uncertainty helps to obtain more reliable information and adequate knowledge [16,17].
To build mathematical models, researchers use methods from different sciences, such as mathematical analysis, probability theory, data science, regression analysis, mathematical statistics, recognition theory, applied geometry, and others [18].
This article concentrates on the techniques of regression analysis for mathematical model building, so corresponding methods are considered in detail.Regression analysis is used to determine the relationship between two or more variables [19] and is widely used to fit mathematical models to statistical data [20].
Regression analysis is frequently used in various applications due to its approximate ease of calculation, high accuracy, and good predictive properties, depending on the approximating function type usage.Regression analysis is applied to different fields in different capacities, for example, in: (1) Medicine: to detect Parkinson's disease based on the analysis of finger-tapping data [21], to forecast the uptake of oxygen based on genes evaluation and to predict data on patient admission [22], and others; (2) Econometrics: to predict the audit opinion using six financial indicators [23], to determine the dependence of economic growth on the level of environmental pollution [24], to describe the trends of economical parameters in correlation with various factors [25,26], and others; (3) Transport systems: to determine the optimal periodicity of the implementation of operation processes [27,28], and to analyze possible routes and traffic intensity [29][30][31]; (4) Aviation: to identify flight conditions and situations based on diagnostic parameter monitoring [32,33], and to predict the human state and decision making depending on various environmental factors [34,35]; (5) Radar systems: to estimate the efficiency of signal detection [36], to determine the dependence of weather parameters on radar-received signals [37][38][39], and others; (6) Navigation systems: to build a mathematical model for the optimal selection of the navigation equipment [40][41][42], to establish the correlation between navigation equipment failures [43], to approximate operational data trends for the prediction of possible aviation events [44], and others; (7) Cybersecurity: to evaluate the efficiency of information web-resources functioning [45], to synthesize data-processing algorithms while detecting cyberattacks [46][47][48], to ensure high-level security against cyberattacks [49,50], and others; (8) Engineering and control: to describe nonlinear dynamic object behavior [51,52], to build the mathematical model for statistical parameters while designing control systems [53], to make decisions based on statistical information processing [54,55], and others; (9) Equipment maintenance: to build the mathematical model for diagnostic variable trends [56], and to determine the uncertainty level while conducting condition monitoring and maintenance preference analysis [57,58]; (10) Reliability analysis: to describe the behavior of reliability parameters [59,60], to simulate statistically nonstationary random processes of failures occurrence [61,62], to describe the processes of technical condition deterioration in the trend of failure rate [63,64], and others.
Regression analysis usually starts with research on the possibility of using a linear regression model.In the case of an unsatisfactory level of accuracy, more complicated models are used [65].These models are nonlinear regression models [66].Nonlinear regression models suggest parabolic, hyperbolic, exponential, segmented, and other approximating functions [65,67].Because of the complicated calculations required when using a nonlinear regression model, various software can be utilized [68].
There are various methods for increasing the accuracy and predictive properties of mathematical models.One approach is to use segmented regression [69,70].In this case, it is necessary to determine the coordinates of the breakpoint between adjacent segments.This problem can be solved using various algorithms [69][70][71][72][73][74][75].These algorithms use the maximum likelihood estimator [69,70], Bayesian changepoint models [71,72], inverted F test [73], random search method, the method of cumulative sums [74,75], and others.A comparative analysis showed some flaws in the algorithms for determining breakpoint coordinates.These flaws are related to a need for prior limitations, as well as the effectiveness of the obtained estimate in terms of robustness and bias.Additionally, the discussed algorithms do not give the possibility to obtain a single mathematical formula for breakpoint coordinates and require the usage of the iterative numerical method described in [76].
The considered literature review motivates authors to synthesize a new approach for calculating the optimal coordinates of breakpoints while using segmented regression and analyzing time series with nonstationary behavior.The building of a mathematical model based on segmented regression usage is of considerable importance because: 1. Using segmented regression gives the possibility to obtain a model with greater accuracy.2. Segmented regression more correctly describes the geometrical structure of time series.3. The obtained segmented models have effective predictive properties.
The research gap in the field of mathematical model building is associated with the absence of a step-by-step procedure for determining the optimal segmented regression model in case of multiple breakpoints in a dataset structure.At the same time, to solve such problems, the method of simple enumeration of the possible options is often used.However, such an approach does not provide mathematical formulations and requires a long computing time.
Therefore, the goal of this article is: (1) to describe the technique of segmented regression building and (2) to obtain mathematical equations for a step-by-step procedure of accuracy increment based on optimal breakpoints abscissas calculations.
Let us state the research problem mathematically.Let us present the statistical dataset in two arrays , each with sample size n.Y is the dependent or response variable, while X is the independent or predictor variable.The relationship between the variables is determined by the function set , where k describes the quantity of the model being fitted to the dataset and k m c ,  is a vector of m parameters for the k -th regression model.In this case, the regression model is determined by the equation [65] , ) , ( , where Δ is an error, which can be described by a normal probability density function. Such an assumption allows the use of ordinary least squares (OLS).For example, in the case of linear regression, , where This paper focuses on increasing the accuracy of mathematical models based on segmented regression usage.In this case, the function set x , br of the breakpoints, where q is the quantity of breakpoints.The accuracy of the model using OLS is usually estimated by the standard deviation σ between the model output and the observed data.The standard deviation depends on the values of abscissas k q x , br of the breakpoint.Thus, this paper aims to solve the minimization problem that can be formulated as follows:

Segmented Regression Models
This section presents the basic mathematical equations for different segmented regression models.Authors mostly employ piecewise linear, linear-quadratic, and quadratic models.

Segmented linear regression (SLR)
This regression type is a sequential connection of 1 + q straight-line segments without discontinuities.The mathematical model of SLR is given as where is the Heaviside function.This function helps to obtain the single mathematical equation for the segmented model.
An example of a mathematical model of three-segmented linear regression has the form This model has two breakpoints, x and 2 br x , and it requires the computation of four unknown coefficients:

Segmented quadratic regression (SQR)
This regression type is a sequential connection of 1 + q quadratic parabola segments without discontinuities.The mathematical model of SQR is given as An example of a mathematical model of two-segmented quadratic regression has the form This model has one breakpoint, x , and it requires the computation of four unknown coefficients: and   2  ,  3 c .These coefficients are estimated based on the OLS.The computation result can be presented in the form of matrix equations

Segmented linear-quadratic regression (SLQR)
This regression type is a sequential connection of 1 + q straight lines and quadratic parabola segments without discontinuities.The mathematical model of SLQR is given as where An example of a mathematical model of two-segmented linear-quadratic regression has the form This model has one breakpoint, x , and it requires the computation of three unknown coefficients:  c .The feature of this model is the equality of adjacent coefficients for the transition between the quadratic parabola segment and the straight-line segment.Thus, The computation result can be presented in the form of matrix equations

Step-by-Step Procedure for Accuracy Increment during Segmented Regression Usage
The method of accuracy increment during segmented regression usage is associated with the estimation of breakpoint abscissas.The breakpoint is the point of connection between two neighboring segments.
The step-by-step procedure contains the following operations: 1. Choosing of the regression model and the quantity of segments.At this stage, the researcher analyzes the geometrical structure of the observed data presented graphically in the form of the dependence of Y on X .After that, based on their experience, the researcher must choose one of the models SLR, SQR, and SLQR.To substantiate the decision on segmented regression usage, the researcher can test the initial data for nonlinearity.The geometrical structure of the observed data also gives the ability to choose the quantity q of the breakpoints.
2. Determining the possible range of values of the breakpoint abscissas.At this stage, the researcher subjectively chooses the discrete range for all breakpoints.The minimal quantity of discrete values should be greater than five.The result of this step is a two-dimensional array x br with size w q × , where w is the number of discrete values in the range of breakpoint abscissas.3. Building a regression model.At this stage, based on the matrix equations presented in the previous section, the researcher calculates the unknown coefficients for the chosen regression model and all possible values in the array x br .
4. Calculating the standard deviations.In the case of OLS usage, the accuracy of the model is determined by the standard deviation between the model output and the observed data, which can be presented as follows: ( ) , where l is the degree of freedom for the chosen regression model.
At this stage, it is necessary to determine the discrete multidimensional dependence ( ) for all possible values in the array x br .
Note that in the case of an alternative regression method (for example, least absolute deviations regression), similar calculations for corresponding accuracy measures should be completed.

5.
Approximating the standard deviation dependence on the breakpoint abscissas by multidimensional paraboloid using OLS.The dimension of the paraboloid corresponds to the quantity q of breakpoints.It is possible to use one of two types of paraboloid: (a) General: ( ) (b) Simplified: ( ) where i α , i β , and j i, γ are approximation coefficients.The simplified paraboloid (5)   can be used in case of assumptions about 0 , = γ j i for the general paraboloid (4).
The coefficients of Equations ( 4) and ( 5 Consider the case of a simplified paraboloid.According to OLS, it is necessary to solve the system of equations ( ) Let us simplify the first equation in the system.After derivative calculation, it can be presented as follows: ( )    ( ) q q w i i q q q w i i q w i i q w i i q w i i q q q q q q q q x x x x w x w x w x w x w x w w .Similar simplifications can be made for other equations in the system.Therefore, the computation result for paraboloid (5) can be presented in the form of matrix equations q w i i q w i i q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q x w x w  6. Calculating the coordinates of paraboloid optimum.To obtain the minimum standard deviation, it is necessary to determine the coordinates of the minimum multidimensional paraboloid.To do this, the partial derivatives are calculated and equated to zero [77]: This system for general paraboloid (4) can be presented in the form of q linear equations system.For paraboloid (5), the solution of the system is given as Calculating the coefficients of the model for the optimal case.The coefficients of SLR, SQR, or SLQR are computed for the optimal location of the breakpoints using OLS.
The final model can be used for the explanation and prediction of the response variable.
Consider the simple example for proposed method.Let us use the dataset with a small sample size presented in [6].These data describe the relationship between production lot size x and the average production cost per unit y (in dollars) and are given in Table 1.
The standard deviation for the optimal SLR model is 0.313.The result of the model building using the SLR model is shown in Figure 1.

Analysis of Proposed Method Based on Statistical Simulation
The analysis of the proposed method is performed using statistical simulation and real data examples.This section presents the statistical simulation results.During the simulation, a dataset with two breakpoints is generated using build-in software operators.The dataset is an additive mixture of deterministic components and random noise.
Assume that the deterministic component corresponds to an SLR model The random noise is distributed according to the Gaussian probability density function.
The initial data for the simulation are as follows: ( (such parameters correspond, for example, to the real process of deterioration occurrence when monitoring the values of voltage for the supply of electronic devices [63]); (4) Predetermined parameters of Gaussian noise: the expected value is equal to zero and the standard deviation equal to 20 (additionally, it is assumed that the noise values are independent random variables for any sampling time moment); (5) The quantity of simulations reiteration 1000 = N .
Consider the calculation procedure of the proposed method for one of the generated datasets.Table 2 shows one of the generated datasets.Figure 2 presents three realizations of the generated datasets, and each realization is marked by circle, triangle, or diamond (the circles correspond to the data in Table 2).c for all possible values of the first and second breakpoints using OLS.As a result, 25 alternative SLR models are obtained.
After that, the standard deviations between the model output and the observed data for these SLR models are determined.Table 3 shows the computation results.Even visual analysis of the data on the standard deviation (Table 3) indicates that the minimal standard deviation is located approximately near To estimate the exact values of breakpoint abscissas, paraboloids (4) and ( 5) are built using OLS.
After the calculations, the following mathematical equations were obtained: ( )  3 and 4 show the visual presentation of paraboloids ( 4) and (5) for this numerical example, respectively.To determine the optimum coordinates for three-dimensional general paraboloid (4), it is necessary to solve the following system of two linear equations: In this case, the calculation gives the following solution: The obtained SLR models give almost the same standard deviations equal to 18.429 and 18.424, respectively.
Figure 5 shows the generated dataset and final optimal SLR models.Visual analysis shows the coincidence of both SLR models.We consider the general simulation results for all iterations.Repeating the simulation provides an opportunity to perform a complete statistical analysis of the breakpoint estimation during mathematical model building.An analysis was performed by plotting histograms and evaluating the numerical characteristics of the random variables.Figure 5 shows the histograms for the estimate of two breakpoint abscissas and the usage of different optimization options (general and simplified paraboloids).The parameter λ in Figure 6 is the quantity of breakpoint abscissa estimates, which are located in the corresponding grouping interval of the histogram.Table 4 shows the numerical characteristics of the breakpoint abscissas estimates (mathematical expectation, standard deviation, range of change, and skewness).To describe the obtained estimates of breakpoint abscissa completely, it is necessary to fit the histogram by theoretical probability density function.Approximate assumptions can be made based on the graphical view of the histograms in Figure 6.The shape of the histogram can correspond to the Gaussian probability density function.Such an assumption can be proven using the chi-squared test with high confidence probability.
The breakpoint estimation bias has preferable values when the general paraboloid method is used.However, the benefit is negligible and averages 0.337% compared with the simplified paraboloid method.The highest percentage of estimate bias (in relative values) is 3.012%.In the case of a long-term breakpoint, the simplified paraboloid method has, on average, a narrower range of change of breakpoint estimates.
Let us analyze the proposed method in comparison with the method of simple enumeration.To obtain the approximately 3% of breakpoint abscissas estimate bias, the method of simple enumeration requires at least 33 possible values for each breakpoint.Therefore, it is necessary to repeat computations for at least 1089 iterations in the case of two breakpoints.At the same time, the proposed method requires 25 iterations and additional calculations of the paraboloid optimum.Therefore, the proposed method reduces the computing time by at least 30 times compared to the method of simple enumeration.
A comparison of the simulation results for a range of initial data provides the ability to conclude approximately the same accuracy characteristics for SLR models based on general and simplified paraboloid usage.Therefore, in practical cases, the adoption of the simplified paraboloid method usage is more advantageous when creating a segmented regression model because of the reduction in computations and calculation time.

Real Data Example
Consider the example of real data on the number of earthquakes with a magnitude of 7 or higher by year, according to the United States Geological Survey [78].Table 5 presents the corresponding data from 1922 to 2021., where i is the number of observations, X is the year, and Y is the quantity of earthquakes.
Figure 7 shows the graphical view of the dataset.To simplify the presentation and calculations, the first year of observation (1922) is assigned a zero point at the abscissa axis in the next computations.Thus, to return to the original data, it is necessary to add 1922 for the shifted abscissa axis.
The method of simple enumeration for a given dataset gives approximately the same result as that shown in Figure 8.However, this method increases the computing time approximately twice.Polynomial regression using a seventh-order polynomial is characterized by a faster computation time; however, it gives unacceptable predictive properties.
The results of the mathematical model building can be used for solving prediction problems.Consider this problem for the observed dataset based on generally known results that have been extensively described in the literature (for example, [76,77]) and innovative methods that may be used in accordance with the properties of segmented regression models.
To  This method of prediction and the obtained SLR model allow us to anticipate that, through 2042, the average annual number of earthquakes with a magnitude of 7 or higher would decrease.
In general, the proposed method can be applied to different datasets and, in the case of using multidimensional optimization, to determine breakpoints.

Conclusions
This article presents a method of accuracy increment when segmented regression is used.The main problem for segmented regression model building is the estimation of the coordinates of the breakpoint between adjacent segments.To solve this problem, two types of multidimensional optimization paraboloids are used.The paraboloids contain information on standard deviations between the model output and the observed data for different sets of possible values of breakpoint abscissas.The minimum standard deviation of each paraboloid coincided with the optimal position of the breakpoints.
A step-by-step procedure for the proposed method was described by examples based on statistical simulation and real data observation.

.
The coefficients are estimated based on the OLS.
) are estimated based on OLS.Such a calculation is possible, because all of the values of the possible breakpoints in the twodimensional array br Χ with size w q × are known, and function ( )

,
the first equation can be presented as follows:

Figure 1 .
Figure 1.Dataset and final optimal SLR model.

Figure 2 .
Figure 2. Examples of generated additive mixture of SLR model and Gaussian noise.To describe the obtained dataset, we choose the SLR model with three segments with 2 = q breakpoints.To simplify the calculations, we choose the quantity of discrete values within the range of possible breakpoints to be 5 = w.According to the geometrical structure of the observed dataset (Figure2), the ranges for two breakpoints are as follows:{ } 35 ; 30 ; 25 ; 20 ; 15

Figure 5 .
Figure 5. Generated dataset and final optimal SLR models.

Figure 6 .
Figure 6.Histograms for breakpoint abscissas for different optimization options: (A)-estimates of the first breakpoint for general paraboloid; (B)-estimates of the second breakpoint for general paraboloid; (C)-estimates of the first breakpoint for simplified paraboloid; (D)-estimates of the second breakpoint for simplified paraboloid.

Figure 7 .
Figure 7. Data on quantity of earthquakes of magnitude 7 or higher by year.
predict the future trend, let us determine the range of the SLR model change.For this purpose, we used a straight line and OLS to approximate the upper and lower ordinates of the breakpoints.The lower line contains the zero point, and the second and fourth breakpoints.The upper line contains the first, third, and fifth breakpoints.The numerical values of the calculated equations are segment of the SLR model is continued to the intersection point with the lower straight line.Figure9shows the visual representation of the trend prediction.

Table 1 .
Relationship between production lot size x and the average production cost per unit y.There are five alternative SLR models for all possible values in the array x br :

Table 2 .
Example of obtained dataset.

Table 3 .
Computation results for standard deviation.

Table 5 .
Quantity of earthquakes of magnitude 7 or higher by year.

Table 5
contains data observed from 1922 to 2021