Modeling Mortality Based on Pollution and Temperature Using a New Birnbaum–Saunders Autoregressive Moving Average Structure with Regressors and Related-Sensors Data

Environmental agencies are interested in relating mortality to pollutants and possible environmental contributors such as temperature. The Gaussianity assumption is often violated when modeling this relationship due to asymmetry and then other regression models should be considered. The class of Birnbaum–Saunders models, especially their regression formulations, has received considerable attention in the statistical literature. These models have been applied successfully in different areas with an emphasis on engineering, environment, and medicine. A common simplification of these models is that statistical dependence is often not considered. In this paper, we propose and derive a time-dependent model based on a reparameterized Birnbaum–Saunders (RBS) asymmetric distribution that allows us to analyze data in terms of a time-varying conditional mean. In particular, it is a dynamic class of autoregressive moving average (ARMA) models with regressors and a conditional RBS distribution (RBSARMAX). By means of a Monte Carlo simulation study, the statistical performance of the new methodology is assessed, showing good results. The asymmetric RBSARMAX structure is applied to the modeling of mortality as a function of pollution and temperature over time with sensor-related data. This modeling provides strong evidence that the new ARMA formulation is a good alternative for dealing with temporal data, particularly related to mortality with regressors of environmental temperature and pollution.


Introduction
Environmental agencies charged with establishing health-based air pollution standards are interested in determining significant relationships between pollution levels and human mortality [1]. These agencies must choose the admissible levels of these standards to protect the population including sensitive groups, such as children and the elderly, against adverse effects on their health [2]. In general, a relevant question to answer is related to the degree of association between pollutants and mortality considering possible environmental contributors, such as climate, linked mainly to temperature [3,4].
Variables associated with mortality, pollutants and temperature are often statistically related, but also their data are dependent over time. Then, a simple multiple regression is not enough to model this relationship, since a time-series structure should be considered [5]. This type of modeling is frequently conducted under the Gaussianity/normality assumption. However, such an assumption is often violated in environmental phenomena due to asymmetry and then diverse practitioners employ logarithmic transformations to reach Gaussianity. Nevertheless, data transformation brings difficulties of interpretation autoregressive moving average with regressors (RBSARMAX) time-series model, which is specified in terms of a conditional mean varying over time and extends the RBS regression proposed in [33], where temporal dependence was not considered. Our approach is similar to that studied in [5,41,42]. The secondary objective is to apply the RBSARMAX structure for modeling mortality as a function of pollution and temperature with data that are related to sensors as detailed in the section on application.
The rest of this article is organized as follows. Section 2 presents the RBS distribution, some of its properties, and the RBS regression model proposed in [33]. In Section 3, the new RBSARMAX model is formulated, conditional maximum likelihood (CML) estimators of the model parameters are derived, and residual analysis is considered for this model. In Section 4, we conduct Monte Carlo simulations to evaluate the performance of the proposed methodology. Section 5 applies the RBSARMAX modeling approach to sensor-related timeseries data to show its potential. The results are compared with an approach based on a Gaussian ARMA model. Finally, Section 6 provides a summary and some concluding observations, limitations, and ideas for the future of the present work.

The RBS Distribution
The RBS distribution [32], as one of the various forms of parameterization of the BS distribution, was introduced using a new parametrization of the latter as a function of its mean. The RBS distribution allows several characteristics of data modeling to be considered [32,43].
To start, if a random variable T follows a BS distribution, usually denoted by T ∼ BS(α, λ), then its cumulative distribution function (CDF) is given by: where Φ is the standard normal CDF, α is a shape parameter, and λ is a scale parameter, as well as the distribution median. Then, by considering the parameters of the BS distribution with CDF defined in (1) as α = √ 2/δ and λ = µδ/(δ + 1), the new parameters of the form reparametrized of the BS distribution are expressed as µ = λ(1 + α 2 /2) and δ = 2/α 2 , where µ > 0 is the mean of the distribution and also a scale parameter, whereas δ > 0 is a shape and precision parameter. In this case, we use the notation Y ∼ RBS(µ, δ).
The CDF of Y ∼ RBS(µ, δ) is stated as: whereas the probability density function (PDF) of Y is obtained by differentiating the expression established in (2) with respect to y formulated as: Figure 1 shows some shapes of the RBS PDF. From Figure 1a, note that δ, in addition to being a precision parameter, is also a shape parameter. Observe that, as δ increases, the PDF is more concentrated around the mean µ = 1 and therefore the variability decreases. In Figure 1b, note that the distribution mean µ also behaves as a scale parameter. Hence, as it increases, there is an increase in the variance and an increased flatness in the PDF.
Due to the relationship of the BS distribution in its original version to the normal distribution, the RBS distribution has the following relationship with the normal distribution: wherein, from (4), we obtain Consequently, from (4) and (5), the quantile function for the RBS distribution is expressed as: where z(q) defined in (6) is the q-th quantile of the standard normal distribution and F −1 Y is the inverse of the CDF of Y applied to q. The expressions for the mean and variance of the RBS distribution are stated, respectively, as: where the notation CV defined in (7) is formulated as CV(Y) = √ 2δ + 5/(δ + 1) ∈ (0, √ 5) and corresponds to the coefficient of variation of Y. As mentioned, δ can be interpreted as a precision parameter, that is, for fixed values of µ, when δ → ∞, the variance of Y tends to zero. In addition, for fixed values of µ, if δ → 0, then Var(Y) = 5µ 2 . The median of Y is δµ/(δ + 1) and hence is proportional to the mean. Note that, for µ fixed, we have that δµ/(δ + 1) → µ when δ → ∞.

Formulation
Based on the RBS distribution, a new approach to the regression modeling of the BS distribution was proposed in [33]. In this approach, the construction of the regression model is similar to the GLM, in which the mean is directly described without the need for a transformation of the dependent variable to the logarithmic scale. Formally, consider Y = (Y 1 , . . . , Y n ) , which is a sample of independent random variables, where each Y t ∼ RBS(µ t , δ), for t ∈ {1, . . . , n}, and their respective observations are y = (y 1 , . . . , y n ) . Then, a regression model based on (3) is defined by a systematic component expressed as: where x t = (x t1 , . . . , x tr ) is a vector of known values for r regressors, with t ∈ {1, . . . , n} and r < n, β = (β 1 , . . . , β r ) is a vector of unknown regression coefficients to be estimated, and α t is the linear predictor. Here, we have a link function g: R → R + which is strictly monotonic, always positive, and at least twice differentiable. Hence, the mean of the response variable is given by µ t = g −1 (x t β), with g −1 being the inverse function of g.

Estimation
The logarithm of the likelihood function of the RBS regression model for the parameter vector γ = (β , δ) has the form: where t (y t ; µ t , δ) defined in (9) is given by: The maximum likelihood estimate of γ is stated through solution of the system of equations U β j (γ) = 0, for j ∈ {1, . . . , k}, and U δ (γ) = 0, where U β j (γ) = ∂ (γ)/∂β j , and U δ (γ) = ∂ (γ)/∂δ. In this case, it is not possible to find an analytical solution so that the maximum likelihood estimates must be obtained numerically using an appropriate iterative method for nonlinear optimization problems, such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method, which is implemented in the R software (https://www.r-project.org, accessed on 22 September 2021) [44,45] by a command named optim.

Estimation
Parameter estimation in the RBSARMAX model is performed with the CML method or the first m observations, in which m = max{p, q} and n > m. From the expression stated in (10), we have that the log-likelihood function for γ = (δ, η, β , φ , θ ) conditional on m observations is given by The CML estimate of γ can be obtained by maximizing the log-likelihood function defined in (13), matching the score vector U(γ) = ∂ /∂γ to zero. Thus, the CML estimates are obtained numerically using the BFGS method. The methodology proposed in this work can be easily used by a practitioner through the R software. In particular, by employing the function garmaFit of a package named gamlss.util and some functions of the RBS package, which can be downloaded from GitHub via remotes:: install github("santosneto/RBS"). Note that the computational cost and complexity are relatively low. In Appendix A, we present mathematical results associated with the Fisher information matrix.

Residual Analysis
Residuals play a key role in the validation of any statistical model and permit us to detect the existence of outliers. In particular, two types of residuals are proposed in this study. The first is a generalized Cox-Snell (GCS) residual given by: wherein S(y t |F t−1 ) is the estimated survival function for the fitted model, defined as: The GCS residuals follow a unit exponential distribution, EXP (1) in short, when the model is specified correctly, and a plot of the theoretical quantiles versus empirical quantiles (QQ) of r GCS t , defined in (14), can be used to assess the fit of the model to the data. The randomized quantile (RQ) residual is also proposed, which is expressed as: where Φ −1 is the inverse function of the CDF of the standard normal distribution and S(y t |F t−1 ) is the estimated survival function, adjusted as in (15). The RQ residual follows a standard normal distribution when the model is specified correctly. Hence, a QQ plot of the residuals defined as in (16) may be utilized to assess the fit of the model to the data.

Definitions and Simulation Model
The simulations are performed using the RBSARMAX(1,1,1) model and are based on samples of size n ∈ {100, 200, 500}, considering two cases. In Case 1, simulations are performed with the values δ ∈ {8, 15, 25, 50}, β = 0.7, η = 1.0, φ = 0.7, and θ = 0.5. For Case 2, the autoregressive (φ) and moving average (θ) parameters take the values of 0.3, 0.5, and 0.7, with δ = 8, β = 0.7, and η = 1.0. These simulations evaluate the performance of the CML estimators of the RBSARMAX(1,1,1) model parameters. The simulation study is based on 1000 Monte Carlo replicates for each n. The proposed sample sizes aim to verify whether there are improvements in the parameter estimation as the sample size increases. The criteria used to evaluate performance for CML estimators of φ, θ, and δ are the empirical mean, bias, variance and mean square error (MSE) given, respectively, by: where ϕ r is the estimate obtained from the r-th replicate of the corresponding parameter, ϕ represents the true value of the parameter and N r is the number of Monte Carlo replicates.
With the exception of the mean, for all other calculated statistics, as the value is smaller, the estimator has a better statistical performance. Note that the bias has this characteristic when analyzed in terms of its absolute value. All simulation and estimation routines were developed employing the R software.

RBSARMAX(1,1,1) Model
Tables 1 and 2 report the empirical mean, bias, variance, and MSE calculated as in (17) of the estimators for the shape and precision parameter (δ), autoregressive parameters (φ), and moving average parameters (θ), respectively. Table 1 shows the estimates for the parameter δ, fixed according to Case 1. Note that the performance of the estimator of δ is related to the sample size. For example, when the sample size increases from n = 100 to n = 500, the empirical bias in absolute value of the estimator of δ = 8, on average, decreases considerably, from 0.4705 to 0.0720. Consequently, the mean of the estimator of δ tends to the true parameter value. In all considered scenarios, the parameter δ is, on average, overestimated, that is, the estimate δ provided by the CML estimator for δ is greater than the true value of the parameter. The results of Table 1 are also shown in Figure 2 to simplify the interpretation of the calculated statistics in relation to the sample size and the true values of δ. Note in Figure 2a that, as n → ∞, the bias of the estimator in absolute value is smaller.
The results in Table 1 and Figure 2 allow us to conclude that, in general, the performance of the estimator of δ is directly related to the sample size. That is, as n → ∞, the values of the statistics are smaller and, consequently, the statistical performance of the estimator is better. Such behavior is expected, because as the sample size is greater, more information is available to estimate the parameters. Table 2 presents summary statistics for the estimates of the parameters φ and θ, fixed according to the settings described for Case 2. Note that the estimators of φ and θ are very accurate for large sample sizes. This makes the results obtained for the MSE very close to the variance. For example, for a sample size of n = 500, φ = 0.5, and θ = 0.3, the estimates are very close to the true value of the parameters, that is, φ = 0.4922 and θ = 0.3034. On average, absolute biases in estimated values of φ or θ are always less than 0.0336. The maximum values of the MSE are observed for φ = 0.3 and θ = 0.3 with a sample size equal to 100. Considering a fixed sample size, there is a slight reduction in the variance and MSE of the estimators of φ and θ as both of these parameters increase. Observe that the estimated values for φ and θ are, on average, underestimated. That is, the estimates φ and θ are less than the true parameter, in most of the considered scenarios.  Figure 2. Empirical bias (a), variance (b) and MSE (c) of δ with simulated data from the RBSARMAX(1,1,1) model.

Performance Measures and Model Selection
Performance measures are used to assess the accuracy of forecasts and compare models. These measures are a function of the observed and predicted values of the time series. Here, we consider two scenarios with respect to the data generating model: (Scenario 1) the model is correctly specified, that is, simulated values from the RBSARMAX model are generated and the RBSARMAX and Gaussian ARMA models are fitted; and (Scenario 2) the model is incorrectly specified, that is, simulated values from an ARMA model based on the Weibull distribution [46] are generated and the RBSARMAX and Gaussian ARMA models are fitted. The Weibull model was chosen because it is an asymmetrical distribution that often is considered as a competing model of the BS distribution. Then, the performance and goodness of fit of the models are compared. To evaluate the predictive ability of the models, the mean absolute percentage error (MAPE) is employed, which is given by: where n is the number of observations in the time series, y t is the observed value at time t, and y t is the predicted value of y t . To select the best model, we use the Akaike information criterion (AIC) and Bayesian information criterion (BIC), which are stated as: where L is the maximized likelihood for the estimated model, n is the number of observations, and k is the number of parameters. The AIC relies on the likelihood penalized by the number of model parameters, while the BIC in addition weights the number of parameters using the sample size. Smaller AIC and/or BIC values indicate better models [47].
4.3.1. Scenario 1 Table 3 reports the results for sample sizes n ∈ {100, 200, 500} of the RBSARMAX(1,1,1) model, with η = 1.0, β = 0.7, δ = 8 and φ, θ ∈ {0.3, 0.5, 0.7}. In the simulation, 1000 replicates are utilized for each combination of parameters. The Gaussian ARMA(1,1) model is also considered. Comparing the RBSARMAX and Gaussian ARMA estructures based on the statistics described in Table 3, note that the values of AIC and BIC highlight the fact that the RBSARMAX model fits the data better than the Gaussian ARMA model, with AIC and BIC being calculated as in (19). Considering the forecasting performance, the RBSARMAX model also provides smaller MAPE values, indicating a better forecasting capacity, with the MAPE being calculated as in (18). To measure the effects of the parameter δ on the performance of the model, Table 4 shows the summary results of 1000 Monte Carlo replicates with η = 1.0, β = 0.7, φ = 0.7, θ = 0.5 and δ ∈ {8, 15, 25, 50}. In this case, the RBSARMAX model provides smaller values of AIC, BIC and MAPE, indicating better goodness-of-fit and forecasting ability. Table 5 reports results for the RBSARMAX and Gaussian ARMA models. The simulated values are generated from a Weibull ARMA model with η = 1.0, β = 0.7 and δ = 8 (shape parameter of the Weibull distribution) and φ, θ ∈ {0.3, 0.5, 0.7} in the case of Table 5, and from a Weibull ARMA model with η = 1.0, β = 0.7, φ = 0.5, θ = 0.3 and δ ∈ {2.5, 5,8,15,25, 50} in the case of Table 6. In general, the results of both tables show that the RBSARMAX model outperforms the ARMA model in terms of forecasting ability based on the MAPE and root mean squared error (RMSE), with RMSE = (1/n) ∑ n t=1 (y t − y t ) 2 , where n, y t and y t are as stated in (18). However, the selection criteria (AIC and BIC) indicate an advantage of the latter model. Since usually in time series, one is interested in forecasting, the RBSARMAX model is a better choice.

Sensor-Related Data and Definition of the Variables
Next, we deal with an illustration and evaluation of the performance of the RBSAR-MAX model applied to a real environmental process composed of three time series related to mortality, pollutants, and temperature. Note that the pollutant data are often available from monitoring stations which are associated with sensors [48] and similarly with the temperature. On the one hand, the monitoring stations extract air from the environment for time intervals and then measure the amount of transmitted light. The measurement method is considered to be quite sensitive to particles small enough to penetrate deep into the human lung. On the other hand, the temperature sensors are electrical and electronic components that, as sensors, allow temperature to be measured using a specific electrical signal. This signal can be sent directly or by changing the resistance. They are also called heat sensors or thermosensors.
The analyzed data are available in the R software through the astsa package. These data correspond to 508 observations of weekly averages of cardiovascular mortality in Los Angeles County, CA, USA, from 1970 to 1979, associated with effects of temperature variation and levels of particulate matter (PM), which are very fine particles of solids or liquids suspended in the air [2]. The variables under analysis are mortality (M t ), temperature (X 1t ) and PM (X 2t ). A study similar to this was carried out in [4], which used the same dataset for regression models in the context of a time series.

Exploratory Data Analysis
The behavior of the variables M t , X 1t , and X 2t over time are shown in Figure 3. Note that all series have a notorious seasonality. In addition, Figure 3a shows a downward trend in mortality over the period under study. Table 7 provides some descriptive measures for each variable, which include: sample size (n), minimum and maximum values, median, standard deviation (SD), CV, and coefficients of symmetry (CS) and kurtosis (CK). Figure 4 displays summaries of M t , X 1t , and X 2t . Histograms are shown along the diagonal; below the diagonal are scatterplots and above the diagonal are the Pearson correlation coefficients (ρ). These graphical plots allow us to identify that mortality M t and temperature X 1t have a clear relationship, with lower temperatures giving higher mortality, and that the mortality is the highest at lower temperatures. Here, ρ = −0.44 indicates a moderate negative correlation which is statistically different from zero at 1% significance. Similarly, mortality M t and PM levels X 2t have a linear relationship and a moderate positive correlation ( ρ = 0.44, which is also statistically different from zero at 1% significance), indicating that higher levels of PM are associated with higher levels of mortality. However, temperature X 1t and PM X 2t have practically no correlation ( ρ = −0.02). The histograms confirm the summaries in Table 7 show that mortality M t and PM levels X 2t have positive skewed behavior, whereas temperature X 1t is more symmetric. This behavior is confirmed by the box-plots shown in Figure 5. Additionally, in this plot, the presence of outliers for mortality M t and PM levels X 2t is evident.

Time-Series Modeling
Based on the analysis of Figure 4, which shows the relationship between the variables M t , X 1t , and X 2t , in addition to considering M t as the response variable, these relationships can be modeled over time t using the corresponding observed values x 1t and x 2t as: where the first two terms define a linear trend in t, as seen in Figure 3a; the next two terms describe a quadratic relationship with temperature and x 1 being the average temperature included to avoid collinearity; the next is a linear term in PM levels; and then ε t is a random error or a noise process. In [4], the error consists of independent and identically distributed variables with zero mean and variance σ 2 ε , whereas an alternative approach is taken here. Figure 6 shows the plots of the autocorrelation function -ACF-(a) and the partial autocorrelation function -PACF-(b) of the residuals fitted with the least squares method for the model stated in (20). Consideration of the ACF and PACF plots suggests the characteristic of a stationary AR(p) model of order p = 2 for the residuals. Thus, the correlated error model defined in (20) is expressed as: where ε t is an AR(2) model and u t is a white noise. The results for this model are obtained using the garmaFit function of a package named gamlss.util (http://www.gamlss.org, accessed on 22 September 2021). Now, consider an analysis with the RBSARMAX model defined by (10) and (12). Table 8 reports the CML estimates as well as the MAPE and AIC/BIC values. From this table, note that the RBSARMAX(2,0,2) model provides a better fit than the ARMA(2,0) model based on the AIC/BIC values. Moreover, the RBSARMAX(2,0,2) model has less MAPE, indicating better forecasting capacity. We emphasize that, in addition to the advantage of these results, the RBSARMAX(2,0,2) model is more appropriate due to the skewed and kurtosis features in the data empirical distribution.
The QQ plots of the GCS and RQ residuals, with simulation envelopes, are presented in Figure 7a,b, respectively, which indicate better agreement with the EXP(1) distribution in the RBSARMA model. However, for the same analysis referring to the ARMA model based on Figure 7, note that the plots of GCS and RQ residuals, with simulation envelopes, produce points that are located far from the diagonal line and outside the envelope. In the ACF and PACF charts, observe that both models produce non-autocorrelated errors; see Figure 7c,d. The time-series forecasts using the fitted RBSARMAX and ARMA models are presented together with the observed time-series data in Figure 8.

Conclusions, Limitations, and Future Research
In this work, a new mean-based autoregressive moving average model using the Birnbaum-Saunders distribution, called RBSARMAX, was studied and formulated mathematically. We have estimated the model parameters with the maximum likelihood method and used information criteria for model selection to assess the adequacy of the new Birnbaum-Saunders autoregressive moving average structure.
We have conducted Monte Carlo simulations to evaluate in practice the statistical performance of the conditional maximum likelihood estimators for the parameters of the new model, showing a good performance. Additionally, in this simulation, several performance measures were used to assess the level of accuracy of forecasts and to compare different models, obtaining similarly reasonable and good results.
In the application, when modeling mortality as a function of pollution and temperature with data related to sensors, the RBSARMAX model presented a superior result to that of the Gaussian ARMA model, providing strong evidence that the Birnbaum-Saunders distribution is a good alternative for dealing with temporal data. Consequently, the results have suggested that the RBSARMAX model can become a valuable tool for analyzing positive and asymmetric time-series data in environmental sciences and other fields of knowledge.
The new methodology is an addition to the tools of applied statisticians, data scientists, and diverse users interested in the modeling of time-series data. From the application presented in this study, we have generated helpful information that may be employed by practitioners and users of statistics.
Some limitations of our proposal are described next. Since the BS distribution is related to the normal distribution, parameter estimation in RBSARMAX models may be affected by outliers and potentially influential cases. To obtain robust estimation, the BS-Student-t distribution could be considered instead [30,49]. Besides fixed effects considered by regression parameters in the RBSARMAX model, random effects may be formulated. A multivariate version of the RBSARMAX model might also be of interest [12,50], and local influence diagnostics could be derived, allowing the detection of potentially influential cases [16]. Other aspects for future study using this new model are associated with quantile, spatial, partial least squares, principal components, and sampling structures [51][52][53][54][55][56].
The authors are working on these and other aspects related to the study reported in this paper, and their findings will be presented in future articles.