Modeling and Predicting Pulmonary Tuberculosis Incidence and Its Association with Air Pollution and Meteorological Factors Using an ARIMAX Model: An Ecological Study in Ningbo of China

The autoregressive integrated moving average with exogenous regressors (ARIMAX) modeling studies of pulmonary tuberculosis (PTB) are still rare. This study aims to explore whether incorporating air pollution and meteorological factors can improve the performance of a time series model in predicting PTB. We collected the monthly incidence of PTB, records of six air pollutants and six meteorological factors in Ningbo of China from January 2015 to December 2019. Then, we constructed the ARIMA, univariate ARIMAX, and multivariate ARIMAX models. The ARIMAX model incorporated ambient factors, while the ARIMA model did not. After prewhitening, the cross-correlation analysis showed that PTB incidence was related to air pollution and meteorological factors with a lag effect. Air pollution and meteorological factors also had a correlation. We found that the multivariate ARIMAX model incorporating both the ozone with 0-month lag and the atmospheric pressure with 11-month lag had the best performance for predicting the incidence of PTB in 2019, with the lowest fitted mean absolute percentage error (MAPE) of 2.9097% and test MAPE of 9.2643%. However, ARIMAX has limited improvement in prediction accuracy compared with the ARIMA model. Our study also suggests the role of protecting the environment and reducing pollutants in controlling PTB and other infectious diseases.


Introduction
Tuberculosis (TB) is a chronic infectious disease caused by Mycobacterium tuberculosis, and it often affects the lungs. WHO proposed an End TB Strategy in 2014, with the targets to reduce TB deaths by 95% and to cut incident cases by 90% between 2015 and 2035 [1]. The cumulative reduction in TB incidence from 2015 to 2020 was 11%, just over half-way to the 2020 milestone of the strategy [2]. To achieve this ambitious goal, accurate prediction of disease trends, as well as related factors, is of great importance.
The autoregressive integrated moving average (ARIMA) model, also known as the Box-Jenkins model, is a commonly used model in time series analysis. Despite its effectiveness to study time series, the ARIMA model applies only to one variable. The autoregressive integrated moving average with exogenous regressors (ARIMAX) model, however, adds other variables related to the target series as input variables to improve the prediction accuracy. Unlike the ARIMA model, the autoregressive integrated moving average with exogenous regressors (ARIMAX) model adds other variables related to the target series as input variables to improve the prediction accuracy. Several ARIMAX studies suggested that weather parameters such as temperature, rainfall, humidity, wind speed, and air pollutants may influence the occurrence of disease [3][4][5][6][7]. ARIMAX modeling studies of tuberculosis are still rare. A time series study in Jiangsu, China, showed that the prediction accuracy of the ARIMA model for PTB was improved by adding monthly PM 2.5 with 0-month lag as an external variable [8]. Another time series study in eastern China also indicated that the predictive performance of the ARIMA model was improved after incorporating meteorological factors [9].
Several studies have proposed the biological mechanisms linking air pollution exposure and the risk of PTB. For example, PM 2.5 exposure disrupted the synthesis and secretion of inflammatory cytokines including interferon (IFN)-γ, tumor necrosis factor (TNF)-α, and interleukin (IL)-10 and impaired key anti-mycobacterial T cell immune functions [10]. In addition, blocking the IL-10 pathway and downregulating TNF-α by CO in lung macrophages may promote the reactivation of PTB [11]. Therefore, we predicted a positive correlation between pollutant concentration and tuberculosis incidence. The possible link between PTB and meteorological factors may be attributable to the following reasons. The risk of Mycobacterium tuberculosis transmission appears to be the greatest during the cold winter, particularly in overcrowded and poorly ventilated settings [12]. Less humidity leads to the evaporation of droplets, reduces their size, and escalates their ability to travel further, which increases the possibility of transmission [13]. The atmosphere pressure has an indirect negative effect on the incidence of PTB. Low-level air rises under low-pressure conditions, and surface pollutants diffuse vertically into the air, resulting in increased air pollution. Therefore, we predicted a negative correlation between meteorological factors and tuberculosis incidence [14].
To our knowledge, few studies used the ARIMAX model to incorporate both the air pollution and meteorological factors to predict PTB. Thus, in the current study, we performed a time series analysis in Ningbo, China, and applied ARIMA models (ARIMA, univariate ARIMAX and multivariate ARIMAX model) to explore whether the inclusion of air pollution and meteorological factors can improve the performance of prediction modeling.

Study Site and Data Collection
As a city in Zhejiang province located along the eastern coast of China, Ningbo covers an area of 98,000 thousand square kilometers. It governed 10 counties and had a permanent population of 9.4 million at the end of 2020. All newly diagnosed TB cases are registered in an online Tuberculosis Management Information System (TBIMS), which is operated by the Center for Disease Control and Prevention (CDC) of China. We extracted monthly incidence of PTB from January 2015 to December 2019 as the study subjects. Population data were obtained from the Ningbo Statistical Yearbook. We used the monthly incidence from January 2015 to December 2018 as the model-construction datasets and incidence from January 2019 to December 2019 as the validation datasets.
The monthly average concentrations of the ambient air pollutants including nitrogen dioxide (NO 2 ), carbon monoxide (CO), sulfur dioxide (SO 2 ), ozone (O 3 ), particulate matter 2.5 µm in diameter (PM 2.5 ), and particulate matter 10µm in diameter (PM 10 ) for the same period were obtained from the Ningbo Environment Monitoring Center. Meteorological factors included monthly average temperature (MAT, • C), monthly average highest temperature (MAHT, • C), monthly average lowest temperature (MALT, • C), monthly average relative humidity (MAH, %), monthly average atmospheric pressure (MAP, hPa), and monthly average wind speed (MAS, m/s) and data were obtained from the Ningbo Meteorological Bureau.

Construction of the ARIMA Model
Following Li et al. [9], we constructed a seasonal ARIMA model, which was expressed as ARIMA (p, d, q) (P, D, Q) s . The variable p is the order of the autoregression (AR) process, q is the number of moving average (MA) terms, d represents the differencing process to form a stationary times series, and P, D, and Q are the seasonal orders of the AR, differencing, and MA processes, respectively [3]. Additionally, s denotes the seasonal period. The number of PTB incidence predicted at time t (Y t ) was determined by the is the operator of the moving-average model, Θ Q (B S ) is the operator of the seasonal-moving average model, ϕ p (B) is the operator of the auto-regressive model, Φ P (B S ) is the operator of the seasonal autoregressive model, (1−B) d is the component of the ordinary differences, (1−B S ) D is the component of the seasonal differences, a t is white noise, and Y t is the predicted variable [9]. Based on the monthly incidence of PTB, we used the auto. arima ( ) function in R software (the R Core Team, Vienna, Austria) to undertake automatic ARIMA model selection and fitting according to Bayesian information criterion (BIC). BIC takes into account the number of observations and has a larger penalty compared with Akaike Information Criterion (AIC) [15]. When the number of observations is too large, it can effectively prevent the model from being too complicated and over-fitting. The model with the lowest BIC is defined as the optimal model. The Ljung-Box test was used to test whether the model residual sequence showed auto-correlation. Finally, we selected the optimal ARIMA model to predict PTB incidence in 2019 and mean absolute percentage error (MAPE) was used to model validation.

Cross-Correlation Analysis
Due to strong auto-correlations in the data, correlations of the time series of the monthly incidence of PTB with air pollution and meteorological factors were difficult to identify. In this study, a prewhitening process was applied to the data among the multiple exogenous regressors [3]. Prewhitening is used to avoid common trends between incidence and ambient factors [16]. We calculated the cross-correlation function (CCF) between the residual series from ARIMA model of the monthly incidence and ambient factors to identify the significant time lags. Based on the results from previous studies as well as biological and epidemiological plausibility, within a lag length of 12, we selected the lag periods with positive and significant values for the next step of the analysis [17].

Construction of the ARIMAX Model
The lag value of the statistically significant factors identified by the cross-correlation analysis was incorporated as exogenous regressors into the ARIMAX model constructed above. The ARIMAX model is described by equation where X represents the exogenous regressor, which can be univariate or multivariate [9,16]. The other parameters are as described in the ARIMA model above. In this study, we first took a single lag period of risk factor in the univariate ARIMAX model. The coefficients of the model were estimated using the maximum likelihood method. Fitted MAPE was used to measure the performance of the ARIMA and the ARIMAX model in predicting PTB incidence from 2016 to 2018, and test MAPE was used to measure the performance in 2019. We determined the suitable univariate ARIMAX models according to four criteria: (a) the BIC value smaller than the optimal ARIMA model; (b) the coefficients of the regression term all significant (p < 0.05); (c) the fitted MAPE value smaller than the optimal ARIMA model; and (d) the test MAPE value smaller than the optimal ARIMA model.
For the multivariate ARIMAX analysis, the incorporated factors were selected from the factors in the suitable univariate ARIMAX models. The suitable multivariate ARIMAX models were also determined by four criteria above. The optimal ARIMAX model should have the lowest BIC, fitted MAPE, and test MAPE values.

Statistical Software
We used the packages of "forecast", "Stats", and "ggplot2" of R 3.5.1 (the R Core Team, Vienna, Austria) (https://www.r-project.org/, accessed on 8 Janunary 2022) to estimate the ARIMA and ARIMAX models and present data visualization. The R code can be found in Supplementary File S1. The significance level was set at 0.05.

Descriptive Analysis
Descriptive statistics of the monthly incidence of PTB, the monthly average air pollutants concentration, and meteorological factors in Ningbo from 2015 to 2019 are listed in Table 1. Time series plots of these are shown in Figure 1. The annual PTB notification rates from 2015 to 2019 of Ningbo were 48.20/100,000, 45.81/100,000, 45.99/10,000, 46.92/100,000, and 44.14/100,000, respectively. The monthly incidence series plot showed a seasonal fluctuation. The peak incidence mainly occurred in March, April, and May, and the trough was more common in November, December, and February. The mean concentrations of PM 2.5 , PM 10

ARIMA Model
The optimal ARIMA model constructed by monthly incidence of PTB from 2015 to 2018 according to BIC was the ARIMA (0,0,0)(1,1,0) 12 Table 2. The Ljung-Box test indicated that residual sequence from all models did not significantly depart from a white noise sequence (p > 0.05).

CCF between the Air Pollutants and Meteorological Factors
We also calculated the CCF coefficients between prewhitened air pollutants concentration and meteorological factors residual series at lag 0 between 2015 and 2018. Table 4 shows that PM 2.5 and PM 10 were significant negatively correlated with MAT, MALT, and MAH; SO 2 was significant negatively correlated with MALT and MAH; NO 2 was significant negatively correlated with MALT, MAH, and MAS; O 3 was significant negatively correlated with MAH , but significant positively correlated with MAP.

Univariate and Multivariate ARIMAX Analyses
Based on the cross-correlation analysis, we tested the univariate ARIMAX model by incorporating different lag periods of risk factors as exogenous regressors based on the ARIMA (0,0,0)(1,1,0) 12 model from PTB time series. Because different ambient factors have different lag periods, from 1 to 11 months, in order to compare the performance of different univariate ARIMAX models and the optimal ARIMA model, it is necessary to make the sequence length of modeling consistent. Therefore, the monthly incidence of PTB and environmental factors with different lag periods from January 2016 to December 2018 were selected to construct the ARIMAX model and the modeling times were all 36 months.

Discussion
In this study, we developed and evaluated time series models to characterize the ambient factors of pulmonary tuberculosis transmission in Ningbo to inform control measures. To our knowledge, this is the first time series study to construct ARIMAX models to explore the role of both air pollution and meteorological factors in predicting PTB. We found that both air pollution and meteorological factors were associated with the incidence of tuberculosis in Ningbo with a lag effect. Additionally, the multivariate ARIMAX model that included both the ozone at lag 0 and the monthly mean atmospheric pressure at lag 11 had a better predicting performance than the inclusion of one variable. This modeling technique can be a useful tool for planning control interventions and could be implemented during routine tuberculosis surveillance in Ningbo.
Although preventive measures have made great progress, the prevention and treatment of tuberculosis still involves enormous challenges, such as increased drug resistance [19,20], dual infection of tuberculosis and AIDS [21], and increased migrant population [22]. Furthermore, urban air quality is an important potential factor in the contribution of tuberculosis infection. There is a growing body of evidence suggesting an association between air pollution exposure and PTB incidence. However, available evidence on the association of air pollution and PTB risk is inconsistent [23]. Our study found a significantly positive correlation of particulate matter with an aerodynamic diameter ≤2.5 µm (PM 2.5 ) or ≤10 µm (PM 10 ) with PTB incidence, which is consistent with other studies [24,25]. Moreover, we also found that CO was positively correlated with PTB incidence, while O 3 was negatively correlated with PTB incidence, which is consistent with other studies [11,24]. Limited evidence from vitro studies has suggested that the survival rate of mice intravenously inoculated with Mycobacterium tuberculosis after intravenous injection of dissolved ozone was significantly higher than that of mice not treated with ozone [26]. The association between PTB and air pollution exposure varied across regions, which may be partially attributed to pollutant concentration or analytic methods. Our study site is a coastal city of South China, the concentrations of air pollutants in our study (PM 2. 5 [27]. Evidence from the area with lower air pollution levels will be important to strengthen the basis for policy making [28].
Meteorological factors, including MAT, MALT, MAH, and MAP, had a negative effect on PTB with a lag effect in our study, which are largely consistent with the findings in previous studies [14,29]. Climate change affects tuberculosis through diverse pathways: changes in climatic factors such as temperature, humidity, and precipitation influence host response through alterations in vitamin D distribution, ultraviolet radiation, malnutrition, and other risk factors [30]. Our research also found that PM 2.5 and PM 10 were significant negatively correlated with MAT, MALT, and MAH, and O 3 was significant negatively correlated with MAH, but significant positively correlated with MAP. The results of our study support previous findings that climatic factors could affect the incidence of tuberculosis by indirectly regulating urban air quality [14].
Our study used incidence of PTB from 2016 to 2018 and delayed ambient factors modeling to predict TB incidence in 2019 and compare it with the actual incidence. Crossvalidating to calculate test MAPE could avoid the over-fitting of the model caused by incorporating ambient factors. For example, ARIMA (0,0,0)(1,1,0) 12 + MAT (lag1) has the smallest fitted MAPE of 2.3205%, but its test MAPE is as high as 12.0531%, which is an obvious over-fitting phenomenon. The fitted and test MAPE of ARIMA (0,0,0)(1,1,0) 12 + O 3 (lag0) + MAP (lag11) are both smaller than the optimal ARIMA model, so over-fitting is effectively avoided. Through ARIMA model prewhitening, we found a lag correlation between the residual sequence of ambient factors and the residual sequence of tuberculosis incidence, which suggests that it is possible to control tuberculosis and other infectious diseases by protecting the environment and reducing pollutants.
Admittedly, there were limitations in our study. Firstly, because there was a lag relationship between ambient factors and the incidence of PTB, we could accurately predict the incidence in these future lag periods by using the ARIMAX model. For example, the lag time of MAP is 11 months, so we can accurately predict the incidence of the next 11 months. However, there was no lag relationship between O 3 and PTB in this study. Therefore, using O 3 to predict tuberculosis in the same month does not produce significant result. In addition, the ARIMA model can only extract the seasonal and long-term trends of ambient factors and tuberculosis incidence. This model cannot extract and predict the information of residual sequences. Thus, it is invalid to use the ARIMA model to predict future ambient factors and then incorporate them into the ARIMAX model. Because we are temporarily unable to accurately predict the future ambient factors, we can only rely on factors such as MAP to accurately predict the incidence of PTB in the next 11 months. It is our future direction to improve the accuracy of the ARIMAX model in predicting long-term incidence of PTB by finding a more advanced model to predict the future ambient factors. Secondly, the ARIMAX model can only identify a correlation between PTB incidence and ambient factors but cannot identify causal relationships. Thirdly, data from fixed monitoring stations assume that all participants in the target area were at the same level of air pollution exposure, and personal exposure data were not collected [31].

Conclusions
Ambient air pollutants and meteorological factors were associated with monthly incidence of PTB with a lag effect. Meteorological factors may affect the incidence of PTB by indirectly regulating urban air quality. The multivariate ARIMAX model that included both the ozone with a 0-month lag and the monthly average atmospheric pressure with an 11-month lag had the best performance to predict the incidence of PTB. However, ARIMAX has limited improvement in prediction accuracy compared with the ARIMA model. This modeling technique can be a useful tool for planning control interventions and could be implemented during routine tuberculosis surveillance in Ningbo. When the peak of tuberculosis and low atmosphere pressure arise, health education on tuberculosis should be strengthened to remind people to seek medical treatment as soon as symptoms appear.

Data Availability Statement:
The data used and analyzed in the current study are not publicly available because restrictions apply to the availability of the data. Data are, however, available from the corresponding author by reasonable requests, and with permission from the Ningbo Municipal CDC.