Ensemble Model Development for the Prediction of a Disaster Index in Water Treatment Systems

: The quantitative analysis of the disaster e ﬀ ect on water supply systems can provide useful information for water supply system management. In this study, a total disaster index (TDI) was developed using open-source public data in 419 water treatment plants in Korea with 23 input variables. The TDI quantiﬁes the possible e ﬀ ects or damage caused by three major disasters (typhoons, heavy rain, and earthquakes) on water supply systems. The four components (regional factor, risk factor, urgency factor, and response and recovery factor) were calculated using input variables to determine the disaster index (DI) of each disaster. The weight of the input variables was determined using principal component analysis (PCA), and the weights of the DI of three natural disasters and four components used to calculate the TDI were determined by the analytical hierarchy process (AHP). Speciﬁcally, two ensemble machine learning models, random forest (RF) and XGBoost (XGB), were used to develop models to predict the TDI. Both models predicted the TDI with the coe ﬃ cient of determination and root-mean-square error-observations standard deviation ratio of 0.8435 and 0.3957 for the RF model and 0.8629 and 0.3703 for the XGB model, respectively. The relative importance analysis suggests that the number of input variables can be minimized, which improves the models’ practical applicability.


Introduction
Various natural disasters, such as floods and earthquakes, cause considerable damage to water supply systems. This damage includes the destruction of plants, intake systems, pipelines, and electric systems, and the consequent interruption of water supply to the public [1]. The assessment of disaster, its quantitative and indicator-based assessments on the water infrastructure have not been thoroughly conducted.
In this study, the effects of various disasters on water supply systems from the perspective of management are quantified by statistical data analysis methods, PCA, and analytic hierarchy process (AHP). From the statistical approach, a total disaster index (TDI) was developed. In the second part, tree-based ensemble models (i.e., RF and GBDT) were used to predict TDI, which provides valuable information for the safety management of water supply systems.

Data Sources
Total 23 input variables of facility specification and operational data in 419 water treatment plants in Korea were used to develop a TDI. The data were obtained from statistical yearbooks and open-source public data ( Table 1). The 23 input variables provide information about the water supply systems, including water supply capacity, pipeline density, number of customers, management labor, and regional characteristics of natural conditions where the water treatment plants are located ( Table 2). The local peak ground acceleration by an earthquake at each water treatment plant was estimated from the Korea Seismicity Map program developed by Cao et al. [29]. The data for regional natural conditions were obtained from meteorological data available from the national meteorological administration information portal [30]. The financial status of a local government that manages the water treatment plant was collected from the public data portal of the Ministry of the Interior and Safety in Korea [31].

Data Reference
Water treatment plant operational information and facility specification Statistical yearbook of water treatment system [32] Meteorological data Meteorological administration information portal [30] Financial status in local government Ministry of the interior and safety information portal [31] Design standard for wind speed Korean Design Standard [33] Local peak ground acceleration by earthquake Korea Seismicity Map [29]   Typhoons and heavy rains are among the most frequent disasters in Korea, while Korea has been known to be relatively safe from earthquakes. However, interest in earthquakes in Korea has increased since the two earthquakes with magnitudes of 5.8 and 5.4 on the Richter scale in 2016 and 2017, respectively. In this study, three natural disasters, typhoons, heavy rains, and earthquakes, were selected as the most influential disasters on the water supply system and used for the TDI development considering natural characteristics in Korea.

Component of Disaster Index
The four components (i.e., regional factor, risk factor, urgency factor, and response and recovery factor), describing the level of the damage caused by each type of disaster, were used to determine the disaster index (DI) of three natural disasters as follows.

1.
Regional factor (RE) represents regional characteristics such as the frequency of natural disaster occurrence in the selected areas; 2.
Risk factor (RI) represents the quantity of possible damage caused by natural disasters. For example, the RI increases as the capacity of water treatment plants or the length of water supply pipelines increases; 3.
Urgency factor (UR) represents the urgency of recovery after a disaster. For example, the UR increases with a larger population in the area receiving drinking water; and, 4.
Response and recovery factor (RR) represents the recovery ability during and after a disaster, which is estimated by the financial status or manpower of the authority of a water treatment plant, such as the local government.
A total of 23 input variables obtained from open-source public data were used to determine the four components (RE, RI, UR, and RR), as summarized in Table 2.

PCA Analysis for Index Weight
The weights of each variable for the DI of three natural disasters (typhoon, heavy rain, and earthquake) were determined using PCA. PCA is a statistical method that reduces the dimension of variables and determines each variable's relative importance using an eigenvector. The input variables were standardized as an average of zero and standard deviation of one for PCA analysis [34][35][36].

AHP Analysis
The DI of three natural disasters and four components were used for the calculation of TDI in 419 water treatment plants. However, there was limited data available for the statistical determination of the relative weight of each natural disaster and four components for the TDI calculation. In addition, although the effect of earthquakes on the water supply system is expected to be extremely large, there were only two significant earthquakes in Korea that occurred in 2016 and 2017. Thus, it should be noted that quantitative data for the analysis of the effect of earthquakes was limited.
The weights of three natural disasters and four components used to calculate the TDI were determined by the AHP suggested by Saaty [37,38]. The AHP is a structured data analysis method for complex decision-making, which is also widely used to analyze disaster data [9,23,39]. In AHP, a pairwise comparison matrix of each element for the decision-making process is structured. This structure relates to the matrix's eigenvector, which represents the weight of each element in the decision-making process [37,40].
The survey results from 62 experts or engineers currently working in water treatment plants were used for AHP analysis. The survey data with a consistency ratio (CR) of less than 0.2 was used to calculate the weight of each input variable to maintain the consistency of the AHP analysis result [40][41][42].
where λ max : principal eigenvalue in the pairwise comparison matrix, n f : number of features, CI: consistency index, RI: random consistency index (RI = 0.90 for n = 4 and RI = 0.58 for n = 3), and CR: consistency ratio.

Disaster Index Model
The TDI is determined by the weighted sum of the DI for three natural disasters using the following equations (Equations (2)-(5)).
where TI: DI for typhoon; HI: DI for heavy rain; EI: DI for earthquake; a, b, and c: weight of each natural DI; and a t , b t , c t , d t , a h , b h , c h , d h , a e , b e , c e , and d e : weight of each component.

Model Selection
Two ensemble models, RF and GBDT, have been increasingly used as ML models to manage the water environment. Both models show good performance, even for nonlinear relationship analysis, and data with outliers are also applicable for both classification and regression [43,44].
RF is a tree-based ensemble model in which a random data selection approach generates multiple decision trees. RF randomly selects several sets of input features from the original input features by a bagging method before generating the decision trees, which increases the independence and variability of each decision tree. The final RF prediction is determined by averaging the predictive results from individual decision trees in RF [45]. Consequently, the prediction performance of RF can be dramatically improved [46][47][48] and outperforms other ML models [49]. RF has shown high performance in various domains and has also been continuously applied to environmental research, such as water quality prediction [50,51].
GBDT is an ensemble model based on a gradient boosting method (GBM), called a sequential tree-based calculation process [45,52,53], and a set of decision trees. Unlike RF which determines the final prediction by voting (for classification) or averaging (for regression), GBDT uses the decision tree, called a weak learning model, from a previous stage in the ML process to improve model performance in the following stage. Residual errors of the prior stage are included in developing the decision tree in the current stage to reduce the residual errors by optimizing a specified loss function [45,52]. This optimization process is sequentially performed until the predefined number of decision trees is reached, which is a major difference with RF, where the calculation of each tree is independent.
GBDT is optimized by minimizing an objective function, J, for a training data set with n samples. The regularization term can be added to avoid overfitting of the model [44,54]. Equation (6) shows an illustrative example of the objective function of GBDT [44,54].
where f k : function of the kth decision tree, L: loss function that calculates the difference between an observation (y i ) and model prediction (ŷ i ) in each decision tree, Ω: regularization function that penalizes the complexity of the model, and n: number of data samples. The schematics of RF and GBDT are compared in Figure 1, where X denotes input features as is a collection of decision trees, and the θ k are independent and identically distributed random vectors [44,45,54].

Model Optimization
The hyperparameters of RF and XGB were optimized by a trial and error method with ten-fold cross-validation using the grid search library in Scikit-learn [57]. The models were developed with 23 input variables of 419 water treatment plants, where the ratio of data used for training and testing of the models was 8:2. In this study, both RF and GBDT models were used for the TDI estimation of 419 drinking water treatment plants. The Python open-source libraries of Scikit-learn (for RF) and XGBoost (for GBDT) were used for regression model development [55,56]. XGBoost (XGB) is one of the most popular GBDT implementations developed by Chen and Guestrin [45,54]. Scikit-learn is also a popular Python-based ML library developed by Pedregosa et al. [55].

Model Optimization
The hyperparameters of RF and XGB were optimized by a trial and error method with ten-fold cross-validation using the grid search library in Scikit-learn [57]. The models were developed with 23 input variables of 419 water treatment plants, where the ratio of data used for training and testing of the models was 8:2.

Feature Importance (FI) of Input Variables
The relative importance of input variables on RF and XGB model performance was calculated using the feature importance (FI) algorithm in Scikit-learn [57]. The FI in the tree-based model was computed as the total impurity reduction of the model brought by that feature [55,58,59].

Model Evaluation
The model performance was evaluated by three evaluation indexes (Equations (7)-(9)), RMSE, coefficient of determination (R 2 ), and RMSE-observation standard deviation ratio (RSR). RSR ranges from 0 to 1 and approaches 0 when the model shows a good fit with observation. The model is considered to predict the observation when RSR < 0.70 [60,61].
where M i,obs : observed values, M i,obs : mean of observed values, and M i,model : model predicted value.

Characteristics of Input Variables
Total 23 input variables for the development of DI were identified from open-source public statistical data. The characteristics of the input variables are summarized in Table 3. The frequency of warning advisories of natural disasters was calculated at each water treatment plant from the sum of the three variables in Table 3 (i.e., RAIN, SWIND, and TYPHOON). The frequency of warning advisories ranged from 0.017 to 1.29 times/km 2 and tended to be higher in areas near the ocean as shown in Figure 2 using ArcGIS pro. statistical data. The characteristics of the input variables are summarized in Table 3. The frequency of warning advisories of natural disasters was calculated at each water treatment plant from the sum of the three variables in Table 3 (i.e., RAIN, SWIND, and TYPHOON). The frequency of warning advisories ranged from 0.017 to 1.29 times/km 2 and tended to be higher in areas near the ocean as shown in Figure 2 using ArcGIS pro.

PCA Analysis
The weights for each natural disaster index were determined from PCA with 23 input variables ( Table 4). The eigenvectors were calculated from PCA and normalized to make the sum of weights of each component to be 1.

AHP Analysis
The weights for each disaster type were determined from the AHP analysis using the survey data (CR < 0.2) ( Table 5). The response rate of the survey was in the range between 52 and 69% for each item. The weights of each disaster are in the order of typhoons, earthquakes, and heavy rain.

Disaster Index (DI)
The TDI was determined using the following model (Equations (10)- (13)) which were developed from PCA and AHP analysis (Tables 4 and 5). Using the developed models, TDI values of 419 water treatment plants were determined with the range between −0.526 and 3.813 with an average of 0 and a standard deviation of 0.343. A higher TDI represents a higher potential of effect or damage by a disaster in water treatment systems. The TDI tends to be higher in water treatment plants near metropolitan cities as well as the areas near ocean.
The TDI was developed considering the natural status of Korea. For example, there were only two earthquakes in 2016 and 2017, which were considered to have caused actual damage to water treatment plants in Korea. As the data available for the quantification of damage by earthquakes is minimal, the AHP based on survey data was used for the DI calculation.
Although there were not many cases of damage in water treatment systems from earthquakes, the weight of the earthquake was larger than that of heavy rain. The AHP results represent that, although earthquakes have been rare in Korea, the damage and consequences by an earthquake would not be negligible when it occurs, indicating that a preventive plan against earthquakes should be prepared in advance. In addition, given that most of the facilities already experience heavy rain and are relatively well prepared for these instances, it is expected that the actual damage caused by heavy rain is relatively small compared to other disasters.

Total Disaster Index (TDI) Prediction using Ensemble Models
Two ensemble ML models, RF and XGB, were used to develop a model to predict TDI. The model performance with the test data set was evaluated by three indices, as summarized in Table 6. The R 2 and RSR were 0.8435 and 0.3957 for the RF model and 0.8629 and 0.3703 for the XGB model, respectively. The observed data and model predictions are compared in Figure 3. The model prediction shows a similar good fit with observations both in the RF and XGB models, while XGB showed a slightly better performance for all three evaluation indexes (Table 6 and Figure 3). The FI of 23 input variables for both RF and XGB models to predict DI are shown in Figure 4. The FI was different between RF and XGB, while the variables that represent the scale of water treatment plants such as PUMP_EP and Q tend to have a higher effect on model performance for both models. For RF, the sum of FI in the highest nine input variables was more than 80%, while for

Feature Importance (FI) Analysis
The FI of 23 input variables for both RF and XGB models to predict DI are shown in Figure 4. The FI was different between RF and XGB, while the variables that represent the scale of water treatment plants such as PUMP_EP and Q tend to have a higher effect on model performance for both models. For RF, the sum of FI in the highest nine input variables was more than 80%, while for XGB, the sum of FI in the highest four variables was more than 80% of the total FI for XGB.   The performance of the models was compared between RF and XGB using fewer input variables, starting with 1 and adding up to 10 input variables with the order from the highest FI ( Figure 5). The RF model showed a tendency to improve the performance of the model as the number of input variables increased from one to ten, and even when using three input variables, the RSR was 0.6954, indicating that the model accurately predicted the observation. XGB shows better performance when using fewer input variables. The RSR is 0.5323 when only three input variables were applied, which reduces to 0.3937 when using ten input variables. The FI analysis shows that several input variables with higher feature importance have a considerable effect on model performance. The analysis results show that both the RF and XGB models show similar performance when using five or more input variables with higher FI. The FI is one of the factors and not an absolute standard considered for model structure. The necessary input variables are not always obtainable from the actual operation and management of water treatment systems. Thus, the practical applicability of the model would be improved as fewer input variables are used. The FI analysis suggests that the model shows acceptable performance if only part of the input variables with the highest FI would increase the practical applicability of the model.

Summary and Conclusions
In this study, a disaster index (DI) for predicting the effect or damage caused by three major natural disasters in Korea (i.e., typhoons, heavy rain, and earthquakes) was newly developed to quantify each natural disaster's effect on water utilities.
Although the operational data in water utilities provided a good understanding regarding the effect of disasters, the data is usually collected in an individually specified format often site-specific, making it difficult to collect, organize, and analyze the data. In addition, the operational data for water utilities was not easily accessible, limiting the comprehensive development of the DI. Therefore, in this study, the DI of natural disasters in water treatment systems was developed using statistical open-source public data. Two well-defined statistical data analysis methods (i.e., AHP and PCA) were used for the determination of DI.
The open-source public data have greater accessibility and are updated regularly, so the DI can also be updated considering the current status, which is also a significant benefit of using open-source public data. The DI developed in this study may be site-specific at a given location and conditions of water utilities, but the developed framework would be applicable for quantifying the effect of disasters on water treatment systems in other regions with different natural status.
In the second part, two ensemble models (i.e., RF and XGB) were used to develop models to

Summary and Conclusions
In this study, a disaster index (DI) for predicting the effect or damage caused by three major natural disasters in Korea (i.e., typhoons, heavy rain, and earthquakes) was newly developed to quantify each natural disaster's effect on water utilities.
Although the operational data in water utilities provided a good understanding regarding the effect of disasters, the data is usually collected in an individually specified format often site-specific, making it difficult to collect, organize, and analyze the data. In addition, the operational data for water utilities was not easily accessible, limiting the comprehensive development of the DI. Therefore, in this study, the DI of natural disasters in water treatment systems was developed using statistical open-source public data. Two well-defined statistical data analysis methods (i.e., AHP and PCA) were used for the determination of DI.
The open-source public data have greater accessibility and are updated regularly, so the DI can also be updated considering the current status, which is also a significant benefit of using open-source public data. The DI developed in this study may be site-specific at a given location and conditions of water utilities, but the developed framework would be applicable for quantifying the effect of disasters on water treatment systems in other regions with different natural status.
In the second part, two ensemble models (i.e., RF and XGB) were used to develop models to predict TDI. Both RF and XGB showed similar satisfactory performance for prediction of the DI, while the XGB showed a slightly better performance in general. The FI analysis also suggested that the models have sufficient performance for practical use with only several input variables of the highest FI, which can improve the practical applicability of the models.
Quantitative assessment of disaster effects on water treatment systems is essential for better management of the water treatment systems and stable supply of drinking water to the public. However, data related to disaster analysis are often limited and even hardly quantifiable. One of the possible solutions would be to keep collecting data, analyze them statistically, while facilitating frequent discussions from experts experiencing the disasters in their utilities [11,35]. The recent advance of information and communication technologies, such as sensor-based real-time monitoring methods, can provide various continuous monitoring data about the operational condition of water treatment plants and related infrastructure, which can improve the pre-and post-management planning processes [20,22]. However, the quantification and assessment of disasters on water treatment systems are still in an early stage, and the use of field operational data and responses, in particular during disaster events, is currently limited at this time.
This study provided quantified information on the impact of various natural disasters on water treatment systems with open-source public data, which would be useful for creating a plan to reduce damage to water supply systems caused by natural disasters. Further study is warranted to use high-frequency real-time data to improve the model performance and practical applicability.