Identiﬁcation of a Contaminant Source Location in a River System Using Random Forest Models

: We consider the problem of identifying the source location of a contaminant via analyzing changes in concentration levels observed by a sensor network in a river system. To address this problem, we propose a framework including two main steps: (i) pre-processing data; and (ii) training and testing a classiﬁcation model. Speciﬁcally, we ﬁrst obtain a data set presenting concentration levels of a contaminant from a simulation model, and extract numerical characteristics from the data set. Then, random forest models are generated and assessed to identify the source location of a contaminant. By using the numerical characteristics from the prior step as their inputs, the models provide outputs representing the possibility, i.e., a value between 0 and 1, of a spill event at each candidate location. The performance of the framework is tested on a part of the Altamaha river system in the state of Georgia, United States of America.


Introduction
Many scholars and practitioners have explored technologies for monitoring water quality because of urbanization, industrialization, climate change, and threats related to terrorism. Among these technologies, identifying the source location of a contaminant in groundwater and river systems has been significantly improved by the development of sensors and data analytics. Rapid identification of contaminant source location enables us to reduce the risk of contaminant exposure by preventing pollution events and providing fast responses to undesired phenomena caused by such events.
Most past research works on identifying contaminant source locations have focused on groundwater systems. Gorelick et al. [1], Aral and Guan [2], Aral et al. [3], Sun et al. [4], and Singh and Datta [5] adopted optimization algorithms, such as linear and non-linear programming algorithms as well as meta-heuristic algorithms, to identify the source location of a contaminant in groundwater systems. Instead of optimization algorithms, some statistical methods, such as a backward probability model approach [6,7] and a geostatistical approach [8], were used for similar problems. In addition, Singh et al. [9], Singh and Datta [10] and Srivastava and Singh [11] employed an artificial neural network model to efficiently identify the source location of a contaminant in a groundwater system.
For river systems, there are relatively few studies regarding the identification of a contaminant source because of the size of such systems and the complexity of the corresponding problems. Boano et al. [12] used a geostatistical approach to generate historical information related to a pollutant when the source location is known. Chen et al. [13] employed multivariate statistical methods to determine spatial and temporal variations in water quality and to identify the contaminant source in a lake. Ghane et al. [14] applied the backward probability method to identify the source location and the release time of pollutants in a river system. In this paper, we focused on a river system and sought to identify the source location of a contaminant spill.
The research of Telci and Aral [15] is most closely related to ours. This work considers the problem of identifying the source location of a single instantaneous contaminant among given candidate locations in a river system while considering uncertainties in spill and rainfall events. They used estimates of statistical changes in concentration levels over time, as shown in Grubner [16]. Then, they applied an adaptive sequential feature selection algorithm developed by Jiang [17] to sequentially screen possible candidate locations of the contaminant source in a river system. Although their sensor network includes more than one sensor, Telci and Aral [15] do not consider relative information among sensors as an input. Moreover, the final result of the adaptive sequential feature selection algorithm is only a single source location index, and thus one cannot evaluate how reliable the identified location is. In this paper, we suggest a way to preprocess data related to changes in contamination levels and use relative information observed by pairs of sensors. We construct random forest models and provide a measure of the possibility that each candidate location is identical to the correct location of the contaminant source, reported as a number between 0 and 1. As a result, a decision maker can quantitatively evaluate the possibility of the results from the model. This paper is organized as follows. Section 2 provides notations and the problem description. In Section 3, we provide a two-step framework to effectively identify the source location of a contaminant including data pre-processing and model generation and assessment. Section 4 presents experimental results of a case study, and concluding remarks follow in Section 5.

Problem Description
In the river system, there are N candidate locations at which a monitoring sensor can be installed or where a spill event may occur. Let D be the index set of the candidate locations, D = {1, 2, . . . , N} and K be the number of sensors in the river system such that 2 ≤ K ≤ N because we consider a network system with multiple sensors. A vector z represents location indices of K sensors, z = (z 1 , . . . , z K ), such that z j ∈ D, for j = 1, . . . , K and we assume z 1 < z 2 < . . . < z K to avoid repetition. In this paper, we consider the case where K and z = (z 1 , . . . , z K ) are given. Each sensor returns a concentration level of the contaminant at time index t for t = 1, .., T. Let Y t z j denote the concentration level at time index t that is monitored by the sensor located at z j for t = 1, .., T and j = 1, .., K. Then, a collection of the concentration levels monitored by K sensors over all time indices is denoted by: For all d ∈ D, we denote P(d) as a measure representing the possibility that location d is identical to the correct spill location. Note that 0 ≤ P(d) ≤ 1 for all d ∈ D. The closer P(d) is to 1, the more likely it is that a spill event has occurred at location index d. Let P denote a vector of P(d) as follows: . . .
The main purpose of the paper is to construct a data-driven model which evaluates P for a given Y(z) as shown in Figure 1. To construct the data-driven model, we first need to prepare large-sized training data that can be obtained from a hydrodynamics simulation. In the next section, we briefly describe the hydrodynamics simulation considering random contaminant spill and rainfall events.

Hydrodynamics Simulation
A simulation software package was employed to get observations of Y(z) and P. The Storm Water Management Model (SWMM) is a popular software package for simulating the hydrodynamics and contaminant transportation in a river system. The SWMM was developed and released by the Environmental Protection Agency (EPA) of the United States of America, and it was designed for simulating hydrodynamic systems around urban areas under dynamic flow, including rainfall events and various watershed conditions as described in Rossman [18]. To construct a simulation model within the SWMM, we need (i) fixed information representing geological and geometrical properties of the system and fundamental hydrodynamics in the system; and (ii) variable information representing random spill and rainfall events based on historical data.
For each SWMM run, fixed information is modeled as a set of constants, and variable information is modeled as a set of random variables. Note that a single instantaneous spill is considered for a spill event. The random variables describe spill and rainfall events. To describe the spill event, we denote random variables Q i and M i as the spill starting time and spill intensity of the ith simulation, respectively. In the case of rainfall events, we employed the method described in Telci et al. [19]. We partitioned the entire area of the river system into ω number of regions, which are called sub-catchments. Each sub-catchment has a number of pre-generated rain patterns based on historical data. Then, a rain pattern of each sub-catchment is randomly selected among the pre-generated patterns. An ω-dimensional vector I i denotes an instance of rain patterns over the entire river network in the ith simulation run.
For the ith simulation, a set of random variables (Q i , M i , I i ) was generated and combined with fixed information in an input file. After executing the SWMM software with the input file, we obtained a large output file that includes concentration levels as well as various quantities regarding hydrodynamics at each candidate location at every inter-reporting time of the simulation clock. We denote Y i (z) and P i as the ith simulation observation of Y(z) and P. Similarly, Y i t z j and P i (d) represent the ith simulation observation of Y t z j and P(d). Therefore, We obtained values of Y i (z) from the large output and constructed P i by assigning 1 for the correct spill location and 0 elsewhere.

Overall Workflow
In this section, we suggest a framework to identify the source location of a contaminant spill through a classification model with simulation data. The overall workflow of the proposed framework is presented in Figure 2. The framework consists of two main steps, including pre-processing simulation data and generating and evaluating a classification model. As described in Section 2.2, a SWMM run with an input file returns Y i (z) and P i . We quantitatively characterize changes of non-zero concentration levels whose shape is called the breakthrough curve, and we calculated relative time indices observed by each pair of sensors (see Section 3.2). After pre-processing Y i (z), we constructed and evaluated a classification model (see Section 3.3).

Data Pre-Processing
Using Y i (z) to directly construct a data-driven model causes two main issues. First, the size of Y i (z) is K × T, and it becomes extremely large as the number of discretized time indices increases. Note that T is often a couple of thousand in practice, and thus it is problematic. Second, when we keep track of Y i 1 z j , . . . , Y i T z j for a fixed z j , most of the values are reported as zeros, and non-zero values consecutively appear under a single, instantaneous spill. Therefore, we need to handle Y i 1 z j , . . . , Y i T z j for a fixed z j efficiently. We focused on characterizing non-zero values of Y i t z j if any contaminant mass is observed at z j . Let a and b represent the time indices at which the sensor starts and ends the detection of non-zero concentration levels for the contaminant, respectively. They are expressed in the following equations: Figure 3 shows a scatter plot of samples of Y i t z j for t = a, . . . , b, and the plot has a curved and unimodal shape. Note that the curve is referred to as a breakthrough curve [11,15]. The breakthrough curve can be interpreted as a constant multiple of a probability density function, and thus it is characterized by using definitions of a series of statistical moments for the probability density function [15,16]. As described in Telci and Aral [15], we first estimate the central statistical moment, standard deviation, skewness, and kurtosis of the breakthrough curve by approximation using the trapezoid rule. Let r t be the time in simulation clock corresponding to time index t. For Y i t z j , t = a, . . . , b, the estimated first moment is denoted by µ i 1 z j , and it is calculated by: For Y i t z j , t = a, . . . , b, the estimated kth central moment, for k = 2, 3, . . ., is denoted by m i k z j and it is calculated by: Let σ i z j , S i z j , and E i z j be the estimated standard deviation, skewness and kurtosis of the breakthrough curve corresponding to Y i t z j , t = a, . . . , b. Using Equations (6) and (7), σ i z j , S i z j , and E i z j are calculated as follows: If there is no positive Y i t z j for all t = 1, . . . , T, one may assign σ i z j = C, S i z j = −C, and i z j = −C, where C is a large positive constant.
In addition to σ i z j , S i z j , and E i z j , we introduce two more quantitative characteristics U i z j and A i z j , which represent estimates of the total area and the time-averaged area between the horizon axis and the breakthrough curve, respectively, as described in Srivastava and Singh [11]. Using the left Riemann sum, they can be calculated by: Using Equations (8)-(12), a series of data sets Y i 1 z j , . . . , Y i T z j for a fixed z j can be transformed into a 5-dimensional vector as follows: Therefore, when considering K sensors whose locations are specified by vector z, Y i (z) can be transformed into the 5K dimensional vector B i (z) as follows: Since the sensor network includes at least two sensors, we may utilize relative information over different sensors regarding when non-zero Y i t z j is first detected at each sensor. Let R i z j denote the time index for first detected non-zero Y i t z j such that: Then, we define R i z p , z q as the difference between the times with non-zero concentration levels first detected by sensors located at z p and z q , calculated by: To exclude meaningless values of R i z p , z q that are due to the structure of the river system, we only considered pairs of sensors satisfying the following conditions: (i) one of the sensors should be located upstream of the other sensor; and (ii) there is no other sensor between a pair of sensors located at z p and z q . We denote R i (z) as the collection of R i z p , z q for all possible pairs satisfying the above two conditions. After the pre-processing, B i (z), R i (z) becomes an input vector of the classification model described in the next section.

Model Generation and Assessment
A random forest model is a popular classification model that contains a collection of tree-structured classifiers. As mentioned in Breiman [20], the random forest model has several advantages regarding accuracy, robustness and computational efficiency, compared with other classification models. Figure 4 shows the schematic flow diagram of the generation of a random forest model.
The first step of model generation is referred to as bootstrapping. In this step, the bootstrapping algorithm generates L number of sample data sets from all the training data. Each sample data set exactly corresponds to a tree classifier. Approximately 2/3 of a sample data set is used as the training data to construct a tree classifier, and the remaining 1/3 of the sample data set is used as out-of-bag (OOB) data to evaluate the generalized error of the random forest model [21]. An estimate of the generalized error is called the OOB error, which is calculated by the ratio of the number of misclassified OOB data to the total number of OOB data.
In the second step of the model generation, we constructed tree classifiers represented by nodes and edges, and we trained them. In a tree classifier, there are two types of nodes, an internal node and a terminal node. At each internal node, F number of input variables are randomly selected and linearly combined with their coefficients. We checked whether a linear combination of the input variables is greater than a certain constant threshold or not, and then moved to the next node. Constant thresholds and coefficients of the linear combination at each internal node can be determined by a randomized node-optimization algorithm developed by Ho [22]. This process is called training. Each terminal node corresponds to a certain final class (e.g., location index), and no further decisions or movements can occur at the terminal node. After training, we get a combined classifier including L number of tree classifiers. Note that different tree classifiers have different structures (e.g., number of nodes and arcs) and different decision rules at each internal node of the classifiers. When a vector of input values (e.g., B i (z), R i (z) ) is entered into the model, each tree classifier selects one of the classes (e.g., location index) as a result. Figure 5 shows an example of a trained tree classifier for z = (9, 19, 26) and F = 1. A vector of input values, B i (9, 19, 26), R i (9,19,26) , is first entered into the node on the top of the tree, and then it moves along the edges. If U i (19) ≤ 7.46 and E i (26) ≤ −0.773, then the example tree concludes that location index 26 is the spill location. Each tree classifier votes for one of the classes (e.g., the location index from 1 to N) based on its conclusion, and the proportions of the number of votes out of the total number of tree classifiers are returned as output values (e.g., P i (d) for all d ∈ D) as shown in Figure 6.
Recall that L represents the number of tree classifiers, and F represents the number of input variables or input features randomly selected at each node. Selecting two parameters, L and F, may affect the performance of the combined classifier. As L increases, the generalization error gradually decreases and converges to a number. In this paper, we selected an L that makes OOB errors converge. A small value of F may reduce the accuracy of individual tree classifiers, but it may also reduce correlation among the trees, which decreases the generalization error. When M is the number of values in an input vector, F is typically selected as √ M [23,24] or log 2 M + 1 [20,23]. We used the F selection strategy described in Breiman [25]. Based on this strategy, we checked OOB errors with F values, which are all possible integers between 0.5 √ M and 2 √ M as well as between 0.5(log 2 M + 1) and 2(log 2 M + 1). Then, we selected the F value with the lowest OOB error. In addition, we noted that a random forest model performs well when the number of classes is 32 or fewer [25], and thus we constructed a unified model with multiple random forest models based on partitioned classes. One way to achieve this for our problem is described in Section 4.2.

Study Area and Simulation Setup
In this section, we briefly review the study area, the Altamaha river system, and explain the method used to generate a SWMM input file for the river system. The Altamaha river system has the largest watershed in the state of Georgia, United Sates of America (31 • 57 33 N; 82 • 32 37 W), and it flows south-eastward to the Atlantic Ocean. The length and size of the river system are approximately 760 km (470 miles) and 36,260 km 2 (14,000 square miles). The system consists of the Ocmulgee river, the Oconee river, the Ohoopee river, and their confluence, and it includes 60 river reaches and 62 junctions, as shown in Figure 7. We selected 100 candidate locations (marked by small circles in Figure 7) which include the most upstream locations, locations of confluences, and locations evenly spaced along with each river reach. The details regarding the selection of candidate locations in the Altamaha river system are shown in Telci et al. [19].
Fixed information for the SWMM input file, which includes geological, geometrical, and fundamental hydrodynamics data of the river system, was obtained from the United States Geological Survey (USGS) in the National Elevation Dataset. The fundamental hydrodynamics of the river system included a steady-state hydraulic system, which was calibrated by data obtained from annual average flow rates measured in 2006 at twenty USGS gauging stations. Note that all lakes and impoundments were approximated as river reaches to simplify the network. Detailed information related to the river system and the corresponding fixed information used to construct the corresponding SWMM model is provided in Telci et al. [19].
We used two random events, spill and rainfall events, as variable information in the SWMM input file. For a spill event, the spill starting time and spill intensity, Q i and M i , respectively, were assumed to be uniformly distributed between their lower and upper limits. The lower and upper limits of Q i are set to 0 and 10 days, respectively, and the lower and upper limits of M i were set to 10 and 1000 grams per liter. For rainfall events, the whole region of the river system was partitioned into 10 sub-catchments (i.e., ω = 10). The rain pattern of each sub-catchment was randomly selected among five pre-generated rain patterns. Note that each pre-generated rain pattern represented time-dependent rain events causing dynamic changes of hydrological conditions of the sub-catchment. Detailed information related to generating random variables to run a SWMM model is provided in Park et al. [26]. An input file of a SWMM run for the Altamaha river system is structured by combining fixed information and variable information, and then the SWMM is executed with the input file. Each SWMM run simulates changes in hydrodynamics and contamination levels during the 40 days and reports related quantitative values (e.g., concentration levels, flow rates, and the amounts of overflows) at each candidate location every 15 min in the simulation clock.

Random Forest Model Generation
As seen in Figure 8, we considered the set of candidate locations D = {1, 2, . . . , 53}, at which a spill event can occur. Since a random forest model performs well when the number of classes is 32 or fewer [25], we partitioned the set D into D 1 = {1, 2, . . . , 26} and D 2 = {27, 28, . . . , 53} and constructed the corresponding random forest models. In Figure 8, the region encircled by the solid line includes candidate locations of D 1 , which are located upstream, and the region encircled by the dotted line includes candidate locations of D 2 , which are located downstream. For p = 1 or 2, let z p be a vector representing location indices of sensors which deliver direct information to detect a source location in D p and let Φ p z p be a random forest model whose input is B i z p , R i z p and output is P i (d), for d ∈ D p . That is, the input of Φ p z p is information obtained from sensors located at z p , and the output of Φ p z p is a measure of the possibility that the correct spill location is d for all d ∈ D p . Note that z 1 may consist of sensor locations included by D 2 because some sensors located in D 2 can observe non-zero concentration levels for spill events that occurred in D 1 . After training Φ 1 (z 1 ) and Φ 2 (z 2 ), a unified model was constructed, as described in Figure 9.
For experiments, we considered three different unified models with respect to the number of sensors: (i) two sensors at z = (26, 53), (ii) four sensors at z = (9, 26, 46, 53), and (iii) six sensors at z = (9, 19,26,33,46,53). See Figure 8 for the locations of the sensors. Sensor locations of the first and second unified model are determined based on a part of the optimal sensor locations from Park et al. [26]. In unified model 3, we arbitrarily added two more sensor locations 19 and 33 to unified model 2. Note that all unified models consider location index 26 as z D in Figure 9.
To train the random forest models, we first ran the SWMM model for the Altamaha river system under a single instantaneous spill at each candidate location with 500 random scenarios. Then, we constructed data sets for training each random forest model. The number of observations to construct Φ 1 (z 1 ) and Φ 2 (z 2 ) were 26 × 500 and 27 × 500, respectively. For all random forest models, we set L = 500. To train each random forest model, we used the "randomForest" package in R version 3.3.0 with a personal computer (Intel core i7-4790 CPU; RAM 8 GB). The average time required to construct a unified classification model was 17.87 s. Table 1 shows detailed information related to the three models with model parameter F and OOB errors, which are training errors of each random forest model. Note that OOB errors significantly decreased as the number of sensors increased.   . Unified classification model with random forest models Φ 1 (z 1 ) and Φ 2 (z 2 ).

Model Assessment 1: Spill at a Candidate Location
To evaluate the performance of our unified classification model, we first tested our models on the case that a single instantaneous spill occurs exactly at one of the candidate locations under uncertainties including the spill starting time, the spill intensity, and the rain patterns. The SWMM model was executed with the spill event at each candidate location with 100 random scenarios with spill and rainfall events, and thus the number of observations for testing our models was 53 × 100. In Figure 10a-c, the horizontal axis represents the spill location indices, and the vertical axis represents the percentage that the location with the maximum P i (d) is exactly the same as the correct spill location. Overall, the percentages at all candidate locations become higher as the number of sensors considered in a model increases. The accuracy and robustness of unified model 3 with 6 sensors was significantly enhanced when compared with those of unified model 1 with 2 sensors. The proportions of misclassifications among the total number of test data were 28%, 8%, and 4% with unified models 1, 2 and 3, respectively.  Figure 11 shows the percentage of time that the correct spill location index was included in the top κ locations when we ordered all candidate locations based on P i (d). Unified model 1 includes the correct spill location within the top 5 locations and unified models 2 and 3 include the correct spill location within the top 2 locations with 100% accuracy over 100 random scenarios. Figure 11. Percentage that the correct spill location is included by the top κ locations with respect to the ranking of the value P i (d).

Model Assessment 2: Spill Near a Candidate Location
Another part of the model assessment was designed with a spill event near candidate locations, as in Telci and Aral [15]. We selected 19 spill locations, marked as R1 to R19, as shown in Figure 12 and assessed the values of P i (d) for all d ∈ D. In this assessment, both the nearest upstream and downstream locations from the spill location were accepted as the correct spill location. Figure 13 shows the percentage that any of the correct spill locations were included in the top κ locations under decreasingly ordered P i (d). All unified models performed better than the model in Telci and Aral [15]. Obviously, higher percentages of correct identifications were achieved if more sensors were installed.
As described in Telci and Aral [15], it is difficult to recognize the correct spill location with respect to R13 because the base flow from location index 32 is much smaller than the discharged flow from location index 33 in the hydrodynamics simulation. This is the main reason that the model in Telci and Aral [15] cannot achieve 100% accuracy for spill location R13 even with an increase in κ. Figure 14a-c represent the reported values of P i (d) at candidate locations in D 2 from unified models 1, 2, and 3, respectively. As shown in Figure 14a, unified model 1 with two sensors cannot significantly recognize either location index 32 or 33 as the correct spill location. Nevertheless, when considering unified models 2 and 3, the value of P i (32) increases up to 0.7. Location 32 can be quantitatively identified as the correct source location based on the P i (32) values.

Conclusions
In this study, we proposed a framework to identify the source location of a contaminant spill, when changes in concentration levels can be observed at multiple sensors in a river system, via simulation. Specifically, the targeted river system was simulated to obtain a large data set under various random scenarios involving spill and rainfall events. To improve data-handling efficiency, the large data set was pre-processed and condensed into breakthrough curves and relative information on detection times at each pair of sensors. Random forest models were constructed and trained based on the pre-processed data. The random forest models were tested on the Altamaha river. Our model performs better than an existing model in terms of source identification. In addition, our model provides quantitative measures indicating that a selected location is the correct spill location.
We employed simulation data to test our framework in this study. Since the real data tend to include more noises in various types than the simulation data do, one may consider adopting noise-handling techniques (e.g., see Kim et al. [27]) to enhance the accuracy of our framework with real data.
Park et al. [26] presented a model to determine the best locations of sensors that minimize detection times while maintaining a certain level of detection reliability. We presented a model to identify a contaminant source location when the number and locations of sensors are given. As we have done so in this study, users may apply the method from Park et al. [26] to determine the optimal locations of sensors first, and then apply our framework to identify a source location based on the data obtained from the sensors. However, since fast contaminant detection and accurate source identification are closely related to each other and are often considered together in practical applications, a meaningful extension can be made by finding the optimal number and locations of sensors while considering detection time, detection reliability, and accuracy of source identification simultaneously. This is an ongoing work.