Application of Convolution Neural Networks and Hydrological Images for the Estimation of Pollutant Loads in Ungauged Watersheds

: River monitoring and predicting analysis for establishing pollutant loads management require numerous budgets and human resources. However, it is general that the number of government ofﬁcials in charge of these tasks is few. Although the government has been commissioning a study related to river management to experts, it has been inevitable to avoid the consumption of a massive budget because the characteristics of pollutant loads present various patterns according to topographic of the watershed, such as topology like South Korea. To address this, previous studies have used conceptual and empirical models and have recently used artiﬁcial neural network models. The conceptual model has a shortcoming in which it required massive data and has vexatious that has to enforce the sensitivity and uncertain analysis. The empirical model and artiﬁcial neural network (ANN) need lower data than a conceptual model; however, these models have a ﬂaw that could not reﬂect the topographical characteristic. To this end, this study has used a convolution neural network (CNN), one of the deep learning algorithms, to reﬂect the topographical characteristic and had estimated the pollutant loads of ungauged watersheds. The estimation results for the biochemical oxygen demand (BOD) and total phosphorus (TP) loads for three ungauged watersheds were all excellent. However, prediction results with low accuracy were obtained when the hydrological images of a watershed with a land cover status different from the ungauged watersheds were used as training data for the CNN model.


Introduction
The control of pollution loads discharged from watersheds plays a crucial role in preserving healthy water resources. In particular, the case of South Korea is globally rather unusual in that approximately 50% (approximately 25 million people) of the Korean population uses a single water resource known as Paldang Lake; therefore, the South Korean government imposes strict control measures on the pollution loads to maintain the health status of Paldang Lake and ensure safety from a national security perspective. A qualitative management method was adopted by the Korean government for the water quality management policy in which a certain level of standard concentrations was set for management until the end of the 1990s. However, with this method, the water quality continued to degrade, thus requiring a paradigm shift in water quality management. Accordingly, the governmental water quality management policy was transformed into a quantitative management method, which has been applied to date in terms of the planning and administrative implementation for pollution load control. In establishing a plan for quantitative water quality management, the most important factor is the prediction of the pollution load discharged from watersheds. As pollution load is affected by certain conditions-such as the land cover in the watershed, physical characteristics of soil, and hydrological phenomenon-the discharge characteristics of the pollution load show different patterns for each watershed. In this regard, when actual measurements are difficult due to physical limitations in terms of establishing objectives for pollution load management research or planning, using a model capable of estimating the related pollution load is a common practice.
Two types of models are employed for pollution load estimation, i.e., conceptual and empirical models. The conceptual model uses factors, such as variations in human activities or in natural conditions that generate pollution loads as parameters for empirical formulas or estimation equations based on physical laws, which enable the description of the causal relationship between the input value and resultant values. This type of conceptual model includes the agricultural non-point source pollution model [1], soil water assessment tool [2], storm water management model [3], and areal non-point source watershed environment response simulation [4]. However, contrary to the distinct advantages mentioned above, the conceptual model requires a large amount of data for human activities, as well as meteorological and spatial data, and has other disadvantages that require post-processing, such as testing and calibration to adjust the gap between measured and simulated data, as well as sensitivity and uncertainty analyses. In contrast, the empirical model is characterized by no dependence on physical laws and is developed based on a stable relationship between the input and output data [5]. The selection and use of an appropriate statistical analysis method, with associated interpretation skills, enable an accurate analysis of the research results, which leads to an efficient and valuable reference by reducing unnecessary trial and error and data loss across all stages of research [6]. The empirical model has advantages in terms of relatively simple development and operation compared with the conceptual model but is unable to describe the causal relationship between the input and output data, and only describes the size of the factors expected to affect the result. In particular, the artificial neural network (ANN), which is categorized as an empirical model, has gained significant attention by obtaining excellent research results, with mainly positive assessments of its applicability. While some studies, which preferred conceptual models, argue that the dynamic relationship between the input and output data cannot be explained by empirical models [7]. However, ANNs have the advantages of high accuracy and fast prediction, when applied to flood prediction in particular [8,9]. ANN has been employed to predict air pollution such as fine dust, NO 3 , and O 3 and reported good applicability [10]. Also, ANN has been used for simulation of total nitrogen and total phosphorus in wet land, and denoted a good of Nash-Sutcliffe efficiency [11]. ANNs have been applied to estimate suspended solids (SS) and biochemical oxygen demand (BOD) concentrations in the effluent of a wastewater treatment plant [12]. Not only these mentioned studies before, a number of previous studies have reported excellent results using ANNs that are sufficiently capable of resolving the nonlinear relationships in data [13][14][15][16]. In addition, some studies have shown that the detection of physical processing in a hydrological model is possible with ANNs, reducing skeptical views on the use of empirical models [17][18][19].
As described above, when determining the suitability of the model applicability for the establishment of watershed management plans based on the strengths and weaknesses of each type of model, for local governments or related fields that lack experts, operating a simpler and more intuitive model is more efficient than handling complex models; one can argue that the use of empirical models is more appropriate than conceptual models. In particular, as is the case in South Korea, there are a number of situations where there is a substantial shortage in manpower with only three or four public officials in charge of the pollution load management for each local government. Furthermore, despite numerous sites that require actual measurements, there are a number of ungauged watersheds due to physical limitations on available budgets and human resources. Thus, there is an increasing consensus on the suitability of empirical models.
Therefore, in this study, we adopted an ANN from the empirical model category to efficiently examine pollution load management to improve the difficulty and inconvenience of collecting pollution load data from ungauged watersheds. Unlike the methodologies used in previous studies, a deep neural network (DNN) with an architecture of multiple layers was used. In particular, this study aims to estimate the pollution load of ungauged watersheds by introducing the convolution neural network (CNN), a DNN algorithm, and use hydrological images as input data for the CNN, thereby analyzing the applicability and model performance for pollution load estimation.

Study Area
For the pollution load estimation of ungauged watersheds, three watersheds were set as target areas in this study. As shown in Figure 1, the three study areas are JJ, HC, and BH, three of the Paldang watersheds located in the Seoul metropolitan area. Paldang watershed encompasses the seven municipalities surrounding Paldang Lake, which was artificially developed through the construction of Paldang Dam. Paldang Lake is located approximately 27.3 km from Seoul, the capital of South Korea. The Namhan River, Bukhan River, and Gyeongan Stream flow into the lake, which has a storage capacity of 244 million m 3 , an effective storage capacity of 18 million m 3 , watershed area of 23,800 km 2 , and the retention time of 5.4 days. As described in the introduction, Paldang Lake is a water resource used by approximately 50% of the Korean population, and it is under special management by the national government of the Republic of Korea in terms of its water quality to ensure the stability of Paldang Lake in terms of its health and security. For the national target of Paldang Lake, the water quality standard is set to 1.0 ppm biochemical oxygen demand (BOD), and the water resource protection zone, Special Countermeasure Area, and waterfront areas (such as a riparian buffer zone) have been established to enforce the highly strict regulations for the site. Furthermore, the Korean national government implemented a total maximum daily load (TMDL) system [20] in which BOD and total phosphorus (TP) were set as water quality management indicators in Paldang watershed. Therefore, while planned development activities are allowed at the site, owing to these regulations, there has been no local development, leading to little variation in the land cover of the area.
As shown in Figure 1a-c, the three study areas are mostly composed of forests and agricultural areas. JJ is a watershed that contains the Jojongcheon with a main stream length of 39 km (37 •  For the data required in this study, spatial data (i.e., land cover and soil maps), two hydrologic datasets (i.e., precipitation and discharge), and two water quality datasets (i.e., BOD and TP), were used for simplicity and convenience. Here, this study only used the BOD and TP for the water quality data was because they are used as the water quality management indicators in the TMDL.
All the data used in this study were obtained from the South Korean government; the land cover maps were obtained from the Environmental Geographic Information Service [21], the soil map from the Water Resources Management Information System [22], precipitation data from the Korea Meteorological Administration [23], and discharge, BOD, and TP data from the Water Environment Information System [24]. The data collection period was from 1 January 2010, to 31 December 2019; the precipitation data were collected daily (mm·day −1 ) from 34 precipitation gauges located around the study area, and the discharge data were daily average data (m 3 ·s −1 ·day −1 ) collected from discharge gauges located at the terminus of each watershed. Water quality data were daily average data (mg·L −1 ·day −1 ) measured at the same points as the discharge gauge points, and the land cover and soil maps were in grid form with a resolution of 30 × 30 m and scale of 1:5000.

Data during the Study Period
The number of rainfall events during the study period was investigated, which were 554 events for JJ, 571 for HC, and 566 for BH. The annual average precipitation of JJ, HC, and BH watershed are 1104.  The variations in the daily average BOD concentration and daily average TP concentration in the JJ, HC, and BH watersheds corresponding to the same dates as in Figure 2 are shown in Figures 3 and 4, respectively. The daily average BOD concentration in the JJ watershed ranged from 0.1-3.8 mg·L −1 ·day −1 ; the BOD concentration tended to increase with increasing discharge (Figure 3a). The daily average BOD concentration in the HC and BH watershed ranged from 0.3-5.7 mg·L −1 ·day −1 and 0.4-16.7 mg·L −1 ·day −1 , respectively (Figure 3b,c). Both of the BOD concentrations in the HC and BH watersheds also varied consistently with increases or decreases in the discharge. Out of the three study areas, the daily average BOD concentration in the BH watershed was observed to be higher than the other two study areas because, as listed in Table 1, the land use in the BH watershed for agricultural purposes is significantly higher at 55.5% compared with the other two study areas (JJ: 6.9%; HC: 12.8%).
The daily average TP concentration in the JJ and HC watershed ranged from 0.002-0.384 mg·L −1 ·day −1 and 0.003-0.378 mg·L −1 ·day −1 , respectively (Figure 4a,b). The daily average TP concentration of BH watershed ranged from 0.159-1.865 mg·L −1 ·day −1 , which was higher than that of the other two study areas (Figure 4c). For the same reason as the BOD concentration, these results are possibly because the agricultural land area in the BH watershed is larger than that in the other two study areas. Figure 5 denotes the relationships between the discharge and pollutant loads. The daily BOD load corresponding to the rainfall event dates ranged from 12.4-104,195.6 kg/day for the JJ watershed and 8.6-297,054.1 kg/day for the HC watershed. The daily BOD load in the BH watershed ranged from 77.76-424,436.2 kg/day, which was higher than the daily BOD load range in the JJ and HC watersheds. The daily TP load in the JJ watershed ranged from 0.3-10,529.2 kg/day, and the daily TP load in the HC watershed ranged from 0.1-19,699.4 kg/day. The daily TP load in the BH watershed was 12.4-48,277.0 kg/day, and similar to the daily BOD load, the TP load in the BH watershed was more than two-fold higher than that of the other study areas.
This is also thought to be due to the influence of higher TP concentrations than those in the other study areas. A t-test was performed to investigate the relationship between the BOD and TP loads and discharge in the JJ, HC, and BH watersheds, where the p values for all the study areas were not more than 0.001, indicating that the correlations between the BOD and TP loads and discharge in all of the study areas were highly significant.

Research Method
This study was conducted in two steps. As the CNN model was used for the estimation of the pollution load, the first step was to build an input dataset suitable for training the CNN model, and the CNN model was trained using the developed dataset. The second was estimating the BOD and TP loads of the ungauged watersheds.

1.
Building a Hydrological Image as a Feature for CNN Model Training The CNN model is mainly used for classification problems, and in this study, RGB (red, green, and blue) photographs were used to train the CNN model. However, in this study, the hypothesis of using an empirical model cannot be established if ordinary RGB photos are used to train the CNN models. In other words, the pollution load shows responses according to major variables, such as the hydrological conditions and land cover; however, if RGB photos are used, the empirical relationship between the independent and dependent variables is not considered. Even if the prediction of the pollution load is highly accurate with the use of RGB photos, the relationship between the features in the training data and the target of prediction has no significance.
Therefore, in this study, to satisfy the causal relationship where the pollution loads are determined by variables, such as the hydrological conditions and land cover, the hydrological images proposed by Song [25] with hydrological attributes were used as training data for the CNN model. Song [25] defined a hydrological image as a set of grids with dimensionless hydrological attributes in a two-dimensional space for an arbitrary watershed, where the hydrological attributes of each grid are calculated using the hydrological curve number (CN) [26], published by the soil conservation service (SCS). The CN is also used in a number of recent studies for similar purposes. The CN is a dimensionless integer ranging from 1-100 derived using conditions such as the precipitation, soil characteristics, and land cover. The CN is used for the simulation of direct runoff or evaluations of hydrological effects [27,28].
As described above, each grid of a hydrological image represents the hydrological attributes and is defined as Q in Equations (1) and (2). This represents direct runoff as pro-posed by the SCS, which has the same calculation method. The S parameter in Equation (3) is a parameter of Equation (2), representing the potential watershed storage determined by the CN. The SCS states that the CN should be assigned according to the four types of physical properties of soil, i.e., A, B, C, and D from the hydrologic soil group (HSG), and land cover, and provides CN values. However, as the topographic conditions in South Korea are different in terms of direct use for the CN values provided by the SCS, this study used CN values established by the Design Flood Estimation Techniques from the Ministry of Land, Transport and Maritime Affairs [29], South Korea. The Q and S parameters are calculated as Another parameter in Equations (1) and (2) is precipitation (P); the Thiessen polygon [30] has been traditionally used to estimate precipitation. However, discontinuity in the numerical variation occurs at the polygonal boundaries when using this method, causing distortion or discrepancy with respect to the actual precipitation. Therefore, in this study, using inverse distance weighting (IDW), the precipitation data were expressed as two-dimensional data in a grid format with a resolution of 30 × 30 m similar to the soil and land cover maps. IDW is an interpolation method based on Tobler's law [31] that can estimate continuous precipitation variations in grid units, which is able to compensate for the discontinuity problem of Thiessen polygons.
SCS recommends that adjustments to the CN in Equation (3) to consider the effect of precipitation, which accounts for the characteristics of direct runoff caused by antecedent precipitation. Therefore, SCS proposed adjustments to the CN as in Equations (4) and (5) by classifying the antecedent moisture condition (AMC), a concept based on the five-day antecedent cumulative precipitation, into AMC I, AMC II, and AMC III. Here, AMC I, AMC II, and AMC III indicate dry, normal, and wet conditions, respectively. These modified CNs can be expressed as As shown in Equations (1) and (2), only hydrological images composed of fixed CNs are generated under non-precipitation conditions. Hydrological images composed of fixed CNs were excluded from this study because those images cannot serve as a variable, and only the hydrological images for days under a precipitation condition were collected and used for CNN model training. The hydrological images were built as two-dimensional images with 30 × 30 m resolution, identical to the land cover and soil maps, and were created as TIFF files. Here, the TIFF file format was used because creating a hydrological image in other file formats, such as BMP, JPEG, or PNG, the hydrological attribute values of all the grids do not follow the results calculated by Equations (1)-(3) but are recalculated into color values expressed from 0-255, resulting in information loss. The hydrological images were built using Python 3.7 [32], an open-source language, through the process illustrated in Figure 6.

2.
Building Target Data The target data were the result corresponding to the given input data and were used to train the model within the neural network. In this study, supervised learning was used as the model training method, which is a method that continuously reduces the error between the predicted data and the target data by improving the weight of the neural network with iterations during training. In this study, the target data were actual observed data, representing the pollution load of the BOD and TP; the pollution load data corresponded to the same dates as the hydrological images. The target data consisted of the BOD and TP loads of the JJ, HC, and BH watersheds, and were developed in a CSV file format for recognition in the CNN model. Here, the pollution load was calculated using discharge and water quality data.

3.
Establishing the Dataset for CNN Model Training and Testing Datasets are required to train CNN models; in this study, the datasets were built using the hydrological images and target data described above. In general, datasets in the CNN field are divided into input dataset and test datasets. The input datasets are divided into training and validation datasets while the test dataset is used to verify the prediction performance of the CNN model. The CNN model in this study was designed for six cases, where cases 1-3 were a model for the simulation of the BOD load and cases 4-6 were a model for the simulation of the TP load. As listed in Table 2, two watershed datasets were used as the input datasets for CNN model training, and the remaining watershed dataset was used as a test dataset, resulting in the final settings for the pollution load estimation of the ungauged watersheds.

CNN Model Structure
The CNN model mimics the visual processing of biological organisms with visual organs, and LeCun et al. [33] first proposed the development of the CNN model. The CNN model produces highly reliable results and has superior efficiency, as well as a reputation as a core technology for research and applications for rapid image classification. Recently, the model has been applied not only in image classification, but also in speech recognition and image semantic segmentation [34].
The structure of the CNN model can be largely divided into two parts. The first part of the CNN model appears at the front of the CNN model architecture, consisting of the repetition of the convolution layer and pulling layer while the second part appears at the back of the CNN model, consisting of flatten layers serving as a connection between the first and second parts, followed by dense layers and output layers. The first part is also known as a convolution layer (CL), which manages the core functions of the CNN. While maintaining the structure of the I/O data for each layer and spatial information for the images, this part effectively recognizes the features with adjacent images. The second part is known as a fully connected layer (FC), representing a deep neural network (DNN) composed of multiple dense layers between a flatten layer and an output layer, which derives simulation results [35,36].
In this study, the architecture of the CNN model was designed as five alternate repetitions of two convolution layers and one pooling layer for the CL part, as shown in Figure 7. In addition, CNN uses a two-dimensional plane function image filter known as a kernel, which is located in each convolution layer and plays a role as a parameter to identify features of the image. The kernels are a square matrix with arrangements of parameters, and their size and number can be adjusted arbitrarily according to the user's purpose. In this study, the size of all kernels was set to 3 × 3, and the numbers of 32, 32, 64, 64, 128, 128, 256, 256, 512, and 512 were allocated in the order of each convolution layer as shown in Figure 3. In addition, the kernel searches for images and extracts features while moving from left to right and top to bottom from the top of the image; in this case, the movement spacing of the kernel is referred to as stride. In this study, the stride of the first to third convolution layer was set to 2, and the stride of all other convolution layers was set to 1. The pooling layer is a sub-sampling technique that only leaves important information while reducing the size of the data received from the convolution layer. In addition, the convolution layer sets the activation function, which was set as a rectified linear unit (ReLu) function in this study. The ReLu function can prevent gradient vanishing in which the gradient value disappears, such that it has been used in several prior studies reporting high optimization efficiency of the model [37,38]. The pooling layer, with the max and average pooling method, can efficiently utilize the computer memory and reduces computation information, resulting in the effect of preventing overfitting. As average pooling methods cause information loss, the max pooling method was used in this study. In addition, the pooling size can be arbitrarily set as in the case for the kernels; therefore, the pooling size was set to 3 × 3, identical to the kernel size.
The difference between the CNN model used in this study and that of previous studies is the FC part. The CNN model employs a softmax or sigmoid function as the activation function in the last dense layer of the FC because it is designed to solve the classification problem. However, this is not suitable for the simulation of unspecified continuous variables as is the case in this study. A linear function frequently used in the regression model was applied to the last dense layer of the FC in this study, unlike previous studies. For all environments of the CNN model design, implementation, and operation, Keras [39], a machine learning library released by Google was used. Keras is a Python-based high-level deep neural network application programming interface (DNN API), with a Tensorflow [40] backend, which supports all special functions of Tensorflow and is highly flexible in implementing DNN models. Keras facilitates rapid experiments with the efficient use of the CPU and GPU.

Hyper-Parameter Setting of the CNN Model
In the field of artificial neural networks, there are various types of hyper-parameters. Depending on the hyper-parameter implementation method, the model prediction results can either show excellent performance or results that are not satisfactory with respect to the expected performance, thus indicating the importance of hyper-parameter implementation. In the field of neural networks, the hyper-parameters are not the main parameters for tuning or optimization, but they are set by the user with a priori knowledge or automatically set through the mechanism of an external model. For the CNN model in this study, the hyperparameters were directly set without using an external model. The hyper-parameters of the CNN model in this study included the LOSS function, optimizer for model optimization, learning rate, number of layers and nodes of the FC, and epoch, which is the number of model training processes.
For learning method of the CNN model, supervised learning was used, in which the model is optimized by minimizing the error, which is the difference between the predicted value derived during training and the target value. To reduce the error, the values of the weights within the neural network were continuously updated. The error is referred to as the LOSS in neural networks, where the optimization of the model using supervised learning can be described as minimizing this LOSS. Functions, such as binary crossentropy, categorical crossentropy, and sparce categorical crossentropy, have been mainly used to calculate the LOSS of the CNN model. However, these functions are suitable for classification problems and are not adequate for models predicting continuous variables, as is the case in this study. Therefore, in this study, mean square error (MSE), which is mainly used in regression analysis, as shown in Equation (6), was used; through this, the optimization of the model was determined. The predictive performance of the model must be evaluated during model training; as a metric for the evaluation, the mean absolute error (MAE) was used in this study, as shown in Equation (7): where, y i and y i denote the observed value and predicted value, respectively, and MSE and MAE indicate that an increased number of values that converge to zero yields higher model performance. Even if the result of the LOSS calculation is good or excellent, it is the result of only using the training dataset for model training purposes; thus, it is necessary to perform validation for the data that the model has not been trained with (or unobserved data). The dataset used for this purpose was the validation dataset mentioned above, along with the LOSS of the training dataset and Equations (6) and (7); the functions used for prediction performance were used in the same manner. In addition, the names were separated to distinguish between the results of the training dataset and those of the validation dataset. The results of the training dataset were set to LOSS and MAE, and the results of the validation dataset were set to validation LOSS and validation MAE. For the model optimization algorithm, the stochastic gradient descent (SGD) [41] method is regarded as the most basic methodology. To improve the optimization speed and accuracy by addressing the shortcomings of the SGD, previous studies have developed momentum [42], Nesterov accelerated gradient [43], Adagrad [44], Nadam [45], AdaDelta [46], RMSProp [47], and Adam [48]. In this study, ADAMAX [48] was used, an algorithm proposed as an extension of ADAM [48]. ADAMAX is an algorithm based on ADAM's update rule, which exhibits the best performance in a CNN model. ADAMAX shows superior performance to other algorithms under noisy conditions and has the characteristic of fast convergence when the data variation is large. Hence, ADAMAX was an appropriate to selection as the optimization algorithm for the CNN model in this study because the characteristics of the pollution loads to be simulated vary significantly depending on the hydrological phenomena, as well as substantial variations in the parameters and noise due to natural conditions. In addition, the learning rate was set to 0.0001 and the epoch was set to 1000. The FC layer was designed with an architecture consisting of three dense layers and one batch normalization layer to avoid overfitting, followed by one dense layer. The number of nodes in the three dense layers was 1024, and the number of nodes in the last dense layer was one. In this study, CNN model construction and training were performed on a desktop with an Intel ® Core™ i9-9900K CPU @ 3.60 GHz, 32 GB RAM, and an NDIVIA ® GEFORCE RTX™ 2080Ti GPU with a Windows OS.

Evaluation of the CNN Model
For the evaluation of the BOD and TP load predictions of the ungauged watersheds with the CNN model, the performance of the CNN model was assessed by comparing the predicted value and observed value. As listed in Table 2, the models for evaluation were six models including cases 1-6 and cases 1-3, which were the models used to estimate the BOD load and cases 4-6 were the models used to estimate the TP load. For evaluation of this CNN model, four methods were used: Pearson correlation coefficient (r), Nash-Sutcliffe efficiency (NSE), root-mean-square error (RMSE), and mean absolute percent error (MAPE).
The Pearson correlation coefficient (r) is a method that examines the linearity between the observed value and predicted value in the range from -1 to 1. Values of r that are closer to 1 suggest stronger linearity between the predicted value and observed value. The Pearson correlation coefficient is calculated as The NSE ranges from 0 to 1; if the NSE value is 1, the model result is in perfect agreement with the observed value. The NSE is calculated as The RMSE is a quantitative metric showing whether the predicted value of the model is close to the observed value; the range is expressed from 0 to infinity within the range of the data. As the RMSE usually follows the unit of the target to be simulated, the unit of the RMSE in this study is kg/day, which is the unit of the pollution load. The evaluation criteria for the RMSE itself are not clearly provided, but values closer to 0 indicate model results that are closer to the observed values. The RMSE is calculated as The MAPE is presented as a percentage. The key point of MAPE is to express the size of the error as a percentage rather than a quantification of the error as is the case for RMSE. Therefore, as MAPE is presented in terms of a percentage in the range from 0-100, the results can be easily interpreted with MAPE. For the evaluation of MAPE, values closer to 0 indicate better model performance, but there is no clear evaluation criterion for model performance. MAPE is calculated as where y i and y i are the observed and predicted values, respectively, and y and y represent the averages of the observed and predicted values, respectively.

Hydrological Image Generation and Image Datasets Construction Results
As mentioned in Section 2.3.1 for the method of generating hydrological images, the number of hydrological images is equal to the number of rainfall event during the study period, and the number of hydrological images created during the study period was 554 for the JJ watershed, 571 for the HC watershed, and 566 for the BH watershed. The sizes of the hydrological images in a grid format with a resolution of 30 × 30 m for the JJ, HC, and BH watersheds were 670 × 819, 1014 × 632, and 513 × 872, respectively. Figure 8 shows part of the results for the generated hydrological images. As shown in Figure 8, the hydrological images generated in this study take the shape of an image as the color-code of the grid, which indicates that each grid has a hydrological attribute value; according to each value, hydrological attribute is represented by color-code (Figure 8a).
The CNN model requires images of the same size during training. However, as shown in Figure 8a, the images of the three study areas had different sizes and therefore were not suitable for CNN model training. Hence, the image should be resized to identical sizes for each watershed. In this study, as shown in Figure 8b, the images were resized into square images based on a 1024 × 1024 grid, i.e., the largest size of the sides among the hydrological images of the three watersheds. For the resizing, linear interpolation by Pillow [49], an image module of Python, was used.
For training and testing of the CNN models, the input dataset and test dataset were classified as listed in Table 3 using hydrological images and target data. There is no specified criterion for the ratio used to divided training datasets and validation datasets in the CNN model; however, in previous studies, the ratio between the training dataset and the validation dataset was set to 7:3 or 8:2 [50][51][52]. Accordingly, in this study, the datasets for training and validation of the model were divided with a ratio of 7:3, as listed in Table 3, but the test datasets were composed of data for the watersheds for simulation only to fulfill the purpose of pollution load predictions in ungauged watersheds; these were not included in the input datasets. In this manner, the number of training datasets and validation datasets for the input datasets in cases 1 and 4 were 796 and 341, respectively, and the number of test datasets was 554. The number of training datasets and validation datasets for the input datasets in cases 2 and 5 were 784 and 336, respectively, and the number of test datasets was 571. For the input datasets in cases 3 and 6, the number of training datasets, validation datasets, and test datasets was 788, 337, and 566, respectively.

CNN Model Architecture and Training Results
The architecture of the CNN model in this study was composed of five alternating repetitions of two convolution layers and one pooling layer, with one flatten layer, three dense layers, and one batch normalization layer, finally followed by one dense layer such as Table 4. In this mode, the number of total parameters was 15,205,601 (1) , the number of trainable parameters was 15,203,553 (2) , and the number of non-trainable parameters owing to the batch normalization layer was 2048 (3) . Table 4. Summary of CNN model structure used in this study.

Convolution Layer
Output Shape (Raw_Size, Column_Size, Image_Channel)  Figure 9 shows the training results of the CNN model. There is no specified criterion for evaluating the training results of an artificial neural network model; however, in general, as the epoch progresses, when the graph of the learning result (loss, validation loss, MAE, and validation MAE) exhibits a similar shape to an exponential function with a base of zero or less as the corresponding values becomes closer to 0 on the y-axis, sufficient model training has been performed.

Number of Parameter
Examining Figure 9, the loss, validation loss, MAE, and validation MAE of all cases showed a sharp decrease in the early epoch progress, and then gradually decreased. As mentioned above, loss and MAE are indicators of the result of model training by training datasets while validation loss and validation MAE are indicators of the generalization result of the model by validation datasets. As shown in Figure 9, all of the graphs show a large amplitude, indicating an unstable training state; however, all values are heading for the zero, indicating that the training and generalization of the model have been sufficiently performed.

Prediction Results of the CNN Model and Model Evaluation
Using the trained CNN models for cases 1-6, the BOD and TP loads were predicted by assuming that the three study watersheds were ungauged watershed, which is the purpose of this study. The BOD load was predicted using the CNN models for cases 1-3, and the TP load was predicted using the CNN models for cases 4-6. Figure 10a shows the relationship between the predicted BOD load and observed value while Figure 10b shows the relationship between the predicted TP load and observed value. Cases 1-3 represent the BOD loads of the JJ, HC, and BH watersheds, respectively (Figure 10a), and cases 4-6 represent the TP loads in the JJ, HC, and BH watersheds, respectively (Figure 10b). In all cases, the scatter plots of the predicted and observed values are shown with a 1:1 line, indicating that the predicted and observed values demonstrate a linear relationship, which confirms that the predicted values agree with the observed values. In this study, r, NSE, RMSE, and MAPE were used to evaluate the performance of the CNN model, whose results are listed in Table 5. For the evaluation of the model using r, the model was weak when the value of r ranged 0.1-0.3, moderate from 0.4-0.6, strong from 0.7-0.9, and perfect when the value was 1 [53]. For the NSE, the model was moderate at NSE values from 0.6-0.8 and good at values above 0.8 [54]. In all cases, r showed a strong correlation of 0.9, and all cases were evaluated as strong, while all NSE values were 0.9, such that all cases were good. In the evaluation using the RMSE, in cases 1-3 for the BOD load simulation, the values were 1031.1, 1152.2, and 5376.5 kg·day −1 , respectively, and the RMSE value was the largest for case 3, which represented the BH watershed. In cases 4-6 for the TP load simulation, the values were 53.6, 169.5, and 476.5 kg·day −1 , respectively, and the RMSE value was the largest for case 6, which represented the BH watershed. In the evaluation using MAPE, the values were 11.5, 15.9, and 20.1% for cases 1-3, and 17.9, 16.3, and 21.3% for cases 4-6, respectively, and similar to the case for the RMSE, the value of MAPE was the highest for the BH watershed. When determining the model using RMSE and MAPE, it is difficult to evaluate the models because there is no specific criterion, but the closer the values are to 0, the model performance can be evaluated to be better. In this regard, this study conducted a comparison with the results of previous studies using artificial neural networks and the CNN model performance of this study. Azadi et al. [55] used a model combining principal component analysis-M5P and ANN for simulation of the chemical oxygen demand (COD) load of landfill, and the Pearson correlation coefficient between the simulated COD load and the observed value was 0.98, and MAPE was 4%, and reported the excellence performance of the model. Also, Yu et al. [56] developed a kernel principal component analysis and extreme learning machine (KPCA-ELM) model for prediction of the COD and BOD concentrations of influent for wastewater treatment. As a result, MAPE of COD was 1.2% and MAPE of BOD was 1.3% in their research. Yan et al. [57] used back-propagation neural network, genetic algorithm combined BPNN, particle swarm optimization combined with BPNN and model combining PSO-BPNN and GA-BPNN for prediction of dissolved oxygen. The results of MAPE were 25.1%, 8.5%, 15.7%, and 6.7%, respectively, reporting that the performance of model combining PSO-BPNN and GA-BPNN was the highest for prediction of dissolved oxygen concentration. In addition, Shao et al. [58] reported a considerably high performance of their model with MAPE values ranging from 0.04% to 1.2% as a result of developing and applying a water quality prediction model based on the backpropagation neural network optimized by the Cuckoo Search algorithm for nine water quality items. As can be seen from these prior studies, excellent results have been demonstrated with the application of artificial neural network technology for simulation of contaminants. The results of cases 1, 2, 4, and 5 of this study also showed similar results compared to previous studies, but cases 3 and 6 showed relatively low performance of the model.
The reason for the low performance results for cases 5 and 6 is a result of the input data not used by the CNN model during the training process. The land cover status of the study areas in the JJ and HC water are forest-dominant with similar ratios of agricultural land area. As discussed in Section 3.1, the average discharge or BOD and TP concentrations in the JJ and HC watersheds showed similar levels. However, the BH watershed is an agriculturedominant region, showing discharge higher levels of BOD and TP concentrations compared with the other study areas. In other words, cases 1 and 5 simulated the BOD and TP loads in the JJ watershed; during the training of the CNN model, data for the HC watershed was also used, providing indirect experience for the JJ watershed, which resulted in a good prediction performance for the JJ watershed. In addition, in cases 2 and 5, during the training of the CNN model, data for the JJ watershed was also used, providing indirect experience for the HC watershed, which resulted in a good prediction performance for the BOD and TP loads in the HC watershed. In contrast, in cases 5 and 6, during the training of the CNN model, there was no training data available similar to the BH watershed, resulting in lower model performance compared with cases 1, 2, 4, and 5, and it is thought that if CNN models are trained.
Consequentially, this study believes the CNN model using hydrological image could enable to be a role as a pollutant loads prediction model through the result of this study and the comparison between this study and previous study. In particular, the CNN model in this study presented the improvement of being able to simulate pollutant loads by reflecting two-dimensional spatial information, unlike the preceding study using a normal ANN model. In addition, this study showed the merit that could enable convenient and simple modeling with limited input data. This paper thought this CNN model might be helpful to the government's policy decision-making and insufficient administrative power for water management. Furthermore, the methodology proposed of this study could be applied to the remote sensing field, so that this study judged it could be used to predict the pollution load of rivers using satellite images.

Conclusions
In this study, hydrological images and CNN models were used to simulate the pollution load of ungauged watersheds. The model performance of cases 1, 2, 4, and 5 was good, but cases 3 and 6 showed relatively low performance. This was a result of a lack of predictive power for the data not used during CNN model training, which is considered a limitation of the CNN model when using hydrological images. Although this study cannot clearly demonstrate whether spatial feature extraction, the core function of the CNN model, was properly applied to the hydrological images to simulate the pollution load, hydrological images with features similar to the simulation target were used to train the CNN model, thereby obtaining better model performance. In addition, the fact that continuous simulation is not possible using hydrological images and the CNN model is a limitation of this study. Through improvements to these methods, this study aims to perform further research that enables the continuous simulation of the pollution load even at non-precipitation conditions. Nevertheless, for the results of this study based on the simulation of the pollution loads in ungauged watersheds affected by precipitation and land cover, the CNN model showed a good level of performance and demonstrated simplicity, i.e., not requiring various data or additional calibration and adjustment, as well as sensitivity analyses required by conceptual models. Furthermore, a CNN model with high accuracy should be obtained if the model can be trained using the methodology proposed in this study and an increased number and variety of hydrological images.