Prediction of Aquatic Ecosystem Health Indices through Machine Learning Models Using the WGAN-Based Data Augmentation Method

: Changes in hydrological characteristics and increases in various pollutant loadings due to rapid climate change and urbanization have a signiﬁcant impact on the deterioration of aquatic ecosystem health (AEH). Therefore, it is important to effectively evaluate the AEH in advance and establish appropriate strategic plans. Recently, machine learning (ML) models have been widely used to solve hydrological and environmental problems in various ﬁelds. However, in general, collecting sufﬁcient data for ML training is time-consuming and labor-intensive. Especially in classiﬁcation problems, data imbalance can lead to erroneous prediction results of ML models. In this study, we proposed a method to solve the data imbalance problem through data augmentation based on Wasserstein Generative Adversarial Network (WGAN) and to efﬁciently predict the grades (from A to E grades) of AEH indices (i.e., Benthic Macroinvertebrate Index (BMI), Trophic Diatom Index (TDI), Fish Assessment Index (FAI)) through the ML models. Raw datasets for the AEH indices composed of various physicochemical factors (i.e., WT, DO, BOD 5 , SS, TN, TP, and Flow) and AEH grades were built and augmented through the WGAN. The performance of each ML model was evaluated through a 10-fold cross-validation (CV), and the performances of the ML models trained on the raw and WGAN-based training sets were compared and analyzed through AEH grade prediction on the test sets. The results showed that the ML models trained on the WGAN-based training set had an average F1-score for grades of each AEH index of 0.9 or greater for the test set, which was superior to the models trained on the raw training set (fewer data compared to other datasets) only. Through the above results, it was conﬁrmed that by using the dataset augmented through WGAN, the ML model can yield better AEH grade predictive performance compared to the model trained on limited datasets; this approach reduces the effort needed for actual data collection from rivers which requires enormous time and cost. In the future, the results of this study can be used as basic data to construct big data of aquatic ecosystems, needed to efﬁciently evaluate and predict AEH in rivers based on the ML models. this study did not reﬂect the data characteristics of periods such as ﬂood and drought; it there is accurately predict in extreme In future research, it will be necessary to continuously train and validate the ML model by acquiring observed data from various rainfall events and supplementing the synthetic datasets of WGAN to accurately predict changes in AEH through the ML model. The ML approach that is proposed in this study could contribute to not the achievement of the AEH improvement goal through the introduction of structural and non-structural watershed management plans, also selecting the that water quality and aquatic ecology management. In the data augmentation method using the for the large amount of data for an integrated water environment management system that organically links the quantity-water quality-aquatic ecosystem.


Introduction
The recent rapid climate change and the increase in human activity due to urbanization and industrialization have caused sudden variability of river water quantity and quality [1][2][3]. In particular, the decrease in water quantity and deterioration of water quality in rivers have caused a significant impact on the reduction in biodiversity and degradation of aquatic ecosystem health (AEH) [4,5]. Maintaining and restoring the AEH has a positive effect on ecosystem services provided to humans and creation of a sustainable habitat environment for aquatic organisms [6,7]. Therefore, it is important to establish a systematic In the future, the AEH grade results predicted from the ML model reflecting various physicochemical factors can be utilized in decision making for the selection of impaired rivers and the establishment of restoration strategies. To our knowledge, this is the first study to predict the grades of each AEH index through ML models and WGAN-based data augmentation, which is the main novelty of this study. The overall workflow of this study is shown in Figure 1.
Sustainability 2021, 13, x FOR PEER REVIEW 3 of 22 ML model to effectively evaluate the grades of each AEH index by considering multiple physicochemical factors such as flow, water quality, and water temperature. In the future, the AEH grade results predicted from the ML model reflecting various physicochemical factors can be utilized in decision making for the selection of impaired rivers and the establishment of restoration strategies. To our knowledge, this is the first study to predict the grades of each AEH index through ML models and WGAN-based data augmentation, which is the main novelty of this study. The overall workflow of this study is shown in Figure 1.

Description of the Study Area
The Han River basin (35,770 km 2 ) is located in the center of the Korean peninsula within the latitudes of 126°35′ to 128°46′ and the longitudes of 36°30′ to 38°56′ ( Figure 2). The Han River has a main reach length of about 481.8 km from the source of the river to the estuary, consisting of about 920 rivers flowing into the main stream. In the Han River basin, there are 232 water quality monitoring stations collecting flow and water quality data and 746 biomonitoring stations collecting aquatic ecosystem data. The Han River consists largely of the Bukhan River and the Namhan River, and these two main rivers flow into Paldang-dam, which supplies drinking and living water for the residents of the metropolitan Seoul area. In particular, numerous tributaries flowing into these two main rivers are important in terms of river water quality and aquatic ecosystem management as habitats for various species. Recently, in the Han River, due to urbanization and industrialization, water pollution problems have been occurring frequently due to the influence of non-point sources as well as point source discharge facilities such as water waste treatment plants (WWTPs) [28]. In addition, the recent decrease in river flow due to frequent drought in spring is causing deterioration of river water quality [29]. These natural and anthropogenic influences can have a significant impact on aquatic ecosystems. Therefore, it is necessary to establish appropriate sustainable management plans for the protection of aquatic ecosystems through the efficient prediction of the AEH in the Han River in advance.

Description of the Study Area
The Han River basin (35,770 km 2 ) is located in the center of the Korean peninsula within the latitudes of 126 • 35 to 128 • 46 and the longitudes of 36 • 30 to 38 • 56 ( Figure 2). The Han River has a main reach length of about 481.8 km from the source of the river to the estuary, consisting of about 920 rivers flowing into the main stream. In the Han River basin, there are 232 water quality monitoring stations collecting flow and water quality data and 746 biomonitoring stations collecting aquatic ecosystem data. The Han River consists largely of the Bukhan River and the Namhan River, and these two main rivers flow into Paldang-dam, which supplies drinking and living water for the residents of the metropolitan Seoul area. In particular, numerous tributaries flowing into these two main rivers are important in terms of river water quality and aquatic ecosystem management as habitats for various species. Recently, in the Han River, due to urbanization and industrialization, water pollution problems have been occurring frequently due to the influence of non-point sources as well as point source discharge facilities such as water waste treatment plants (WWTPs) [28]. In addition, the recent decrease in river flow due to frequent drought in spring is causing deterioration of river water quality [29]. These natural and anthropogenic influences can have a significant impact on aquatic ecosystems. Therefore, it is necessary to establish appropriate sustainable management plans for the protection of aquatic ecosystems through the efficient prediction of the AEH in the Han River in advance.

Data Collection
In Korea, AEH monitoring has been performed in most rivers nationwide. The grades of the AEH indices (BMI, TDI, and FAI) are calculated according to the health score evaluation criteria as shown in Table 1. In this study, raw datasets for AEH indices, required for training the ML models, were prepared using observed data collected from 76, 73, and 67 water quality monitoring and biomonitoring stations from 2008 to 2018, respectively. Each raw dataset was composed of physicochemical factors (i.e., flow, dissolved oxygen (DO), biochemical oxygen demand in five days (BOD5), total nitrogen (TN), total phosphorus (TP), suspended solids (SS), and water temperature (WT)) and AEH grade data (A to E), which are target data in ML training process. The physicochemical factors were selected as factors that can have a relatively significant effect on the health of aquatic ecosystems by referring to previous research results [30,31]. However, in Korea, the AEH monitoring data, collected twice a year during Spring and Autumn, are insufficient compared with water quality data in terms of the number of datasets and timing of data collection. Thus, the raw datasets were built by classifying physicochemical factors' data from the same monitoring date based on the AEH monitoring date. The general statistical

Data Collection
In Korea, AEH monitoring has been performed in most rivers nationwide. The grades of the AEH indices (BMI, TDI, and FAI) are calculated according to the health score evaluation criteria as shown in Table 1. In this study, raw datasets for AEH indices, required for training the ML models, were prepared using observed data collected from 76, 73, and 67 water quality monitoring and biomonitoring stations from 2008 to 2018, respectively. Each raw dataset was composed of physicochemical factors (i.e., flow, dissolved oxygen (DO), biochemical oxygen demand in five days (BOD 5 ), total nitrogen (TN), total phosphorus (TP), suspended solids (SS), and water temperature (WT)) and AEH grade data (A to E), which are target data in ML training process. The physicochemical factors were selected as factors that can have a relatively significant effect on the health of aquatic ecosystems by referring to previous research results [30,31]. However, in Korea, the AEH monitoring data, collected twice a year during Spring and Autumn, are insufficient compared with water quality data in terms of the number of datasets and timing of data collection. Thus, the raw datasets were built by classifying physicochemical factors' data from the same monitoring date based on the AEH monitoring date. The general statistical data such as mean, minimum, and maximum values for each input datum of raw datasets are presented in Table 2. Table 3 shows the number of data by grade in the raw datasets for each AEH index. In particular, as shown in Figure 3, the number of samples of grades D or E is relatively small compared to the number of samples of grades A, B, and C of each AEH index. In such cases, the ML models tend to be overwhelmed by large classes and ignore small classes [32]. Therefore, to improve performance of the ML models for grade prediction of each AEH index, it is necessary to alleviate the imbalance of the dataset by augmenting the input data for each grade. Table 1. Evaluation criteria for grades of each AEH index [11]. data such as mean, minimum, and maximum values for each input datum of raw datasets are presented in Table 2. Table 3 shows the number of data by grade in the raw datasets for each AEH index. In particular, as shown in Figure 3, the number of samples of grades D or E is relatively small compared to the number of samples of grades A, B, and C of each AEH index. In such cases, the ML models tend to be overwhelmed by large classes and ignore small classes [32]. Therefore, to improve performance of the ML models for grade prediction of each AEH index, it is necessary to alleviate the imbalance of the dataset by augmenting the input data for each grade.

Wasserstein Generative Adversarial Network (WGAN)
The Generative Adversarial Network (GAN) is an algorithm that creates data distribution or variance itself through competitive learning of two neural networks, the generator G and the discriminator D [20]. The G is trained to generate data that are similar to the real data, whereas the D is trained to lower the probability of erroneously discriminating the data generated by G. However, the original GAN is sometimes difficult to train due to mode collapse and gradient loss problems [23]. Therefore, we used WGAN as a data augmentation method to increase the number of training samples. The WGAN uses the Wasserstein Distance (WD) in Equation (1) as the loss function instead of the Jensen-Shannon Divergence loss function used in the original GAN [24].
The WD is the smallest estimate of the expected value of the x and y distances among all the joint probability distributions. This loss function has the advantage in that the slope does not disappear near the optimum point of the parameter and the learning can proceed stably [33]. In WGAN, a newly defined critic is used instead of the discriminator, and when the G is trained once, the critic trains n critic times. The critic updates parameter w based on the RMSProp optimizer and weight clipping for parameter w is performed to satisfy the k-Lipshitz continuous function condition. The TensorFlow (version: 2.3.0) and Keras (version: 2.4.3) libraries were used as a deep learning framework for WGAN. As a parameter of WGAN, the learning rate (α) based on the RMSProp optimizer was 0.00005, the weight clipping parameter (c) was 0.01, and the n critic was 5. The raw datasets of the AEH indices were used as the input data of WGAN, and the epoch of WGAN was set to 15,000 each, and training was performed. Figure 4 shows the model network for G and D (a critic in the paper) constructed for WGAN training. The model network for G was composed of dense layers and activation function LeakyReLU layers, and the model network for D was constructed by adding dropout layers in the middle to prevent overfitting [34].

ML Models Building and Evaluation Process
In this study, we built six ML models (Support Vector Machine [35], Decision Tree [36], K-Nearest Neighbors [37], Random Forest [38], Gradient Boosting [39], and eXtreme Gradient Boost [40]) to predict the grades of each AEH index. Scikit-learn module (version: 0.24.1) and XGboost library (Version: 1.4.2) in the python 3.7 environments were used to build each ML model. A detailed description of ML models can be found in the work of Bae et al. [41]. The function information of the ML models used in the study is presented in Table 4. First, to evaluate the AEH grade prediction performance of the ML model and the applicability of the WGAN, two datasets were constructed. One of the two datasets is a raw dataset with an unbalanced distribution, and the other is a WGAN-based dataset in which a synthetic dataset is augmented through WGAN and raw datasets combined. Before the model training, each dataset was split into 80% for training set and 20% for test set to evaluate the performance of the models. After that, preprocessing was performed on the training set. In general, appropriate preprocessing such as data normalization, abnormal data processing, and data format conversion is required to improve the performance of the ML model [42]. The StandardScaler function was used for a preprocessing step that makes the distribution of the entire input data average 0 and variance 1. Additionally, the dummy function of the Pandas module (version: 1.1.4) was performed to encode the grades of each AEH index as an integer. Second, to evaluate the reliability and stability of the AEH grade prediction of six ML models, the ML models were trained and validated using the raw training set with the K-fold cross-validation (CV) technique. This technique has the advantage of being able to classify training and validation sets while maintaining the proportions of classes, each with a different proportion, and can prevent over-fitting and under-fitting Sustainability 2021, 13, 10435 7 of 20 on a specific dataset and improve the prediction accuracy of the model [43,44]. For the single parameter, K, representing the number of groups into which a given dataset will be split, we used a value of 10, which is usually recommended to check the generalizability of the model [45]. The training performance of the ML models was validated based on the F1-score performance metric. Additionally, based on the results, three ML models were selected with the best AEH grade prediction performance. Finally, three selected ML models for applicability evaluation of WGAN were trained and validated on the raw and WGAN-based training sets and evaluated on each test set.

ML Models Building and Evaluation Process
In this study, we built six ML models (Support Vector Machine [35], Decision [36], K-Nearest Neighbors [37], Random Forest [38], Gradient Boosting [39], and eX Gradient Boost [40]) to predict the grades of each AEH index. Scikit-learn module sion: 0.24.1) and XGboost library (Version: 1.4.2) in the python 3.7 environments used to build each ML model. A detailed description of ML models can be found i work of Bae et al. [41]. The function information of the ML models used in the stu presented in Table 4. First, to evaluate the AEH grade prediction performance of th model and the applicability of the WGAN, two datasets were constructed. One of th datasets is a raw dataset with an unbalanced distribution, and the other is a WGANdataset in which a synthetic dataset is augmented through WGAN and raw datasets bined. Before the model training, each dataset was split into 80% for training set and  Confusion matrix [46], which contains information about the actual and predicted classes obtained by a classification model, is useful for visually understanding the performance of the model. The basic form of the confusion matrix is shown in Figure 5. The confusion matrix is useful for evaluating the performance of ML models to solve various classification problems [47,48]. In general, four evaluation metrics (Accuracy, Precision, Recall, and F1-score) are widely used to evaluate model classification performance based on the confusion matrix. These metrics can be obtained by Equations (2)-(5). Accuracy is a measure of the overall predictive performance of a model as the ratio of the number of correctly classified data to the total number of data. Precision is the ratio of correctly classified data out of the total data predicted by the classification model as positive. Recall is a measure of the completeness of a model and is the ratio of correctly classified data to the total number of positive actual data. F1-score is the harmonic mean which considers both the precision and recall [49]. Precision and Recall have an inherent trade-off relationship in that when one metric increases, the other decreases. Therefore, the F1-score is a special measure that can evaluate the performance of a model through the trade-off between precision and recall. Additionally, the F1-score gives a better view of ML model performance, especially for datasets with an imbalanced class distribution, because F1-score is not biased towards majority classes [50]. This study evaluated the classification performance of ML models using the F1-score based on the confusion matrix.

Correlation Analysis Results
It is an important process to analyze and understand each input datum before training the ML model [41]. Correlation analysis was carried out on input data (i.e., WT, DO, BOD5, SS, TN, TP, and Flow) of the raw dataset for AEH indices, which were analyzed using a heat map, as shown in Figure 6. The results showed that BOD5, TN, TP, and SS concentrations had a strong negative correlation with scores for each AEH index in common compared to other factors. In the results of Woo et al. [51], water quality concentration and scores of AEH indices showed a similar negative correlation. Kim et al. [52] also showed that TDI had a strong negative correlation with BOD5, TN, and TP concentrations. This indicates that water quality concentration can be used as an important indicator to identify changes in AEH. Additionally, SS concentration had the highest correlation with BMI, followed by the TDI and FAI score. According to Griffiths and Walton [53], upper tolerance levels for SS concentration are between 80 and 100 mg/L for fish and as low as

Correlation Analysis Results
It is an important process to analyze and understand each input datum before training the ML model [41]. Correlation analysis was carried out on input data (i.e., WT, DO, BOD 5 , SS, TN, TP, and Flow) of the raw dataset for AEH indices, which were analyzed using a heat map, as shown in Figure 6. The results showed that BOD 5 , TN, TP, and SS concentrations had a strong negative correlation with scores for each AEH index in common compared to other factors. In the results of Woo et al. [51], water quality concentration and scores of AEH indices showed a similar negative correlation. Kim et al. [52] also showed that TDI had a strong negative correlation with BOD 5 , TN, and TP concentrations. This indicates that water quality concentration can be used as an important indicator to identify changes in AEH. Additionally, SS concentration had the highest correlation with BMI, followed by the TDI and FAI score. According to Griffiths and Walton [53], upper tolerance levels for SS concentration are between 80 and 100 mg/L for fish and as low as 10-15 mg/L for bottom invertebrates. SS concentration is a factor that interferes with the feeding or spawning of most benthic organisms and has a greater effect on benthic organisms than any other organisms in the river [54]. On the other hand, the correlation between FAI score and SS concentration was found to be rather small at negative 0.22, which is thought to be because the relatively low-concentration data compared to the high-concentration data were reflected in the correlation analysis results. This indicates that ML models need to be trained on sufficient high-concentration SS data to accurately capture the impact of SS on fish population changes and habitat environment. In addition, since fish and other organisms respond greatly to the duration exposure of sediment as well as the concentration of sediment [55], correlation analysis including variables such as exposure period to SS concentration will be required for accurate sensitivity analysis of physicochemical factors according to AEH changes.

Correlation Analysis and WGAN-Based Data Augmentation Results
In this study, the datasets for ML model training were augmented using the raw datasets for the AEH indices through WGAN. In the WGAN training process, we found that the loss function of the discriminator showed large fluctuations in the initial stage (Figure 7). This implies that the generator and discriminator at the initial stage do not

Correlation Analysis and WGAN-Based Data Augmentation Results
In this study, the datasets for ML model training were augmented using the raw datasets for the AEH indices through WGAN. In the WGAN training process, we found that the loss function of the discriminator showed large fluctuations in the initial stage (Figure 7). This implies that the generator and discriminator at the initial stage do not recognize anything about basic characteristics of the data, so there is a limit of the discriminator to sufficiently train relevant information related to the real data [56]. However, as training progresses, the loss function stably converges to an optimal value close to zero. In addition, as the loss function was optimized, the synthetic data generated by the generator showed a result similar to the distribution of the real data ( Figure 8). This indicates that the generator generates better quality synthetic data because the discriminator's ability to capture the details of the real data improved as the training iteration. After training, 25,000 synthetic data for the AEH indices were randomly generated through WGAN and raw datasets were combined to construct WGAN-based datasets. Table 5 shows the amount of data by grade in the synthetic dataset and the WGAN-based dataset for the AEH indices. As can be seen in Table 5, the amount of data by grade for each AEH index augmented through WGAN is not the same, but it can be seen that the overall data balance by grade is improved compared to the raw dataset. When the input data of the ML model for solving the classification problem is unbalanced by class, the model may be overwhelmed by the large classes and ignore the small classes [57]. Therefore, the method of increasing the amount of data by grade for the AEH indices through WGAN is a meaningful attempt to improve the classification ability of the ML model. WGAN has the advantage of being able to generate various new data similar to real data through stable learning compared to the original GAN, but it is difficult to always accurately match the distribution of synthetic data generated under the influence of weighted clipping with the real distribution of data [24,58]. Thus, to improve the convergence speed and accuracy of WGAN in the future, it will be necessary to study the WGAN algorithm improvement and parameter calibration. Figure 9 and Table 6 show 10-fold CV results of six ML models using the raw training set in terms of F1-score. As shown in Figure 9, the results showed no significant difference in F1-score for BMI and FAI for each ML model, but among the six models, XGB was the highest with an F1-score of 0.475 for BMI, and F1-score of 0.310 for FAI using the XGB. On the other hand, in the case of TDI, RF and SVM had a significantly higher predictive performance with F1-scores of 0.388 and 0.370, respectively, than other models. The average F1-score for the AEH indices of the six models ranged from 0.349 to 0.376 (Table 6). Especially, the RF had the highest F1-score of 0.376, followed by XGB 0.367 and SVM 0.359. The RF and XGB are decision tree-based ensemble models and have excellent classification capacity of imbalanced datasets [59,60]. SVM is a powerful state-of-the-art algorithm with a strong theoretical foundation, which aims to find the decision boundary or hyperplane, and is widely used to solve high-dimensional nonlinear classification problems using various kernel functions [61]. Additionally, to analyze the effect of improving the grade prediction performance on the AEH indices according to the use of the WGAN-based training set, three ML models (RF, XGB, and SVM) were trained and validated using the WGAN-based training set. Overall, better results were obtained from ML models trained on WGAN-based training set while the worst results were obtained from ML models trained on the raw training set ( Figure 10). We also found that the ML model trained on the WGAN-based training set outperformed the model trained on the raw training set with an F1-score of 0.953 for RF, 0.959 for XGB, and 0.953 for SVM (Table 7). This indicates that data augmentation through WGAN alleviated the imbalanced distribution of the raw training set and successfully improved the grade prediction performance of each AEH index of the ML models. In particular, through the above results, the ML model can confirm the possibility of predicting a grade that lacks data relatively better than the ML model trained using limited datasets. Other studies in various fields have also demonstrated that data augmentation through WGAN improves the performance of training classifiers [62,63]. This implies that the data augmentation method through WGAN can be reasonably utilized to enlarge datasets in the fields of hydrology and aquatic ecosystems.

Comparison of Validation Results of ML Models
for solving the classification problem is unbalanced by class, the model may be overwhelmed by the large classes and ignore the small classes [57]. Therefore, the method of increasing the amount of data by grade for the AEH indices through WGAN is a meaningful attempt to improve the classification ability of the ML model. WGAN has the advantage of being able to generate various new data similar to real data through stable learning compared to the original GAN, but it is difficult to always accurately match the distribution of synthetic data generated under the influence of weighted clipping with the real distribution of data [24,58]. Thus, to improve the convergence speed and accuracy of WGAN in the future, it will be necessary to study the WGAN algorithm improvement and parameter calibration.   Figure 9 and Table 6 show 10-fold CV results of six ML models using the raw training set in terms of F1-score. As shown in Figure 9, the results showed no significant difference in F1-score for BMI and FAI for each ML model, but among the six models, XGB was the highest with an F1-score of 0.475 for BMI, and F1-score of 0.310 for FAI using the XGB. On the other hand, in the case of TDI, RF and SVM had a significantly higher predictive performance with F1-scores of 0.388 and 0.370, respectively, than other models. The average F1-score for the AEH indices of the six models ranged from 0.349 to 0.376 (Table 6). Especially, the RF had the highest F1-score of 0.376, followed by XGB 0.367 and SVM 0.359. The RF and XGB are decision tree-based ensemble models and have excellent classification capacity of imbalanced datasets [59,60]. SVM is a powerful state-of-the-art algorithm with a strong theoretical foundation, which aims to find the decision boundary or hyperplane, and is widely used to solve high-dimensional nonlinear classification problems using various kernel functions [61]. Additionally, to analyze the effect of improving the grade prediction performance on the AEH indices according to the use of the WGANbased training set, three ML models (RF, XGB, and SVM) were trained and validated using the WGAN-based training set. Overall, better results were obtained from ML models  trained on WGAN-based training set while the worst results were obtained from ML models trained on the raw training set ( Figure 10). We also found that the ML model trained on the WGAN-based training set outperformed the model trained on the raw training set with an F1-score of 0.953 for RF, 0.959 for XGB, and 0.953 for SVM (Table 7). This indicates that data augmentation through WGAN alleviated the imbalanced distribution of the raw training set and successfully improved the grade prediction performance of each AEH index of the ML models. In particular, through the above results, the ML model can confirm the possibility of predicting a grade that lacks data relatively better than the ML model trained using limited datasets. Other studies in various fields have also demonstrated that data augmentation through WGAN improves the performance of training classifiers [62,63]. This implies that the data augmentation method through WGAN can be reasonably utilized to enlarge datasets in the fields of hydrology and aquatic ecosystems. Figure 9. Comparison of F1-scores for AEH indices through 10-fold CV by ML model using raw training set.

Grade Prediction of Each AEH Index for Test Set Using the ML Models
The confusion matrixes for the grades of each AEH index for test sets predicted by

Grade Prediction of Each AEH Index for Test Set Using the ML Models
The confusion matrixes for the grades of each AEH index for test sets predicted by three ML models trained on the raw and WGAN-based training sets are shown in Figures 11 and 12, respectively, and the performance results through the three evaluation metrics (Precision, Recall, and F1-score) are summarized in Tables 8 and 9. The average F1-scores of the three models trained using the raw training set for the BMI, TDI, and FAI grades were 0.53, 0.35, 0.56 for the RF, 0.44, 0.32, 0.58 for the XGB, and 0.49, 0.24, and 0.61 for the SVM ( Table 8). The average F1-scores for the FAI grade predicted by each ML model was greater than those compared with the results of the other indices. This indicates that in the process of training and validating each ML model, the characteristics of the data by grade of the raw training set for FAI are well-reflected. On the other hand, the F1-score of BMI grade A showed a high performance, with 0.80 for the RF, 0.73 for the XGB, and 0.79 for the SVM but the prediction performance for the other grades (B to E) was mostly lower than 0.29. The raw training sets for BMI have an unbalanced distribution characteristic with a proportion of grade A of at least 50%. Therefore, it appears that the ML models trained using the raw training set for BMI overfit the grade A and do not correctly classify all grades in the test set. In the study of Woo et al. [18], the results also showed that the predictive performance of the RF for a small number of grades (B to E) was relatively lower than that of majority grade A due to the effect of data imbalance. Additionally, although the samples by grade of the raw training set for TDI were relatively balanced, each ML model did not correctly classify the grade of TDI as a whole (Figure 11). This implies that each ML model does not capture the characteristics of the samples in the test set well, and it is necessary to train and validate the model with sufficient data for each grade. As can be seen in Table 9, the predictive performance of the ML models trained on the WGAN-based training set was superior compared to the ML model trained on the raw training set. The average F1-scores of the ML models trained on the WGAN-based training set for the BMI, TDI, and FAI grades were 0.92, 0.77, 0.93 for the RF, 0.92, 0.75, 0.93 for the XGB, and 0.93, 0.84, and 0.94 for the SVM. In particular, the results showed that the ML models trained on the WGAN-based training set were not biased toward majority grades and predicted all grades well overall ( Figure 12). This is because the synthetic data by grade of the AEH indices augmented through WGAN alleviated the unbalanced distribution characteristics of the raw training set [64]. Through the above results, it was confirmed that the data augmentation method through WGAN can overcome the limitations of limited datasets and improve the overall model performance for grade prediction of each AEH index. Therefore, securing data for each AEH grade through the WGAN-based data augmentation is an important process for building a robust ML model that can efficiently predict all AEH grades.  with a proportion of grade A of at least 50%. Therefore, it appears that the ML models trained using the raw training set for BMI overfit the grade A and do not correctly classify all grades in the test set. In the study of Woo et al. [18], the results also showed that the predictive performance of the RF for a small number of grades (B to E) was relatively lower than that of majority grade A due to the effect of data imbalance. Additionally, although the samples by grade of the raw training set for TDI were relatively balanced, each ML model did not correctly classify the grade of TDI as a whole ( Figure 11). This implies that each ML model does not capture the characteristics of the samples in the test set well, and it is necessary to train and validate the model with sufficient data for each grade. As can be seen in Table 9, the predictive performance of the ML models trained on the WGAN-based training set was superior compared to the ML model trained on the raw training set. The average F1-scores of the ML models trained on the WGAN-based training set for the BMI, TDI, and FAI grades were 0.92, 0.77, 0.93 for the RF, 0.92, 0.75, 0.93 for the XGB, and 0.93, 0.84, and 0.94 for the SVM. In particular, the results showed that the ML models trained on the WGAN-based training set were not biased toward majority grades and predicted all grades well overall ( Figure 12). This is because the synthetic data by grade of the AEH indices augmented through WGAN alleviated the unbalanced distribution characteristics of the raw training set [64]. Through the above results, it was confirmed that the data augmentation method through WGAN can overcome the limitations of limited datasets and improve the overall model performance for grade prediction of each AEH index. Therefore, securing data for each AEH grade through the WGAN-based data augmentation is an important process for building a robust ML model that can efficiently predict all AEH grades.

Conclusions
This study evaluated the applicability of WGAN to augment ML training datasets in

Conclusions
This study evaluated the applicability of WGAN to augment ML training datasets in the fields of hydrology and aquatic ecosystems and proposed a method to predict the grades of each AEH index using ML models. The main results of this study are as follows. First, among various physicochemical factors, water quality factors such as BOD 5 , TN, TP, and SS concentrations were found to have a relatively significant effect on the AEH indices compared to other factors in common. This implies that water quality factors can be important indicators in predicting AEH changes in rivers through ML models. Second, as the training of the WGAN progressed stably, the synthetic data for the AEH indices similar to the distribution of the real data were generated. Additionally, the 10-fold CV performance of the ML models (RF, XGB, and SVM) reflecting these synthetic data showed an improved result with an average F1-score of 0.9 or more for each AEH index. Finally, as a result of predicting the grade of the test set by AEH indices through the above three ML models, the models trained using the raw training set did not properly classify a small number of grades due to data imbalance, whereas the models trained using the WGAN-based training set classified all AEH indices well overall without being biased towards a large number of grades. Through the results of this study, it was confirmed that the AEH grade classification prediction performance of the ML model can vary greatly depending on the data distribution for each grade; additionally, the synthetic datasets augmented through WGAN can contribute to improving the model performance and reducing the effort needed for real data collection. However, the ML models built in this study did not reflect the data characteristics of periods such as flood and drought; it is considered that there is a limit to accurately predict changes in AEH due to extreme events. In future research, it will be necessary to continuously train and validate the ML model by acquiring observed data from various rainfall events and supplementing the synthetic datasets of WGAN to accurately predict changes in AEH through the ML model. The ML approach that is proposed in this study could contribute to not only evaluating the achievement of the AEH improvement goal through the introduction of structural and non-structural watershed management plans, but also selecting the impaired rivers that need water quality and aquatic ecology management. In addition, the data augmentation method using the WGAN can be used for building the large amount of data necessary for an integrated water environment management system that organically links the quantity-water quality-aquatic ecosystem.