Evaluation of Classiﬁcation Algorithms to Predict Largemouth Bass ( Micropterus salmoides ) Occurrence

: This study aimed to evaluate classiﬁcation algorithms to predict largemouth bass ( Micropterus salmoides ) occurrence in South Korea. Fish monitoring and environmental data (temperature, precipitation, ﬂow rate, water quality, elevation, and slope) were collected from 581 locations through-out four major river basins for 5 years (2011–2015). Initially, 13 classiﬁcation models built in the caret package were evaluated for predicting largemouth bass occurrence. Based on the accuracy (>0.8) and kappa (>0.5) criteria, the top three classiﬁcation algorithms (i.e., random forest (rf), C5.0, and conditional inference random forest) were selected to develop ensemble models. However, combining the best individual models did not work better than the best individual model (rf) at predicting the frequency of largemouth bass occurrence. Additionally, annual mean temperature (12.1 ◦ C) and fall mean temperature (13.6 ◦ C) were the most important environmental variables to discriminate the presence and absence of largemouth bass. The evaluation process proposed in this study will be useful to select a prediction model for the prediction of freshwater ﬁsh occurrence but will require further study to ensure ecological reliability.


Introduction
Various classification algorithms have been used to predict the presence of freshwater fish under certain environmental conditions [1][2][3]. For instance, boosted regression tree [4], classification tree [5], genetic algorithm for rule-set prediction [2,6], logistic regression [3], generalized additive model [7], and artificial neural networks [1,8] have been used for freshwater fish prediction. In particular, the habitat preference, distribution shifts, and invasion risk of freshwater fish have been evaluated using classification models [9,10]. For instance, Fukuda et al. analyzed the occurrence of the invasive fish Pseudorasbora parva using a random forest algorithm [9]. Kwon et al. also used a random forest model among six candidate models and predicted the occurrence of 22 endemic fishes in South Korea [11].
One of the most promising tools to build classification algorithms is the caret package [12] in R [13]. The caret package can be an alternative to the widely used biomod2 package, which requires environmental data in raster file format [14]. Given that freshwater environmental data are often available as point data [15], the application of the biomod2 package is limited. Additionally, the caret package offers more variety of hyper-parameter settings compared to the biomod2 package. The caret package offers 238 methods in total consisting of 102 classification, 48 regression, and 88 classification/regression algorithms. Because of the large number of algorithms and parameter settings, it is necessary to screen algorithms in the caret package.
Recently, ensemble modeling approaches have been widely used to improve the performance of individual models [16,17]. Ensemble models have advantages in improving prediction accuracy, reducing variance, and interpolating sampling bias errors [18,19]. Several studies have predicted the occurrence of freshwater fish by taking advantage of ensemble modeling [1,20]. For example, Poulos et al. demonstrated that an ensemble model that integrated four algorithms outperformed the individual algorithms in terms of predicting the distributions of three invasive fishes [20]. Grenouillet et al. also suggested an ensemble model of eight algorithms to estimate the distributions of freshwater fishes in France instead of individual models [1]. However, Hao et al. reported that some tuned individual models performed better than ensemble models to predict species distribution of eucalypt trees [21]. These findings suggest that the outperformance of ensemble models over individual models is still controversial.
This study aimed to evaluate the performance of classification algorithms to predict largemouth bass (Micropterus salmoides) occurrence in South Korea. The largemouth bass was selected as the model species for this study because it is an invasive species that causes numerous ecological impacts worldwide [22,23]. For instance, the largemouth bass disturbs freshwater ecosystems by competing with endemic fish for food [24] or by predating on endemic species [25]. Therefore, this study may contribute to the establishment of a suitable classification model that can be used in the management of invasive freshwater fish.

Study Area and Fish Data
The study area included four major river basins (Han, Nakdong, Geum, and Yeongsan) in South Korea ( Figure 1). Largemouth bass monitoring data (2011)(2012)(2013)(2014)(2015) were obtained from the Water Environment Information System (http://water.nier.go.kr/, assessed on 3 August 2020). Among the 960 fish monitoring stations in South Korea, 581 sites throughout the 4 major river basins were selected by considering the availability of largemouth bass monitoring data and environmental data. Specifically, 226, 155, 94, and 106 sites belonged to the Han, Nakdong, Geum, and Yeongsan River basins, respectively.

Environmental Data
The environmental data used in this study were temperature (Temp), precipitation (Prcp), flow rate (Flow), total nitrogen (TotalN), total phosphorus (TotalP), and total suspended solids (TotalSS). Temperature is an important determinant because freshwater  Fish monitoring is conducted annually by the National Aquatic Ecological Monitoring Program of Korea [26]. Fish were captured using casting net and skimming net methods. Fish monitoring stations are representative sites reflecting the characteristics of rivers and streams. Each station includes a riffle, a pool, and a run over the course of a 200 m section in length between upstream and downstream.

Environmental Data
The environmental data used in this study were temperature (Temp), precipitation (Prcp), flow rate (Flow), total nitrogen (TotalN), total phosphorus (TotalP), and total suspended solids (TotalSS). Temperature is an important determinant because freshwater fish belong to ectotherms. Precipitation and flow rate are well-known variables influencing the physical habitat suitability of fish [27]. In addition, water quality variables such as TotalN, TotalP, and TotalSS are known to play key roles in fish distribution [28]. These variables were further divided into six categories such as annual average, monthly difference, as well as the means of spring (March to May), summer (June to August), fall (September to November), and winter (December to February) to reflect the annual trends and seasonal variability. Additionally, two topographic variables that limit the geographic distribution of fish [1,29,30], elevation and slope, were used as background environmental data. Elevation is generally negatively correlated with water temperature, which affects fish distribution [31]. Slope can influence water velocity [32], which is one of the important hydraulic variables determining the distribution of fish [33].
Temperature and precipitation data were downloaded from the Korea Meteorological Administration (http://www.climate.go.kr/, assessed on 5 August 2020). Flow rate and water quality data were obtained from the climate change database (http://motive.kei.re.kr/, assessed on 10 August 2020) of model of integrated impact and vulnerability evaluation (MOTIVE). Elevation and slope data were acquired from the National Geographic Information Institute (http://www.ngii.go.kr/, assessed on 3 August 2020) and from the National Institute of Agricultural Sciences (http://www.naas.go.kr/, assessed on 3 August 2020), respectively.

Classification Modeling
Fernández-Delgado et al. proposed a list of the top 20 binary classification models by comprehensively comparing the accuracy of 179 algorithms using 121 data sets [34]. Among the 20 models, 13 algorithms that were included in the caret package [12] were used in this study: random forest (rf), C5.0, conditional inference random forest (cforest), k-nearest neighbor (knn), support vector machine with radial basis function kernel (svmRadial and svmRadialCost), flexible discriminant analysis (fda), neural network with feature extraction (pcaNNet), Bayesian generalized linear model (bayesglm), support vector machine with polynomial kernel (svmPoly), model averaged neural network (avNNet), neural network (nnet), and penalized discriminant analysis (pda). Classification algorithms were developed in R [13] and all default parameters in the caret package [12] were used in this study.
In total, 2869 records (778 presence and 2091 absence records) were collected from 2011 to 2015 (581 records per year, except for 545 records in 2011). Of these, 70% were used for model training, whereas the remaining 30% were labeled as the test set ( Figure 2). The training and test samples were selected using the createDataPartition function in R, and this sample selection was replicated 10 times. The createDataPartition function offers a balanced sampling of records, which can prevent bias. For each replication, 13 classification algorithms were calibrated using the training set, and algorithm performance was evaluated using the corresponding test set. The training set was divided into two subsets to search for the best hyperparameter setting. Subsets were randomly selected using the bootstrap method. We used 75% of the training set for hyperparameter searching and the remaining was used for hyperparameter validation. This hyperparameter searching was replicated 25 times, and the best hyperparameter was selected based on the algorithm accuracy. To assess the performance of each algorithm, accuracy and kappa values were Individual classification algorithms that had an average accuracy of over 0.8 and an average kappa value above 0.5 were selected to develop the ensemble models. The ensemble models were constructed from a combination of the top-three-ranked algorithms (i.e., rf, C5.0, and cforest). Each algorithm was optimized by two hyperparameter optimization methods, grid search and random search. Considering that the occurrence frequency can be used to assess the impacts of invasive species [35], the optimal model was selected based on the average difference between the observed and predicted frequency of largemouth bass occurrence within the study period (2011)(2012)(2013)(2014)(2015). Additionally, the contribution of environmental variables to the prediction of largemouth bass occurrence was assessed using the optimized model. Moreover, true skill statistics (TSS) was applied to derive threshold values for the top-three-ranked environmental variables: elevation, annual mean temperature, and fall mean temperature. Table 1 compares the ability of the 13 classification algorithms to predict largemouth bass occurrence. According to the average rank of the individual models, the algorithm rf showed the best performance, followed by C5.0, and cforest, all of which had an average accuracy over 0.8 and an average kappa value above 0.5 in both the training and test simulations. This result is consistent with previous studies that revealed the strong performance of random forest algorithm [11,36]. The average rank of model performance fell sharply from the fourth model (knn) onward because of large decreases in the accuracy and kappa values. Previous studies have built ensemble models by simply integrating all the candidate algorithms into the model [1] or by weighting candidate algorithms based Individual classification algorithms that had an average accuracy of over 0.8 and an average kappa value above 0.5 were selected to develop the ensemble models. The ensemble models were constructed from a combination of the top-three-ranked algorithms (i.e., rf, C5.0, and cforest). Each algorithm was optimized by two hyperparameter optimization methods, grid search and random search. Considering that the occurrence frequency can be used to assess the impacts of invasive species [35], the optimal model was selected based on the average difference between the observed and predicted frequency of largemouth bass occurrence within the study period (2011)(2012)(2013)(2014)(2015). Additionally, the contribution of environmental variables to the prediction of largemouth bass occurrence was assessed using the optimized model. Moreover, true skill statistics (TSS) was applied to derive threshold values for the top-three-ranked environmental variables: elevation, annual mean temperature, and fall mean temperature. Table 1 compares the ability of the 13 classification algorithms to predict largemouth bass occurrence. According to the average rank of the individual models, the algorithm rf showed the best performance, followed by C5.0, and cforest, all of which had an average accuracy over 0.8 and an average kappa value above 0.5 in both the training and test simulations. This result is consistent with previous studies that revealed the strong performance of random forest algorithm [11,36]. The average rank of model performance fell sharply from the fourth model (knn) onward because of large decreases in the accuracy and kappa values. Previous studies have built ensemble models by simply integrating all the candidate algorithms into the model [1] or by weighting candidate algorithms based on their accuracy [20,36]. However, this study suggests that individual classification algorithms have a large spectrum of modeling performance and, therefore, should be incorporated into the ensemble model carefully.  Figure 3 illustrates the performance of ensemble models compared with the best individual model (rf). Model performance was evaluated using the average difference between the observed and predicted frequency of largemouth bass occurrence (Table S1). The rf model showed the least difference between the observed and predicted frequency, whereas the addition of the second and third classification algorithms (C5.0 and cforest) into the rf model notably increased prediction errors. These findings suggest that ensemble models did not work better than the best individual model (rf) at predicting largemouth bass occurrence. Ensemble models generally outperform individual models when combining individual models predicting different trends [37]. However, this might not have occurred in this study because all of the top three individual models underestimated the frequency of largemouth bass occurrence. Ensemble models may have higher ecological validity than individual models. Muñoz-Mas et al. showed the ecological reliability of the ensemble model, which is obtained from the attenuation of response curves [37]. This attenuation occurred due to the diversity of the prediction result. The diversity derived from the model error can be measured by ambiguity decomposition [38] and bias-variance-covariance decomposition [39]. However, this "diversity" is only applicable in regression models [40], and there is still no  Ensemble models may have higher ecological validity than individual models. Muñoz-Mas et al. showed the ecological reliability of the ensemble model, which is obtained from the attenuation of response curves [37]. This attenuation occurred due to the diversity of the prediction result. The diversity derived from the model error can be measured by ambiguity decomposition [38] and bias-variance-covariance decomposition [39]. However, this "diversity" is only applicable in regression models [40], and there is still no consensus on defining diversity in classification tasks [41,42]. Future studies can be conducted by applying diversity measurements to ensemble classification algorithm evaluations.

Performance of Classification Algorithms
Classification models only depict the species' characteristics that appear in the past or present [43], which may lead to high uncertainty under certain situations in invasive species modeling. For example, insufficient invasive species records due to short invasion history may increase uncertainty [44]. Moreover, training samples recorded under a non-equilibrium state may also increase model uncertainty [44]. In this study, we assumed that largemouth bass live in an equilibrium state because the largemouth bass was first introduced several decades ago and now occurs in all four major river basins in South Korea. In addition, our occurrence data for largemouth bass are sufficient to reflect diverse distribution characteristics because they were collected from representative monitoring sites in South Korea.

Role of Environmental Variables
The mean value (2011-2015) of environmental variables and their cumulative contribution to the prediction of largemouth bass occurrence are shown in Table 2 and Figure 4, respectively. The contribution was normalized by the most important variable, elevation (100%). Yoon et al. reported that elevation was the most influential variable in the distribution of freshwater fish in South Korea, which was negatively correlated with water temperature [31]. The next most important variables were climatic variables such as temperature (Temp) and precipitation (Prcp). The high contribution of temperature has been well-demonstrated in the distribution of ectothermic freshwater fish [45,46].  In addition to annual average temperature, seasonal temperature, particularly in fall and winter, played an important role in largemouth bass occurrence (Figure 4). Kwon et al. also demonstrated that seasonal variation in temperature significantly influenced the distribution of freshwater fish in South Korea [11]. Following the climatic variables, water quality (TotalN, TotalP, and TotalSS) also affected largemouth bass occurrence. Meador et al. reported that TotalN, TotalP, and water temperature frequently correlated with the increased species richness of invasive freshwater fish, including the largemouth bass [47]. These findings suggest that water quality parameters and seasonal variations in environmental variables should be considered when predicting invasive freshwater fish distributions.   Threshold values of the top-three-ranked environmental variables for the prediction of largemouth bass occurrence were determined by the TSS (Figure 5). For elevation, threshold was not determined because the TSS was less than zero, indicating that it has poor discriminating power. However, the highest TSS was found to be 0.4185 and 0.4190 at 12.1 • C annual mean temperature and at 13.6 • C fall mean temperature, respectively. Figure 6 shows the accuracy of these threshold values to predict the presence and absence of largemouth bass. For annual mean temperature, the accuracy of presence and absence was 80.0% and 61.8%, respectively. In addition, the threshold of fall mean temperature distinguished the presence and absence at 68.6% and 73.3% accuracy, respectively. In general, the growth potential of largemouth bass decreases as temperature decreases, thus limiting distribution [48]. Moreover, fall mean temperature might restrict the distribution of largemouth bass because of higher swimming performance in fall than in spring or in winter [49]. These findings suggest that the frequency of largemouth bass occurrence in South Korea may increase under global warming. was 80.0% and 61.8%, respectively. In addition, the threshold of fall mean temperature distinguished the presence and absence at 68.6% and 73.3% accuracy, respectively. In general, the growth potential of largemouth bass decreases as temperature decreases, thus limiting distribution [48]. Moreover, fall mean temperature might restrict the distribution of largemouth bass because of higher swimming performance in fall than in spring or in winter [49]. These findings suggest that the frequency of largemouth bass occurrence in South Korea may increase under global warming.

Conclusions
In this study, 13 classification algorithms were systematically evaluated to predict largemouth bass occurrence in South Korea. The best individual model (rf) works better than any ensemble models of the top three algorithms (rf, C5.0, and cforest) at predicting the frequency of largemouth bass occurrence over a period of 5 years (2011-2015). In addition, water quality variables (TotalN, TotalP, and TotalSS) substantially contributed to the prediction of largemouth bass occurrence, following conventional climatic (temperature and precipitation) variables. Given that annual mean temperature and fall mean temperature are the most important discriminating variables, the ecological risk posed by invasive largemouth bass is expected to increase under climate change. The evaluation process proposed in this study can be useful for developing prediction models for invasive freshwater fish, but requires further study to elaborate the ecological reliability of the model. In addition, ecological factors such as interspecific competition and predation should be considered in further studies because ecological interactions can influence the distributions of invasive species.

Conclusions
In this study, 13 classification algorithms were systematically evaluated to predict largemouth bass occurrence in South Korea. The best individual model (rf) works better than any ensemble models of the top three algorithms (rf, C5.0, and cforest) at predicting the frequency of largemouth bass occurrence over a period of 5 years (2011-2015). In addition, water quality variables (TotalN, TotalP, and TotalSS) substantially contributed to the prediction of largemouth bass occurrence, following conventional climatic (temperature and precipitation) variables. Given that annual mean temperature and fall mean temperature are the most important discriminating variables, the ecological risk posed by invasive largemouth bass is expected to increase under climate change. The evaluation process proposed in this study can be useful for developing prediction models for invasive freshwater fish, but requires further study to elaborate the ecological reliability of the