Next Article in Journal
Genomic Characterization of Aureimonas altamirensis C2P003—A Specific Member of the Microbiome of Fraxinus excelsior Trees Tolerant to Ash Dieback
Next Article in Special Issue
A New Species from the Canary Islands Increases the Diversity of the Red Algal Genus Pterocladiella in the Northeastern Atlantic
Previous Article in Journal
Transcriptomics Analysis Reveals a Putative Role for Hormone Signaling and MADS-Box Genes in Mature Chestnut Shoots Rooting Recalcitrance
Previous Article in Special Issue
Diversity and Ecology of Lobophora Species Associated with Coral Reef Systems in the Western Gulf of Thailand, including the Description of Two New Species
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Association between the Classification of the Genus of Batrachospermaceae (Rhodophyta) and the Environmental Factors Based on Machine Learning

1
Shanxi Key Laboratory for Research and Development of Regional Plants, School of Life Science, Shanxi University, Taiyuan 030006, China
2
School of Physical Education, Shanxi University, Taiyuan 030006, China
*
Authors to whom correspondence should be addressed.
Plants 2022, 11(24), 3485; https://doi.org/10.3390/plants11243485
Submission received: 15 October 2022 / Revised: 3 December 2022 / Accepted: 9 December 2022 / Published: 13 December 2022
(This article belongs to the Special Issue Genetic Diversity and Taxonomy of Algae)

Abstract

:
Batrachospermaceae is the largest family of freshwater red algae, widely distributed around the world, and plays an important role in maintaining the balance of spring and creek ecosystems. The deterioration of the current global ecological environment has also destroyed the habitat of Batrachospermaceae. The research on the environmental factors of Batrachospermaceae and the accurate classification of the genus is necessary for the protection, restoration, excavation, and utilization of Batrachospermaceae resources. In this paper, the database of geographical distribution and environmental factors of Batrachospermaceae was sorted out, and the relationship between the classification of genus and environmental factors in Batrachospermaceae was analyzed based on two machine learning methods, random forest and XGBoost. The result shows: (1) The models constructed by the two machine learning methods can effectively distinguish the genus of Batrachospermaceae based on environmental factors; (2) The overall AUC score of the random forest model for the classification and prediction of the genus of Batrachospermaceae reached 90.41%, and the overall AUC score of the taxonomic prediction of each genus of Batrachospermaceae reached 85.85%; (3) Combining the two methods, it is believed that the environmental factors that affect the distinction of the genus of Batrachospermaceae are mainly altitude, average relative humidity, average temperature, and minimum temperature, among which altitude has the greatest influence. The results can further clarify the taxonomy of the genus in Batrachospermaceae and enrich the research on the differences in environmental factors of Batrachospermaceae.

1. Introduction

Batrachospermaceae was established by Agardh in 1824 [1] and belongs to Rhodophyta, Florideophycidae, Batrachospermales. It is the largest family of freshwater red algae and there are currently 206 formally recognized species [2]. The members of Batrachospermaceae live in low temperature, low light, clean, high dissolved oxygen, and flowing spring or stream water [3]. Red algae (Rhodophyta) are an important part of the algal flora in the inland aquatic ecosystems, and the species diversity of freshwater red algae is mainly concentrated in Batrachospermaceae [4]. With the deterioration of the global ecological environment, the survival of Batrachospermaceae is under considerable threat. Some countries have listed its members as rare and protected species [5,6]. The studies of environmental factors and accurate classification of Batrachospermaceae are the basis for the protection and utilization of the family.
There have been many types of research about the influence of environmental factors on the geographical distribution, growth, and development of freshwater algae. Sheath and Burkholder [7] pointed out that the rapid fluctuation of physical and chemical conditions in five streams in Rhode Island is the main force determining the distribution and abundance of macroalgal species, community composition, and seasonal dynamics. Biggs and Price [8] and Biggs [9] stated that there is a strong link between the specific conductance and the surrounding algal biomes in New Zealand streams. Branco et al. [10] and Krupek and Branco [11] found that local conditions affect the spatial distribution of algal communities, such as irradiance, water velocity, and light intensity, in the central and western regions of Paraná state in southern Brazil. Jimenez and Fatjo [12] documented that Batrachospermum gelatinosum and Sirodotia suecica were clearly distinguished from B. helminthosum based on nutrient content in tropical high-altitude rivers in central Mexico. Carmona et al. [13] showed in their literature that the gametophytes of S. suecica grew in eutrophic circumstances and particular microhabitat conditions: high current rate speed, low irradiance, and shallow depth, in a high-altitude stream in central Mexico, and some morphological and reproductive characteristics seem to be adaptations to high current velocity: abundant secondary branches, spermatangia, and carpogonia. Xie found that the growth, occurrence frequency, and average cover of B. gelatinosum and Sheathia arcuata showed obvious seasonal variation in two springs from North China [14,15]. These studies show that there is a complex relationship between the growth and distribution of freshwater algae, including Batrachospermaceae, and environmental factors. Therefore, it could be speculated that environmental factors may have an impact on the classification of Batrachospermaceae.
At present, most of the classifications at the genus level of Batrachospermaceae are based on traditional morphological [16,17,18] and molecular biological methods [19,20,21]. It has not been documented to study the effect of differences in environmental factors on the classification of the genus of Batrachospermaceae based on the machine learning method.
Machine learning methods (ML) are statistical techniques originating in the field of artificial intelligence that focus on identifying complex structures, and nonlinear data, and generating accurate predictive models [22]. Compared with traditional statistical analysis methods, ML can model complex nonlinear relationships in data without meeting the restrictive assumptions required by traditional parameterization methods. ML provides techniques for dealing with high dimensions and missing, and can mine the information hidden in large databases. The rapid development of machine learning methods has been widely used in biological molecules, ecological informatics, medicine, and other fields [22,23,24].
Based on the advantages of machine learning methods, random forest (RF) and extreme gradient boosting (XGBoost) adopted in this study realized the classification of genera and the ranking of the importance of environmental factors based on the data of environmental factors. The effects of environmental factors on the growth and distribution of Batrachospermaceae were also discussed. This study will enrich the research on the taxonomy of genera of Batrachospermaceae and the influence of environmental factors on the growth and distribution of the Batrachospermaceae plants. The results of this study prove the feasibility of classifying the genera of the Batrachospermaceae plants by environmental, and may provide some reference for the conservation, restoration, and utilization of the plant resources of the Batrachospermaceae plants.

2. Results

2.1. Descriptive Statistical Analysis

Based on the latitude, longitude, and environmental factors of Batrachospermaceae, we have made a global distribution map (Figure 1) and the boxplot of environmental factors of Batrachospermaceae plants in different continents (Figure 2). It could be found that the members of Batrachospermaceae are distributed all over the world, and their distribution is of very obvious temperate nature. In addition, there are obvious differences in the distribution conditions of environmental factors of Batrachospermaceae on five continents, including altitude, average temperature, minimum temperature, and maximum temperature. The distribution range fluctuates most obviously.
There is a larger altitude distribution range of Batrachospermaceae in Asia and Oceania, but it is smaller in South America. In terms of the distribution range of average temperature, it has the largest distribution range in Asia, but the smallest in South America, and little difference in other continents. The distribution range of maximum temperature is larger in Asia, slightly smaller in Europe and Oceania, and the smallest in South America. The distribution range of average wind speed is larger in Asia and North America, while it is slightly small in Europe, and smaller in Oceania and South America.

2.2. Results of UMAP

The UMAP clustering results of the data of Batrachospermaceae samples between continents (Figure 3a) show that samples from different continents can be clustered, among which Oceania has the best clustering effect, and European and North American samples close in projected space. The UMAP clustering results among the genera of Batrachospermaceae (Figure 3b) show that the genera Nothocladus and Virescentia have the best clustering effect, followed by the genus Sheathia. The clustering effect of the genera Batrachospermum, Kumanoa, and Sirodotia are slightly inferior.

2.3. Results of Machine Learning Methods

2.3.1. Results of the Random Forest

To realize the genus-level classification of Batrachospermaceae based on the geographical distribution and environmental factors, the classical random forest method in machine learning is used first. The overall area under the ROC (receiver operator characteristic) curve (AUC score) of the random forest model for the classification of Batrachospermaceae in this study reached 90.41%, which has a fairly good classification performance. Figure 4a shows the ROC curve of each genus in the random forest model. The full name of the ROC curve is the Receiver Operating Characteristic curve, which shows the tradeoff between specificity and sensitivity and is often used to evaluate the performance of models [25]. ROC curve abscess was 1-specificity, with the false positive rate (FPR) representing the proportion of positive predicted but negative samples to all negative samples. The ordinate of the ROC curve is Sensitivity, true positive rate (TPR), which represents the proportion of predicted positive samples in all positive samples that are positive [26]. Normally, the area range under the ROC curve is 0.5–1. When the area under the ROC curve is closer to 1, the better the performance of the model. The closer the area is to 0.5, the worse the model performance is and the closer it is to random prediction. Among them, the model has the best classification performance for the genera Nothocladus and Virescentia, followed by genera Sheathia and Sirodotia, and the classification effect for the genera Batrachospermum and Kumanoa is slightly lower. Figure 4b is the confusion matrix of the random forest model on the validation set. During the construction of the random forest model, the original data set is randomly divided into a training set and a validation set in a ratio of 3:1. Through the confusion matrix, the real situation can be observed when the model predicts each genus. For easy observation, we set the numbers on the diagonal of the matrix to zero. The predicted values were from the samples of the genus Batrachospermum, and a total of 7 samples were mispredicted, including 1 in the genus Kumanoa, 1 in the Remainder, and 5 in the genus Sheathia. By analogy, it can be found that the samples of the genus Sheathia are easily misidentified as the genus Batrachospermum by the model, and the samples of the genus Batrachospermum are also easily misidentified as the genus Sheathia. These two genera are easily confused by the model, which may be because the two genera in the geographical distribution and environmental factors are similar to each other. In addition, the samples of the genus Kumanoa are easily predicted to be the genus Sirodotia. Figure 4c is the ranking result of the importance of environmental factors on the classification results obtained in the operation of the random forest classification model. It can be found that the altitude in the model is the most important environmental factor affecting the classification results of Batrachospermaceae, followed by atmospheric pressure, average relative humidity, minimum temperature, and average temperature, and to a lesser extent, maximum sustainable wind speed, maximum temperature, and average wind speed.

2.3.2. Results of the XGBoost

Previously, the random forest model was used to classify the Batrachospermaceae according to their geographical distribution and differences in environmental factors. In this section, the XGBoost method in machine learning was used to compare with the results of the random forest model. In this study, the overall area under the ROC curve (AUC score) of the XGBoost model for the classification of Batrachospermaceae reached 85.85%, which also has a relatively good classification performance. Figure 5a shows the ROC curve for each genus in the XGBoost model. Among them, the model has the best classification performance for Nothocladus and Virescentia, followed by Sirodotia, slightly worse classification performance for Sheathia and Batrachospermum, and the worst classification performance for Kumanoa. Figure 5b is the confusion matrix of the XGBoost model on the validation set. During the construction of the XGBoost model, the original data set is randomly divided into a training set and a validation set in a ratio of 3:1. Through the confusion matrix, we can observe the real situation when the model predicts each genus. To facilitate our observation of the results, we set the numbers on the diagonal of the matrix to zero for ease of observation. The predicted values were samples of the genus Batrachospermum, and a total of 12 samples were wrongly predicted, including 3 in the genus Kumanoa, 1 in the Remainder, 6 in the genus Sheathia, and 2 in the genus Sirodotia. By analogy, it can be found that in the XGBoost model, the samples of the genus Sheathia are easily misidentified as the genus Batrachospermum by the model, and the samples of the genus Batrachospermum are also easily misidentified as the genus Sheathia, which is similar to the results of the random forest model. In addition, the samples of the genus Kumanoa were also easily predicted to be the genera Batrachospermum and Sheathia. Figure 5c is the ranking result of the importance of environmental factors on the classification results obtained in the operation of the XGBoost classification model. It can be found that, similar to the random forest model, the most important environmental factor affecting the classification results of Batrachospermaceae is altitude, followed by average relative humidity, mean temperature, minimum temperature, and atmospheric pressure, and to a lesser extent, mean wind speed, maximum temperature, and maximum sustainable wind speed.

3. Discussion

According to the results of classification prediction by two machine learning methods at the genus level of Batrachospermaceae, the classification performance of the random forest model is better than that of the XGBoost model. The overall AUC score of the random forest model reaches 90.41%, and the XGBoost model reaches 85.85%. Based on the ranking of the importance of environmental factors of the two methods, the most important environmental factors affecting the classification results at the genus level of Batrachospermaceae are altitude, average relative humidity, average temperature, and minimum temperature.
As a special environmental factor, altitude is a comprehensive reflection of many related environmental factors including climate, geology, water chemistry, and so on [27]. In addition, changes in altitude are also closely related to changes in many environmental factors such as precipitation and temperature [28,29]. Altitude may increase the heterogeneity of environmental elements including climate and may increase the spatial isolation of species, so it can reduce the similarity between regions or communities [30]. The difference in altitude shows the difference in conditions such as precipitation and temperature, so altitude is an important factor affecting the distribution difference of different genera of Batrachospermaceae.
The relative humidity is an important physical quantity that characterizes air humidity, which indicates the degree of saturation of water vapor in the air [31]. The change of relative humidity is comprehensively affected by various conditions such as circulation form, cloud cover, precipitation, wind, and topographic factors [32]. The change of relative humidity in a region is mainly affected by local temperature, precipitation, and wind speed. Batrachospermaceae are widely distributed around the world, and the difference in average relative humidity will inevitably lead to the difference in the geographical distribution of the family.
There is a close relationship between water temperature and air temperature. Changes in air temperature can cause certain changes in water temperature in rivers and lakes [27,33]. As river and lake water temperatures are close to equilibrium with air temperature, air temperature is a key variable affecting water temperature in most biological systems, strongly affecting water chemistry, biochemical reactions, and biota growth/death [34]. Temperature conditions affect the latitude, elevation, watershed distribution, and seasonality of freshwater red algae, and some geographic patterns of freshwater red algae are also affected by photosynthesis on temperature [3].Therefore, the average temperature and minimum temperature can also affect the growth and distribution of Batrachospermaceae to a certain extent.
The results of the importance of environmental factors of the two machine learning methods in this study found that the impact of longitude and latitude on the taxonomy of Batrachospermaceae is extremely important, and it also shows that altitude is also very important for the classification of Batrachospermaceae, which is consistent with Branco et al. That is, space can have a strong impact on the community composition of less dispersive red algae taxa [35]. This study successfully realized the classification of the genus of Batrachospermaceae through environmental factors, which also indicated that the local ecological environment would have a certain impact on the growth and distribution of Batrachospermaceae plants. In the study of Abdelahad et al. [4], some morphological changes of plants in the family Batrachospermaceae also indicated that the local ecological environment would have a certain impact on the growth and distribution of giant algae.
Although both the random forest model and XGBoost model in this study have good classification prediction performance, limited by the size of the data, there is still room for improvement in the prediction accuracy. It is believed that with the increase in the availability of more and more comprehensive geographic distribution data and environmental factor data of Batrachospermaceae in the future, the machine learning classification model can classify Batrachospermaceae more effectively. In addition, benefiting from the advantages of machine learning methods, this study can accurately rank the importance of environmental factors that affect the genus-level classification of Batrachospermaceae, but it is difficult to observe more specific relationships between genus-level classifications and environmental factors. Later, we will also take other methods to analyze the relationship between the genus-level taxonomy of Batrachospermaceae and its environmental factors.

4. Materials and Methods

Figure 6 is the method block diagram of this study. The data set in this study is a standardized data set. Uniform manifold approximation and projection (UMAP) was first used to reduce the dimension of the data and preliminarily observe the clustering of each genus. After that, two machine learning methods based on different integration ideas were used to classify the genus of Batrachospermaceae: Random Forest (RF)—bagging integration idea and extreme gradient boosting (XGBoost)—boosting integration idea. Finally, the ROC curve was used to evaluate the classification performance of the model for each genus, and the actual situation of the model classification was observed through the confusion matrix, with the importance of environmental factors affecting the classification results being sorted.

4.1. Data Description

According to the collected specimens, relevant literature [10,12,13,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53], and information from the Algaebase Database (https://www.algaebase.org/, accessed on 10 September 2022), a total of 16 genera and 101 species of Batrachospermaceae were sorted out in this study (Table 1). Figure 7 shows the proportion of data for each genus of Batrachospermaceae. Some species with the small number of samples are not enough for analytical research. Therefore, 6 genera, Batrachospermum, Kumanoa, Sheathia, Sirodotia, Virescentia, and Nothocladus, with more than 3 species are selected for analysis, and the rest of the genera are classified into one category denoted as Remainder. Table 2 shows the number of samples per genus in the database compiled in this study.
The latitude and longitude data of the Batrachospermaceae samples’ collection site are from https://www.gpsspg.com/, accessed on 12 September 2022, and the data of 8 important environmental factors are from https://www.wunderground.com/, accessed on 12 September 2022. Longitude, latitude, and environmental factors are expressed as follows: Lat: Latitude (°), Long: Longitude (°), ASL: Altitude (m), TM: Maximum temperature (°C), T: Average temperature (°C), Tm: Minimum temperature (°C), H: Average relative humidity (%), V: Average wind speed (km/h), VM: Maximum sustainable wind speed (km/h), SLP: Atmospheric pressure at sea level (kPa). Data for the average relative humidity and average wind speed are obtained by calculating the average of the daily mean values for the same acquisition day over five years.

4.2. Methods

4.2.1. UMAP

Uniform manifold approximation and projection (UMAP) is a nonlinear dimensionality reduction technique based on manifold learning [54], which is widely used for visualization, exploratory data analysis, and clustering and classification tasks [55,56]. In addition to UMAP, there are many dimensionality reduction algorithms, such as the principal component analysis (PCA), multidimensional scaling, Sammon’s mapping, and T-distributed random neighbor embedding (t-SNE), etc. The performance of the UMAP algorithm significantly outranks other non-linear dimensionality reduction methods [56]. Compared with PCA, UMAP can precisely capture the nonlinear structure of large data sets [57]. Compared to t-SNE, UMAP is faster and has fewer parameters for tuning. In addition, UMAP has the advantage of being able to switch focus between local or global structures and provides a way to infer similarities and differences between clusters based on their proximity in the latent space [58]. In this study, we used UMAP to observe the inter-continental and inter-generic clustering of Batrachospermum.

4.2.2. Random Forest

Random forest [59] is an efficient algorithm invented by Leo Breiman in 2001 based on classification and regression trees (CART) to form decision trees. It integrates the idea of random subspace [60] and bagging [61], and belongs to a supervised machine learning algorithm. The advantages of this algorithm are simple to use, with high accuracy, fewer parameters that need to be adjusted, and high operation and efficiency, therefore it can deal with high-dimensional (with many characteristic variables) data, fast training speed, and no overfitting phenomenon [62]. Due to the advantages, the random forest algorithm has been widely used in many fields [63], including ecology, bioinformatics, chemical informatics, etc. In this study, the construction and drawing of a random forest model are implemented by R language.
Construction of random forest:
(1)
Use the bootstrap method to resample the original data sample set X, and randomly generate K training sample sets X1, X2, …, XK;
(2)
Use each generated training set, generate the corresponding decision tree T1, T2, …, Tk, and select “mtry” attributes (split attributes randomly selected from M attribute sets) on each intermediate node (non-leaf node). The attribute of the best splitting method in the set is used as the splitting attribute of the current node to split on this node;
(3)
Each decision tree grows completely without pruning;
(4)
Test and classify each decision tree on the original data sample set X;
(5)
By voting, the category with the most output from the K decision trees is taken as the category to which the original data sample set X belongs.
In building a random forest model, the two most important parameters are the number of decision trees (ntree) and the number of attributes in the split attribute set (mtry). In addition, it is necessary to set a minimum number of samples required for internal node repartition (min_n) in each leaf node. After optimizing the parameters, the parameters used in the construction of the random forest model in this study are set to mtry = 7, trees = 1000, and min_n = 6.

4.2.3. XGBoost

XGBoost [64] algorithm is a tree model structure, which is an improvement of the gradient boosting decision tree (GBDT) calculation method and adopts additive learning model for optimization. The basic idea of the XGBoost algorithm is to continuously add new trees generated by feature splitting, and learn a new function for each new tree to fit the residuals of the previous round of predictions [65]. After the training, each tree will contain a leaf node according to the characteristics of the sample, and each leaf node corresponds to a score [64]. Finally, the scores corresponding to each tree are added to obtain the predicted value of the sample. The XGBoost model has the advantages of strong generalization ability, high expandability, and fast computing speed [66].
In the XGBoost algorithm, an additive model is constructed by iteration after iteration. At each iteration, a sub-prediction model is generated to correct the prediction residual of the current model for classification events, and finally, a model consisting of multiple sub-models is constructed. As shown in Equation (1):
y ^ = k = 1 K f k ( x i ) , f k F
where ŷ as input samples xi predictive value; f is the function space composed of all sub-prediction models fk. Therefore, the objective function is defined as:
Obj   = i = 1 n l ( y i , y i ^ )  
where l ( y i , y i ^   ) is the loss function, which measures the amount of error between the predicted value and the actual value; Ω(f) is the regular term, indicating the complexity of the sub-prediction model generated in each iteration. The complexity of each tree is defined as the following:
Ω ( f ) = γ T + 1 2 λ j = 1 T ω j 2
The objective function in XGBoost is defined as:
Obj = j = 1 T [ G j ω j + 1 2 ( H j + λ ) ω j 2 ] + γ T
The model training process is as follows:
(1)
The model starts the initial iteration, and a sub-prediction model is constructed in each iteration.
(2)
Before each iteration, calculate the first-order and second-order gradients of the loss function under each training sample value.
(3)
In each iteration, with the goal of minimizing Equation (4), a decision tree is generated as a sub-prediction model, and the corresponding prediction value ω of each leaf node of the decision tree is calculated.
(4)
After each iteration, the newly generated model is added to the previous model. After several iterations, the final prediction model can be obtained.
In this study, the construction of the XGBoost model was implemented by R language. In establishing the XGBoost model, many parameters need to be adjusted (Table 3). The optimized parameters are set in this study, as follows: mtry = 2, trees = 1000, min_n = 4, tree_depth = 11, learn_rate = 0.00046294621410231, loss_reduction = 0.0126584260009419, sample_size = 0.790578703079373.

5. Conclusions

Two different machine learning methods (random forest and XGBoost) can be used to classify the genus of Batrachospermaceae plants through environmental factor data and had a good classification effect. Among them, the overall AUC score of the random forest model reached 90.41%, and the overall AUC score of the XGBoost model reached 85.85%. Thanks to the advantages of machine learning methods, the ranking of the importance of environmental factors can also be obtained in the model components. Altitude, average relative humidity, average temperature, and minimum temperature are the main factors affecting the classification results of the two models, among which altitude has the most significant influence. If more sample data and more environmental factor data can be obtained in the future, then the results of the model may have finer and more accurate results.

Author Contributions

Conceptualization, F.W. and S.X.; methodology, Q.Y. and F.W.; software, Q.Y. and F.W.; formal analysis, Q.Y.; investigation, Q.Y., F.N., X.L., Q.L., J.L. and J.F.; data curation, Q.Y.; writing—original draft preparation, Q.Y.; writing—review and editing, F.W. and S.X.; supervision, S.X.; funding acquisition, S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (32170204).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Agardh, C.A. Systema Algarum; Literis Berlingianis: Lundae, Sweden, 1824. [Google Scholar]
  2. Guiry, M.D.; Guiry, G.M. AlgaeBase. World-Wide Electronic Publication, National University of Ireland, Galway. 2022. Available online: https://www.algaebase.org (accessed on 10 September 2022).
  3. Sheath, R.G.; Vis, M.L. Red Algae//Freshwater Algae of North America; Academic Press: Cambridge, MA, USA, 2015; pp. 237–264. [Google Scholar] [CrossRef]
  4. Abdelahad, N.; Bolpagni, R.; Jona Lasinio, G.; Vis, M.L.; Amadio, C.; Laini, A.; Keil, E.J. Distribution, morphology and ecological niche of Batrachospermum and Sheathia species (Batrachospermales, Rhodophyta) in the fontanili of the Po plain (northern Italy). Eur. J. Phycol. 2015, 50, 318–329. [Google Scholar] [CrossRef]
  5. Siemińska, J. Red list of threatened algae in Poland. In List of Threatened Plants in Poland, 2nd ed.; Polish Academy of Sciences, W Szafer Institute of Botany: Krakowie, Poland, 1992; pp. 7–19. [Google Scholar]
  6. Nemeth, J. Red list of algae in Hungary. Acta Bot. Hung. 2005, 47, 379–417. [Google Scholar] [CrossRef]
  7. Sheath, R.G.; Burkholder, J.M. Characteristics of softwater streams in Rhode Island II. Composition and seasonal dynamics of macroalgal communities. Hydrobiologia 1985, 128, 109–118. [Google Scholar] [CrossRef]
  8. Biggs, B.J.F.; Price, G.M. A survey of filamentous algal proliferations in New Zealand rivers. N. Z. J. Mar. Freshw. Res. 1987, 21, 175–191. [Google Scholar] [CrossRef]
  9. Biggs, B.J.F. Periphyton communities and their environments in New Zealand rivers. N. Z. J. Mar. Freshw. Res. 1990, 24, 367–386. [Google Scholar] [CrossRef]
  10. Branco, C.C.Z.; Krupek, R.A.; Peres, C.K. Distribution of stream macroalgal communities from the mid-western region of Paraná State, southern Brazil: Importance of local scale variation. Braz. Arch. Biol. Technol. 2009, 52, 379–386. [Google Scholar] [CrossRef]
  11. Krupek, R.A.; Branco, C.C.Z. Ecological distribution of stream macroalgae in different spatial scales using taxonomic and morphological groups. Braz. J. Bot. 2012, 35, 273–280. [Google Scholar] [CrossRef] [Green Version]
  12. Jimenez, J.C.; Fatjo, G.V. Survey and distribution of Batrachospermaceae (Rhodophyta) in tropical, high-altitude streams from central Mexico. Cryptogam. Algol. 2007, 28, 271–282. [Google Scholar]
  13. Carmona, J.; Bojorge-García, M.; Beltrán, Y.; Ramírez-Rodríguez, R. Phenology of Sirodotia suecica (Batrachospermaceae, Rhodophyta) in a high-altitude stream in central Mexico. Phycol. Res. 2009, 57, 118–126. [Google Scholar] [CrossRef]
  14. Xie, S.L. Seasonal dynamics of Batrachospermum arcuatum growth and distribution in Jinci Spring, China. J. Shanxi Univ. (Nat. Sci. Ed.) 2009, 32, 596–600. [Google Scholar]
  15. Xie, S. Seasonal dynamics of Batrachospermum gelatinosum growth and distribution in Niangziguan spring, China. J. Appl. Ecol. 2004, 15, 1931–1934. [Google Scholar]
  16. Harvey, W.H. Nereis Boreali-Americana: Contributions to a History of the Marine Algae of North America; Smithsonian Institution: Washington, DC, USA, 1858. [Google Scholar]
  17. Kylin, H. Studien uber die schwedischen Arten der Gattungen Batrachospermum Roth und Sirodotia nov. gen. Nova Acta Reg. Soc. Sci. Upsal. 1912, 3, 1–40. [Google Scholar]
  18. Skuja, H. Untersuchungen uber die Rhodophyceen des Suβwassers. VI. Nemalionopsis shawii eine neue gattung und Art der Heominthocladioceen. Beih. Zum Bot. Cent. B 1934, 52, 188–192. [Google Scholar]
  19. Nan, F.; Zhao, Y.; Feng, J.; Lv, J.; Liu, Q.; Liu, X.; Xie, S. Morphological and Molecular Phylogenetic Analysis of a Lemanea Specimen (Batrachospermales, Rhodophyta) from China. Diversity 2022, 14, 479. [Google Scholar] [CrossRef]
  20. Han, J.-F.; Nan, F.-R.; Feng, J.; Lv, J.-P.; Liu, Q.; Liu, X.-D.; Xie, S.-L. Sheathia matouensis (Batrachospermales, Rhodophyta), a new freshwater red algal species from North China. Phytotaxa 2019, 415, 255–263. [Google Scholar] [CrossRef]
  21. Necchi, J.O.; Garcia, F.A.; Paiano, M.O. Revision of Batrachospermum sections Acarposporophytum and Aristata (Batrachospermales, Rhodophyta) with the establishment of the new genera Acarposporophycos and Visia. Phytotaxa 2019, 395, 51–65. [Google Scholar] [CrossRef]
  22. Olden, J.D.; Lawler, J.J.; Poff, N.L. Machine Learning Methods without Tears: A Primer for Ecologists. Q. Rev. Biol. 2008, 83, 171–193. [Google Scholar] [CrossRef] [Green Version]
  23. Sun, S.; Wang, C.; Ding, H.; Zou, Q. Machine learning and its applications in plant molecular studies. Brief. Funct. Genom. 2020, 19, 40–48. [Google Scholar] [CrossRef]
  24. Roscher, R.; Bohn, B.; Duarte, M.F.; Garcke, J. Explainable Machine Learning for Scientific Insights and Discoveries. IEEE Access 2020, 8, 42200–42216. [Google Scholar] [CrossRef]
  25. Chang, C.-I. An Effective Evaluation Tool for Hyperspectral Target Detection: 3D Receiver Operating Characteristic Curve Analysis. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5131–5153. [Google Scholar] [CrossRef]
  26. Hoo, Z.H.; Candlish, J.; Teare, D. What is an ROC curve? Emerg. Med. J. 2017, 34, 357–359. [Google Scholar] [CrossRef] [Green Version]
  27. Hinden, H.; Oertli, B.; Menetrey, N.; Sager, L.; Lachavanne, J.-B. Alpine pond biodiversity: What are the related environmental variables? Aquat. Conserv. 2005, 15, 613–624. [Google Scholar] [CrossRef]
  28. Fu, B. The effects of topography and elevation on precipitation. Acta Geogr. Sin. 1992, 47, 302–314. [Google Scholar]
  29. Feng, J.; Wang, X.; Fang, J. Altitudinal pattern of species richness and test of the Rapoport’s rules in the Drung river area, southwest China. Acta Sci. Nat. Univ. Pekin. 2006, 42, 515. [Google Scholar]
  30. Qian, H.; Ricklefs, R.E. Disentangling the effects of geographic distance and environmental dissimilarity on global patterns of species turnover. Glob. Ecol. Biogeogr. 2012, 21, 341–351. [Google Scholar] [CrossRef]
  31. Wang, P.; Cheng, Q.P.; Kong, G.C.; Wang, Q. Variation characteristics analysis and forecast of relative humidity over past 53 years in Panlong river basin of Yunnan. J. Meteorol. Res. Appl. 2016, 37, 15–20. [Google Scholar]
  32. Jin, Y.H.; Lian, S.H.; Zhou, D.W.; Xu, J.B.; Peng, C. Study on change of relative humidity in semiarid region under global climate change. J. Northeast Norm. Univ. (Nat. Sci. Ed.) 2009, 41, 134–138. [Google Scholar]
  33. Cho, Y.-K.; Lee, K.-S.; Park, K.-Y. Year-to-year Variability of the Vertical Temperature Structure in the Youngsan Estuary. Ocean Polar Res. 2009, 31, 239–246. [Google Scholar] [CrossRef]
  34. Jun, X.; Shubo, C.; Xiuping, H.; Rui, X.; Xiaojie, L. Potential impacts and challenges of climate change on water quality and ecosystem: Case studies in representative rivers in China. J. Resour. Ecol. 2010, 1, 31–35. [Google Scholar]
  35. Branco, C.C.Z.; Bispo, P.C.; Peres, C.K.; Tonetto, A.F.; Branco, L.H.Z. The roles of environmental conditions and spatial factors in controlling stream macroalgal communities. Hydrobiologia 2014, 732, 123–132. [Google Scholar] [CrossRef]
  36. Akhtar, T.; Gilani, S.O.; Mushtaq, Z.; Arif, S.; Jamil, M.; Ayaz, Y.; Butt, S.I.; Waris, A. Effective Voting Ensemble of Homogenous Ensembling with Multiple Attribute-Selection Approaches for Improved Identification of Thyroid Disorder. Electronics 2021, 10, 3026. [Google Scholar] [CrossRef]
  37. Vis, M.L.; Sheath, R.G.; Cole, K.M. Distribution and systematics of Batrachospermum (Batrachospermales, Rhodophyta) in North America. 8b. Section Batrachospermum: Previously described species excluding Batrachospermum gelatinosum. Eur. J. Phycol. 1996, 31, 189–199. [Google Scholar] [CrossRef]
  38. Entwisle, T.J.; Vis, M.L.; Chiasson, W.B.; Necchi, O., Jr.; Sherwood, A.R. Systematics of the batrachospermales (Rhodophyta)—A synthesis. J. Phycol. 2009, 45, 704–715. [Google Scholar] [CrossRef] [PubMed]
  39. Kwandrans, J.; Eloranta, P. Diversity of freshwater red algae in Europe. Oceanol. Hydrobiol. Stud. 2010, 39, 161–169. [Google Scholar] [CrossRef]
  40. Branco, C.C.Z.; Necchi, O. Distribution of stream macroalgae in the eastern Atlantic Rainforest of São Paulo State, southeastern Brazil. Hydrobiologia 1996, 333, 139–150. [Google Scholar] [CrossRef]
  41. Eloranta, P. Freshwater red algae in Finland. Plant Fungal Syst. 2019, 64, 41–51. [Google Scholar] [CrossRef] [Green Version]
  42. Chen, L.; Feng, J.; Han, X.-J.; Xie, S.-L. Investigation of a freshwater acrochaetioid alga (Rhodophyta) with molecular and morphological methods. Nord. J. Bot. 2014, 32, 529–535. [Google Scholar] [CrossRef]
  43. Sherwood, A.R.; Vis, M.L.; Sheath, R.G. Phenology and phylogenetic positioning of the Hawaiian endemic freshwater alga, Batrachospermum spermatiophorum (Rhodophyta, Batrachospermales). Phycol. Res. 2004, 52, 193–203. [Google Scholar] [CrossRef]
  44. Vis, M.L. Biogeography of River Algae. In River Algae; Springer: Berlin/Heidelberg, Germany, 2016; pp. 219–243. [Google Scholar] [CrossRef]
  45. Rossignolo, N.L.; Necchi, J.O. Revision of section Setacea of the genus Batrachospermum (Batrachospermales, Rhodophyta) with emphasis on specimens from Brazil. Phycologia 2016, 55, 337–346. [Google Scholar] [CrossRef]
  46. Ji, L.; Xie, S.L.; Feng, J.; Chen, L.; Wang, J. Molecular systematics of four endemic Batrachospermaceae (Rhodophyta) species in China with multilocus data. J. Syst. Evol. 2014, 52, 92–100. [Google Scholar] [CrossRef]
  47. Shulian, X.; Zhixin, S. Taxonomy of algal genus Sirodotia Kylin (Batrachospermaceae, Rhodophyta) in China. J. Trop. Subtrop. Bot. 2004, 12, 1–6. [Google Scholar]
  48. Shulian, X.; Zhixin, S. Three new species of Batrachospermum Roth (Batrachospermaceae, Rhodophyta) in China. Chin. J. Oceanol. Limnol. 2005, 23, 204–209. [Google Scholar] [CrossRef]
  49. Fang, K.-P.; Nan, F.-R.; Feng, J.; Lv, J.-P.; Liu, Q.; Liu, X.-D.; Xie, S.-L. Batrachospermum qujingense (Batrachospermales, Rhodophyta), a new freshwater red algal species from Southwest China. Phytotaxa 2020, 461, 1–11. [Google Scholar] [CrossRef]
  50. Chankaew, W.; Sakset, A.; Ganesan, E.K.; Jr, O.N.; West, J.A. Diversity of freshwater red algae at Khao Luang National Park, southern Thailand. Algae 2019, 34, 23–33. [Google Scholar] [CrossRef] [Green Version]
  51. Xie, S.L.; Feng, J. Batrachospermum hongdongense (sect. Batrachospermum, Batrachospermaceae), a new species from Shanxi, China. Bot. Stud. 2007, 48, 459–464. [Google Scholar]
  52. Han, J.-F.; Nan, F.-R.; Feng, J.; Lv, J.-P.; Liu, Q.; Kociolek, J.P.; Xie, S.-L. Sheathia jinchengensis (Batrachospermales, Rhodophyta), a new freshwater red algal species described from North China. Phytotaxa 2018, 367, 63–70. [Google Scholar] [CrossRef]
  53. Feng, J.; Chen, L.; Wang, Y.; Xie, S. Molecular Systematics and Biogeography of Thorea (Thoreales, Rhodophyta) from Shanxi, China. Syst. Bot. 2015, 40, 376–385. [Google Scholar] [CrossRef]
  54. Mcinnes, L.; Healy, J.; Melville, J. UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar] [CrossRef]
  55. Becht, E.; McInnes, L.; Healy, J.; Dutertre, C.-A.; Kwok, I.W.H.; Ng, L.G.; Ginhoux, F.; Newell, E.W. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 2019, 37, 38–44. [Google Scholar] [CrossRef]
  56. Weijler, L.; Diem, M.; Reiter, M.; Maurer-Granofszky, M. Detecting Rare Cell Populations in Flow Cytometry Data Using UMAP. Presented at the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 4903–4909. [Google Scholar] [CrossRef]
  57. Rugard, M.; Jaylet, T.; Taboureau, O.; Tromelin, A.; Audouze, K. Smell compounds classification using UMAP to increase knowledge of odors and molecular structures linkages. PLoS ONE 2021, 16, e0252486. [Google Scholar] [CrossRef]
  58. Joswiak, M.; Peng, Y.; Castillo, I.; Chiang, L.H. Dimensionality reduction for visualizing industrial chemical process data. Control Eng. Pract. 2019, 93, 104189. [Google Scholar] [CrossRef]
  59. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  60. Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar]
  61. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
  62. Ying, Z.; Qianqian, G. Water quality evaluation of Chaohu Lake based on random forest method. Chin. J. Environ. Eng. 2016, 10, 992–998. [Google Scholar]
  63. Biau, G.; Scornet, E. A random forest guided tour. Test 2016, 25, 197–227. [Google Scholar] [CrossRef] [Green Version]
  64. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  65. Mitchell, R.; Adinets, A.; Rao, T.; Frank, E. Xgboost: Scalable GPU accelerated learning. arXiv 2018, arXiv:1806.11248. [Google Scholar] [CrossRef]
  66. Li, S.; Zhang, X. Research on orthopedic auxiliary classification and prediction model based on XGBoost algorithm. Neural Comput. Appl. 2019, 32, 1971–1979. [Google Scholar] [CrossRef]
Figure 1. The global distribution of Batrachospermaceae.
Figure 1. The global distribution of Batrachospermaceae.
Plants 11 03485 g001
Figure 2. Boxplot of environmental factors of Batrachospermaceae plants in different continents. ASL: Altitude; TM: Maximum temperature; T: Average temperature; Tm: Minimum temperature; H: Average relative humidity; V: Average wind speed; VM: Maximum sustainable wind speed; SLP: Atmospheric pressure at sea level.
Figure 2. Boxplot of environmental factors of Batrachospermaceae plants in different continents. ASL: Altitude; TM: Maximum temperature; T: Average temperature; Tm: Minimum temperature; H: Average relative humidity; V: Average wind speed; VM: Maximum sustainable wind speed; SLP: Atmospheric pressure at sea level.
Plants 11 03485 g002
Figure 3. Clustering results of UMAP. (a) Clustering results of Batrachospermaceae among continents; (b) Clustering results of UMAP among the genus of Batrachospermaceae.
Figure 3. Clustering results of UMAP. (a) Clustering results of Batrachospermaceae among continents; (b) Clustering results of UMAP among the genus of Batrachospermaceae.
Plants 11 03485 g003
Figure 4. Result of random forest model. (a) ROC curve of each genus of random forest model; (b) Confusion matrix for random forest model on the validation set; (c) Ranking of environmental factor importance for random forest model.
Figure 4. Result of random forest model. (a) ROC curve of each genus of random forest model; (b) Confusion matrix for random forest model on the validation set; (c) Ranking of environmental factor importance for random forest model.
Plants 11 03485 g004
Figure 5. Results of the XGBoost model. (a) ROC curve of each genus of the XGBoost model; (b) Confusion matrix for XGBoost model on the validation set; (c) Ranking of environmental factor importance for XGBoost model.
Figure 5. Results of the XGBoost model. (a) ROC curve of each genus of the XGBoost model; (b) Confusion matrix for XGBoost model on the validation set; (c) Ranking of environmental factor importance for XGBoost model.
Plants 11 03485 g005
Figure 6. Method block diagram of this study [36].
Figure 6. Method block diagram of this study [36].
Plants 11 03485 g006
Figure 7. The proportion of sample data of each genus of Batrachospermaceae.
Figure 7. The proportion of sample data of each genus of Batrachospermaceae.
Plants 11 03485 g007
Table 1. The number of genera and species in the database compiled in this study.
Table 1. The number of genera and species in the database compiled in this study.
GenusNumber of SpeciesGenusNumber of Species
Acarposporophycos1Psilosiphon1
Balliopsis2Sheathia12
Batrachospermum24Sirodotia9
Kumanoa25Torularia3
Lympha1Tuomeya1
Montagnia1Virescentia4
Nothocladus12Visia1
Petrohua1Volatus3
Table 2. The number of samples per genus in the database compiled in this study.
Table 2. The number of samples per genus in the database compiled in this study.
GenusNumber of Samples
Batrachospermum61
Kumanoa38
Sheathia53
Sirodotia30
Virescentia24
Nothocladus13
Remainder45
Total264
Table 3. Main parameters.
Table 3. Main parameters.
ParametersIllustration
mtryThe number of features to randomly take when building each tree
treesThe number of decision trees
min_nMinimum number of samples required for internal node re-splitting
tree_depthMaximum depth of decision tree
learn_rateThe learning rate in the ensemble, which is also the weight reduction factor for each weak classifier
loss_reductionMinimum loss reduction required to make a further partition on a leaf node of the tree
sample_sizeSubsample ratio of the training instance
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yang, Q.; Nan, F.; Liu, X.; Liu, Q.; Lv, J.; Feng, J.; Wang, F.; Xie, S. Association between the Classification of the Genus of Batrachospermaceae (Rhodophyta) and the Environmental Factors Based on Machine Learning. Plants 2022, 11, 3485. https://doi.org/10.3390/plants11243485

AMA Style

Yang Q, Nan F, Liu X, Liu Q, Lv J, Feng J, Wang F, Xie S. Association between the Classification of the Genus of Batrachospermaceae (Rhodophyta) and the Environmental Factors Based on Machine Learning. Plants. 2022; 11(24):3485. https://doi.org/10.3390/plants11243485

Chicago/Turabian Style

Yang, Qiqin, Fangru Nan, Xudong Liu, Qi Liu, Junping Lv, Jia Feng, Fei Wang, and Shulian Xie. 2022. "Association between the Classification of the Genus of Batrachospermaceae (Rhodophyta) and the Environmental Factors Based on Machine Learning" Plants 11, no. 24: 3485. https://doi.org/10.3390/plants11243485

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop