Mapping Potential Plant Species Richness over Large Areas with Deep Learning, MODIS, and Species Distribution Models

The spatial patterns of species richness can be used as indicators for conservation and restoration, but data problems, including the lack of species surveys and geographical data gaps, are obstacles to mapping species richness across large areas. Lack of species data can be overcome with remote sensing because it covers extended geographic areas and generates recurring data. We developed a Deep Learning (DL) framework using Moderate Resolution Imaging Spectroradiometer (MODIS) products and modeled potential species richness by stacking species distribution models (S-SDMs) to ask, “What are the spatial patterns of potential plant species richness across the Korean Peninsula, including inaccessible North Korea, where survey data are limited?” First, we estimated plant species richness in South Korea by combining the probability-based SDM results of 1574 species and used independent plant surveys to validate our potential species richness maps. Next, DL-based species richness models were fitted to the species richness results in South Korea, and a time-series of the normalized difference vegetation index (NDVI) and leaf area index (LAI) from MODIS. The individually developed models from South Korea were statistically tested using datasets that were not used in model training and obtained high accuracy outcomes (0.98, Pearson correlation). Finally, the proposed models were combined to estimate the richness patterns across the Korean Peninsula at a higher spatial resolution than the species survey data. From the statistical feature importance tests overall, growing season NDVI-related features were more important than LAI features for quantifying biodiversity from remote sensing time-series data.


Introduction
Spatial patterns of species richness can provide insights for understanding and monitoring community, regional, and global scales of biodiversity [1,2]. Diverse approaches have been used to estimate species richness or biodiversity, including using drones [3], citizen science observation data [4], and remotely sensed data [5]. In applied ecology and conservation biology, species richness is an indicator at the crossroads of conservation and restoration [6]. In addition, it can support decision making, conservation strategies, and the management of natural resources worldwide [7][8][9].
Estimating and conserving biodiversity requires studies of extensive areas that are independent from administrative boundaries [10]. However, insufficient data, including the lack of species surveys and geographical data gaps, are obstacles to understanding and measuring species biodiversity [7,11,12]. In addition, data-poor regions are often under great risk of losing biodiversity, so filling the data gaps is essential to halt the ongoing declines in worldwide biodiversity [13]. Although many global species databases, such as the Global Biodiversity Information Facility (GBIF; www.gbif.org), provide species' been conducted. Novel biodiversity estimation approaches can be applied to other unreported areas where survey data are scarce.
The goal of the study was to estimate potential plant species richness for the Korean Peninsula, including North Korea, which has limited survey data, by combining deep learning with remote sensing data. To accomplish this goal, plant species richness in South Korea was estimated by combining the suitability predictions of 1574 species using species distribution models. To estimate potential plant species richness patterns for North Korea, a state-of-the-art DL approach was used to develop a species richness retrieval model by integration of the surveyed and estimated richness information collected in South Korea and multitemporal remote sensing data. Not only were we able to estimate potential plant species richness over the entire Korean Peninsula at a higher resolution than in previous efforts [33], but we identified which variables at which time periods were more important for estimating plant species richness.

Materials and Methods
Located at the eastern end of Eurasia, the total land area of the Korean Peninsula, including the islands, is 223,170 km 2 (100,210 km 2 for South Korea). The wildlife of the Korean Peninsula belongs to the Palearctic realm [35] and its climate is affected by the East Asian monsoon. Its 4 seasons have very distinct temperature and moisture patterns. The DMZ (demilitarized zone) divides the peninsula into North Korea (Democratic People's Republic of Korea) and South Korea (Republic of Korea) ( Figure 1).  Figure 2 illustrates a brief overview of this study to estimate plant species richness patterns for the Korean Peninsula, which comprised 2 parts: (1) species distribution modeling and (2) DL-based species richness modeling. First, we estimated plant species richness in South Korea by combining the 1574 species' suitability estimations from species distribution models at a resolution of 30-arcseconds. We then developed a DL-based species richness model using the species richness results from South Korea and a time-series of MODIS-driven NDVI and LAI, which were resampled to 30-arcseconds from 15-arcseconds of the original sources. Finally, we applied the DL model to estimate the plant species richness for both North and South Korea from the original MODIS-driven NDVI and LAI images (15-arcseconds). More details are described in the following sections.  Figure 2 illustrates a brief overview of this study to estimate plant species richness patterns for the Korean Peninsula, which comprised 2 parts: (1) species distribution modeling and (2) DL-based species richness modeling. First, we estimated plant species richness in South Korea by combining the 1574 species' suitability estimations from species distribution models at a resolution of 30-arcseconds. We then developed a DL-based species richness model using the species richness results from South Korea and a time-series of MODIS-driven NDVI and LAI, which were resampled to 30-arcseconds from 15-arcseconds of the original sources. Finally, we applied the DL model to estimate the plant species richness for both North and South Korea from the original MODIS-driven NDVI and LAI images (15-arcseconds). More details are described in the following sections. Overview of the present study. Potential plant species richness in South Korea calculated by combining the 1574 species' suitability estimations from species distribution models at 30-arcseconds was used as the ground truth. Multilayer perceptron (MLP) models using quarterly Moderate Resolution Imaging Spectroradiometer (MODIS)-driven Normalized Difference Vegetation Index (NDVI) and Leaf Area Index (LAI) time-series datasets in South Korea were applied to estimate the plant species richness at a 15-arcsecond resolution for the entire Korean Peninsula, including inaccessible North Korea.

Species Distribution Modeling
To estimate species richness in South Korea, we used 183,854 occurrence records of vascular plants, comprising 1574 species from South Korea's Third National Ecosystem Survey data (National Ecosystem Survey data are available at http://ecobank.nie.re.kr (accessed on 24 June 2021)). Most species observations were recorded between 2006 and 2012. Evaluating the habitat suitability for all species located in South Korea would be ideal for determining the potential species richness, but we required a minimum number of 10 occurrence records for use in modeling, since small sample sizes affect the reliability of statistical analyses [36,37]. The observation records for the 1574 species ranged from 10 to 921.
We downloaded 19 bioclimatic variables at a grid resolution of 30-arcseconds (≈900 × 900 m at the equator) from the WorldClim (http://worldclim.org/ (accessed on 24 June 2021); [38]), soil variables from the Harmonized World Soil Database at a resolution of 30 arcseconds [39], and digital elevation model (DEM) with a 90 m resolution from the SRTM (http://srtm.csi.cgiar.org/ (accessed on 24 June 2021); [40]). For elevation, we calculated the standard deviation and the range (maximum value minus minim value) of elevation at a coarser resolution of 30-arcseconds to identify the topographic changes in the surrounding areas [37,41]. We matched the resolutions of all variable layers to 30-arcseconds for further modeling processes. Among these variables, we selected 8 variables that had a Spearman's |rho| of <0.7 [42,43]. These were annual mean temperature (Bio1), mean diurnal range (Bio2), isothermality (Bio3), precipitation of wettest quarter (Bio16), precipitation of driest quarter (Bio17), topsoil silt fraction, topsoil pH, and the standard deviation of elevation.
We fitted each species' habitat suitability using the variables at a resolution of 30 arcseconds and predicted to South Korea at the same resolution level. We used the Maxent SDM [44] to estimate the plant species' habitat suitability in South Korea. Maxent is appropriate for presence-only data and is highly reliable, even when the number of samples is small [36,45,46]. However, models of species with few occurrence records can be overfitted when the number of environmental predictors is large compared with having few occurrence records [47,48]. Lomba et al. [47] proposed a strategy for modeling rare species' distributions using sets of bivariate models and averaging them using weights derived from model performance, and Breiner et al. [48] showed that ensembles of bivariate models performed better than standard models. We adopted the approach of using an ensemble of bivariate models for species with less than 50 occurrence records. In our Figure 2. Overview of the present study. Potential plant species richness in South Korea calculated by combining the 1574 species' suitability estimations from species distribution models at 30-arcseconds was used as the ground truth. Multilayer perceptron (MLP) models using quarterly Moderate Resolution Imaging Spectroradiometer (MODIS)-driven Normalized Difference Vegetation Index (NDVI) and Leaf Area Index (LAI) time-series datasets in South Korea were applied to estimate the plant species richness at a 15-arcsecond resolution for the entire Korean Peninsula, including inaccessible North Korea.

Species Distribution Modeling
To estimate species richness in South Korea, we used 183,854 occurrence records of vascular plants, comprising 1574 species from South Korea's Third National Ecosystem Survey data (National Ecosystem Survey data are available at http://ecobank.nie.re.kr (accessed on 24 June 2021)). Most species observations were recorded between 2006 and 2012. Evaluating the habitat suitability for all species located in South Korea would be ideal for determining the potential species richness, but we required a minimum number of 10 occurrence records for use in modeling, since small sample sizes affect the reliability of statistical analyses [36,37]. The observation records for the 1574 species ranged from 10 to 921.
We downloaded 19 bioclimatic variables at a grid resolution of 30-arcseconds (≈900 × 900 m at the equator) from the WorldClim (http://worldclim.org/ (accessed on 24 June 2021); [38]), soil variables from the Harmonized World Soil Database at a resolution of 30 arcseconds [39], and digital elevation model (DEM) with a 90 m resolution from the SRTM ( http://srtm.csi.cgiar.org/ (accessed on 24 June 2021); [40]). For elevation, we calculated the standard deviation and the range (maximum value minus minim value) of elevation at a coarser resolution of 30-arcseconds to identify the topographic changes in the surrounding areas [37,41]. We matched the resolutions of all variable layers to 30-arcseconds for further modeling processes. Among these variables, we selected 8 variables that had a Spearman's |rho| of <0.7 [42,43]. These were annual mean temperature (Bio1), mean diurnal range (Bio2), isothermality (Bio3), precipitation of wettest quarter (Bio16), precipitation of driest quarter (Bio17), topsoil silt fraction, topsoil pH, and the standard deviation of elevation.
We fitted each species' habitat suitability using the variables at a resolution of 30 arcseconds and predicted to South Korea at the same resolution level. We used the Maxent SDM [44] to estimate the plant species' habitat suitability in South Korea. Maxent is appropriate for presence-only data and is highly reliable, even when the number of samples is small [36,45,46]. However, models of species with few occurrence records can be overfitted when the number of environmental predictors is large compared with having few occurrence records [47,48]. Lomba et al. [47] proposed a strategy for modeling rare species' distributions using sets of bivariate models and averaging them using weights derived from model performance, and Breiner et al. [48] showed that ensembles of bivariate models performed better than standard models. We adopted the approach of using an ensemble of bivariate models for species with less than 50 occurrence records. In our study, we had 8 predictor variables, so the number of all possible bivariate predictor combinations was 28. We divided the occurrence records randomly in half and the portions of the data were used in fitting and validating a model, respectively. We calculated the AUC for each bivariate model, and the AUC (area under the curve) was used to calculate Somers' D for weighting all the possible bivariate models. Somers' D is D = 2 × (AUC-0.5), and bivariate models with a Somers' D lower than 0 (i.e., AUC < 0.5) were not used to build the ensemble range models [48]. Finally, the suitability of the species was averaged by constructing the ensembles of bivariate models 5 times. For the species with 50 or more occurrence records, we used all 8 predictor variables and fitted the models 5 times using k-fold cross-validation [49] with k = 5. We averaged the 5 results for the final suitability result for each species.

Species Richness (as a Response Variable)
Several modeling approaches, including macro-ecological models [50] and stacking species distribution models (S-SDMs) [51], have been proposed for estimating species richness. As many global species databases, such as the Global Biodiversity Information Facility (GBIF; www.gbif.org (accessed on 24 June 2021)), eBird (https://ebird.org (accessed on 24 June 2021) [52]), and iNaturalist (www.inaturalist.org (accessed on 24 June 2021)), provide species' occurrence data [14,16], S-SDMs which combine the predictions of each species' SDM to estimate species richness have become a common strategy in recent studies [53]. Although there is still no agreement as to how to stack each species' predictions, many studies have found that probability-based (raw SDM results ranging from 0 to 1) stacking produced unbiased richness that is closer to the true species richness [16,50,53,54]. We estimated the potential plant species richness of South Korea by combining the probability-based SDM results of the 1574 species.

MODIS Products: NDVI and LAI (as Input Variables)
Remote sensing data efficiently record reflected energy over extended and inaccessible areas using sensors on aircraft or spacecraft, while also providing information on the spatial variability of reflectance periodically. Over the past decade, users have had opportunities to access data acquired from a variety of Earth-observing sensors. The Moderate Resolution Imaging Spectroradiometer (MODIS) is one of the most widely used remote sensing instruments aboard the Terra and Aqua satellites (https://modis.gsfc.nasa.gov/ (accessed on 24 June 2021)). With a 2330 km swath, MODIS can observe the Earth's surface every 1 to 2 days, using an enhanced spectral resolution sensor in 36 spectral bands. Along with low-level MODIS products, diverse higher-level products for land, atmosphere, cryosphere, and ocean color applications have been distributed to the science and applications community [55]. Two MODIS sensors have sun-synchronous, near-polar circular orbits and observe the same regions 3 hours apart (Terra and Aqua are timed to cross the equator at 10:30 (from north to south) and 13:30 (from south to north), respectively).
Productivity measures calculated from remote sensing technology may be highly correlated to species richness [25]. Remote sensing indices provide a quantitative proxy of environmental phenomena as an alternative to in-situ measurements. Indices such as NDVI and leaf area index (LAI) that emphasize spectrally unique characteristics of the targets of interest have been developed for application of remote sensing data to vegetation dynamics. We used NDVI and LAI products, which may have high correlations with species richness [56], among the various regularly produced MODIS land products quantifying vegetation richness. NDVI quantifies vegetation by utilizing the spectral characteristics of green vegetation, which strongly reflects in the near-infrared range but absorbs in the red or blue wavelength region, and the value always falls between −1 and 1 (note: a negative NDVI indicates dead plants or inorganic objects, while a positive value indicates live plants, with those close to 1 being healthier) [57]. The NDVI is retrieved from daily atmosphere-corrected bidirectional surface reflectance using a specific compositing method to remove low quality data, and is widely used in all ecosystems and climates, and in natural resource management studies [58]. Leaf area index, which is defined as the one-sided green leaf area per unit ground area, is an important structural property of vegetation and characterizes plant canopies, representing foliage vigor [59]. While NDVI can be formulated by a simple difference/sum ratio, LAI is measured by a direct and destructive method, which is time-consuming [60]. LAI is mainly derived from a look-up-table that exploits the spectral information of the MODIS red and near-infrared surface reflectance, and a 3D radiative transfer equation [61].
In this study, we used the 16-day composite NDVI product (MOD13A1 and MYD13A1) at a resolution of 15 arcseconds and a combined 8-day composite LAI product (MCD15A2) at 15 arcseconds. However, since the MCD15A2 data contain many missing values because of high cloud cover, and to match the temporal resolution of the NDVI product, we converted them to 16-day composite images. To be consistent with the species richness results in South Korea, these NDVI and LAI data were then resampled at a spatial resolution of 30 arcseconds. The data were downloaded from the Land Processes Distributed Active Archive Center (LP DAAC; https://lpdaac.usgs.gov/ (accessed on 24 June 2021)) and processed using the MODIS Reprojection Tool to convert the map projection and to create mosaic images. To cover the entire Korean Peninsula, 3 images (h27v04, h27v05, and h28v05 of MODIS's Sinusoidal Tile Grid) were composited.

Deep Learning-Based Species Richness Model
An artificial neural network mimics the learning process of the human brain and then is expanded to deep learning using deeper network architecture [26]. From an applied perspective, DL networks are able to automatically learn arbitrary complex mappings from inputs to outputs, and they offer promise for problems with complex nonlinear and multivariate regression.
Among the various types of DL architectures, multilayer perceptron (MLP) is the most practical and typical type of feedforward neural network and is effective for solving regression problems [62]. However, MLP is now deemed insufficient for modern advanced computer vision tasks. As an alternative, there is increased interest in convolutional neural networks (CNNs), which are more effective at analyzing image data [63], since they use spatial information and sparsely (partially) connected layers instead of MLP's fully connected layers. Although we used co-registered MODIS-derived NDVI and LAI images collected in 2009 as inputs, there were many missing values at different pixel locations and times due to high cloud cover. In CNN architecture, these missing values could hinder both model training and proper inference. Therefore, in consideration of our data characteristics and the objectives of this study, pixel-wise MLP was used to estimate species richness. Multilayer perception comprises more than one perceptron, which consists of an input layer to receive the signal, an output layer that makes a decision, and one in between those two, and is capable of learning any continuous mapping function from the hierarchical or multilayered structure of the networks. Figure 3 illustrates the network architecture of MLP used in this study.  The input variables are the compiled MODIS-derived NDVI and LAI time-series (x = [x 1 , x 2 , ⋯, x , x 1 , x 2 , ⋯, x , = 23]) collected in 2009, the middle year of the period when the species surveys were conducted, and which may have a high correlation with species richness. The output layer (y) is a value of species richness derived from S-SDMs. At each pixel location where no missing values existed, we obtained approximately 500,000 samples of x and y in South Korea for model training and testing. TensorFlow (https://tensorflow.org (accessed on 24 June 2021)), which is an end-to-end open source platform for DL, was used to develop our species richness estimation model.
In developing DL models, there are several hyperparameters that need to be tuned prior to training the model, but there are common sets of rules or heuristics governing parameter tuning. After iterative grid search parameter tuning, using a small subset of our data, 5 hidden layers ( , = 1, 2, ⋯ ,5) were used, and the number of neurons (n) in ) collected in 2009, the middle year of the period when the species surveys were conducted, and which may have a high correlation with species richness. The output layer (y) is a value of species richness derived from S-SDMs. At each pixel location where no missing values existed, we obtained approximately 500,000 samples of x and y in South Korea for model training and testing. TensorFlow (https://tensorflow.org (accessed on 24 June 2021)), which is an end-to-end open source platform for DL, was used to develop our species richness estimation model.
In developing DL models, there are several hyperparameters that need to be tuned prior to training the model, but there are common sets of rules or heuristics governing parameter tuning. After iterative grid search parameter tuning, using a small subset of our data, 5 hidden layers (H j , j = 1, 2, . . . , 5) were used, and the number of neurons (n) in each hidden layer was 64, 128, 256, 128, and 64, respectively. The hidden layers were stacked one by one to transfer the input signals to the deeper layer, which could extract hidden and unknown features related to species richness. We chose the RMSprop stochastic descent optimizer with the default parameters [64], and a rectified linear unit (ReLU) as a nonlinear activation function because of its promising performance in the literature [65]. Dropout layers with a rate of 0.2 were added to each hidden layer to prevent model overfitting [66]. The L1 loss function, also known as the least absolute error, was used because it is not sensitive to outliers and is intuitive. The detailed MLP model structure is described in Figure 3a.
To develop a geographically more robust estimation model due to the lack of survey data in North Korea, stacked species richness training data for South Korea were divided into quadrants (NE, NW, SE, and SW). For each quadrant of South Korea, we developed 4 MLPs using 3 subsets as the training set and the other set as the testing set. For example, as illustrated in Figure 3b, MLP 1 used the NW, SW, and SE datasets as the training sets and the SW dataset as the testing set. In MLP 2, the NE, SW, and SE datasets were used to train and the NW dataset was used to test. A randomly selected 20% of the training samples were used as validation data to determine the models' performance and to minimize overfitting for unseen data. Therefore, we obtained 4 potential plant species richness models according to the geographical quadrants of South Korea. To evaluate the statistical model performance, estimations from the test set of each MLP model were combined, indicating that the combined statistical accuracies were calculated from the unseen data. For the final inference of the potential plant species richness of the entire Korean Peninsula at a resolution of 15 arcseconds, the resolution of the input NDVI and LAI products, and the results from all MLP branches were ensembled.
The training process was performed for a number of iterations in which all the training data were exposed to the network until the loss function reached its minimum value. The model score reached its maximum after approximately 5000 iterations with a NVIDIA Titan X GPU (3584 CUDA cores). The number of trainable parameters was 85,569 and the computational run-time was approximately 4 h with a training batch size of 1024.
Deep learning approaches have shown promising results in many scientific applications [67,68] but do not always guarantee better outcomes [69,70]. Therefore, we compared the performance of the proposed DL model with a RF regression model as a baseline, since RF showed reasonable results and is popular in various machine learning applications [71].
To determine the best values of the hyperparameters (number of trees, maximum tree depths, and the maximum number of features) in the RF model, a grid search was used.
Due to the unique characteristics of neural networks, which solve problems by exploiting the hidden relationships inherent in multiple input variables, it was difficult to physically quantify the importance of the input variables. As an alternative, we performed a statistical feature importance test (SFIT) to explain which feature had the greatest significance in the species richness retrievals and to determine the optimized features in an operational retrieval system. For the SFIT, a single feature was randomly shuffled, while all the other features were kept constant. We iterated this process by changing the test variable. The feature importance shows the extent to which the model performance decreased with random shuffling. In this study, we used the root mean square error (RMSE) as the performance metric.

Independent Validation of Species Richness
Finally, we evaluated our two model results, S-SDMs and DL species richness, by comparing the species richness obtained from independent tree plot datasets from the Korea Forest Service [72]. We calculated species richness using grids with a 10 km resolution after sensitivity analysis at different resolutions. We then calculated the overall correlation and local correlations between the species richness from the independent datasets and the results from the S-SDMs and DL species richness model. For direct comparisons, the model results were resampled at 10 km by pooling a median value of the corresponding pixels within a 10 km grid and comparing them with the results using the independent datasets. To calculate the local correlations, we defined a 3 × 3 square focal area (30 km × 30 km) for each grid, using a moving window to define the spatial ranges for correlations.

Species Richness Estimation from S-SDMs
The mean AUC and mean Boyce index of 1574 plant species were 0.77 (±0.11 SD) and 0.73 (±0.18 SD), respectively (Table S1, see Supplementary Materials). When we stacked the probability-based SDM results of 1574 species, the 30-arcsecond cell with the highest potential number of species richness had 1167 species. This was in the middle of Jeju Island, the largest island located in southernmost South Korea (Figure 4). S-SDMs predicted the highest species richness to be in the northeastern part of South Korea (Gangwon Province) and Jeju Island, and the results correspond with previous literature and other studies [33,73].
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 20 the model results were resampled at 10 km by pooling a median value of the corresponding pixels within a 10 km grid and comparing them with the results using the independent datasets. To calculate the local correlations, we defined a 3 × 3 square focal area (30 km × 30 km) for each grid, using a moving window to define the spatial ranges for correlations.

Species Richness Estimation from S-SDMs
The mean AUC and mean Boyce index of 1574 plant species were 0.77 (±0.11 SD) and 0.73 (±0.18 SD), respectively (Table S1, see Supplementary Materials). When we stacked the probability-based SDM results of 1574 species, the 30-arcsecond cell with the highest potential number of species richness had 1167 species. This was in the middle of Jeju Island, the largest island located in southernmost South Korea (Figure 4). S-SDMs predicted the highest species richness to be in the northeastern part of South Korea (Gangwon Province) and Jeju Island, and the results correspond with previous literature and other studies [33,73].  Table 1 summarizes the mean absolute error (MAE), RMSE, bias (also known as the mean error), and the Pearson's correlation coefficient of RF and the proposed DL models. The p-values of all models from a two-sided Student's t-test were <0.05, which means that there was a statistically significant difference between the S-SDMs-derived and RF/DL model-derived species richness at a confidence level of 95%. As shown in Table 1 Table 1 summarizes the mean absolute error (MAE), RMSE, bias (also known as the mean error), and the Pearson's correlation coefficient of RF and the proposed DL models. The p-values of all models from a two-sided Student's t-test were <0.05, which means that there was a statistically significant difference between the S-SDMs-derived and RF/DL model-derived species richness at a confidence level of 95%. As shown in Table 1 A combined scatterplot of plant species richness from the S-SDMs (PSR S-SDMs ; x-axis) and DL models (PSR DL ; y-axis) for the test sets, with the number of pixels denoted by the color density, shows that most data points were located around the one-to-one line (Figure 5a; black solid line), although some pixels were underestimated or overestimated. Overall, our model yielded an RMSE of 38.58 and a MAE of 28.81, with a near unity slope (0.95) in the linear regression model. The bias (mean difference between PSR S-SDMs and PSR DL ) was calculated as 10.21, indicating that the proposed model slightly underestimated the potential plant species richness compared with the reference S-SDMs species richness, but the differences were small and may be handled by adjusting the slope and offset parameters in post-processing if necessary. A remarkable Pearson's correlation coefficient (0.98) was achieved.

Deep Learning-Based Species Richness Estimation Model Using Remote Sensing Data
A histogram (Figure 5b) of PSR S-SDMs and PSR DL , with a 20 PSR bin width, shows a relative weakness in estimating PSR values between 400-600 and PSRs larger than 850. The Kullback-Leibler divergence (D KL ) is often used to observe the statistical similarity of the two distributions. Although D KL is not a distance metric, it is close to zero if two distributions are similar. In this histogram, the D KL was 0.0043, indicating that the two distributions are nearly the same.
We applied the proposed model to a time-series of MODIS NDVI and LAI images acquired at a 15-arcsecond resolution to estimate plant species richness for both North and South Korea (Figure 4, right), whereas Figure 4 (left) is the reference S-SDM results for South Korea at a resolution of 30 arcseconds. For South Korea, the DL-predicted richness maps exhibited high agreement with the reference S-SDM results, which is similar to the statistical comparisons in Figure 5. There are some missing values in the predicted map ( Figure 4, right), but these are mostly areas of large cities or missing values in the input NDVI or LAI images due to clouds. Based on quantitative and qualitative analysis of the S-SDMs results in South Korea, the MODIS LAI and NDVI time-series data were successfully integrated into the proposed DL model to estimate species richness. The distribution of species richness values near the border between North and South Korea showed continuous values. Forested areas in the eastern part of the Korean Peninsula showed high species richness, while low concentrations were in urban areas in the west (Figure 1). Cold highland areas in the north also showed relatively low richness values.  Figure 6 shows the increased RMSE values obtained by the SFITs of the input features, which were calculated after 10 replications of the tests, with the Pearson's correlation coefficients of each feature (red lines). A more important feature has a higher increased RMSE value. Table 2 summarizes the average feature importance according to the feature sources and seasons, with the correlation coefficients. We divided the input features according to the seasonal characteristics of the Korean Peninsula. Overall, NDVIrelated features were generally more important than LAI features (32.64 versus 42.32). The features in the growing season (i.e., spring) exhibited greater feature importance than those of the other seasons in both the LAI and NDVI feature groups. Highly correlated features generally had higher statistical feature importance values, but were not proportional to each other.  Figure 6 shows the increased RMSE values obtained by the SFITs of the input features, which were calculated after 10 replications of the tests, with the Pearson's correlation coefficients of each feature (red lines). A more important feature has a higher increased RMSE value. Table 2 summarizes the average feature importance according to the feature sources and seasons, with the correlation coefficients. We divided the input features according to the seasonal characteristics of the Korean Peninsula. Overall, NDVI-related features were generally more important than LAI features (32.64 versus 42.32). The features in the growing season (i.e., spring) exhibited greater feature importance than those of the other seasons in both the LAI and NDVI feature groups. Highly correlated features generally had higher statistical feature importance values, but were not proportional to each other.

Independent Validation of Species Richness
Species richness at a 10-km resolution using the independent South Korea datasets was high in the middle of Jeju Island, in southern mountainous areas, and in the northeastern part of South Korea (Figure 7). This is similar to the species richness results from the two species richness models (Figure 8). The overall Pearson's correlation between the two species richness values was 0.49 for the species richness from S-SDMs and 0.47 from the DL model. From the local correlation map, we identified that positive correlations appeared in almost all regions (Figure 8). We found that the negative correlations along coastal and boundary lines, especially with the DL model results, mostly had to do with missing values from the input MODIS images.

Independent Validation of Species Richness
Species richness at a 10-km resolution using the independent South Korea datasets was high in the middle of Jeju Island, in southern mountainous areas, and in the northeastern part of South Korea (Figure 7). This is similar to the species richness results from the two species richness models (Figure 8). The overall Pearson's correlation between the two species richness values was 0.49 for the species richness from S-SDMs and 0.47 from the DL model. From the local correlation map, we identified that positive correlations appeared in almost all regions (Figure 8). We found that the negative correlations along coastal and boundary lines, especially with the DL model results, mostly had to do with missing values from the input MODIS images.

Discussion
Biodiversity data are lacking in many parts of the world [24,25]. In order to overcome this limitation, deep learning, which is one of the techniques of machine learning, was applied to multitemporal satellite remote sensing data that are globally available. By DL combining the modeled richness estimations and remote sensing data, we increased the resolution of potential plant species richness maps for the Korean Peninsula to 15 arcseconds, including the large unreported area of North Korea, and maintained statistical accuracy. Since species richness represents a basic indicator for effective conservation and restoration, our results can provide resource managers and planners with maps that show areas in need of active conservation or restoration, even in regions that have not been sufficiently investigated. Threats to biodiversity are global problems [74], but the spatial gaps in scientific information make actions less reliable, because of socioeconomic status, history, culture, geography, and scientific interests [14]. Our novel data fusion approach shows potential for overcoming these data limitations.

Deep Learning-Based Species Richness Estimation
Based on our results in Table 1, the proposed DL-based model exhibited a better estimation capacity for plant species richness from remote sensing time-series data than the widely used RF model. Pau et al. [75] identified that there was no direct relationship between NDVI and species richness using the mean NDVI values of a 10-year period. However, it is worth noting that the DL model successfully captured the inherent relationships between 46 remote sensing-based input features and the species richness from the S-SDMs. An ensemble of spatially validated models using quarterly geographic datasets confirmed that the DL model was spatially robust, at least within South Korea. In light of the size and ecosystems of the Korean Peninsula and the spatial robustness of the proposed model, it could be used to estimate the spatial patterns of plant species richness in North Korea, as shown in visual inspections of Figure 4 (right).
Plant species richness may be related to a variety of factors, including temperature, light and water availability, and environmental heterogeneity [76]. We found that the average feature importance of LAI and NDVI in spring were the two most important feature groups, marking the onset of vegetation growth, while the fall months, when deciduous plants start dropping their leaves, were the most insignificant months for LAI (Table 2). However, NDVI features in the fall and winter months were more critical than the fall and

Discussion
Biodiversity data are lacking in many parts of the world [24,25]. In order to overcome this limitation, deep learning, which is one of the techniques of machine learning, was applied to multitemporal satellite remote sensing data that are globally available. By DL combining the modeled richness estimations and remote sensing data, we increased the resolution of potential plant species richness maps for the Korean Peninsula to 15 arcseconds, including the large unreported area of North Korea, and maintained statistical accuracy. Since species richness represents a basic indicator for effective conservation and restoration, our results can provide resource managers and planners with maps that show areas in need of active conservation or restoration, even in regions that have not been sufficiently investigated. Threats to biodiversity are global problems [74], but the spatial gaps in scientific information make actions less reliable, because of socioeconomic status, history, culture, geography, and scientific interests [14]. Our novel data fusion approach shows potential for overcoming these data limitations.

Deep Learning-Based Species Richness Estimation
Based on our results in Table 1, the proposed DL-based model exhibited a better estimation capacity for plant species richness from remote sensing time-series data than the widely used RF model. Pau et al. [75] identified that there was no direct relationship between NDVI and species richness using the mean NDVI values of a 10-year period. However, it is worth noting that the DL model successfully captured the inherent relationships between 46 remote sensing-based input features and the species richness from the S-SDMs. An ensemble of spatially validated models using quarterly geographic datasets confirmed that the DL model was spatially robust, at least within South Korea. In light of the size and ecosystems of the Korean Peninsula and the spatial robustness of the proposed model, it could be used to estimate the spatial patterns of plant species richness in North Korea, as shown in visual inspections of Figure 4 (right).
Plant species richness may be related to a variety of factors, including temperature, light and water availability, and environmental heterogeneity [76]. We found that the average feature importance of LAI and NDVI in spring were the two most important feature groups, marking the onset of vegetation growth, while the fall months, when deciduous plants start dropping their leaves, were the most insignificant months for LAI (Table 2). However, NDVI features in the fall and winter months were more critical than the fall and winter LAI features. The high coverage of evergreen forests in the Korean Peninsula (approximately 30%) may explain the relatively high feature importance of fall and winter NDVI [77].
The relationship between LAI and NDVI is not clear. As seen in Figure 6, LAI or NDVI, at a specific time, particularly DOY 129-161, had relatively high agreement with the species richness values. However, exploiting a single LAI or NDVI feature was limited in terms of correctly retrieving the species richness values. Highly correlated features did not always guarantee high feature importance. For example, LAI 337 , NDVI 193 , and NDVI 209 had very low correlation coefficients (less than 0.1), but they were statistically important in the DL model. NDVI features from DOY 225-257 were highly ranked features in the single feature correlation tests, but their feature importance was ranked at the bottom of the graph. Thus, it is difficult to say that there is a linear relationship between species richness and the correlation coefficient of a single feature. However, incorporating multiple features into the state-of-the-art DL framework proposed in this study could define the inherent relationships between species richness and remote sensing time-series data, and then estimate species richness with the accuracy level of the S-SDMs model. It seems that 46 features may be redundant, but the lack of negative feature importance in Figure 6 indicates that each feature plays a specific role in improving the species richness model in hidden networks.
Although previous studies investigated biodiversity using high-resolution airborne/ spaceborne LiDAR and multispectral/hyperspectral data for local areas [29][30][31]78,79], data acquisition is challenging. Our proposed model used easy-to-access MODIS imagery to develop a species richness estimation model over extended and inaccessible regions by combining surveyed data, and further improved the spatial resolution of the species richness map.

Limitations and Recommendations
We validated our S-SDM-and DL-based species richness models using independent tree plot data. Although the independent data we used were only for trees, we found a moderate positive correlation between these species richness measures. As it can be seen in Figure 7, the area that has not been surveyed in the tree plot data is widely displayed, using smaller grids than 10 km produces many empty areas, and using larger grids generalized the richness pattern excessively [80]. There are widespread areas that have not been surveyed and it is difficult to evaluate the species richness of a large area depending only on surveys. In addition, in the local correlation map (Figure 8), most of the areas with negative correlations appear around coastal or administrative boundaries. This might be due to the differences in the resolution and the calculation methods of the two species richness sets. Therefore, it is necessary to use appropriate remote sensing data as inputs to the richness model to compensate for data scarcity according to the estimation's purpose. Moreover, we may be able to further validate the S-SDM-and DL-based plant richness models in different ways. For example, area-specific plant surveys, targeted systematic surveys such as the forest tree survey used here, and herbarium or global species presence records such as GBIF and iNaturalist could be used to compare the S-SDM and DL models' predictions at certain spots, either for subsets of the entire plant species richness (e.g., trees) or for the presence of individual species. While, in this study, we compared the overall pattern of species richness from each model and data source, it would also be possible to use a point-based or small area-based approach to assess the overall accuracy of our modeled approaches.
We were able to obtain a very detailed level of species richness, but it was greatly affected by the resolution of the remote sensing data we used. Although MODIS may be the best sensor for mapping at a global scale, it is not appropriate for observing local areas due to the low spatial resolution. Especially, NDVI shows higher correlation with species richness than LAI. NDVI can be readily derived from most remote sensing sensors, including sensors with better resolution but a narrower swath such as Landsat, WorldView, and KOMPSAT (Korean Multi-Purpose Satellite). Although the revisit cycle of highresolution data is less frequent than MODIS, it may be possible to develop high-resolution species richness for local areas by optimizing our proposed model and the results of the feature importance test.
Regarding the DL framework, our proposed models were based on the most typical type of neural network, using pixel-wise inputs due to the characteristics of the data used in this study. As mentioned in Section 2.4, general image-based CNN architectures may not be appropriate for our data; applying one-dimensional CNN [81,82] or recurrent neural networks [83] in the temporal domain, or graph convolutional networks for handling graph structures between samples [84] should be interesting topics, as these are more advanced approaches. Additionally, super-resolution using a generative adversarial network [85] would further improve the spatial resolution of species richness map.
In addition, we quantitatively tested the proposed model using quarterly datasets acquired in 2009. Thus, we could not test the temporal robustness of the model using data from other years. One of the advantages of remote sensing data is it allows continuous data acquisition over extended areas. If we investigate and test multiple-year data, the proposed model can be further extended to an operational system to estimate and monitor plant species richness periodically. The GEO BON (the Group on Earth Observations Biodiversity Observation Network) consortium has developed the Local Biodiversity Intactness Index by combining the annual PREDICTS global biodiversity survey database and 1 km landuse maps [86]. Our study was able to provide more detailed information using a similar approach to GEO BON, and periodical application of our study is expected in terms of practicality. Moreover, our approach overcomes the limitations of SDMs' environmental space, which can be explained only by the environmental variables used, by combining these with proxy values representing site degradation. Since many sites are not in pristine condition, especially in some parts of North Korea, developing a modification factor associated with site conditions could produce additional insights.

Conclusions
Species richness information provides basic clues for effective conservation and restoration, but many areas lack sufficient data for its estimation. We combined machine learning techniques in ecology and computer science to analyze remote sensing big data and estimate species richness across data-limited and inaccessible areas. Our DL-based model exhibited better estimation capacity for plant species richness than the widely used RF model. Overall, we found that NDVI related features in growing season are generally more important than LAI features for estimating plant species richness. Our methods are readily applicable to other regions, but the results may depend on the structure of the remote sensing data used. Thus, future studies using high-resolution remote sensing time-series data could provide more detailed information for effective conservation and restoration planning.