Integrating Multi-Sensors Data for Species Distribution Mapping Using Deep Learning and Envelope Models

: The integration of ecological and atmospheric characteristics for biodiversity management is fundamental for long-term ecosystem conservation and drafting forest management strategies, especially in the current era of climate change. The explicit modelling of regional ecological responses and their impact on individual species is a signiﬁcant prerequisite for any adaptation strategy. The present study focuses on predicting the regional distribution of Rhododendron arboreum, a medicinal plant species found in the Himalayan region. Advanced Species Distribution Models (SDM) based on the principle of predeﬁned hypothesis, namely BIOCLIM, was used to model the potential distribution of Rhododendron arboreum . This hypothesis tends to vary with the change in locations, and thus, robust models are required to establish nonlinear complex relations between the input parameters. To address this nonlinear relation, a class of deep neural networks, Convolutional Neural Network (CNN) architecture is proposed, designed, and tested, which eventually gave much better accuracy than the BIOCLIM model. Both of the models were given 16 input parameters, including ecological and atmospheric variables, which were statistically resampled and were then utilized in establishing the linear and nonlinear relationship to better ﬁt the occurrence scenarios of the species. The input parameters were mostly acquired from the recent satellite missions, including MODIS, Sentinel-2, Sentinel-5p, the Shuttle Radar Topography Mission (SRTM), and ECOSTRESS. The performance across all the thresholds was evaluated using the value of the Area Under Curve (AUC) evaluation metrics. The AUC value was found to be 0.917 with CNN, whereas it was 0.68 with BIOCLIM, respectively. The performance evaluation metrics indicate the superiority of CNN for species distribution over BIOCLIM.


Introduction
The Himalayan ecosystem is experiencing a continuous temperature rise, and the impact of climate change can be seen very clearly in the Himalayas, which demonstrates the need to monitor the Himalayan ecosystem even more [1,2]. The Himalayas are home to several medicinally and economically important plant species, Rhododendron species with botanical name Rhododendron arboreum Sm. from the family Ericaceae is among one of them [3][4][5]. It is widely spread in Himalayas, South India, and Sri Lanka [4]. With tremendous biological significance, it can sustain itself in the fragile ecotone between the alpine and subalpine biomes. Despite being identified as a medicinally important plant species, the geographical distribution and geospatial modelling of Rhododendron arboreum have not been explored to its fullest and needs to be deciphered, which will further benefit the formulation of conservation strategies [6]. The literature review of past studies offered isolated information on the distribution pattern, frequency of the species, genetic diversity, and net productivity of Rhododendron arboreum, particularly in Himalayan regions. The studies primarily focus on the areas that are found over the Mussoorie hills of the Uttarakhand [7], Himachal Pradesh [8], and Garhwal division of Himalayas [9] and mainly showcased the threat of habitat fragmentation and frequency degradation of Rhododendron arboreum over the Himalayan region. Therefore, a holistic and cohesive research approach is needed to track the distribution of these species so as to generate baseline data for future research programmes, management practices, and conservation policies.
Throughout the centuries, researchers have observed and documented the linear relationship between species distribution and their local physical ecosystem and the role of climate and altitudinal variation in their occurrence, which can be found in the available scientific writings of the early 19th century [10][11][12]. Identifying the parameters for establishing the relationship between species and the environment is the core step for simulating the geographical distribution of any species [13,14]. Species Distribution Modelling (SDM) or Ecological Niche Modelling (ENM) is widely used in biogeography [15], macroecology [16], and biodiversity [17] research to model the geographical distribution of species. It is a statistical tool that performs habitat suitability analysis using high-dimensional digital data through regression or machine learning algorithms. Distribution modelling is achieved by establishing a linear or nonlinear relationship between regional climatic and ecological conditions. As each species is adapted to specific tolerance zones (also known as niches), SDM helps to identify the environmental constraints and simulates the n-dimensional input data to produce habitat suitability [18].
Generally, the SDM techniques take in geocoded input data of species distribution and establish a relationship with the regional environmental and climatic conditions to map its distribution throughout an area of interest [14,19,20]. One of the most commonly used SDM is BIOCLIM, which is also used in the present study. BIOCLIM is one of the earliest developed SDM algorithms, and it is mainly used by ecologists due to its easy-to-use algorithm, which is more accessible than other models. BIOCLIM was first introduced by reporting bioclimatic profiles and distribution maps of 73 species. After introducing BIOCLIM, several researchers applied the model in different bioclimatic conditions and received better results [21][22][23][24]. They found BIOCLIM to be quite consistent. The reason behind BIOCLIM's popularity is credited to its predefined assumptions and less complex training algorithm.
The theoretical aspect of SDM considers that a given species is likely to be found in a single privileged ecological niche under an ecosystem of unimodal distribution [25]. Still, an actual scenario is more complex and diverse than a hypothetical niche. To overcome this limitation, deep neural networks would be a good option, as their architecture favours high order multi-dimensional feature interactions without constraining their functional form [26]. Deep neural networks have shown significantly better results in image classification, and there are several cases where single-layered neural networks have been used for SDM [27,28]. Recently, a study by [29] has shown a better deep neural network prediction ability in SDM that has been found to be even better performing than the conventional ecological SDM models. A detailed discussion of convolution neural networks in handling and learning non-linear features are given in later sections.
In this paper, a study was conducted to map the distribution of Rhododendron arboreum using linear models, namely BIOCLIM, and Convolutional Neural Network (CNN) architecture has been proposed to a establish nonlinear relationship between input parameters. A total of 16 environmental and climatic parameters acquired from different satellites are used as input parameters, which influence the distribution of a particular species to a greater extent in the given study area.

Study Area
The current study was conducted within four districts of Uttarakhand, India, namely Chamoli, Almora, Bagheshwar, and Pithoragarh, which is situated at the foothills of the Himalayas. Geographically, the study area lies between 28 •  The study area is shown in Figure 1. Being situated in the Himalayan Mountain range, the variation in ground elevation is very sharp, varying from 416 to 7801 m from mean sea level for the present study area. Due to the variation in elevation and unique climatic settings, this region is immensely rich with thousands of different plant species and has a remarkable diversity in flora and fauna [30,31]. This region experiences evenly distributed rainfall throughout the year and has an average temperature of 23.4 • C [32]. According to terrestrial ecoregion classifications that have been previously performed, the ecoregions found in the present study area are tropical and subtropical moist broadleaf forest, coniferous forest, temperate broadleaf and mixed forest, temperate conifer forest, and montane grassland and shrublands, and approximately 20% of the area is covered with snow throughout the year.

Target Species and Occurrence Data
Rhododendron species are found in the Himalayan range, which exhibits vast biological significance in the fragile ecotone of the alpine and subalpine zones [33]. Among the variety of Rhododendrons, Rhododendron arboreum is also a common species in the Western Himalayan region and can be found at an elevation range of 1200-4000 m above mean sea level. It shows some characteristics of invasive species and has a high medicinal value that makes the study of its distribution and the impact of climatic and ecological parameters on its growth very important [34]. Not only does this species carry medicinal value, but it is also very highly valued economically [35]. Medicinally, it is found to possess anti-cancer, immunomodulatory, anti-inflammatory, hepatoprotective, antidiabetic, antioxidant, antidiarrheal, adaptogens, antimicrobial, and antinociceptive properties, among others [35]. Economically, it has found its usage in squash, local brew, jellies, jams, and sherbet (rhodojuice). The juice from the leaves is used to encounter bed bugs bites. The wood from Rhododendron can be used to make tools used in agriculture such as 'Khukri' handles, etc. [36]. The leaves are used for decoration purposes in houses as well as in temples [37]. The wood is also used for the preparation of charcoal and can be used as a fuel. Some studies reported that consuming the squash made from the flowers can serve as a treatment for mental retardation [38,39], and flowers along with the roots and bark were found to be effective in treating digestive, heart, and respiratory complications [40]. The leaves of the plant burnt with juniper leaves are used to cleanse the air [41]. Menstrual cramps and heartaches are treated with the juice and squash made out of these flowers [42]. The extracts of the plants have also been utilized in curing nasal bleeding [43], headache, fever, rheumatism, wounds, dysentery [44], cough, skin diseases, liver malfunction, piles, worms, and jaundice as well as for preliminary cancer treatment [45].
Since phonological responses are better observed during the flowering season, the ground data sampling was done in September 2019, March 2020, and March 2021 at different elevation ranges [46]. The amount of ground data available for certain species, and particularly for Rhododendron arboreum, are limited, which makes it crucial to undertake the possibility of bias based on the sample size. As per the study by [47], they found that the SDM models with a smaller size consistently performed poorly and suggested that for reliable accuracy, the sample size must be greater than 30. In purview of the studies conducted by several researchers and considering the local geography, a total 65 homogenous patches of Rhododendron arboreum were identified and geotagged within the study area using the handheld Garmin GPS (Global Positioning System) with a horizontal accuracy of 95% ± 9.3 m. According to Wisz et al. [47], a small sample size (>30) is generally useful in exploratory modelling, and considering that the present study is conducted at a regional scale for a single target species, 65 sample points are enough for regional distribution modelling. The observed multiple species occurrence within the pixels were removed by applying spatial rarefication, which gave a single occurrence point per pixel. The photographs of the Rhododendron arboreum captured during field sampling are given in Figure 2.

Environmental Variables
The environmental variables used in this study include different bioclimatic variables acquired from various active satellites. Recent developments in satellite sensors have enabled access to several ecological and climatic information derived from satellite observations at higher spatial and temporal resolutions. Current studies have used MODIS (https: //modis.gsfc.nasa.gov/data accessed on 5 June 2021), Sentinel-2 (https://sentinel.esa.int/ web/sentinel/missions/sentinel-2/data-products accessed on 5 June 2021), Sentinel-5P (https://sentinel.esa.int/web/sentinel/missions/sentinel-5/data-products accessed on 5 June 2021), ECOSTRESS (https://ecostress.jpl.nasa.gov/data accessed on 5 June 2021), and SRTM (https://www2.jpl.nasa.gov/srtm accessed on 5 June 2021) satellite observations as an input parameter for the SDM algorithms. The satellite products are acquired during the sampling period (September 2019, March 2020, and March 2021) and then averaged, so that it can be used as a single product for each parameter. The Leaf Area Index (LAI) [48] and a fraction of Photosynthetically Active Radiation (fPAR) were retrieved from the MODIS data product. The motive behind using both LAI and fPAR establishes their direct relationship with the surface photosynthesis, evapotranspiration, and net primary productivity of the plants that is further utilized to estimate the water cycle processes, terrestrial energy, biophysical, and biochemical properties of the regional vegetation. Sentinel 2 optical data were used to estimate NDVI and EVI values at a fine scale of 10 m, which helped in understanding the vegetation status throughout the study area and were also used to mask the non-vegetative lands. Although there are a number of vegetation indices, which can play a crucial part in identifying the species distribution including Modified Soil Adjusted Vegetation Index (MSAVI) as well as other soil and ground surface adjusted indices, the target specie Rhododendron Arboreum is found in the dense forest cover of Himalayas and is independent of any soil and surface distortion; therefore, only NDVI and EVI were considered for SDM. Sentinel-5P is one of the most recent satellite missions from the European Space Agency (ESA), which is a combined mission of the European Union. It can take atmospheric measurements with the high spatial and temporal resolution and is utilized to retrieve several atmospheric parameters. Presently, eight different Sentinel-5P parameters have been used to establish a relationship between the existence of target species and to model species distribution which includes, the Aerosol Absorption Index (AAI), CO density, water vapor column, columnar NO 2 , columnar O 3 level, SO 2 density, surface albedo, and tropospheric Formaldehyde (HCHO) density [49]. Evapotranspiration (ET) and Land Surface Temperature (LST) are also included to study the response of the target species with regional land processes. It was retrieved from the ECOSTRESS satellite. Elevation is an important parameter when studying the species distribution. It has a huge role in species growth and distribution due to the changing conditions at varying altitudes; therefore, SRTM Digital Elevation Model (DEM) data are used to retrieve the elevation factor at a spatial resolution of 30 m. A significant factor that is also included is terrestrial ecoregions to acquire the regional biome information to understand the diversity of the Himalayan ranges that coexist with each other. The ecoregion the vector data provided by [50] was used, which was attributed into 14 classes, 5 of which are used in the current study area that is found in the foothills of the Himalayas, namely tropical and subtropical moist broadleaf forests, tropical and subtropical coniferous forests, temperate broadleaf and mixed forests, temperate coniferous forests, and montane grassland/shrubland. The input parameters and their source mission are listed in Table 1.
All of the input parameters were used based on their linear and non-linear correlation with the occurrence data. The parameters are the yearly average for year 2020, considering the cloud free pixels only. Additionally, a correlation matrix was plotted to interpret the relationship between each parameter.

Ecological Niche Modelling
Broadly, there are two methods to model the ecological niche: a mechanistic approach and a correlative approach [47]. The mechanistic approach deals with the physiological limiting mechanisms of species intolerance in ecological conditions. In this approach, the growth parameters are taken into consideration, such as soil pH, nitrogen content in the soil, incoming solar radiation, carbon dioxide intake by plants, etc. [51]. The correlative approach, also known as an empirical approach, uses the environmental variables that are reasonably expected to affect the growth of a particular species. The basis of the correlative approach is the interrelation between the observed parameters within the identified species location, which is used to establish a relationship between the parameters to model the species distribution for an entire area [52]. Having run the algorithm, a species distribution map can be generated using the established relationship. At this stage, the model's ability can also be tested using a set of species occurrence data that was not used in model development via suitable statistical parameters [53]. The SDM used in the present study only includes the presence of the BIOCLIM model and CNN based SDM. The representation of the conventional methodology is given in Figure 3.

BIOCLIM
BIOCLIM is one of the first SDM algorithms to be introduced by [54]. The BIOCLIM is a widely used SDM due to its easy-to-use graphical user interface and wide application area. After the introduction of BIOCLIM, many researchers published their work using this algorithm, including the work of [55], in which they discussed the application of BIOCLIM in building ecographic regions and ways to improve the estimation of the ecological distance between patches in meta-population landscape dynamics. The study by [55] also pointed out some pros and cons, which include the error associated with the climatic parameters, defining the ranking of factors, and taxonomic uncertainty. Early BIOCLIM applications occurred between 1984-1991 in terms of ecology and conservational biology and were addressed by [56].
BIOCLIM is based on a bioclimatic envelope model, which is widely used to predict the potential species distribution, which does not account for any possible interrelation between variables. Being an intuitively simple model, it assigns equal weight to each variable and produces binary predictions [57,58]. To predict the probability species distribution, BIOCLIM compares the values of input variables of known locations to the values of unknown pixels. The closer the value of an unknown pixel to the available pixel, the more suitable the location is for a particular species to be found. BIOCLIM is simple and intuitive. It is susceptible to over prediction and as specified, does not account for the interactions between input variables [59].

CNN
A deep neural network is a multi-layered model that can learn complex nonlinear relationships between the input parameters. The current study is an attempt that has been made to use the Convolutional Neural Network (CNN) architecture for SDM. During the last two decades, there has been a huge increase in deep learning and advanced machine learning algorithms in a variety of research fields [60,61]. Deep learning conducts high-level data abstraction using a hierarchical architecture consisting of multiple interconnected layers with multiple artificial neurons. The neurons receive the input values and multiply them with the specific weight obtained through optimization. Thereafter, the weighted sum is transformed through the nonlinear activation function to further pass it to the neurons of the next layer. The CNN architecture is represented in Figure 4 and the pseudocode is given in Table S1. Through this procedure, the network will learn through the optimal set of weights between the neurons in the adjoining layers and will maximize the network performance, which would help the neurons focus on specific patterns in the data. In the final layer, the parameters are passed through the SoftMax function, which transforms them into probabilities that sum to 1, as shown in Equation (1).
where K is the total number of classes, s(x) is a vector with the weight of each class for instance x, and σ(s(x)) k is the calculated probability of x belonging to class k as per the assigned weight. Although there have been several works that have been conducted with SDM using shallow networks containing a single hidden layer, their performance is not as good as that of the multi-layer networks [62]. The authors of [63], used a multi-layered network for distribution modelling and achieved a better performance compared to the single-layered networks. A Convolutional Neural Network (CNN) is a type of deep learning-based model for processing multidimensional data that follows a grid pattern [60]. The model is developed in such a way that the algorithm learns and adapts to the spatial hierarchies of features by itself from the lower to the higher levels of the pattern. Mathematically, it is composed of three layers or building blocks: convolution, pooling, and fully connected layers. Feature extraction is conducted using the first two layers and mapping the extracted features to the output is conducted by the third layer.
Convolution is used for feature extraction, in which a kernel is applied to an input tensor. A feature map is thus obtained through the product of kernel elements and tensor input. The procedure is then repeated on multiple kernels to obtain random feature maps that represent different feature extractors. The hyperparameters involved in convolution operations are the size and number of kernels. The size could be anything from 3 × 3 to 5 × 5 to 7 × 7, and the kernel could be chosen randomly.
A pooling layer offers downsampling functionality that decreases the dimensionality of the feature maps to achieve translation invariance to the alterations and the biases incorporated and thus helps in reducing the number of learnable parameters. There are two types of pooling operations, namely Max Pooling and Global Average Pooling [64]. The first one extracts speckles from the input feature maps and offers maximum values in each of the speckles and leaves the remaining values unattended. The second one downsamples a feature map with a size equaling product of height and width into an array of a one cross one by averaging the elements of each feature map by retaining the depth of the feature map. The advantage of Global Average Pooling lies in reducing the number of learnable parameters along with offering the CNN with variable sized input.
The features extracted by the convolution layers followed by downsampling by the pooling layers are mapped using a subset of fully connected layers to the final output of the network. The fully connected layer is executed with the ReLU function [65,66]. Mathematically, the Rectifier can be described as: where x is the input to the neuron. A unit employing the Rectifier is known as the Rectifier linear unit (ReLU). The performance evaluation of the model is conducted by tuning the learnable parameters, kernels, and weight by a loss function through the forward propagation followed by updating these parameter values through an optimization algorithm either by backpropagation or gradient descent.

Model Validation
All of the input parameters are resampled in a single grid size of 100 m and are converted into the same file format. Out of the in-situ occurrences of Rhododendron arboreum at ground locations, only 70% of the data were used in calibrating the model, whereas the remaining 30% of the data were used to test the model. In any type of modelling, performance evaluation is an essential task. In terms of validation of species probability distribution, the AUC (Area Under ROC (Receiver Operating Characteristics) Curve) is one of the most used performance evaluation metrics [67]. The primary application of the ROC curve is in the threshold independent assessment that characterizes the model performance at various discrimination thresholds. This application was found in raster-based studies focusing on predicting land use and land cover, species distribution modelling, risk assessment, and other probability mappings.
The AUC is generated by plotting the True Positive Rate (TPR) versus the False Positive Rate (FPR) at varied thresholds. The TPR is also known as sensitivity, probability of detection, or recall, and the FPR is also known as the probability of false alarm. Therefore, an accurate model will generate a ROC curve away from the 1:1 line, and a less accurate model will have a ROC curve towards the 1:1 line. The range of the AUC varies from 0 to 1. The closer the value is to 1, the better the prediction is. The plots can be described mathematically as: TPR or Sensitivity or Recall or Probability of Detection = TP TP + FN × 100 (3) FPR or Probability of false alarm = 1 − Specificity (5) Here, TP stands for true positive, and FP is false positive, where specificity is also termed the true negative rate. The TPR provides the percentage of correctly predicted instances of species other than rhododendron, whereas specificity provides the percentage of correctly predicted instances of rhododendron distribution.
Thereafter, Cohen's kappa is also calculated to support the AUC value. Being one of the most popular performance evaluation indices, it is considered to be less complex and dependent on prevalence. The kappa value ranges from −1 to +1, where +1 indicates the perfect agreement. Other than kappa, the True Skill Statistic (TSS) is also incorporated, as it corrects the unimodel dependency of kappa. TSS is widely used in ecology, and it can be explained as TSS = Sensitivity − Specificity − 1 (6)

Results
The spatial species distribution is highly associated with regional environmental conditions, climatic variability, and land use [68,69]. The species distribution is simulated using the correlation models between the dependent as well as independent parameters. These models were generated through the presence-only data, presence/absence data, and pseudo-presence locations of the species. A total of 16 input parameters were taken from different satellite observations to model the potential distribution of Rhododendron arboreum confined to the current study area. The in situ species locations were recorded to be used as training and testing data and to retrieve the corresponding ecological and climatic satellite observations [70,71]. To understand the overall objective of the work, analysis was conducted on the distribution of the input parameters followed by the intercorrelation between them.

Assessing the Distribution of Input Parameters
As several input parameters were used from different satellite observations, a statistical downscaling was first performed to achieve a common spatial resolution of 100 m to be given as the model input. The statistically resampled images of different input parameters are shown in Figure 5. The yearly average was taken for each parameter to incorporate the overall variation throughout the year. AAI was found to range between −2.196 to 0.071, in which the higher values were distributed where the higher altitudes have an upper limit of 7771 m and a lower limit of 379 m from mean sea level. This drastic variation in elevation permits rare species to grow in an extraordinary ecosystem, and it is the main reason for the higher species heterogeneity in this region. The EVI and NDVI derived from the Sentinel-2 optical data varies from −0.19 to 0.77 and −0.28 to 0.83, respectively. The lower and lower-middle altitude locations tend to have higher NDVI/EVI values than the higher altitudes. As the LST has a linear relationship with altitude, a drastic variation in the upper and lower limits of LST can be found to be in the range of 242.2 to 306.1 Kelvin, respectively, which reflects the presence of glaciers on top of the Himalayan mountains. Due to the presence of dense vegetation at the lower altitudes, the values of water vapour, fPAR, LAI, and ET are also higher in the foothills and lower in the upper Himalayas. At the same time, atmospheric constituents like ozone, nitrogen dioxide, and carbon monoxide also show high values at the lower altitude, where the vegetation density and the presence of anthropogenic factors contributing to their concentration are relatively higher. However, the concentration of SO 2 and HCHO are very low and are evenly distributed throughout, which directly relates to industrial and transportation activities, which is very low in these areas. The SO 2 varied from −0.0002 to 0.0004 mol/m 2 , and HCHO varied from −0.00005 to 0.00025 mol/m 2 .

Understanding Parameter Intercorrelation
A correlation matrix plot was drawn as depicted in Figure 6 to understand the relationship between the input parameters. Total fifteen parameters except for the biome layer, which is in the vector form, were used in the correlation matrix. A highly linear or nonlinear relationship shows a relation/dependency or non-relation/non-dependency between the parameters. The representation can be explained in terms of the values varying from −1 to +1. The −ve value represents the negative relationship, and the +ve value describes the positive relationship. The depiction of the negative relationship is in orange, where the higher correlation value is visualized through the steeper circular shape, and vice versa for the positive relationship. No relationship is represented by the correlation value of zero that is represented by a perfect circular shape, and the colour becomes whitish. It can be observed that many parameters are related to each other. A highly linear relationship exists between Sentinel-5p based ozone and carbon monoxide as well as with ozone and water vapour with a correlation value of 0.99. It can also be observed that the linear relationship of DEM with NDVI, LST, and water vapour is very high, which shows the variation in the local geometry and the influence of regional ecological and climatic parameters. A lower correlation value is observed between the atmospheric parameters and the vegetation indices, especially for EVI, LAI, and ET.

Spatial Distribution of Rhododendron arboreum
To simulate the potential distribution of Rhododendron arboreum, linear and a nonlinear SDM were used in the current study using 16 a priori input parameters. The input parameters were taken from different satellite observations followed by data resampling to match their spatial resolution to achieve a standard resolution of 100 m. The probability distribution is classified into four classes, namely very low, low, high, and very high, according to their distribution. The presence, based only on the BIOCLIM model, predicts the probability distribution of species using a linear correlation, as shown in Figure 7a. Apart from a well-established presence only algorithm, a deep learning-based convolution neural network model was used to establish a nonlinear relationship between the input parameters to predict the probability of species distribution. A CNN based architecture was used to train the model according to the known locations and was fitted on different layers. The perfect combination of layers and activation functions was then used to predict the species distribution Figure 7b.

Model Validation and Comparison
The accuracy or performance of the probability distribution of the incorporated models was compared using the AUC, TSS, and kappa coefficient that characterize the performance of the models with an in situ validation dataset. Table 2 shows the statistical performance for BIOCLIM and CNN based the probability distribution of Rhododendron arboreum. A lower AUC value was obtained by the BIOCLIM models, which is 0.639, based on the in situ points reserved for the validation purpose. The AUC values for the CNN-based probability distribution was found to be 0.917, which is considered to be very good compared to the BIOCLIM that was given an AUC of 0.68. In addition to the AUC value, TSS and kappa, with values 0.652 and 0.94, respectively, also gave support to the applicability of CNN in comparison to conventional SDM's such as BIOCLIM. These values showcase the superiority of deep learning models for species probability distribution using the given set of ecological and bioclimatic parameters.

Discussion
Understanding the dynamics and distribution of the forest ecosystem is a crucial step towards biodiversity conservation. The recent advancements in statistical machine learning models and the availability of reliable datasets help researchers build policies towards conservation and sustainable solutions to achieve this conservation. The SDMs came into existence in the mid-1980s with very limited ecological and climatic datasets, and from there, they have evolved with the regular integration of newer and more reliable datasets.
Recently, a great variety of SDMs have been used to model the distribution of species, in which the most popular are the ones that are based on the non-linear modelling approach followed by statistical and rule-based methods. Among machine learning models, Maxent is widely used due to its user-friendly interface and simple background algorithm. Maxent accuracy, as per [72], has provided robust predictions with an AUC of 0.75 and for BIOCLIM, an AUC of 0.65, whereas a similar result is achieved by [73] with an AUC of 0.73 and 0.66 for Maxent and BIOCLIM, respectively. Another well performing algorithm is the Boosted Regression Tree (BRT), which performs slightly better than Maxent. As per [74], they archived an overall AUC of 0.81 using BRT. Statistical and rule-based methods are among the conventional approaches that are not consistent with the changes in regional ecology.
The BIOCLIM model is one of the oldest yet most used SDMs due to its simple algorithm and easy parameterization [75]. It is based on a linear bioclimatic envelop model that assigns equal weight to each variable and offers a binary prediction. A pervious study indicated the use of BIOCLIM to predict species distribution, and similar studies were conducted in the past. The introduction of machine learning and its integration in SDM revolutionized the probability distribution modelling approach. The variety in modelling approaches and the increased number of datasets has made this model one of the most globally accepted SDMs. There is a growing concern for the establishment of the nonlinear relationship between the bioclimatic parameters through innovative approaches such as deep learning-based models. It has been observed that the BIOCLIM model is overestimates the species distribution and the higher probability of species occurrence at the higher altitude. Moreover, it has been observed that the distribution pattern of the predicted Rhododendron arboreum distribution using CNN architecture is quite different at some places than from conventional BIOCLIM models. The current work proposed deep learning-based CNN architecture for probability distribution modelling and proved to perform better than the traditional BIOCLIM model. There was an underestimation of species distribution observed in CNN than BIOCLIM. The distribution probability in CNN was precise, and it was found that the majority of Rhododendron arboreum is distributed in the southern part of the study area where the vegetation density is high. Some high probability patches at the higher altitudes are commonly predicted by both the models.
However, the scalability of the current outcome needs to be tested on a global scale. Apart from this, some limitations, namely uncertainties associated with the input data, the assigned weights, and some important biotic parameters, need to be handled for future work. An ensemble of all of these available methods needs to be explored in the future to establish the linear and nonlinear relationship between the dependent parameters to predict any one species out of the multiple species that are available in a location. Additionally, there is a need to perform sensitivity analysis to understand variable impact on the target variable, instead of forcing all of the variables into the model. This would reduce the algorithm complexity and computational demand.
In spite of achieving significant accuracy and popularity in the field of correlation modelling, there is still not a single algorithm that can be recommended. Deep neural networks are showing more promising results, but they are still to be tested in different ecological settings.

Conclusions
This study is a novel approach towards establishing a CNN architecture and testing the performance of CNN in SDM and its comparison with other well established SDMs namely, BIOCLIM. This study was conducted on the foothills of the Himalayas, where the altitudinal variation is very drastic and varies from 416 to 7801 m above mean sea level. This high-altitude Himalayan ranges constitutes a heterogeneous ecosystem and is home to many rare/endangered, medicinally, and economically important plant species. One of the major economically and medicinally important plant species, Rhododendron arboreum, was tracked and mapped in this study using different SDMs. Based on its occurrence and several ecological and bioclimatic satellite-based observations, the probability distribution of the Rhododendron arboreum was established. The CNN based probability distribution model outperformed the presence only based BIOCLIM model with an AUC score of 0.917. The CNN based prediction was also found to be more precise and accurate and with significantly less overestimation, whereas the AUC values of the BIOCLIM model were found to be 0.68 with a high overestimation. The superiority of CNN implies the role of nonlinear parameters in predicting the probability of species distribution. The scalability of the current solution on a global scale, the addition of some other important parameters, and an ensemble of all of the available SDMs need to be explored in future work. An increase in the presence of the Rhododendron species is an indication of strong soil retention, which, in turn, is fruitful for other vegetation to grow and flourish. Apart from this, an increase in the green vegetation fraction and a decrease in shade fraction was found to be associated with a higher likelihood of Rhododendron. This increased likelihood using the models would offer researchers an opportunity to understand the vegetation distribution and to contribute to the restoration of the ecology and biodiversity conservation in the protected areas so that a provision could be established for sustainable ecosystem services.