Potential Range Map Dataset of Indian Birds

: Conservation management heavily relies on accurate species distribution data. However, distributional information for most species is limited to distributional range maps, which could not have enough resolution to take conservation action and know current distribution status. In many cases, distribution maps are difﬁcult to access in proper data formats for analysis and conservation planning of species. In this study, we addressed this issue by developing Species Distribution Models (SDMs) that integrate species presence data from various citizen science initiatives. This allowed us to systematically construct current distribution maps for 1091 bird species across India. To create these SDMs, we used MaxEnt 3.4.4 (Maximum Entropy) as the base for species distribution modelling and combined it with multiple citizen science datasets containing information on species occurrence and 29 environmental variables. Using this method, we were able to estimate species distribution maps at both a national scale and a high spatial resolution of 1 km 2 . Thus, the results of our study provide species current species distribution maps for 968 bird species found in India. These maps signiﬁcantly improve our knowledge of the geographic distribution of about 75% of India’s bird species and are essential for addressing spatial knowledge gaps for conservation issues. Additionally, by superimposing the distribution maps of different species, we can locate hotspots for bird diversity and align conservation action.


Summary
Distributional information is crucial for conservation planning of species.However, because of the vast distributional range of species and the consequences of habitat loss and climate change, it is exceedingly difficult to monitor changes in the range of most species and plan conservation measures.
For many species, expert range maps created using expert knowledge and secondary literature or as part of threat assessments may effectively identify their coarse ranges.But at smaller geographical resolutions, such as below 100 km, their false presence rates are exorbitantly high and significantly overstate actual distribution [1][2][3].Also, most species range maps often exaggerate the real distribution of a species by including regions of appropriate habitat [2][3][4].Past studies have revealed that the range maps for the species can be overestimates of their distribution ranges, hence further underlining the need for more accurate species distribution range maps required at a higher resolution for conservation planning [2][3][4][5][6].
Species presence-only data records from museum samples and records of citizen science data [7,8], are fine-scale information of species distribution also serve as the basis for a various of spatial analyses.The most important applications of such data records is in correlative species distribution models (SDMs) [9][10][11][12].In such models the presence records have been extensively used to quantify species-habitat or environment relationships to identify the appropriate species niche, and predict distributions [13][14][15][16][17].In view of this SDMs tools are significantly used to decrease the uncertainty in distribution projections of species.However these data sets are often susceptible towards sampling bias [18][19][20][21].To Improve the overall accuracy of SDMs, it requires proper handling of presence records or data of species to assess the impacts of spatial sampling biases and reduction using procedure i.e., deployed data cleaning and thinning techniques [20,[22][23][24][25][26].
In this study, we aimed to use citizen science occurrence data of bird species reported in India to construct distribution range maps at the national level fine scale using MaxEnt-based species distribution modelling.These range maps are converted into binary presence-absence raster maps for broader use.The available data give information on the possible geographical distribution of birds in India.Different stakeholders, such as policymakers, academics, international and local non-government organizations, government organizations, and birder groups interested in preserving, conserving, and studying Indian birds, will find this dataset valuable.Inadequate awareness of the regional distribution of avian biodiversity impedes decision-making for bird conservation in India, which is one of the goals of making this data available.Authors have used this dataset to assess climate change impacts on Indian birds [78].

Data Description
This dataset aims to offer an easy-to-use resource that will allow non-specialists from various user groups to get fast insights into the current distribution of Indian bird species, hence contributing to the enhancement of usability.As a result, the dataset is provided in the ubiquitous Geo tiff raster geodata format to give information in a single file and facilitate simple use with all available GIS software.
As an example, we present the distribution map of four birds species: (a) Rufous-faced warbler (Abroscopus albogularis, (Moore, 1854)), (b) Indian courser (Cursorius coromandelicus, (Gmelin, JF, 1789)), (c) Red-necked falcon (Falco chicquera, Daudin, 1800) and (d) Spot-bellied eagle-owl (Bubu nipalensis, (Hodgson, 1836)) in Figure 1.These datasets contain the spatial distribution data for 968 birds along with the metadata.Details like species, sample size, MaxEnt Model validation results, etc., are included in the metadata table (Supplementary File S1).The dataset contains range maps in Geo Tiff raster format as presence absence maps covering the geographic area of India.Each raster map is approximately 1 km in resolution and in WGS 1984 datum.

Species Presence Data
We have utilized online, open-access citizen science databases (Global Biodiversity Information Facility GBIF; https://www.gbif.org/(accessed on 25 September 2021) ] [79] and eBird https://ebird.org(accessed on 25 December 2021) [8].These databases have presence-only records of bird species occurring in India compiled by citizens during bird watching.This comprises ~28.4 million record locations of 1344 bird species in India across the Indian Sub-continent.We used only data from 1950 onwards to match the temporal duration of climatic data [80,81].We also removed inaccurate species presence records using comprehensive range maps of species compiled by Birdlife International and Handbook of the Birds of the World [82], while keeping genuine species records through expert evaluation.To decrease the risk of errors in species identification and location, we used the research-grade presence-only occurrence data of each species using citizen science platforms like eBird and iNaturalist, where each record is reviewed by an experienced reviewer [83][84][85].

Species Presence Data
We have utilized online, open-access citizen science databases (Global Biodiversity Information Facility GBIF; https://www.gbif.org/(accessed on 25 September 2021)) [79] and eBird https://ebird.org(accessed on 25 December 2021) [8].These databases have presence-only records of bird species occurring in India compiled by citizens during bird watching.This comprises ~28.4 million record locations of 1344 bird species in India across the Indian Sub-continent.We used only data from 1950 onwards to match the temporal duration of climatic data [80,81].We also removed inaccurate species presence records using comprehensive range maps of species compiled by Birdlife International and Handbook of the Birds of the World [82], while keeping genuine species records through expert evaluation.To decrease the risk of errors in species identification and location, we used the research-grade presence-only occurrence data of each species using citizen science platforms like eBird and iNaturalist, where each record is reviewed by an experienced reviewer [83][84][85].
Citizen science data suffers from sampling biases [85,86].We removed all the duplicates and low precision coordinates, and kept only unique records occurring within a 1 × 1 km 2 cell to fit into the similar spatial resolution of the climatic data.We further used rarefication on occurrence using "SpThin" [87,88] package in R 3.4.0[89].We also eliminated species with fewer than thirty independent localities [90,91].Furthermore, species with smaller sampling areas (i.e., n < 10,000 km 2 ) were removed from further analysis [78].This includes species with small range areas, e.g., small range, pelagic, coastal, or island species.We have also removed species with less than 30 presence records for further modelling.
The Sampling errors or biases in the geographic positioning co-ordinates and incomplete information about the species in biodiversity studies may have serious concerns that must be addressed [22,[92][93][94].Thus, we used the "sampbias" package [95] in R [89] to measure the impact of sample error or biases via procedure of data cleaning i.e., removing duplicate, incorrect or incomplete data.The findings of our data cleaning procedure suggested that our processed datasets have less sampling errors or bias than the initial datasets.Figure S1 in Supplementary File S2 has further information on bias correction.
After removing the biases and inconsistencies in species presence records, the corrected final presence occurrence database consists of ~1.9 million independent records of 1091 terrestrial avian species out of 1344 species.We used ~1.9 million location to develop models for of 1091 species [78,80,81].

Climate Data
The generation of SDMs is contingent on a variety of environmental conditions associated with the places where certain bird species exist.We compiled 29 environmental variables (EVs), which include 19 bioclimatic variables that summarized temperature and precipitation downloaded from WorldClim 1.4 layers [96], five variables related to topography [96], and five variables from ENVIREM [http://envirem.github.io/accessed on 9 September 2023] [97].The Supplementary Files S2 Table S1 contain a list of the total 29 EVs used in SDMs.The topographic and ENVIREM variables are the are proximally correlated with species' physiological requirements (e.g., microclimate, edaphic conditions) [66,[98][99][100][101][102].Because MaxEnt's built-in variable selection is dependable due to L1-regularization and is insensitive to correlation among variables, we preserved all 29 variables for SDM.If additional variable selection methods are imposed before MaxEnt is run for all species under consideration, the model's accuracy could be compromised [103,104].

Species Distribution Modelling
In this study, we used MaxEnt 3.4.4platform to predict the species distribution [105].Using presence-only data, it uses a machine learning approach to produce reliable results [106].To determine the model calibration region, we applied the minimal convex polygon (MCP) method to species occurrence data with a buffer of two-degree [106,107].
By considering locations of occurrence of all bird species across the Indian subcontinent, we used the target-group background selection strategy [22,24,108] to diminish the impact of spatial sampling bias [78,[109][110][111].The background data define the study's environmental dimensions, while the presence data indicate conditions likely to be associated with species occurrence.
We used "ENMeval" [112,113] package in R [89] to fine-tuned MaxEnt models, which helps in choosing model parameters exhibiting the greatest performance.
By using the checkerboard2 approach to segment the occurrence data, we were able to do 4-fold cross-validations.We used ENMeval to fine-tune 48 distinct species models with RM (Regularization Multiplier) values ranging from 0.5 to 4.0 (in the increments of 0.5) and six distinct Feature Classes (FCs).The Feature Classes (FCs) combinations were L, LQ, H, LQH, LQHP and LQHPT, where L = linear, Q = quadratic, H = hinge, P = product, and T = threshold).
The test omission rate of the top model we deployed was the lowest, while the validation area under the AUC curve (receiver operating characteristic curve was the largest.[16,88].In Supplementary File S1, details of the ideal model tuning parameters are given.We opt for the MaxEnt's Cloglog output format because it reduces the impacts of sample selection bias, which can enhance model performance [105].
To create 'present/absence' binary maps from Cloglog raster outputs, we used the 10th percentile training presence threshold [88,91,114,115].The 10th percentile training presence threshold improves species distributions and decreases overpredictions in final binary maps [115].All the data needed for species distribution models is in Supplementary File S1.
To measure the effectiveness of species distribution models (SDMs), we evaluated the final models using multiple threshold-dependent and independent criteria [116][117][118][119].We derived model training and validation AUC (AUC TRAIN , AUC VAL ) and estimated the difference between the two (AUC DIFF ).This difference is expected to be large in overfitted models [119].We also calculated OR MTP ('Minimum Training Presence' omission rate) and OR 10 (training omission rate of 10%) to quantify model overfitting [88,116,117,120].Using the R package "kuenm" [121], the AUC ratio (pAUC Ratio) was calculated based on the partial ROC performance metric.We also calculated the Continuous Boyce Index (CBI VAL , CBI TRAIN ) for training and validation data.This index is a measure of the variation of the model predictions from the randomly distributed presence observations across the prediction gradients [122].
We retained data of 968 species out of total 1149 species having AUC TRAIN and CBI TRAIN greater than 0.7, indicating appropriate model performance and better model abilities to discriminate between conditions of occurrence area and those of background area [116,123].
These resulting models of 968 bird species demonstrated mean AUC TRAIN = 0.86 and AUC VAL = 0.85.We also estimated the mean pAUC Ratio = 1.95, indicating that models performed better than the random models.We obtained mean CBI VAL = 0.89 and mean CBI TRAIN = 0.97, indicating excellent model performance.
Information used for model validation and evaluation for species distribution models is provided in Supplementary File S1.

Potential Constraints and Future Directions
Our method is based on universal premises seen in all multi-species studies that use species distribution models.These models initially assume that species are in balance with the environment and that all relevant climatic parameters that may have an impact on species existence are taken into account in order to compute climatic tolerance from the observed distribution of the species.The primary disadvantages of this approach include the probable removal of crucial climatic variables from models and the potential impact of several other factors, such as habitat loss, hunting, and exploitation, on the existing and future distribution of bird species.Because of this, species distribution model assumptions are frequently broken [124].
The assumption that different species adapt to climate change individually is another weakness of species distribution models, which ignore interspecies interactions because species interactions both within and across trophic levels may significantly affect whether a particular taxon can persist in its current range or colonize new areas [125,126].
All species distribution models incorporate some degree of uncertainty.We attempted to lessen the sampling bias by target background selection strategy combined with rarefication of presence-only data and exploited ~70 years of presence only data to decrease temporal sampling disparities.This work might be regarded as among the earliest efforts in India to undertake a comprehensive assessment of distributions of birds based on presence only data.The increasing popularity of bird watching via citizen science projects and the enhancement of data quality and quantity provide unparalleled availability of bird distribution data for various purposes.We hope that future studies will regularly update analysis as done in this study using more data as they become available that will help meet diverse difficulties encountered in biodiversity protection.