Assessing OpenStreetMap Completeness for Management of Natural Disaster by Means of Remote Sensing: A Case Study of Three Small Island States (Haiti, Dominica and St. Lucia)

: Over the last few decades, many countries, especially islands in the Caribbean, have been challenged by the devastating consequences of natural disasters, which pose a significant threat to human health and safety. Timely information related to the distribution of vulnerable population and critical infrastructure is key for effective disaster relief. OpenStreetMap (OSM) has repeatedly been shown to be highly suitable for disaster mapping and management. However, large portions of the world, including countries exposed to natural disasters, remain incompletely mapped. In this study, we propose a methodology that relies on remotely sensed measurements (e.g., Visible Infrared Imaging Radiometer Suite (VIIRS), Sentinel ‐ 2 and Sentinel ‐ 1) and derived classification schemes (e.g., forest and built ‐ up land cover) to predict the completeness of OSM building footprints in three small island states (Haiti, Dominica and St. Lucia). We find that the combinatorial effects of these predictors explain up to 94% of the variation of the completeness of OSM building footprints. Our study extends the existing literature by demonstrating how remotely sensed measurements could be leveraged to evaluate the completeness of the OSM database, especially in countries with high risk of natural disasters. Identifying areas that lack coverage of OSM features could help prioritize mapping efforts, especially in areas vulnerable to natural hazards and where current data gaps pose an obstacle to timely and evidence ‐ based disaster risk as well as with the distribution of geographical features, such as the density of the road network Further, we show that remotely sensed spectral indices signifying the distribution of vegetation (NDVI, SAVI) and built ‐ up land cover (NDBI, UI) are also correlated with the distribution of OSM building footprints. While previous studies propose to utilize vegetation spectral indices (e.g., NDVI) to estimate the completeness of urban green spaces mapping in OSM [92], we show that these indices are also significantly correlated with the coverage of OSM building footprints.


Introduction
Over the last few decades, many countries have been challenged by the devastating consequences of natural disasters which pose a significant threat to human health and safety and impact vulnerable communities and critical infrastructure globally. Every year, natural disasters impact close to 160 million people worldwide [1], causing destruction of the physical, biological and social environments, impacting food security, and causing global losses that amount to over 100 billion dollars [2]. The frequency of natural disasters has been steadily increasing since 1940 [3] and over the next century, climate change will likely amplify the number and severity of such disasters [4].
While the impacts of natural disasters are worldwide, some countries have been more vulnerable to different types of disasters than others [5]. For example, in 2017, Puerto Rico, Sri Lanka and Dominica were at the top of the list of the most affected countries to natural disasters such as significant precipitation, floods and landslides. Caribbean island countries are especially exposed to the same overall area (as of 2018, there were nearly 60,000 mappers contributing to Missing Maps) [25].

Assessing OSM Completeness and Accuracy
Although OSM road network data is estimated to exceed 80% completeness in relation to the world's roads and streets [26], in general, the coverage and completeness of OSM features (including building footprints) vary significantly-not only between countries, but also within countries. For example, completeness of coverage of remote and rural areas is often lower than that of highly populated urban areas [27], and the coverage of developed countries tends to be lower than that of developing countries [28][29][30][31][32]. These differences are in part due to societal factors, such as population distribution and population density, distance to major cities and the location of contributing users [29,[33][34][35][36][37].
With the increased utilization of VGIs-including OSM-for disaster preparedness and response, various methodologies have been proposed to assess the quality and the accuracy of the collected data [12]; for example, in terms of data completeness, logical consistency, positional, thematic, semantic and spatial accuracy, temporal quality and usability [28,[38][39][40][41][42]. Several approaches have been proposed to assess the completeness of the OSM database and the completeness of the street networks [32,43], the land use and the building footprints [44]. The completeness of the coverage can be assessed by comparing the OSM mapped features with external datasets, for example, national administrative data [28,[44][45][46][47]. Such data varies by country and is not always made available-especially in developing countries.
In this study, we propose a methodology that utilizes remotely sensed observations to estimate the coverage of OSM mapped features, specifically to identify gaps in the completeness of OSM building footprints. In the past, expensive satellite imagery and limited computational power only allowed analysis of small geographical contexts. This model is being replaced thanks to the accessibility of publicly available and free satellite data that capture every location on earth every few days. The availability of daytime (e.g., Sentinel-2, Landsat) and nighttime (e.g., DMSP Operational Linescan System (OLS) or the Visible Infrared Imaging Radiometer Suite (VIIRS)) satellite imagery, together with advancements in the capabilities of cloud-based computational platforms, now allows for analyzing Land Use and Land Cover (LULC) characteristics of Earth across a greater geographic and temporal scale. Land cover refers to the attributes of the Earth land surface and its immediate subsurface (e.g., biota, soil, typography, surface, groundwater and human structure). Land use refers to the purpose for which humans exploit the land cover [48]. Because remotely sensed observations typically capture the unique reflectance characteristics of physical objects on Earth, most remote sensing applications focus on detection and classification of Earth`s land cover characteristics. Differentiation between different types of Land Use (which typically do not hold unique physical characteristics) remains challenging. In respect to OSM, mapped features can be tagged according to both, their land use and land cover. Although OSM contributors are free to use their own tags, there is a quasi-official collection of tags that has been established and agreed upon (for example, "landuse" and "landcover" keys or other more specific keys such as "building" or "highway") [49]. Previous studies demonstrated the potential use of these tags to create detailed LULC maps [50].
The methodology we propose in this study relies on remotely sensed measurements to estimate the coverage of OSM building footprints and to identify "mapping gaps" (i.e., areas that have not yet been mapped). Previous studies have utilized OSM data for different remote sensing applications, for example, for classification of urban areas [51] or for semantic labeling of aerial and satellite images [52]. Despite significant progress in the field of machine learning and the increasing availability of satellite imagery, there is still a scarcity of studies aiming to utilize remotely sensed observations to estimate the completeness of OSM building footprints at a given point in time. Identifying areas that lack coverage of OSM features could help plan and prioritize mapping efforts, especially in areas that are vulnerable to natural hazards and where current data gaps pose an obstacle to timely and evidence-based disaster risk management actions. By its nature, the OSM database is dynamic and is updated daily with thousands of new entries. However, as discussed above, the frequency and extent of updates vary largely by geographical areas. Some regions are being updated more frequently than others, and especially developing countries are not fully mapped, which are often the most vulnerable to the impacts of natural disasters. The objective of this study is to propose a methodology to estimate the completeness of OSM building footprints based on remotely sensed measurements that are available at a global scale and are updated frequently. We demonstrate our methodology in the case study of three small island states: Haiti, Dominica and St. Lucia.
The remainder of this article is organized as follows. In Section 2, we discuss the methodology, the study area and the data we use to predict the coverage of OSM building footprints. In Section 3, we present and evaluate the results in the case of Haiti and in Section 3.3, we illustrate the applicability of our approach in the case of Dominica and St. Lucia. In Section 4, we offer a concluding discussion.

Study Areas
We demonstrate our methodology in the case of three small island states: Haiti, Dominica and St. Lucia (Figure 1).

Haiti
Located on the western side of Hispaniola Island, Haiti (27,750 km 2 in size, with a population of approximately 11.5 million) is the poorest country in the Western Hemisphere, with a Gross Domestic Product (GDP) per capita of US$ 870 [53]. Haiti is highly vulnerable to natural disasters; more than 96% of its population is exposed to different types of natural hazards, particularly hurricane, coastal and riverine flood, and earthquake [53]. More than half of the population lives in cities and towns, a major shift from the 1950s when approximately 90% of Haitians lived in the countryside [54]. Almost all of Haiti's 30 major watersheds experience significant flood events, due to intense seasonal rainfall, storm surge in the coastal zones, deforestation and erosion, and sediment-laden river channels [55]. Furthermore, large portions of the country`s population (e.g., in the capital, Port-au-Prince) live in shanty towns built upon steep and exposed hillsides [56]. In 2018 alone, some 2.8 million people were considered to be in need of humanitarian assistance valued at US$ 252.2 million [57].

St. Lucia
A small windward island state located in the Caribbean Sea and the North Atlantic Ocean, St. Lucia (616 km 2 in size) has a population of approximately 165,000 [58] and a GDP per capita of US$ 10,315 [59]. St. Lucia is susceptible to numerous natural hazards, including hurricanes, landslides, flooding, and volcanic eruptions. Its terrain consists mainly of mountains and steep slopes in the center of the country due to its volcanic origins with low-lying areas along the coasts [60]. As of 2018, approximately 19% of the population resides in these low-lying areas [61]. In addition, St. Lucia's economy is highly dependent on two sources: the export of bananas and income from tourism. Both have been negatively impacted recently by natural hazards such as in 2016 when Hurricane Matthew caused 70% of the island to lose power and damaged 80% of the country's banana plantations [60].

Dominica
Dominica (approximately 74,000 people [62]) is located in Leeward Islands chain in the Lesser Antilles of the Caribbean Sea, approximately 1,200 km southeast of Haiti, with large portions of its population residing in the capital Roseau (population 14,700) and Portsmouth (population 5,200) [62]. Dominica is vulnerable to a wide range of natural hazards, including hurricanes, intense rainfall, slope instability, volcanic eruptions, seismic activities, and tsunamis [63]. Reflecting a rugged physical topography, most of the population and infrastructure are located on the coast, making the country particularly vulnerable to strong winds and high seas [64]. In September 2017, a Category 5 hurricane Maria hit the country, causing losses and damages worth 226 percent of GDP [65].

Analytical Framework
The objective of this study is to identify gaps in the completeness of OSM building footprints in three small island states (Haiti, St. Lucia and Dominica) based on remotely sensed measurements and other geospatial features. The procedure involves seven steps.

Step 1: Construct an Artificial Tessellation
We construct an artificial tessellated grid of cells that span each of the countries; each cell is 0.25 square km in size (a total of 136,747 grid cells over Haiti, 2,796 grid cells over St. Lucia and 3,861 grid cells over Dominica). Each grid cell was treated an independent unit of analysis.

Step 2: Download the Current OSM Building Footprints
We downloaded the most up-to-date OSM data for the three countries (data downloaded in July 2019). For Haiti, we downloaded the data (in a Shapefile format) from Geofabrik (https://www.geofabrik.de/data/download.html). At the time of the analysis, Geofabrik did not have data for Dominica and St. Lucia; thus, we downloaded the data for these countries from overpass turbo (https://overpass-turbo.eu/) in a KML format (this data requires additional pre-processing and we selected OSM features that are labeled as "building=Yes"). At the time of the analysis, there were 930,000 mapped buildings in Haiti, 38,619 mapped buildings in Dominica and 29,412 mapped buildings in St. Lucia.

Step 3: Calculate Total Area of OSM Building Footprints in a Grid Cell
We calculated the total area of OSM building footprints in each grid cell. This is the value to be predicted by the explanatory variables (the remotely sensed and geospatial measurements).

Step 4: Preprocess and Aggregate the Remotely Sensed and Geospatial Data
We relied on several predictors (explanatory variables) to estimate the coverage of OSM building footprints in a grid cell and to identify gaps in OSM coverage. We preprocessed the data and aggregated it to the level of a grid cells (Table 1 provides a description of the evaluated explanatory variables and the aggregation measures). The preprocessing, analysis and aggregation of the remotely sensed data were completed by using Google Earth Engine (GEE). GEE is a platform that leverages cloud-computing services to achieve planetary-scale utility and has been previously used for a wide range of applications [66], including mapping population [67,68] and urban areas [69,70].
Nighttime Lights (VIIRS): The Visible Infrared Imaging Radiometer Suite (VIIRS) is one of the key instruments onboard the Suomi National Polar-Orbiting Partnership (Suomi NPP) spacecraft (launched in 2011). VIIRS instrument collects visible and infrared imagery and global observations of land, atmosphere, cryosphere and oceans. This instrument has significant improvements over the capabilities of the former DMSP-OLS [71], notably its availability on a daily basis and higher spatial resolution (up to 500 m at the equator). The VIIRS DNB provides global coverage with 12-hour revisit time. First, we record for each pixel the maximum value of all overlapping pixels (in the same location) in a stack of seven monthly composites (Jan-July) of 2019. Then, for each grid cell, we calculated a Sum of Light (SOL) measure (calculated as the sum of the digital number values of all overlapping pixels in each cell).
Sentinel-2-Derived Spectral Indices: The Copernicus Sentinel-2 mission comprises a constellation of two polar-orbiting satellites that collect multispectral data in 13 spectral bands, with four bands at a spatial resolution of 10 m and 6 bands at a spatial resolution of 20 m. The revisit period of Sentinel-2 is 5 days at the equator. We calculated four remotely sensed measures sensitive to vegetation and built-up land cover: Normalized Difference Vegetation Index (NDVI) [72], Soil Adjusted Vegetation Index (SAVI) [73], Normalized Difference Built-up Index (NDBI) [74] and Urban Index (UI) [75]. For each grid cell, we calculated a per-index sum value of all pixels overlapping with the grid cell.
Sentinel-1 SAR: Sentinel-1 mission comprises a constellation of two polar-orbiting satellites, performing C-band synthetic aperture radar imaging, enabling them to acquire imagery in day and night conditions regardless of the weather. Sentinel-1 has a 12-day repeat cycle, with a spatial resolution down to 5 m. Similarly to [70], we captured the texture of the surface by utilizing Sentinel-1's C-band (single co-polarization vertical transmit and vertical receive (VV) acquisition mode with an Interferometric Wide Swath (IW) instrument mode, a 250 km swath at 5 m by 20 m spatial resolution (single look)). From each scene, we removed speckle noise and performed radiometric calibration and terrain correction. To create the annual composites, we calculated for each location (pixel) the median value of all overlapping pixels in an entire stack of all scenes captured in 2019. For each grid cell, we calculated the average value of all pixels incorporated within the area of the grid cell.
Slope: To capture the topography of the surface, we used the Global SRTM mTPI dataset (available in GEE in a spatial resolution of 270 m), where a local gradient is calculated for each pixel based on the global SRTM DEM elevation data (30 m resolution). The mTPI distinguishes ridge from valley forms and is calculated using elevation data for each location subtracted by the mean elevation within a neighborhood [76]. For each grid cell, we calculated the average value of all pixels in the grid cell.
Forest Cover: We estimated the extent of forest cover in 2018 based on the Hansen Global Forest Change v1.6 (2000-2018) [77]. First, we defined a pixel as "forest" in the year 2000 if more than 20% of it was covered in 2000 with forest. We recorded pixels that experienced a major event of forest cover loss between 2000 and 2018 and estimate the total area of forest cover in 2018 per grid cell.
Urban Footprints: We relied on two remotely sensed derived products signifying urban and rural settlements that were produced by the Earth Observation Center at DLR: The Global Urban Footprint (GUF) (in a spatial resolution of ~12m) and the World Settlement Footprint (WSF) (in a spatial resolution of ~10m) [78][79][80].
OSM Transportation Network Features: We calculated the total length of OSM roads in a cell and the total number of junctions in a cell as additional potential predictors of OSM-building footprints.
The sum NDVI value of all pixels in a grid cell The sum SAVI value of all pixels in a grid cell The sum NDBI value of all pixels in a grid cell The sum UI value of all pixels in a grid cell We adopted a visual interpretation method to visually assess the completeness of OSM building footprints in the grid cells in Haiti and St. Lucia. We achieved this by overlaying the OSM building footprint dataset with the most recent high-resolution base map image (provided by ESRI, updated as of 2019 [81]). We identified grid cells in Haiti and St. Lucia where we assessed that at least 75% of the buildings that are visible in the satellite image have been mapped (we identified 835 grid cells in Haiti and 179 grid cells in St. Lucia). Because the majority area of Dominica has been mapped, we skipped this step in the case of this country.

Step 6: Perform Correlation Analysis and Prediction
We evaluated the correlation between the remotely sensed and the geospatial measures (the explanatory variables) and the area of OSM building footprint in a grid cell using a Pearson Correlation Test, and performed an Ordinary Least Squares (OLS) regression to estimate the potential of the variables, combined, to explain the observed variation in the area of OSM building footprints in a grid cell. Additionally, we evaluated the potential of the explanatory variables to predict the area of OSM building footprints in a grid cell using a regression with Random Forests. Random Forests [82] are tree-based models that include k decision trees and p randomly chosen predictors for each recursion. When predicting, for an example, its variables are run through each of the k trees, and the k predictions are averaged through an arithmetic mean. Each tree is trained using a subset of examples from the training set, drawn randomly with replacement, with each nodeʹs binary question determined using a random subset of p input variables. We performed the regression with the 835 grid cells that were visually assessed as being relatively fully mapped (i.e., more than 75% of the buildings in a grid cell are assessed as mapped). To evaluate the accuracy of the prediction, we adopted a fivefold cross-validation method. In each experiment, the examples in one of the data folds were left out for testing and the examples in the remaining four folds were used to train the model. The performance quality of the trained model was tested on the examples in the left-out fold, and the overall performance measure is then averaged over the five folds. We assessed the classification accuracy with a different number of decision trees: 2, 4, 8, 16, 32, 64, 128, 256 and 512, with minimum size of terminal nodes set to 5.

Step 7: Predict the Coverage of OSM-Building Footprints in Each Entire Country
We used either the grid cells that are visually assessed as relatively fully mapped (in the case of Haiti and St. Lucia) or all the grid cells (in the case of Dominica) as references for the training of Random Forest Regression and to predict the area of OSM building footprints over the entire grid cells in each country. We identified the grid cells that were predicted to incorporate OSM building footprints, but were not yet mapped.  As discussed above, a visual examination of the completeness of OSM building footprints over Haiti suggests that large portions of the island remain unmapped (Figure 3a). Figure 3b,c show, as an illustration, the coverage of OSM building footprints in the capital of Haiti, Port-au-Prince and Carrefour, and in the adjacent Carrefour commune. While buildings in many areas within these cities have been mapped, large portions are still not fully mapped. We observe that densely mapped zones of Port-au-Prince co-exist alongside zones that remain entirely unmapped (Figure 3c), a visual pattern that may result from the episodic engagement of community mapping volunteers and the definition of mapping 'tasks' on a neighborhood scale through OSM editing tools. Moreover, significant parts in northern Haiti are not mapped (Figure 4), including, for example, the cities Gonaïves and Cap-Haitien.  200  800  1400  2000  2600  3200  3800  4400  5000  5600  6200  6800  7400  8000  8600  9200  9800  10400  11000  11600  12200  12800  13400  14000  14600  15200  15800  16400  17000  17600  18200  18800   A Pearson correlation test indicated a significant (p<0.01) correlation between the total area of OSM building footprints in a grid cell and several of the examined explanatory variables. As expected, there was a positive and significant correlation between the area of OSM building footprints in a grid cell and the total area of built-up land cover, according to WSF and GUF (r=0.73 and 0.71, respectively, p<0.01) as well as with nighttime lights (VIIRS SOL) (r=0.63, p<0.01). We find a significant (p<0.01) correlation between OSM building footprints area in a grid cell with the four Sentinel-2 spectral indices, indicated by a positive correlation with UI and NDBI (r=0.59 and r=0.47) and a negative correlation with both SAVI and NDVI (r=-0.53).

An
We identified 835 grid cells where, according to a visual assessment, at least 75% of the buildings that were visible in the satellite image are mapped in OSM ( Figure 5 shows examples of grid cells where more than 75% of the structures are mapped). The correlation between the area of OSM building footprints in a grid cell and the examined predictors was higher compared to the previous experiment, where all the grid cells (i.e., 136,747 grid cells) were considered (for example, r=0.78 and r=0.65 with WSF and VIIRS and r= 0.61 and r=-0.55 with UI and SAVI, respectively) ( Table 2), which is likely due to the fact that large portions of the country are not mapped (i.e., there are grid cells that lack OSM coverage while actually populated and exhibit LULC characteristics of populated areas). As expected, there were also similarities and correlations between some of the explanatory variables. Figure 6a presents pairwise correlation coefficients between the explanatory variable (variables are ordered according to a hierarchical clustering). The explanatory variables form several similarity clusters: a cluster composed out of vegetation spectral indices (NDVI, SAVI) and forest cover (which are positively correlated with each other), and a cluster composed out of built-up land cover spectral indices (NDBI, UI), together with VIIRS, WSF, GUF, and OSM road network features. As expected, there is a negative and significant correlation between the vegetation and the built-up land cover spectral indices. The dendrogram shown in the figure further highlights hierarchical clusters formed between the variables, notably, OSM area and VIIRS, UI and NDBI, NDVI and SAVI, and road length and number of junctions in a grid cell.    (Table 4). While GUF and WSF together explain 66% of the fit, the addition of nighttime lights improves the fit of the model (indicated by explanation of up to 76% of the variation). The addition of further remotely sensed measures (i.e., Sentinel-2derived spectral indices, slope, texture and forest cover) improves the model fit by a further 5% (up to 81% of the variation). With the addition of OSM transportation network features the fit of the model improves marginally to around 82%.  Step (1)

Prediction of OSM Building Footprint Coverage
The results above indicate that the area of OSM building footprints in a grid cell can be explained by several of the remotely sensed and geospatial explanatory variables. To evaluate the potential of these variables to predict the area of OSM building footprints in a cell, we performed a regression with Random Forests.
Random Forest regression predicts up to 89% of the variation of OSM building footprints in a grid cell. Performance improves with the addition of decision trees up to 64 trees, for example, from 81% to 89% of the predicted area (with 2 and 64 decision trees, respectively, Figure 7). Figure 8 presents a comparison between the actual and the predicted area of OSM building footprints in a grid cell (regression with 64 decision trees) (Figure 8a). The two most important variables to the model are WSF and GUF, followed by OSM road network features and Sentinel-2 derived spectral indices (indicated by variable importance sensitivity (lncNodePurity), Figure 8b). We use the Random Forest model to predict the area of OSM building footprints over all the grid cells in Haiti (i.e., we train the model with the 835 grid cells that were assessed as relatively fully mapped and predict for the coverage over the entire dataset). Figure 10b shows the predicted area of OSM building footprints per grid cell over Haiti. The results highlight large portions in Haiti that have not yet been mapped (e.g., for example, the northern cities Gonaives and Cap-Haitien) as well as patches of unmapped grid cells around major cities (e.g., Port-au-Prince). This analysis allows us to identify areas (grid cells) that are predicted to incorporate large areas of building footprints but are not mapped (Figure 9).
A visual examination shows that the predicted coverage of OSM building footprints (Figure 10b) corresponds more closely with the distribution of built-up land cover (according to GUF, for example) and nighttime lights (VIIRS) (Figure 10c,d, respectively) than compared to the current distribution of OSM building footprints (Figure 10a).
To further evaluate the accuracy of the model, we perform an OLS regression analysis using the entire Haiti dataset (i.e., 136,747 grid cells). We find that the remotely sensed measurements explain up to 89% of the variation of the predicted area of OSM building footprints in all the grid cells spanning Haiti (R 2 = 0.89, F (12,136,734) = 90,690, p < 0.01). In comparison, these indicators explain only 48% of the variation of the current area of OSM building footprints in the Haiti dataset (R 2 = 0.48, F (12,136,734) = 10,330, p < 0.01).
Finally, in order to identify grid cells that are predicted to incorporate building footprints but are actually not mapped, we calculated the ratio between the actual and the predicted area of OSM building footprint in a grid cell (calculated as the predicted area of OSM building footprints in a grid cell divided by the actual area of OSM building footprints in a grid cell) (Figure 11a,c). This analysis allows us to identify grid cells where the ratio between the predicted and the actual area of OSM building footprint in a grid cell is low (Figure 11b,d), highlighting grid cells that require mapping.

Evaluation of the Method in the Case of Dominica and St. Lucia
The results above suggest the potential of several remotely sensed indicators to predict the coverage of OSM building footprints, at least in the case of Haiti. In order to assess the validity of the method, we performed further analysis in the case of two additional small island states: Dominica and St. Lucia. A visual examination suggests that the coverage of OSM building footprints in Dominica is relatively complete, while large portions of St. Lucia remain unmapped. Areas lacking OSM building footprints include parts of the capital, Castries, and the second-largest town, Vieux Fort ( Figure 12). These two areas account for approximately 49% of the population of St. Lucia (64,654 and 16,624 people, respectively [83]). Similar to the methodology described in the case of Haiti, we created a fishnet of grid cells, 0.25 km 2 in size, spanning the two islands. In the case of Dominica, we found a high positive and significant correlation between the area of OSM building footprints in a grid cell and several explanatory variables. We found a high positive correlation with GUF, number of junctions and WSF (between r= 0.90 and r= 0.91, p<0.01 for both). The correlation between the area of OSM building footprints and VIIRS is a bit lower (r=0.75, p<0.01). With Sentinel-2-derived spectral indices, this correlation ranges between r= 0.35 and r=0.38 (with UI and NDBI, respectively, p<0.01 for both). An OLS regression analysis reveals that together, these variables explain 92% of the variation of OSM building footprint area in a grid cell (R 2 = 0.92, F(12,3846) = 3848, p < 0.01). Random Forest regression (with 64 decision trees) results in similar trends, indicated by a high accuracy rate of 88% (regression accuracy assessed using fivefold cross-validation).
In the case of St. Lucia, when the analysis was done with the entire dataset of grid cells over the country (i.e., N=2781), the correlation between the area of OSM building footprints, GUF and WSF ranges between 0.70 and 0.75 (p<0.01) and there is a lower correlation between OSM building footprints area and VIIRS (r=0.58, p<0.01). Together, the predictors explain only 66% of the variation is OSM building footprints (R 2 = 0.66, F(12,2783) = 464.6, p < 0.01). We relate the lower fit of the model to the fact that large portions of the country have not been mapped. Thus, we visually assessed the completeness of OSM building footprints in St. Lucia grid cells and identified 179 grid cells in which we could assess that more than 75% of their area is mapped. With these visually assessed grid cells, the fit of the model improved, and together, the predictors explain 92% of the variation of OSM building footprint area (R 2 =92%, F(12,166) = 166.4, p < 0.01). Random Forest regression (with 64 decision trees) results in a similar accuracy rate (R 2 =92%) ( Table 5).
Finally, we use Random Forest to predict the area of OSM building footprints in each of the countries. Figure 13 presents the predicted area of OSM building footprint in St. Lucia. Unmapped areas include, for example, areas surrounding the capital of Castries. The central part of the city is densely mapped while its adjacent neighborhoods lack coverage. Large swaths of the surrounding areas of Charlotte, Vigie, and Bisee completely lack OSM building footprints.

Discussion
In recent decades, natural disasters have been responsible for an estimated 0.1% of global deaths, killing on average 60,000 people per year [84]. In the last two decades alone, developing countries have accounted for more than half of all reported casualties [85]. Natural disasters often cause significant damage to communities, infrastructure and the environment, and require immediate intervention and implementation of appropriate measures aiming to save lives.
Accurate and easily accessible geospatial information is key for an effective disaster risk management cycle and for informed decision-making during humanitarian response [86]. The increasing availability of geospatial information is revolutionizing disaster research and emergency management. Until recently, much of this essential geospatial information was proprietary, scarce and in many cases, unavailable during significant disasters.
In the last decade, OSM has repeatedly been shown to be highly suitable for disaster mapping and management. Despite the continuous efforts to improve completeness of the OSM database, large portions of the world remain unmapped, especially in countries that are prone to natural hazards, reflecting limited internet speed connectivity, limited availability of GPS devices, lack of technologically skilled volunteers and limited awareness of VGI technologies [18].
Several methods have been proposed to assess the completeness of the OSM database, including evaluation of the completeness of street networks, land use and buildings; many of them rely on external datasets for accuracy and completeness estimation. The increased availability of free and open-source remotely sensed data can be utilized to identify mapping gaps in OSM datasets, find locations that require mapping, and help prioritize and plan mapping campaigns and efforts.
Although several applications have recently been proposed to predict the coverage of OSM by means of remotely sensed derived products (such as WorldPop) [87], to the best of our knowledge, no study has yet evaluated the potential use of remotely sensed measurements to predict the completeness of OSM features.
In this study, we demonstrate a methodology to identify areas where building footprints have not yet been mapped in OSM dataset. The methodology relies on remotely sensed measurements and derived products and geospatial information related to the road network to predict the completeness of OSM building footprints in three small island states (Haiti, Dominica and St. Lucia).
In the case of Haiti, the results show that large portions of the country are still unmapped, despite the continued mapping efforts to maintain a full and up-to-date map while also keeping pace with the changing socio-physical characteristics of the country and to aid response and recovery in future disasters [88]. We find that in the case of the three countries, the coverage of OSM building footprints is significantly correlated with several remotely sensed measures and indicators. As expected, the coverage is positively correlated with the distribution of built-up land cover (indicated by a Pearson correlation coefficient of between r=0.78 and r=0.91 in the case of Haiti and Dominica, respectively). To some extent, this is not surprising, and previous studies have already shown the potential of remotely sensed derived products to predict the coverage of OSM building footprints [87]. However, a limiting factor in utilization of remotely sensed derived products is that their availability varies in space and time and these products are not always updated on a regular basis. In this study, we demonstrate the potential use of free and open-source remotely sensed indicators to estimate the coverage of OSM building footprints. We show that the intensity of nighttime lights luminosity (measured by VIIRS) is highly correlated with the area of OSM building footprints (indicated by a Pearson correlation coefficient of between r=0.65 and r=0.75 in the case of Haiti and Dominica, respectively). This finding aligns with previous studies showing that the intensity of nighttime lights is closely correlated with anthropogenic activities and with changes in the distribution of built-up land cover [89,90], as well as with the distribution of geographical features, such as the density of the road network [91]. Further, we show that remotely sensed spectral indices signifying the distribution of vegetation (NDVI, SAVI) and built-up land cover (NDBI, UI) are also correlated with the distribution of OSM building footprints. While previous studies propose to utilize vegetation spectral indices (e.g., NDVI) to estimate the completeness of urban green spaces mapping in OSM [92], we show that these indices are also significantly correlated with the coverage of OSM building footprints.
Because different sensors record distinct characteristics of the land (e.g., brightness, temperature, height, density, texture), data fusion techniques that exploit the best characteristics of each type of sensor have become a valuable procedure in remote-sensing analysis, including for urban applications [70,93] and in mapping the built-up land cover [69]. In this study, we show that the combined effect of the explanatory variables explains between 92% and 94% of the variation in the area of OSM building footprints, exceeding the effect of each predictor by itself. We show that in the case of Haiti, the addition of nighttime lights to built-up land cover classification products (GUF, WSF) improves the fit of the model by 10%, and that with the addition of remotely sensed measures, the fit of the model improves by 5%. Although land cover characteristics are often related to nighttime light luminosity (for example, vegetation density decreases while luminosity increases from the rural area to the urban core), fusing nighttime and daytime remotely sensed measures allows for an increase in the separability between urban and nonurban land [94]. Fusing daytime and nighttime measurements enable feature complementation and compensation for the limitation of single data sources in extracting urban information [95].
We also perform a Random Forest Regression to predict the area of OSM building footprints in a cell and find that the regression explains between 88% and 94% of the variation (in Dominica and Haiti, respectively). Previous studies suggest that the number of decision trees of the Random Forest is generally proportional to the model's accuracy [96], although they show mixed results for the optimal number of trees in the decision tree. The number varies between 10 [97] and 150 trees [98].
Here, we find that with Random Forests, accuracy improves up to 64 decision trees and then moderately decreases as the number of trees increases.
Finally, we demonstrate the potential of our approach to identifying mapping gaps, or areas that lack OSM coverage. We do this by training the model with the grid cells that are visually assessed as relatively completely mapped and use the trained model to predict the coverage of OSM building footprints throughout the countries. We capture grid cells that lack OSM building footprints coverage (i.e., grid cells where the predicted coverage of OSM building footprints largely exceeds the actual coverage). These grid cells represent locations where mapping efforts could potentially be targeted.
To summarize, VGI contributions, especially in developing countries, tend to be made in spurts, for example in response to a specific trigger, such as a natural disaster or humanitarian crisis, rather than as a regular, continuous process [13]. Because OSM datasets rely on volunteers, the completeness and mapping efforts vary in space and time. There is a need for a systematic tool that would guide and prioritize mapping efforts and mapping campaigns. As more and more remotely sensed data become available to the research community, our study extends the existing literature by demonstrating how they could be leveraged to conduct novel, and critical, large-scale assessments of the completeness of the OSM database, especially in areas at high risk to natural disasters.
We note a few limitations to this study. First, we demonstrate our methodology in the case study of three small island states, which, at least to some extent, are characterized by relatively similar geographical conditions and characteristics. An extension of this study would evaluate our methodology in additional countries characterized by diverse geographical conditions (topography, land cover, land use, etc.). Second, in order to create the training examples for the model and to evaluate the relation between the explanatory variables and the area of OSM building footprints, we created a relatively small dataset of examples (grid cells), which were visually assessed using a subjective visual interpretation method as relatively fully mapped. By its nature, visual interpretation may be subject to idiosyncratic variation across individuals performing the manual classification. An extension to this study would leverage the crowd to create an extensive dataset of grid cells visually assessed and mapped and would account for an agreement between the interpreters. Third, the analysis presented in this study was done at a single point in time. By its nature, the OSM database is continuously updated and is dynamic, while simultaneously, new remotely sensed measurements are being collected. An extension to this study would account for these changes in time and evaluate the completeness of OSM building footprints on an ongoing basis as new data becomes available.
Extensions to our approach may improve the identification of areas that lack completed OSM coverage by accounting for additional inputs; for example, socioeconomic variables (including WorldPop, Facebook's High Resolution Settlement Layer (HRSL) and additional physical/geographical characteristics and spectral indices). Further extensions to our approach may also include the application of learning algorithms and evaluation with various tuning parameters of the classifiers and the fit models.

Conclusions
Globally, there has been an increase in the frequency and impacts of major natural disaster events; in the next century, it is likely that climate change will amplify the number and severity of such disasters. While accurate and timely geospatial information is vital for the full cycle of disaster risk management, this data is not always available for the disaster management community when disasters occurs. Although VGI platforms, specifically OpenStreetMap (OSM), show great potential to support humanitarian mapping tasks, gaps in VGI data remains a major concern [99]. There is an increasing need for a fully automatic tool that would allow to identify areas that lack a complete mapping of OSM features-especially in areas prone to hazard events. While previous studies have utilized OSM data as reference for classification of built-up land cover with satellite imagery [100][101][102][103] here we show the potential use of publicly available, remotely sensed data as predictors of the spatial coverage of OSM building footprints. The tool and methodology we present here are timeefficient and scalable.
An extension to our approach may improve the accuracy of the prediction of OSM building footprints area by adding additional remotely sensed measures. Incorporating additional datasets, such as newly developed VIIRS nighttime light products, socioeconomic variables, additional land cover and land use classification schemes may offer opportunities to improve the accuracy of the prediction.
Author Contributions: R.G. designed the experiments and the methodology with N.J. and J.M., implemented the experiment and wrote the manuscript. J.M. and N.J. helped to carry out and implement the experiment and improve the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the World Bank Group.