On the Synergistic Use of Optical and SAR Time-Series Satellite Data for Small Mammal Disease Host Mapping

(1) Background: Echinococcus multilocularis (Em), a highly pathogenic parasitic tapeworm, is responsible for a significant burden of human disease. In this study, optical and time-series Synthetic Aperture Radar (SAR) data is used synergistically to model key land cover characteristics driving the spatial distributions of two small mammal intermediate host species, Ellobius tancrei and Microtus gregalis, which facilitate Em transmission in a highly endemic area of Kyrgyzstan. (2) Methods: A series of land cover maps are derived from (a) single-date Landsat Operational Land Imager (OLI) imagery, (b) time-series Sentinel-1 SAR data, and (c) Landsat OLI and time-series Sentinel-1 SAR data in combination. Small mammal distributions are analyzed in relation to the surrounding land cover class coverage using random forests, before being applied predictively over broader areas. A comparison of models derived from the three land cover maps are made, assessing their potential for use in cloud-prone areas. (3) Results: Classification accuracies demonstrated the combined OLI-SAR classification to be of highest accuracy, with the single-date OLI and time-series SAR derived classifications of equivalent quality. Random forest analysis identified statistically significant positive relationships between E. tancrei density and agricultural land, and between M. gregalis density and water and bushes. Predictive application of random forest models identified hotspots of high relative density of E. tancrei and M. gregalis across the broader study area. (4) Conclusions: This offers valuable information to improve the targeting of limited-resource disease control activities to disrupt disease transmission in this area. Time-series SAR derived land cover maps are shown to be of equivalent quality to those generated from single-date optical imagery, which enables application of these methods in cloud-affected areas where, previously, this was not possible due to the sparsity of cloud-free optical imagery.


Introduction
The parasitic tapeworm Echinococcus multilocularis (Em) is a highly effective pathogen of Human Alveolar Echinococcosis (HAE), which is a chronic debilitating disease with >90% mortality in untreated patients 10 years after diagnosis [1][2][3]. A neglected zoonotic disease, HAE is responsible for a significant burden of human disease across continental Asia, and is expanding in the prevalence and range in Europe, North America, and Asia [4][5][6]. In China and central Asia, HAE prevalence humans can exceed 10% locally [7][8][9] even though geographical patterns are highly variable, as elsewhere in the world [10]. The reasons for this variability are unknown.
The Em transmission cycle is based on predator-prey relationships between definitive hosts (such as the widely distributed red fox, corsac fox, Tibetan fox, and wolf), and small mammal (rodent or lagomorph) intermediate hosts ( Figure 1). Tapeworm eggs are shed in definitive host faeces, contaminating the environment, with intermediate hosts infected by oral ingestion of eggs when feeding on vegetation. The transmission cycle is completed when definitive hosts themselves become infected by predating on infected, small mammals. Domestic dogs can also be infected through predation of small mammals, and, due to their close contact with people, are a major transmission source to humans [11] who are infected via accidental egg ingestion. Once infected, HAE is characterized by slow growing larval cysts or infiltrative metacestode lesions in the liver or other organs [12], with an asymptomatic period of several years usually followed by significant morbidity. Treatment is difficult and primarily based on surgery and/or high dose benzimidazole therapy. Disease control measures exist including anti-helminthic dog dosing. However, resources are limited and implementation lacks appropriate geographical targeting, which reduces the effectiveness of these measures. Domestic dog infection has been shown to be linked to surrounding high densities of small mammals [1]. Therefore, determining the environmental drivers influencing these small mammal distributions is critical to understanding Em transmission dynamics. Small mammal distributions are influenced by the availability of key favourable habitats [14,15], with landscape modification impacting these habitats, so that small mammal presence and transmission varies spatio-temporally [16,17]. Despite its importance, and due in part to its complexity, knowledge is lacking as to the specific landscape mechanisms that are driving these small mammal host distributions.
There is currently an urgent need for cost-effective methods to map the key landscape variables, which drive small mammal host distributions, as a proxy for identifying likely areas of active Em transmission. This cannot be achieved using conventional ground-based surveys, which are costly, time consuming, and spatially limited. Satellite remote sensing offers a solution. This enables costeffective multi-scale and regional-level monitoring, which, alongside in-situ ecological surveys identifying small mammal distributions, can model and extrapolate landscape-small mammal relationships.
Remote sensing has been applied to the study of a number of small-mammal borne diseases including leptospirosis [18], junin virus [19], sin-nombre virus [20], and other hantaviruses [21,22], and has emerged as an important surveillance tool [23]. It has also been applied to studies of Em, Domestic dog infection has been shown to be linked to surrounding high densities of small mammals [1]. Therefore, determining the environmental drivers influencing these small mammal distributions is critical to understanding Em transmission dynamics. Small mammal distributions are influenced by the availability of key favourable habitats [14,15], with landscape modification impacting these habitats, so that small mammal presence and transmission varies spatio-temporally [16,17]. Despite its importance, and due in part to its complexity, knowledge is lacking as to the specific landscape mechanisms that are driving these small mammal host distributions.
There is currently an urgent need for cost-effective methods to map the key landscape variables, which drive small mammal host distributions, as a proxy for identifying likely areas of active Em transmission. This cannot be achieved using conventional ground-based surveys, which are costly, time consuming, and spatially limited. Satellite remote sensing offers a solution. This enables cost-effective multi-scale and regional-level monitoring, which, alongside in-situ ecological surveys identifying small mammal distributions, can model and extrapolate landscape-small mammal relationships.
Remote sensing has been applied to the study of a number of small-mammal borne diseases including leptospirosis [18], junin virus [19], sin-nombre virus [20], and other hantaviruses [21,22], and has emerged as an important surveillance tool [23]. It has also been applied to studies of Em, which illustrates strong links between landscape composition and HAE prevalence in the South Gansu Province, China [8,24,25]. Links between Em intermediate host distributions and degraded grassland [14] and tree/shrub habitats [26] in China have also been demonstrated, which illustrate that landscape composition is a key spatial determinant of Em transmission [8,13,27].
Previous applications of remote sensing for the study of Em have utilised medium-resolution optical imagery, and, while this has advanced current knowledge of landscape-Em host dynamics, it does have limitations. Cloud cover inhibiting data collection is a persistent problem in many parts of the world, which significantly impacts data availability. Often, this results in limited optical satellite data availability for the study areas and time-periods pertinent to the studies being conducted. Where a single image is available, this can also be problematic. While a single image characterises the landscape at a snap-shot in time, that snap-shot may not necessarily be representative of the broader seasonal phenological variability in the landscape and vegetation characteristics of that location over the course of a year [28]. This can be problematic when investigating the relationships between the landscape and vegetation characteristics, and small mammal distributions for species where behaviour is highly responsive to the vegetation condition (and, consequentially, food and shelter availability) over relatively short time periods. This could result in inappropriate conclusions being drawn on the influences of the landscape on small mammal distributions. For example, in the case of M. arvalis and A. terrestris in France, absence is typically determined by bare ground in winter as a result of tilling, where populations cannot be sustained due to the lack of vegetation [16,29]. If single-date remote sensing data acquired in the spring or the summer is used to characterise vegetation in these locations, then these areas will, misleadingly, be characterised by cereal crops, which are considered a favourable habitat for these species. The only way to map the permanent presence of favourable habitats and vegetation conditions such as the permanent grassland shown to be key factors for species such as M. arvalis and A. terrestris [16,29] is to use a time-series of data. The importance of seasonal differences in vegetation condition for small mammal studies has also been illustrated by Reference [28], which determined relationships between Ochotona spp. small mammal presence and Enhanced Vegetation Index values to be the strongest in specific periods in the spring and autumn, and weaker in the summer and winter. Similarly, Reference [30] observed that correlations between vegetation index values and deer mouse (a primary host of sin nombre virus) density, and the number of infected deer mice, typically peaked in May or June. This again indicates that biomass levels at certain periods of year are key, and illustrates the importance of incorporating multi-temporal data capturing phenological variability, rather than just single-date imagery, for small mammal-landscape modelling.
Vegetation phenology is typically well defined [31], with the initial leaf emergence typically followed by rapid green-up prior to a stable period of maximum leaf area before senescence, leading to a low-leaf area period. Vegetation in different localities exhibits different and distinctive patterns in this phenological profile [32], with single date imagery alone unable to capture this distinctiveness. Whereas a time-series of optical imagery may typically be the first choice to monitor phenology, this may not be possible in many areas due to persistent cloud cover. In these situations, an alternative approach is to utilise a dense time-series of Synthetic Aperture Radar (SAR) data, using the SAR temporal backscatter profile to characterise vegetation type, via its changing physical structure due to seasonal phenology. Utilising this time-series of SAR imagery in combination with optical imagery where available, enables the generation of targeted land cover maps quantifying the spatial distribution of key land cover types influencing small mammal distributions. Multi-date SAR data has previously been utilised for other land cover mapping activities [33][34][35][36]. However, it has not been employed within an epidemiological context. Optical and Synthetic Aperture Radar (SAR) remote sensing datasets characterise target features in different ways, but deliver complementary information that, if used in combination, generally leads to an increase in land cover mapping accuracy [37]. Multispectral optical remote sensing imagery delivers rich spectral information about the scattered energy (visible and infrared) from the target surface, which enables the discrimination of different land cover classes on the basis of spectral variations in the specific features in question [38]. In contrast, SAR characterises the physical structure of target features, with the differing spatial structure of these features scattering the SAR signal at differing levels of intensity and amplitude. The ground surface texture strongly influences SAR backscatter [39], and the varying characteristics of SAR backscatter can be used to derive information about that target feature. By using SAR-based texture (structural) and optical (spectral) information synergistically to exploit the different physical principles, which they characterise, improved classifications of land cover have been observed [37,40], compared to where either optical or SAR data have been used independently [38]. This has been illustrated for examples such as mapping of winter wheat in China [35], maize, soybeans, and sunflowers in Ukraine [41,42], forestry studies [43][44][45][46], and mapping land cover [47][48][49][50][51][52].
The Sentinel remote sensing satellite series launched as part of the European Space Agency's Copernicus program offers cost-free, broad-area, high temporal frequency coverage. Sentinel-1 offers cloud-penetrating SAR data, which guarantees data collection even in heavily cloud-affected areas. The availability of Sentinel satellite data at 10 m resolution, along with the six-day repeat frequency of Sentinel 1, provide yet more advantages, and enable mapping of key land cover types present in the study areas using the differing phenological characteristics of the vegetation communities and land cover types present [53,54]. By utilising both Sentinel-1 SAR and optical remote sensing data in combination (for example, from Sentinel-2 or Landsat OLI), misclassifications, which commonly occur using single-date imagery due to the spectral similarity of features, should be reduced by using the additional phenological backscatter profile data from the SAR time-series.
In this case, we address gaps in current knowledge of the Em transmission cycle by investigating small mammal-landscape links at spatial and temporal scales not previously achieved. We address the following problems: (1) lack of information of the small mammal distributions, and landscape drivers thereof, of the species involved in the Em transmission cycle, which limits efficient targeting of control measures, (2) cloud-cover inhibiting optical remote sensing-based landscape-small mammal modelling, and (3) inaccuracies resulting from the influence of vegetation phenological dynamics for landscape-small mammal modelling.
We propose to exploit and combine the technological advances of the Sentinel-1 satellites along with novel land cover classification methods to identify landscape drivers determining small mammal distributions, and construct predictive models identifying potential Em transmission foci. Key advancements of this research are, first, to increase our fundamental understanding of the distribution and ecology of the intermediate hosts involved in the Em transmission processes and apply this knowledge for improved disease control targeting, and second, utilise new remote sensing and modelling methodologies for characterisation of small mammal habitats. Specifically, we pose two research questions: (1) Can time-series SAR data generate land cover maps of equivalent quality as optical imagery?
(2) What are the key land cover drivers impacting Em host Ellobius tancrei and Microtus gregalis distributions in the study area?

Study Site
The study area is located close to the town of Sary Mogol, Alay Valley, Kyrgyzstan (39. prevalence rates of 7.1% reported [25]. The Em parasite found in both dogs and voles [55] indicates that a transmission cycle is and has previously been active here. . Transect locations were separated at an average distance of 1.2 km from each other to ensure that spatial autocorrelation was not present [25]. Of these transects, 76 (80%) were completed in the winter pastures and 19 (20%) in more extensive expeditions to summer pastures to specific areas of interest identified in advance of satellite imagery. For each transect, 20 intervals of 10 paces were surveyed with indicators of small mammal activity recorded. Indicators identifiable to species or the genus level including visual sightings of small mammals, foraging corridors, ground holes, and small mammal faeces [1,56] were used as evidence of a small mammal presence using methods established in Reference [57]. Previous studies have performed transect surveys alongside trapping activities and shown good correspondence between activity indices and species with this method established and used widely (e.g., [56][57][58][59]). Relative density scores of small mammal presence (the number of intervals where indices of the small mammal presence were observed) were produced for each transect for each small mammal species present [25]. Transect routes were mapped with hand-held Global Positioning System (GPS) receivers with an accuracy of approximately 15 m. The small mammal surveys focused on two species known to be Em intermediate hosts, E. tancrei and M. gregalis, which are populations that dominate the small mammal community in this area. Previous studies have indicated that, in this region, the population density of E. tancrei increases with grassland productivity while M. gregalis is often found along streams and in marshes [25]. They both exhibit low population densities in the early spring and population peaks in the early autumn, but, . Transect locations were separated at an average distance of 1.2 km from each other to ensure that spatial autocorrelation was not present [25]. Of these transects, 76 (80%) were completed in the winter pastures and 19 (20%) in more extensive expeditions to summer pastures to specific areas of interest identified in advance of satellite imagery. For each transect, 20 intervals of 10 paces were surveyed with indicators of small mammal activity recorded. Indicators identifiable to species or the genus level including visual sightings of small mammals, foraging corridors, ground holes, and small mammal faeces [1,56] were used as evidence of a small mammal presence using methods established in Reference [57]. Previous studies have performed transect surveys alongside trapping activities and shown good correspondence between activity indices and species with this method established and used widely (e.g., [56][57][58][59]). Relative density scores of small mammal presence (the number of intervals where indices of the small mammal presence were observed) were produced for each transect for each small mammal species present [25]. Transect routes were mapped with hand-held Global Positioning System (GPS) receivers with an accuracy of approximately 15 m. The small mammal surveys focused on two species known to be Em intermediate hosts, E. tancrei and M. gregalis, which are populations that dominate the small mammal community in this area. Previous studies have indicated that, in this region, the population density of E. tancrei increases with grassland productivity while M. gregalis is often found along streams and in marshes [25]. They both exhibit low population densities in the early spring and population peaks in the early autumn, but, since they are habitat-specific, their relative distribution between habitats and seasons is the same over multiple years.

Satellite Data
A time-series of Sentinel-1 SAR data was acquired for a 12-month period (to capture the full annual phenological cycle), comprising 28 dates in total (Table 1), and was downloaded from the Copernicus Open Access Hub (https://scihub.copernicus.eu). This 12-month period began on 18 October 2014 as this was the earliest date for which Sentinel-1 data was available after the satellite was launched and was as close as possible to the small mammal survey dates. The SAR data was utilised in both ascending and descending orbits, with the VV polarisation Ground Range Detected (GRD) data product from the Interferometric Wide Swath (IW) acquisition mode used. Pre-processing steps for the SAR data were performed, which included range doppler terrain correction (performed using the Shuttle Radar Topography Mission (SRTM) 3Sec digital elevation model), sigma calibration, and de-speckle procedures applied on an individual SAR image basis using a Lee Sigma filter with a 7 × 7 window size. The data was then sub-set to the study area extent using the SNAP software package. The time series of SAR images were then layer-stacked in order of the imagery acquisition date using ERDAS Imagine to generate a 28-band dataset, which formed the SAR only dataset for analysis. The time-period of this study pre-dated Sentinel-2 imagery availability. Therefore, a 30 m resolution Landsat OLI optical image (acquisition date 22 July 2014) was also acquired and was the closest available cloud and snow-cover free image of the study area to the 2014 small mammal survey data collection period. This was acquired from the EarthExplorer data access portal (https://earthexplorer.usgs.gov) and was downloaded as a level-2 atmospherically corrected surface reflectance data product. Landsat OLI bands 2 (blue, 0.452-0.512 µm), 3 (green, 0.533-0.590 µm), 4 (red, 0.636-0.673 µm), 5 (near infrared, 0.851-0.879 µm), 6 (shortwave infrared, 1.566-1.651 µm), and 7 (shortwave infrared, 2.107-2.294 µm) were used to generate the single-date (OLI only) dataset for analysis. The OLI only imagery and time-series SAR only 28-band dataset were layer-stacked together to generate the combined SAROLI (34-band) dataset using ERDAS Imagine. All data was projected to the Universal Transverse Mercator (UTM) WGS84 zone 43N.

Land Cover Classification
Three land cover classifications were then generated using identical training data to assess their relative abilities to discriminate the land cover classes present using (1) Landsat OLI optical image only, (2) Sentinel-1 SAR time-series data only, (3) Sentinel-1 SAR time-series data, and Landsat OLI optical image in combination. The land cover classifications were performed using a random forest classifier using the R statistical package, and used an eight-class classification nomenclature using the classes: built-up, bare, water, dry grassland, alpine grassland, steppe, bushes, and agriculture (hay and low productivity barley fields). Reference locations of known land cover types were used for training the classification and accuracy assessment, respectively, with field photographs illustrating the different land cover classes within the study area displayed in Figure 3. These data were generated from four sources: (1) field-collected land cover survey locations, (2) reference locations derived from field photos, (3) reference locations derived from very high-resolution Bing aerial and Google Earth imagery, and (4) expert knowledge and direct interpretation of reference locations of clear imagery features (for example, water). The use of higher resolution imagery accessible via public portals such as Google Earth for identification of reference data points is an established technique [38]. For each land cover class, the reference data points were allocated on an alternating basis as training or independent validation locations, which created two equal-sized datasets providing 352 locations for both the training and validation datasets (704 in total) distributed across all land cover classes (built up 7.1%, bare 14.2%, water 10.8%, dry grassland 14.2%, alpine grassland 14.2%, steppe 14.2%, bushes 11.1%, and agriculture 14.2%). Built-up, water and bush classes had lower availability of reference data due to their more restricted coverage in the study area when compared to the other land cover classes. For the training data, training locations were used to generate reference polygons where spectral homogeneity was allowed. Validation data was used to assess the accuracy of the classifications, and to statistically compare the three classifications using McNemar's Test [60]. This is a non-parametric test that assesses, in a statistically rigorous manner, the binary distinction between correct and incorrect class allocations of two classification outputs, and determines whether statistically significant differences in these outputs exist [61].

Land Cover Data Extraction
Since E. multilocularis transmission is largely dependent on small mammal host population distribution and densities through ecological processes linking the landscape to prey/predator relationships [8,27,[62][63][64], determining the appropriate scale at which relevant information on small mammal host population densities can be captured is essential in determining risk areas for Em transmission. Since the home range size of E. tancrei and M. gregalis is not known a-priori, examining the relationships between small mammals and land cover presence over a pre-defined range without a specific basis for doing so could result in misleading results, which do not appropriately identify the relationships present. Instead, the relationships between small mammal relative density scores and surrounding land cover composition is investigated at multiple range sizes, with a series of nested circular buffers centered on the small mammal transect locations created using buffer radii from 50 m to 500 m increasing in 50 m increments. To minimise collinearity between nested land cover area measurements (variables calculated using smaller buffers partly measures the same area as the larger buffers) butretain the nested spatial structure, the buffers were converted to a series of concentric rings with the areas of smaller buffers removed from the larger buffer area within which they are nested. This created a new set of variables Z50 m … Z500 m following the methodology of Rhodes et al. [65] such that:

Land Cover Data Extraction
Since E. multilocularis transmission is largely dependent on small mammal host population distribution and densities through ecological processes linking the landscape to prey/predator relationships [8,27,[62][63][64], determining the appropriate scale at which relevant information on small mammal host population densities can be captured is essential in determining risk areas for Em transmission. Since the home range size of E. tancrei and M. gregalis is not known a-priori, examining the relationships between small mammals and land cover presence over a pre-defined range without a specific basis for doing so could result in misleading results, which do not appropriately identify the relationships present. Instead, the relationships between small mammal relative density scores and surrounding land cover composition is investigated at multiple range sizes, with a series of nested circular buffers centered on the small mammal transect locations created using buffer radii from 50 m to 500 m increasing in 50 m increments. To minimise collinearity between nested land cover area measurements (variables calculated using smaller buffers partly measures the same area as the larger buffers) butretain the nested spatial structure, the buffers were converted to a series of concentric rings with the areas of smaller buffers removed from the larger buffer area within which they are nested. This created a new set of variables Z50 m . . . Z500 m following the methodology of Rhodes et al. [65] such that: Z50 m = X50 m (1) where X50 m . . . X500 m are the land cover class coverage data for the 50 m . . . 500 m buffer sizes, respectively, and the Z100 m . . . Z500 m provide the difference between the original variables and the variable nested within it. For each new variable (Z50 m . . . Z500 m), the area of each of the eight land cover classes present was calculated). This produced a total of 80 land cover variables with 10 area measurements each containing 8 land cover classes. The nested land cover area variables were modelled using random forests in its regression form against the relative density scores for E. tancrei and M. gregalis, respectively, to identify the relative importance of the area coverage of the different land cover types at the different nested buffer sizes. Random forests (RF) are ideally suited for this analysis, and are an ensemble learning technique developed by Breiman [66] based on a large set of classification and regression trees. They are well-suited to complex, non-linear ecological datasets [67] handling large datasets with correlated predictor variables [68], are non-parametric [69], handle a variety of data types [70], make no assumption of independence concerning the data being analysed [71], and are robust to outliers, noise, and over-fitting [66]. Their use in ecology is becoming more widespread [72], including studies of landscape dynamic influences on parasite hosts [14,28].
Random forest analysis was performed six times, using the land cover maps derived from the combined SAROLI data, the SAR only data, and the OLI data only, for E. tancrei and M. gregalis, respectively. To achieve the final six random forest models, stepwise removal of the initial 80 nested land cover variables was performed, using the %IncMSE (percentage increase in mean squared error (MSE)) values produced for each variable. Variables with a negative %IncMSE value (i.e., their removal from the random forest would reduce the overall MSE) were removed with the RF then re-run with a reduced set of variables. This process was repeated iteratively, until all variables remaining in the random forest showed positive %IncMSE values (i.e., the MSE of the RF would increase if that variable was removed). No specific splitting rules or pruning methods were defined for the random forest analysis. Random forests also generate variable importance measures identifying the respective influence of the explanatory variables on the response variable (small mammal relative density) and allow the production of partial dependence plots for each variable, which enables further examination of the nature of the relationships present.
It is also possible to examine the statistical significance of the relative importance values of the predictor variables using a permutation-based random forest approach. This works by constructing a large number of random forest models from an identical dataset, and permuting the response variable to obtain the probability distributions of the relative importance measures of the respective predictor variables [73] under the null hypothesis of no relationship between the predictor and response variables. p-values for each predictor variable are also generated, which enables assessment of whether the observed relationships are statistically significant. The random forest analysis performed in this case built on methods previously applied in Reference [74] and was performed in R using the 'randomForests' package [75]. 2000 permutations were run for the random forest permutation analysis.
To enable the predictive application of the developed random forests across the study area, a regular point grid with 50 m spacing between points was created across the full extent of the study area. Around each point, nested buffers from 50 m to 500 m in 50 m increments were created, and the proportion of land cover class coverage calculated in an identical manner was conducted for the original survey transect locations. All points where the corresponding 500 m radius buffer extended beyond the classified area were disregarded from further analysis due to their incomplete coverage. This was performed for the SAROLI, SAR only, and OLI only generated land cover maps, respectively. The six random forest models were then applied predictively to generate a predictive relative density values for E. tancrei and M. gregalis for the land cover maps derived from the SAROLI, SAR-only, and OLI-only data set.
Lastly, to determine which buffer size is optimal for assessing the land cover class-small mammal relative density relationships, Pearson's Product Moment Correlations, p-values, and statistical significance are calculated individually for each land cover class at each nested buffer size for all three land cover classifications for both E. tancrei and M. gregalis.

Results
Three land cover classifications were produced using (1) Landsat OLI single-date optical imagery only, (2) a time-series of Sentinel-1 SAR data only, and (3) a combination of the single-date Landsat OLI image and time-series Sentinel-1 SAR data ( Figure 4). Accuracy assessments of the three classifications was performed using identical validation data with classification accuracies of 88.92% (Landsat OLI only), 90.91% (SAR only), and 94.6% (combined Landsat OLI and SAR), respectively. Individual class user's and producer's accuracies are presented in Table 2, with full error matrices presented in Supplementary Information Tables S1-S3. McNemar's test was performed between classification pairs (SAROLI and SAR-only, SAROLI and OLI-only, and SAR-only and OLI-only). Results examining the overall SAROLI and SAR-only classification accuracies produce a p-value of 0.037, which indicates that there is a statistically significant difference in classification accuracy between these two classifications at the 95% confidence interval. Likewise, a McNemar's test p-value of 0.003 for the SAROLI and OLI-only classifications also identifies a similar statistically significant difference in classification accuracy at the 95% confidence interval (refer to Supplementary Information Table S4). When taken in context with the respective classification accuracy figures, we can say that the SAROLI-based land cover classification offers a statistically significant improvement in classification accuracy over the SAR-only and OLI-only classifications. The McNemar's test results comparing SAR-only and OLI-only derived land cover classifications produced a p-value of 0.470, which confirm no statistically significant difference in classification accuracy between these two classifications. Therefore, where both SAR and optical remote sensing data are available, the best results are achieved by using both datasets in combination to generate land cover classification. Where optical remote sensing data is not available due to factors such as persistent cloud cover, McNemar's test results show that SAR-only based classifications perform as well as single-date Landsat OLI derived classifications, which produce similar classification accuracies (SAR-only = 90.91%, OLI-only = 88.92%). This offers a viable alternative method of land cover characterisation based on vegetation phenology where cloud cover prevents the use of optical imagery. statistical significance are calculated individually for each land cover class at each nested buffer size for all three land cover classifications for both E. tancrei and M. gregalis.

Results
Three land cover classifications were produced using (1) Landsat OLI single-date optical imagery only, (2) a time-series of Sentinel-1 SAR data only, and (3) a combination of the single-date Landsat OLI image and time-series Sentinel-1 SAR data (Figure 4). Accuracy assessments of the three classifications was performed using identical validation data with classification accuracies of 88.92% (Landsat OLI only), 90.91% (SAR only), and 94.6% (combined Landsat OLI and SAR), respectively. Individual class user's and producer's accuracies are presented in Table 2, with full error matrices presented in supplementary information in supplementary information Tables S1-S3.  Random forest analysis was then performed to rank the relative importance values of the presence of the different land cover types, at the different nested buffer sizes, in relation to the relative density scores of E. tancrei and M. gregalis. The relative density values across all transects for Ellobius tancrei are mean = 0.75, median = 0, minimum = 0, maximum = 10, standard deviation = 1.91, and for Microtus gregalis, the density values are mean = 1.95, median = 0, minimum = 0, maximum = 19, and standard deviation = 3.27. The random forest analysis for E. tancrei consistently shows agriculture for multiple nested buffer sizes to be the most important variable in influencing the relative density scores, with these agriculture values (with the exception of 350 m and 450 m buffers for SAROLI, and 400 m, 450 m, and 500 m for SAR-only) shown to be statistically significant by the random forest permutation analysis (Table 3)  Partial dependence plots were generated for the statistically significant variables in relation to E. tancrei relative density scores, as identified by the random forest permutation analysis to illustrate the nature of the relationships present. For brevity, only the SAROLI partial dependence plots are presented here, with the SAR-only and OLI-only plots displayed in Supplementary Information. For the SAROLI classification, Figure 5 consistently shows that E. tancrei relative density scores increase as the area of agricultural land present increases. This clearly illustrates that E. tancrei relative density scores are positively related to increasing areas of agricultural land. Similar patterns are observed for the SAR-only derived land cover classification of RF partial dependence plots (Supplementary Information Figure S1), and the OLI-only derived land cover classification partial dependence plots (Supplementary Information Figure S2), which demonstrate that this pattern is consistently observed for the land cover classifications derived using three different input datasets. Remote Sens. 2018, 10, x FOR PEER REVIEW 14 of 26  For M. gregalis, the nested land cover class variables ranked as being of highest importance were more variable (Table 4). However, SAROLI-only bushes (50 m, 100 m, and 300 m) were shown to be statistically significant. For SAR-only, only agriculture at 400 m, water at 200 m, and bushes at 50 m were statistically significant. For OLI-only, only bushes at 300 m was statistically significant. Partial dependence plots were again generated for the variables identified as being statistically significant in the random forest permutation analysis. For the SAROLI-derived land cover classification, the general trend was for increased M. gregalis relative density scores as bushes at 100 m and 300 m increased. This is indicative of a preference of M. gregalis to inhabit areas with higher levels of bushes present ( Figure 6). This pattern was also reflected in the SAR-only derived land cover classification of partial dependence plots, with increasing M. gregalis relative density scores associated with increasing bushes at 50 m, and also with increasing water at 200 m (bushes are typically found in close proximity to water at this study site). Conversely, a negative relationship is observed with agriculture, with higher M. gregalis relative density scores where agriculture is absent, and with these scores rapidly decreasing to low levels with increasing agriculture presence. For the OLI-only derived land cover classification partial dependence plots, only bushes at 300 m were shown to be statistically significant with a similar pattern of increasing M. gregalis relative density scores and with an increasing bush presence observed, as was seen with the SAROLI and SAR-only partial dependence plots. Predictive application of the random forest models illustrated the predicted relative density of E. tancrei and M. gregalis across the broader study area. Predicted E. tancrei distributions based on SAROLI, SAR-only, and OLI-only random forests show very similar predicted patterns of relative density (Figure 7). Predicted densities vary considerably across the study area with clusters of high predicted relative density shown to occur to the north and center of the study area. These patterns are consistent across all three E. tancrei land cover classifications, with high relative density clusters consistent with areas of agriculture presence. This is the case in close proximity to the village of Sary Mogol, which indicates that high densities of E. tancrei are likely to be present in close proximity to human settlement. Field surveyed relative density data for E. tancrei is overlaid on the predicted densities, which enables a comparison of predicted and observed densities. Although a small number of areas predicted low relative densities of E. tancrei where higher densities were observed during the field survey (towards the centre of the study area in Figure 7), the areas predicted to have high relative densities by the random forests were where higher densities were observed in the field survey data. This is especially the case to the north of the study area. Predictive application of the random forest models illustrated the predicted relative density of E. tancrei and M. gregalis across the broader study area. Predicted E. tancrei distributions based on SAROLI, SAR-only, and OLI-only random forests show very similar predicted patterns of relative density (Figure 7). Predicted densities vary considerably across the study area with clusters of high predicted relative density shown to occur to the north and center of the study area. These patterns are consistent across all three E. tancrei land cover classifications, with high relative density clusters consistent with areas of agriculture presence. This is the case in close proximity to the village of Sary Mogol, which indicates that high densities of E. tancrei are likely to be present in close proximity to human settlement. Field surveyed relative density data for E. tancrei is overlaid on the predicted densities, which enables a comparison of predicted and observed densities. Although a small number of areas predicted low relative densities of E. tancrei where higher densities were observed during the field survey (towards the centre of the study area in Figure 7), the areas predicted to have high For M. gregalis, all three RF predictions consistently identified the same area along a river towards the north of the study area as being of high predicted relative density (Figure 8). This corresponds to the area where the principle area of bushes (shown to be significant by the random forest analysis) was present, and also close to water, which was identified as important in the SAR-only RF analysis. Generally, across the remainder of the study area, predicted M. gregalis relative density was low even though the SAR-only RF did predict areas of medium relative density to the west, south, and east of the study area that were not predicted by the SAROLI and SAR-only RF modelling. All three input classifications did, however, identify mostly consistent hot-spots of higher predicted relative density across the study area. When comparing the predicted and observed relative densities for M. gregalis (Figure 8), the patterns were less consistent even though areas along the river where the highest relative densities were predicted also saw the highest observed densities during the field survey. Likewise, the area in the center of the study area was predicted to be of medium-to-high density and had also observed higher densities during the field survey.  To determine the optimal buffer size for assessing the land cover class, small mammal relative density relationships, Pearson's Product Moment Correlations, p-values, and statistical significance are calculated individually for each land cover class at each nested buffer size for all three land cover To determine the optimal buffer size for assessing the land cover class, small mammal relative density relationships, Pearson's Product Moment Correlations, p-values, and statistical significance are calculated individually for each land cover class at each nested buffer size for all three land cover classifications, for both E. tancrei and M. gregalis. Full results are presented in Supplementary  Information Tables S5-S7. When assessing the statistically significant correlations for E. tancrei, agriculture at the 150 m buffer radius size was shown to have the highest correlations with relative density for all three land cover classifications (SAROLI = 0.451, p-value ≤ 0.001, SAR-only = 0.479, p-value ≤ 0.001, OLI-only = 0.529, p-value ≤ 0.001), which indicates that a buffer size of 150 m can be considered optimal for detecting E. tancrei correlations. For M. gregalis, bushes at the 50 m buffer size showed the highest correlations with relative density for both the SAROLI (0.585, p-value ≤ 0.001) and the SAR-only (0.487, p-value ≤ 0.001) classifications. Although for the OLI-only classification, the bushes at the 300 m buffer size showed the highest correlation (0.504, p-value ≤ 0.001), the 50 m buffer size also exhibited a statistically significant correlation of 0.420 (p-value ≤ 0.001). Therefore, generally, a buffer size of 50 m can be considered optimum for assessing the correlations between M. gregalis relative density and land cover presence.

Discussion
This research examined a critical phase of the Echinococcus multilocularis (Em) transmission cycle, and adopted an analytical approach using RF to model and predict E. tancrei and M. gregalis small mammal presence in relation to landscape characteristics within a highly endemic Em area in Sary Mogol, Alay Valley, Kyrgyzstan. This approach successfully identified the key land cover types driving small mammal distributions and enabled the prediction of small mammal distributions based on land cover maps derived from combined SAR time-series and optical OLI, time-series SAR-only, and single-date optical OLI-only datasets.
Our first question 'Can time-series SAR data generate land cover maps of equivalent quality as optical imagery?' can be answered from our analysis. McNemar's test results showed no statistically significant difference between classification accuracies indicating that the performance of both time-series SAR data (90.91% accuracy) and single-date Landsat OLI imagery (88.92% accuracy) in generating land cover maps using identical training data are similar. However, in this study, the synergistic use of time-series SAR and Landsat OLI data in combination has produced statistically improved land cover classifications over each dataset used independently by using an identical training dataset. This shows that, where data availability allows, the combination of optical imagery (capturing spectral variability) and time-series SAR data (capturing structural variability and its dynamics resulting from seasonal vegetation phenology) is the best option for producing high accuracy land cover maps. Similar results have been observed in previous studies, which found higher classification accuracies when using SAR and optical data in combination [35,37,40,42,44,50] rather than using each dataset individually. These improved classification accuracies are due to the respective capabilities of optical and SAR datasets, which detect different physical properties of land cover features, provide complementary information across the electromagnetic spectrum, and compensate for limitations of using either sensor alone [40]. This is particularly relevant for the agriculture class identified as being of key importance, with its specific SAR backscatter profile exhibited over the growing cycle (based on sowing, growth, and harvesting of crops and ploughing of fields) differing from the typical phenological growth and senescence cycle of other vegetation types such as grassland. This provides additional information on which to classify this class.
Although the three land cover classifications generated are broadly similar, differences did exist in the distributions of the classified land cover types due to the way in which SAR and optical sensors characterise target features. The two key land cover classes of agriculture and bushes did vary in both producer's and user's accuracies for each of the SAROLI, SAR-only, and OLI-only classifications, although accuracy levels were generally high. For agriculture, producer's accuracies (98.0% (SAROLI), 88.0% (SAR-only), and 96% (OLI-only)) and user's accuracies (98.0% (SAROLI), 91.7% (SAR-only), and 82.8% (OLI-only)) showed that this class was classified well using all three input datasets. Likewise, bushes producer's accuracies (89.7% (SAROLI), 94.9% (SAR-only), and 61.5% (OLI-only)) and user's accuracies (97.2% (SAROLI), 97.4% (SAR-only), and 92.3% (OLI-only)) showed that this class was being classified well, with the exception of producer's accuracy for OLI, which performed less well.
Although there were some differences in the classification accuracies and spatial distributions of patches of these land cover classes, the random forest analysis consistently identified these classes of being of the highest importance in relation to E. tancrei and M. gregalis relative density.
Although there is some variation in the generated land cover maps and derived predicted E. tancrei and M. gregalis relative density maps, this study illustrates that where optical data is not available due to factors such as persistent cloud cover (a common problem in many parts of the world including Em endemic areas), time-series SAR data is shown to also produce high quality land cover classifications. Whereas, in this example, SAR-only could not achieve as high a classification accuracy as SAROLI, it did achieve classification accuracy similar to that of the single-date OLI data derived classification. This offers new opportunities for land cover mapping in cloud-affected areas, with the ability of SAR data to penetrate cloud offering guaranteed data available from Sentinel-1. This provides an improved potential for characterising phenological variability to strengthen land cover classifications. This variability cannot be captured from single-date imagery, which potentially results in misleading conclusions being drawn on landscape modelling results derived from snap-shot imagery.
Our second question 'What are the key land cover drivers impacting Em host Ellobius tancrei and Microtus gregalis distributions in the study area?' can again be answered by our analysis. E. tancrei distributions are consistently shown by all three land cover classifications to be directly related to agriculture, with increased E. tancrei relative density values demonstrated to be related to an increasing presence of this class. This relationship was shown to be statistically significant by the random forest permutation analysis, with agriculture consistently identified as variables of high importance at various nested buffer sizes by the random forest importance plots. When applied predictively across the study area, all three E. tancrei predicted models identified the same areas as being of high relative density. The models identified these areas as likely hot-spots for active Em transmission foci due to the denser presence of intermediate hosts (Figure 7). Since the predicted random forest relative density maps all identified similar areas as the highest E. tancrei relative density, this further supports the suitability of applying SAR-only datasets for generating land cover classifications for small mammal distribution modelling in this region. This shows that similar results can be achieved by using optical remote sensing imagery, while overcoming the challenges of persistent cloud cover and phenological vegetation dynamics.
For Microtus gregalis, the results were more varied. However, increasing the presence of bushes was consistently identified as a key variable resulting in higher M. gregalis relative density, which is both statistically significant and ranked as of high importance by the random forest analysis. Water was also identified as statistically significant in influencing M. gregalis relative density, which is consistent with the relationship observed with bushes since most areas of higher presence of bushes were found along a river towards the north of the image. This was reflected in the predicted random forest M. gregalis relative density map, which consistently identified areas along the river toward the north of the study area as being areas of higher relative density than the remainder of the study area ( Figure 8).
This study also determined, based on the highest statistically significant correlations between relative density and the land cover class presence, the optimum buffer sizes for assessing the relationships between E. tancrei and M. gregalis and the landscape. Results showed that a 150 m buffer size was optimal for assessing E. tancrei relative density and land cover presence. A 50 m buffer size is generally optimal for M. gregalis. This is important for future research conducted on the spatial distributions of these species, and the design and improvement of sampling strategies for small mammal distribution data collection. As shown in the results, the influence of different land cover types is also dependent on buffer size. Previously, in this region, there was no prior knowledge about small mammal home ranges or the relevant resolutions at which landscape analysis should be performed to capture relevant information for modelling small mammal distributions. Previous articles have supported the idea that small mammal population distributions and dynamics are responsive to landscape configuration, and that these are scale-dependent [17,76]. Identifying the appropriate scales on which to conduct these activities is critical for planning sampling strategies for further studies and to implement small mammal monitoring since the resolution chosen must be optimal. The multiple nested buffer methods presented here are appropriate for determining the optimal buffer size to investigate small mammal species distribution in a given area. This is a critical point for designing the resolution of risk maps by taking into account small mammal host species distribution and their optimal habitats. However, our results should be considered as preliminary since grassland small mammals can present large inter-annual variations of population density [77]. This might lead to variations in landscape patch occupancy that cannot be detected in short-term studies and calls for further research.
While the results presented illustrate the potential of these methods, it is also necessary to consider the limitations of this approach for small mammal distribution mapping. Although the acquisition dates of the satellite data used was selected to provide a full annual time-series as close as possible to the field survey dates, there was a temporal mismatch between field data collection and imagery acquisition dates. This was unavoidable in this case as the Sentinel-1 satellite was launched in April 2014. However, it is acknowledged that, as some small mammal species experience population cycles, population densities may differ between the field survey and imagery acquisition periods. These small mammals follow "classical" population dynamics with seasonal variations. Due to mortality and no reproduction (bad conditions in winter), a lower density of old animals having overwintered is observed in early spring (April-May), with the population then increasing until September-October due to breeding. Despite these seasonal changes, the distribution of each species is habitat-specific and spill-over to unfavourable habitats is negligible. This means the general pattern of their relative distribution between habitats is constant whatever the season. This is supported by the random forest analysis, which consistently identified the same land cover classes as key in influencing small mammal distributions, although the relative density of the two-species studied may have differed during the intervening period.
Misclassification is also observed within the three land cover classifications and is quantified in the accuracy assessment performed. Although the overall accuracy of the land cover classifications was good at 94.6% (combined Landsat OLI and SAR), 90.91% (SAR-only) and 88.92% (OLI-only), respectively, class accuracies did differ. Agriculture class accuracies did vary between the three land cover classifications with user's accuracy (UA) = 98.00%, producer's accuracy (PA) = 98.00% for SAROLI, UA = 91.67%, PA = 88.00% for SAR-only, UA = 82.76%, and PA = 96.00% for OLI-only classifications. Although generally high, there were differences in classification accuracy and this will, in turn, impact the resultant modelling and predicted small mammal relative density maps, which explains why there are some differences in the three predicted models for each species. Similarly, bushes also showed variability in class accuracy with UA= 97.22% and PA = 89.74% for SAROLI, UA = 97.37%, PA = 94.87% for SAR-only, and UA = 92.31%, and PA = 61.54% for OLI-only. It is likely that it is the misclassification of bushes that produced the inconsistent predicted relative density of M. gregalis toward the left of the image in Figure 8c and illustrates that, although the modelling approaches employed here have considerable potential, they are still limited by the quality of the land cover classifications on which the modelling is built. Future work should also investigate the influence of frequency of SAR data acquisition over the annual growing cycle (and how this affects its ability to capture phenological change), and ascending or descending orbital direction to establish their impacts of this classification accuracy.
Additionally, due to the sparsity of historical small mammal survey data it was not possible to validate the predictive distribution maps. This is a current weakness, and although the land cover classifications generated were accuracy assessed, it is important for future application of these methods that validation of the predicted models is also performed including additional data collection if required. However, when the relative density scores from the field survey data were overlaid on the predicted relative densities for visual comparison, the general pattern was that higher observed densities were generally in the areas where the random forest predicted densities were also high although there are some exceptions to this. A comparison can still be drawn between the predictive models generated from the three land cover maps using identical small mammal distribution data, with the primary outcome of this research being the comparison of the use of optical and radar data to map habitats favourable to the small mammal disease transmission vectors, and to propose an alternative to optical images for regions where cloud cover is persistently high.

Conclusions
The results of this study, both in confirming the ability of SAR-only time series remote sensing data to generate land cover classifications of equivalent quality as single-date optical imagery, and the ability to use random forests to identify key small mammal-landscape relationships and predictively apply these models over broader study areas, holds significant potential for Em disease control activities. These techniques, along with the availability of cost-free high temporal frequency and broad geographical coverage of Sentinel-1 SAR data, can be used to identify areas of high relative density of the small mammal species, which act as intermediate hosts in the Em transmission cycle. In turn, this could aid the improved geographical targeting of pre-emptive disease control measures, including targeted treatment of dogs with anti-helminthic drugs to disrupt the Em transmission cycle in that region. Therefore, this would reduce Em infection risk in local human populations. While this study shows that the best results can be achieved by using time-series SAR data and optical imagery in combination, for the first time, we offer a platform to conduct predictive modelling of small mammal distributions based on land cover distributions without the need for optical imagery. This overcomes a major challenge in many Em endemic regions, where persistent cloud cover inhibits the acquisition of good quality optical imagery. By providing new opportunities to incorporate remote sensing as a core element in disease control strategies in Em endemic areas where this was not previously possible, improved geographical targeting of limited disease control resources could contribute strongly to the disruption of Em transmission, and alleviate societal disease burdens in these regions.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2072-4292/11/1/39/s1, Table S1. Confusion matrix for the SAR and OLI combined classification. Table S2. Confusion matrix for the SAR-only classification. Table S3. Confusion matrix for the OLI-only classification. Table S4 Table S5. Pearson product-moment correlation, p-values, and statistical significance for E. tancrei and M. gregalis in relation to individual land cover class presence for the 50 m to 500 m nested buffers derived from the SAROLI land cover classification. Table S6. Pearson product-moment correlation, p-values, and statistical significance for E. tancrei and M. gregalis in relation to individual land cover class presence for the 50 m to 500 m nested buffers derived from the SAR-only land cover classification. Table S7. Pearson product-moment correlation, p-values, and statistical significance for E. tancrei and M. gregalis in relation to individual land cover class presence for the 50 m to 500 m nested buffers derived from the OLI-only land cover classification. Figure S1. Random forest partial dependence plots for the statistically significant (as shown by random forest permutation analysis) nested land cover variables derived from the SAR-only land cover classification, and E. tancrei relative density. Figure S2. Random forest partial dependence plots for the statistically significant (as shown by random forest permutation analysis) nested land cover variables derived from the OLI-only land cover classification, and E. tancrei relative density.