Exploratory Analysis of Dengue Fever Niche Variables within the Río Magdalena Watershed

Previous research on Dengue Fever have involved laboratory tests or study areas with less diverse temperature and elevation ranges than is found in Colombia; therefore, preliminary research was needed to identify location specific attributes of Dengue Fever transmission. Environmental variables derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) and Tropical Rainfall Measuring Mission (TRMM) satellites were combined with population variables to be statistically compared against reported cases of Dengue Fever in the Río Magdalena watershed, Colombia. Three-factor analysis models were investigated to analyze variable patterns, including a population, population density, and empirical Bayesian estimation model. Results identified varying levels of Dengue Fever transmission risk, and environmental characteristics which support, and advance, the research literature. Multiple temperature metrics, elevation, and vegetation composition were among the more contributory variables found to identify future potential outbreak locations.


Introduction
The resurgence of tropical diseases in the Americas, including Dengue Fever (DF), may be related to the suspension of vector eradication programs, such as those conducted by the Pan American Health Organization (PAHO) programs until the 1970s [1].Implementation of new control programs require ongoing data collection.Cost limitation of in situ data collection has increased the use of remote sensing (RS) and Geographic Information Science (GIS) applications to estimate disease spread and population vulnerability [2,3].Public health agencies can use such tools to implement monitoring and surveillance programs for multiple purposes, including the estimation of vector abundance, implement prevention strategies, or to control for further dispersion of the disease and vector [4,5].Additionally, local metrics and research are needed to improve current vector disease monitoring practices before cost effective predictive or forewarning systems can be studied or implemented [6][7][8][9].Therefore, the goal of this study was to investigate the variables which would provide the most benefit to future investigations of predictive modeling.The outputs of this study identified variables capable of inclusion in such studies, but the statistical methods used here are not designed for predictive purposes.
One of the more commonly implemented remote sensing (RS) tools for vector research has been the Moderate Resolution Imaging Spectroradiometer (MODIS).This dual satellite platform, Aqua and Terra, of NASA's Earth Observation System maintains a sun-synchronous orbit and is capable of collecting data over the entire Earth multiple times a day [10].The high temporal resolution reduces the sensor's spatial capabilities, 250 m for multispectral data and 1 km for land surface temperature (LST), but its ability to collect a variety of spectral wavelengths daily has proven useful for a variety of environmental assessments [11].The satellite has already been implemented for a variety of studies.Fuller et al. [12] demonstrated how predicative modeling of DF can be achieved through temperature records and remote sensing products.Vector modeling could therefore be implemented into current autonomous monitoring systems, such as MODIS's Famine Early Warning Systems Network (FEWS NET [13]) [2,12,[14][15][16].NASA scientists have developed such a Malaria transmission model though the Goddard Space Flight Center's Health Planet program, which incorporates input variables for parasites, hosts, vectors, and environmental and human factors [4].Similar vector warning products could be developed to improve disease dispersion monitoring practices for additional diseases [2,12].
The Aedes aegypti mosquito vector is capable of spreading a variety of diseases, including: Yellow Fever, Chikungunya, Zika, and DF.Disease transmission rates of this vector are considered to be influenced by both cultural and climatic conditions.Although commonly associated with tropical and subtropical climates, due to the habitable zone of the mosquito, changing climate conditions may expand the dispersion potential of these diseases [1,[17][18][19].This demonstrates why improved methods to identify habitable zones of this vector will become increasingly important to improve future disease mitigation and prevention practices.
Since no vaccine for DF is currently available, mosquito control practices are critical to reduce infection [1].Identification of geographical areas to reduce mosquito populations and contact between the vector and people are essential to attenuate potential outbreaks.Previous research has identified flower pots, water storage containers, impervious surface catchments, and other anthropogenic water sources as opportunities for Aedes aegypti larval development [1,6].These features are common to domesticated or developed areas, perpetuating the likelihood of the Aedes aegypti mosquito to be found in urban/populated environments, and increasing the opportunity for mature specimens to feed on human populations to disperse the virus [4,6,20].
The vector's habitat and disease transmission potential are largely set by given meteorological and elevation boundaries [3,21].Identification of these vector boundaries, and the pathogen it carries, is known as detection of the zoonotic niche [22][23][24][25].The variables defining this niche can be estimated through RS data to derive the geographic extent of the mosquito's habitat [3], and could become increasingly important to accurately monitor future transmission risk of DF.Similar habitat and dispersion studies are already being conducted for this and other vector diseases, including West Nile Virus and Malaria [2,4,7].The focus of this manuscript is the DF virus, although some of the incorporated variables and discussion directly address the Aedes aegypti mosquito vector, which we assume the vector dispersion to be associated to the disease spread.
A study of DF in Singapore described an increase of 22%-184% in reported cases given a 2-10 • C rise in maximum or minimum temperature [8].Additional laboratory and field research in Cali, Colombia have also demonstrated a strong relationship between DF dispersion and temperature range [26,27].Under lab conditions, Brady et al. [28] reported low survival of the mosquito vector below 14-15 • C or above 35 • C [29].Rueda et al. [30] similarly saw survival decline outside the 15-34 • C temperature range, but also documented an inverse relationship between body size and increasing temperature [30].These temperature limits may also impact the latitude and altitude of the mosquito's habitat [21].In Mexico and Colombia, the mosquito has been reported to be abundant up to 1700 [18,31] and 1800 m [32], respectively.However, more recently it was reported present (although not abundant) at 2130 m in Central Mexico [18] and 2300 m in Colombia [33].Cultural parameters can involve the presence of anthropogenic water storing activities, which may be responsible for up to 79% of the reported cases [3,8].Water storage tanks, potted plants, and similar peridomestic related containers can provide larval rearing sites proximal to populated areas.
Exploratory factor analysis has previously been used to investigate the relationship between a dependent variable and potential contributing, independent variables [34,35].Principal component analysis (PCA) is a factor analysis method used to identify trends within a multivariate dataset to reduce or derive new, efficient, component variables [5,36].Independent variables are plotted against the dependent variable during the PCA to identify linear relationships, or clusters, to build the component variable relationships.The amount of the original dataset variability a component variable encompasses is demonstrated by its eigenvalue; the larger the eigenvalues, the stronger the composite component.Composite variables with an eigenvalue of one (1), or greater, are classified as principal components according to the Kaiser Criteria as they describe at least as much variability as any one of the original variables [35,37].Components which do not reach the Kaiser Criteria can be considered 'noise' and not implemented in future stages; therefore, fewer variables are used to describe the majority of the data [37,38].These methods have been shown to reduce the amount of data used and identify trends in the dataset [8,36].Other studies have used regression techniques, including the geographically weighted regression and auto-regression, methods for virus modeling [8,21,39], and boosted regression trees [22][23][24][25].These methods can be computationally intense and difficult to implement.Although they were chosen for their respective studies due to their predictive ability, they sometimes incur complications with non-normalized data, such as that used in this study.Since the main goal of the present study was not prediction, but data reduction to identify strong determinant variables, the PCA methodology was identified to be more appropriate for this purpose [38,40].
This project was designed as an exploratory analysis to identify variables which may explain the annual incidence of DF within Colombia's Magdalena River watershed during the 2012-2014 study period.This study site is characterized by climate conditions ranging from sea level to permanent snow, and from desert to tropical rainforest.The wide range of climate conditions, which exceeds the presence of the upper and lower climate limits of the Aedes aegypti, makes this an isolated and ideal study site for this type of exploratory analysis.This methodology was intended to investigate whether the study site supports habitat and transmission information obtained about the vector during previous laboratory and other geographical area studies.Although this PCA analysis has a descriptive, rather than predictive, connotation, its results are aimed to guide future research projects intended to develop predictive models or early warning systems for the disease.

Study Site
Analysis was conducted in the Río Magdalena watershed in the South American country of Colombia.The approximately 273,000 km 2 area is bounded by a split in the Andes Mountain range on three sides and the Caribbean Sea coast to the north, as can be seen in the site map in Figure 1.The 2014 projected population of the watershed, including the capital city of Bogota, was calculated to be over thirty-six million.The watershed is in close proximity to the equator, indicating a warmer climate, and well within the currently-documented latitude range of the Aedes aegypti mosquito [8].The elevation of the watershed's mountain boundaries should naturally isolate the Aedes aegypti vector from external areas, providing a study area with limited external influence [41,42].Due to its location on the western side of South America, it is also influenced by El Niño/La Niña seasonal variations.DF has been present in this region for some time.Reports indicate the disease will continue to be an issue for the region, and may potentially reach epidemic proportions [32].

Population Data
Population variables incorporated into the analysis included socioeconomic data collected during the 2005 General Census administered by El Departamento Administrativo Nacional de Estadística (The National Administrative Department of Statistics).Projected populations for 2012-2014 were created by the data's proprietors, a sample of the data can be viewed in Table 1, and consisted of tabulated total population, rural and urban populations distinguished by gender [43].Population variables were collected at the municipality resolution, which became the assessment size for this study.Colombian municipality boundaries were distinguished through an open access Esri country shapefile [44], which was clipped to the watershed boundary.Projected population levels and incidence rates for the three study periods can be identified in the following graph.Since the disease is well established, these numbers do not reflect prevalence, but basic annual infection rates based on the population size.

Remote Sensing Data
Environmental variables were obtained from the U.S. Geological Survey's (USGS) Earth Explorer [45] and NASA [46] open access databases [47].Daily MODIS data was collected using an automated MS-DOS batch file operation covering 2012-2014, independently.Due to the high volume of files, a Python 2.7 [48] script was composed for ArcGIS 10.3's [44] model builder to automatically process the satellite imagery to use a consistent projection file (WGS 1984 UTM Zone 18N), mosaic, and apply correction algorithms as the products required.Rather than use the eight-day MODIS composite product available through the database, the daily imagery was composited into seven day

Population Data
Population variables incorporated into the analysis included socioeconomic data collected during the 2005 General Census administered by El Departamento Administrativo Nacional de Estadística (The National Administrative Department of Statistics).Projected populations for 2012-2014 were created by the data's proprietors, a sample of the data can be viewed in Table 1, and consisted of tabulated total population, rural and urban populations distinguished by gender [43].Population variables were collected at the municipality resolution, which became the assessment size for this study.Colombian municipality boundaries were distinguished through an open access Esri country shapefile [44], which was clipped to the watershed boundary.Projected population levels and incidence rates for the three study periods can be identified in the following graph.Since the disease is well established, these numbers do not reflect prevalence, but basic annual infection rates based on the population size.

Remote Sensing Data
Environmental variables were obtained from the U.S. Geological Survey's (USGS) Earth Explorer [45] and NASA [46] open access databases [47].Daily MODIS data was collected using an automated MS-DOS batch file operation covering 2012-2014, independently.Due to the high volume of files, a Python 2.7 [48] script was composed for ArcGIS 10.3's [44] model builder to automatically process the satellite imagery to use a consistent projection file (WGS 1984 UTM Zone 18N), mosaic, and apply correction algorithms as the products required.Rather than use the eight-day MODIS composite product available through the database, the daily imagery was composited into seven day composite raster files to coincide with the weekly calendar year.The composite data was necessary to reduce the impact of cloud cover [11].All imagery was clipped to the watershed shapefile.
ArcMap's raster calculator tool [44] was used to convert MODIS MYD11A1 composited data to LST through additional scripting.Both daytime and nighttime thermal wavelength raster cells were multiplied by a gain factor of 0.02, then converted to Celsius by subtracting 273.15 [49].Both day and night temperatures were included in this study to investigate previously reported low temperature [3,21] and high/low temperature range [26][27][28] impact on vector survivability.Another MODIS product, the MYD09GQ, was used to calculate the two-band enhanced vegetation index (EVI) (2.5 × [(NIR − Red)/(NIR + Red + 1)]) [49].The two-band EVI was used due to the MYD09GQ's 250 m spatial resolution, the prevalence of high biomass/cloud cover in the study region, the index's ability to differentiate vegetation canopy from background noise, and its proven usefulness in previous DF research [12,50].
Precipitation data was collected through the Tropical Rainfall Measuring Mission (TRMM); a joint venture between NASA and the Japanese Aerospace Exploration Agency launched in 1997 [51].TRMM's daily temporal resolution has proved useful for vector research [8].One of TRMM's three onboard sensors, the Visible Infrared Scanner (VIRS), provided the 3B42 V7 dataset, which reports global precipitation estimates at a resolution of 5 km.Although the TRMM was decommissioned in 2015, its flight time sufficiently covers the 2012-2014 study period.
Additional annual products for land use land cover (LULC) and elevation were incorporated into the analysis.The pre-processed MODIS MCD12Q1 product identified patterns of LULC at a resolution of 500 m.The data identifies major landscape elements, including: barren, cropland, forest, savanna, shrub land, snow and ice, urban, and wetland features [11].The Shuttle Radar Topography Mission (SRTM) provided elevation data at a spatial resolution of 30 m.These resolutions were considered to be sufficient to quantify variations across the watershed's contour.
This project consisted of three stages for data acquisition and analysis, which is graphically represented in Figure 2. The acquisition and development of the weekly composite satellite data, previously described, represent the preprocessing part of Stage 1. Composited data were further compounded into annual values using Esri ArcMap's Cell Statistics tool.Cell statistics calculated an input variable's maximum, minimum, and average value at a pixel's location over the calendar year.The final part of Stage 1 consisted of zonal statistics being applied to the yearly composited cell values, using Colombian municipality boundaries.Variables acquired as yearly variables, elevation and LULC, did not require a cell statistics step, but were still processed through zonal statistics.The processing order demonstrates the naming structure of the variables used throughout the rest of this project.Nomenclature begins with the variable type (example LST or VEG), followed by the cell statistic (annual longitudinal: min, max, mean), and concludes with the zonal statistic calculation (cross-sectional per municipality: min, max, range, mean, standard deviation, sum).The zonal statistic step also provided a unique identification code per municipality to improve database organization and allow for the municipality population data to be joined to the environmental variables.
composite raster files to coincide with the weekly calendar year.The composite data was necessary to reduce the impact of cloud cover [11].All imagery was clipped to the watershed shapefile.
ArcMap's raster calculator tool [44] was used to convert MODIS MYD11A1 composited data to LST through additional scripting.Both daytime and nighttime thermal wavelength raster cells were multiplied by a gain factor of 0.02, then converted to Celsius by subtracting 273.15 [49].Both day and night temperatures were included in this study to investigate previously reported low temperature [3,21] and high/low temperature range [26][27][28] impact on vector survivability.Another MODIS product, the MYD09GQ, was used to calculate the two-band enhanced vegetation index (EVI) (2.5 × [(NIR − Red)/(NIR + Red + 1)]) [49].The two-band EVI was used due to the MYD09GQ's 250 m spatial resolution, the prevalence of high biomass/cloud cover in the study region, the index's ability to differentiate vegetation canopy from background noise, and its proven usefulness in previous DF research [12,50].
Precipitation data was collected through the Tropical Rainfall Measuring Mission (TRMM); a joint venture between NASA and the Japanese Aerospace Exploration Agency launched in 1997 [51].TRMM's daily temporal resolution has proved useful for vector research [8].One of TRMM's three onboard sensors, the Visible Infrared Scanner (VIRS), provided the 3B42 V7 dataset, which reports global precipitation estimates at a resolution of 5 km.Although the TRMM was decommissioned in 2015, its flight time sufficiently covers the 2012-2014 study period.
Additional annual products for land use land cover (LULC) and elevation were incorporated into the analysis.The pre-processed MODIS MCD12Q1 product identified patterns of LULC at a resolution of 500 m.The data identifies major landscape elements, including: barren, cropland, forest, savanna, shrub land, snow and ice, urban, and wetland features [11].The Shuttle Radar Topography Mission (SRTM) provided elevation data at a spatial resolution of 30 m.These resolutions were considered to be sufficient to quantify variations across the watershed's contour.
This project consisted of three stages for data acquisition and analysis, which is graphically represented in Figure 2. The acquisition and development of the weekly composite satellite data, previously described, represent the preprocessing part of Stage 1. Composited data were further compounded into annual values using Esri ArcMap's Cell Statistics tool.Cell statistics calculated an input variable's maximum, minimum, and average value at a pixel's location over the calendar year.The final part of Stage 1 consisted of zonal statistics being applied to the yearly composited cell values, using Colombian municipality boundaries.Variables acquired as yearly variables, elevation and LULC, did not require a cell statistics step, but were still processed through zonal statistics.The processing order demonstrates the naming structure of the variables used throughout the rest of this project.Nomenclature begins with the variable type (example LST or VEG), followed by the cell statistic (annual longitudinal: min, max, mean), and concludes with the zonal statistic calculation (cross-sectional per municipality: min, max, range, mean, standard deviation, sum).The zonal statistic step also provided a unique identification code per municipality to improve database organization and allow for the municipality population data to be joined to the environmental variables.Compounded statistical analysis on the environmental variables, both cell and zonal statistics, was used to investigate the complexity of environmental influences on transmission rates.Previous 'non-laboratory' studies often incorporated simple statistics into their models, such as the mean temperature, which may not quantify the most useful variable to study [6,22,27,52].The complexity added by the two-stage statistics in this project allows for the assessment of variable diversity to be expanded from what is available in the literature [21,52].For example, the application of the municipality zonal statistic on the night LST Min cell values explored the minimum, mean, and maximum zone values of each minimum pixel value with the zone.This methodology was anticipated to expand the individual values previously used and further the literature's understanding on environmental influencers.For example, the maximum zone value may better demonstrate whether the minimum survival temperature of the mosquito is reached within a municipality, which would be an improvement over the amount of information available through the mean value.Similar examples to this use of extended variables include the use of day and nighttime mean and temperature ranges for zoonotic niche studies in Africa [23], or max and minimum temperatures for DF in Singapore [8].The involvement of additional data compounds the complexity of the dataset, further demonstrating the usefulness of a data reduction tool, such as PCA.This statistical tool is able to identify local attributes of infection and could be reproduced in additional study areas/time periods by investigators with limited programming or GIS experience.

Health Data
Municipality records of DF were generated and made available by the National Institute of Health (Instituto Nacional de Salud) of Colombia for the three study years [53].Confirmed cases were identified through laboratory tests of patients exhibiting symptoms of: headache, ocular pain, myalgia, arthralgia, and/or rash.Specific virus specimen antibodies were identified by polymerase chain reaction (PCR) tests for fever patients who had symptoms for less than five days, while an ELISA (enzyme-linked immunosorbent assay) test was used for patients who had symptoms for more than five days [17].The municipality records were compounded into annual records and merged to the shapefile database to be used as the dependent variable in the statistical analysis.Since this methodology incorporated an annual aggregation, it extended beyond the normal gestation or gonotrophic cycle of the virus and vector.Therefore, no temporal lag was needed to be incorporated into the methodology.

Principal Component Analysis
Following the creation of the complete database incorporating health data, population demographics, and yearly zonal environmental variables, Stage 2 conducted a PCA independently on each variable category using the IBM SPSS 23 statistical software's dimension reduction tool [54].A category consisted of the zonal statistics derived from a single data source, such as the nighttime LST or precipitation data.The dependent variable was comprised of the number of confirmed cases of DF per municipality.This stage was used to identify the strongest unique variables within each variable category to include in Stage 3 [40].A variable was considered for the subsequent stage if it strongly loaded in component 1 or 2 in both 2012 and 2014.The El Niño Southern Oscillation (ENSO) conditions between 2012 and 2014 were weak, thus, no important influence from El Niño or La Niña should have affected the studied relationship between meteorological variables and DF [55].Stage 2 was implemented to remove variables which were highly correlated to one another and eliminate noise within the dataset [36,38].One variable from component 1 and component 2 was added to Stage 3 to represent the largest diversity of information provided by the category, as multiple variables from a single component are commonly highly correlated to one another and, therefore, contribute little more than noise to the results.LULC and population variables were the exception to this rule.Four LULC variables were included in Stage 3 due to the literature's consistent documentation of their value during this type of analysis.The four variables included urban (1.8% of the study site's landmass), forest (39.8%), cropland (29.9%), and barren land (0.03%).Due to the similar results of population category, total population and population density were investigated independently in different models in Stage 3. The purpose of Stage 2 was to reduce the amount of input variables, from the original 150 zonal statistic derived variables, to the strongest variables for Stage 3.This step allowed for a more robust relationship to be built from the derived Stage 3 models.
Stage 3 also incorporated a PCA, which identified the relationship between the top variables identified in Stage 2. Three models were analyzed over each of the three years, 2012-2014.Stage 2 results identified total population as a superior variable over gender or urban vs. rural divisions; the models, therefore, incorporated variables of the total municipality population count, population density, and an empirical Bayesian estimation (EBE) of DF counts.The population model included the yearly projected population per municipality as one of the 14 independent variables, while the population density model used the municipality's projected population divided by its area (population/km 2 ).The municipalities are not consistent in size, but it was anticipated the higher populated urban areas would increase the density value of the municipality enough to distinguish them from less developed areas, regardless of the municipality's size.Both population models used DF cases as the dependent variable.
The EBE variable was developed following the procedures of Brandel et al. [56] using a standardized event ratio for each municipality and an adjacency matrix to produce an estimate of vector presence without political boundary influence [56,57].The adjacency matrix was created through the ArcGIS plug-in 'Adjacency for WinBUG Tool' (available through USGS [58]).Since the population of the municipality was incorporated into the EBE dependent variable, no population independent variable was required for the model, so only 13 variables were used.
Normalization of all variables was conducted for Stage 3 to reduce distribution or range bias in the results.Independent variables were normalized to Z-scores using SPSS's descriptive analysis tool.Similarly, the Poisson distribution of the dependent variable, reported DF cases, was transformed through a natural log calculation (Ln [cases + 1]).The EBE dependent variable included a normalizing step in its development, so no further manipulation was required.The Stage 3 PCA was processed by the SPSS dimension reduction tool [54], using a varimax rotation for an oblique transformation method, and included a regression variable output [59].As with previous vulnerability studies, these component values can be added together to derive a sum factoral output for the municipality to provide a comparable value and output for mapping [34,[60][61][62][63][64][65].
Comparative assessment of the Stage 3 models was conducted for the three models through a Receiver Operating Characteristic (ROC) Area under the Curve (AUC).The AUC demonstrates the positive reporting of DF cases in high sum factoral areas compared to modeled false positives in areas without reported DF cases.This can demonstrate the ability of the PCA regression output to correctly identify areas of increased risk of transmission, whether the variables are appropriate for identifying transmission risk, and provided a comparison between the results across years [22].

Results
The Stage 2 categorical PCA reduced the list of 150 independent variables to 14 variables which were able to more efficiently explain the variance in the dataset without extraneous variable noise.Among the more prominent variables found from Stage 3 were nighttime LST minimum cell-max zone value, elevation minimum value, vegetation min cell-max zone, and daytime LST max cell-mean zone.Four LULC variables were used in the Stage 3 assessment based on Stage 2 results and a priori knowledge from the literature, such as the relationship between the mosquito and urban populated areas, but all four variables demonstrated limited results [1,6].A complete list of the Stage 3 independent variables can be viewed, along with their component scores, in Figure 3; negative numbers indicate an inverse relationship, and the strongest component loading was bolded for each variable.The variables which load heavier in the earlier components (example 1 or 2) demonstrate a stronger relationship to the dependent variable.2 indicate the methods were fairly consistent between the three models.The results show both population-based models contain five components, and explain over 73% of the variance within the dataset.The EBE dependent model contained one fewer component, with four, and a lower explanation of variance compared to the other models.This may be due to the reduced number of input variables, aka population, but further analysis would be needed to confirm such a theory.The three models also had similar results when compared through the AUC, as is seen in Table 3.These results suggest there were limited statistical differences between them.The ROC AUC also demonstrated the exploratory methods used in this study were able to provide results better than statistical chance, but could be improved.The ROC AUC represent a ranking of "Poor-Fair" for the statistical method results, but are acceptable for an exploratory study [66].With the similarity in model PCA results, similar component loadings between the three models will support the incorporation of those strongly-loading variables in future projects.2 indicate the methods were fairly consistent between the three models.The results show both population-based models contain five components, and explain over 73% of the variance within the dataset.The EBE dependent model contained one fewer component, with four, and a lower explanation of variance compared to the other models.This may be due to the reduced number of input variables, aka population, but further analysis would be needed to confirm such a theory.The three models also had similar results when compared through the AUC, as is seen in Table 3.These results suggest there were limited statistical differences between them.The ROC AUC also demonstrated the exploratory methods used in this study were able to provide results better than statistical chance, but could be improved.The ROC AUC represent a ranking of "Poor-Fair" for the statistical method results, but are acceptable for an exploratory study [66].With the similarity in model PCA results, similar component loadings between the three models will support the incorporation of those strongly-loading variables in future projects.Night LST min cell-max zone consistently ranked high across all three models.Night LST mean cell-min zone and daytime LST max cell-range zone were also highly loaded in the first or second component, supporting previous documentation on the influence of temperature range, or diurnal temperature, on mosquito survivability and their use for DF modeling [26][27][28]52].Since both warmer daytime and cooler nighttime temperatures were found to be highly influential in the results, it demonstrates the variability of temperature within an area may influence DF transmission.This is particularly relevant since the range zonal statistic was determinant in both day and night temperature variables.Although daily diurnal temperatures has previously been demonstrated to influence the development of the mosquito, it should be noted these results are representative of data aggregated to an annual value which may not directly compare [27,52,67,68].The general term 'temperature range' may be used hereafter to refer to the general temperature variability observed on the dependent variable.LST data documented an annual mean nighttime pixel range of −8.218-21.88• C, and 8.1-42.73• C during the day between both 2012 and 2014.Although large ranges are detected in this study, they may be influenced by either the weather or the diverse landscapes within a municipality, and may not provide direct support for the previous laboratory identified vector temperature limits [27,28,30].The importance of the temperature range on DF is further supported by the consistently strong inverse loading of elevation in the first component across all three models; high elevations contain colder temperatures and have been documented to impede mosquito habitation [3,28,32].Another influential explanatory variable was identified as vegetation min cell-max zone; which loaded positive in the first component for all models and could represent areas capable of supporting flora and fauna [1].
DF has been widely recognized as a mainly urban or peri-urban disease.The mosquito vector prefers populated spaces, due to the availability of anthropogenic water breeding sites and human food sources [1,6,17,19,22,26,52].However, these results showed Urban LCLU as one of the lowest ranked variables in all three of the models, and across all three of the study years.This unexpected result might have been due to the low annual temporal or spatial resolutions used.Similarly, the population and population density variables ranked low across all years, not listing higher than the fourth component.Conversely, the vegetation index results were superior to those of the LULC variables.However, these results do not necessarily contradict previous work, as they still loaded within the principal components of Stage 2. The lower than expected importance of urban LCLU and population in this study suggests the need for future studies to investigate contributory urban attributes, particularly if using different spatial or temporal resolutions [1,3,17,41].For example measures of local density, urban heat island, or water storage potential variables could be investigated rather than an individual LULC urban variable.
The SPSS regression variable output from Stage 3 was mapped by municipality to demonstrate the geographical distribution of the project's results, following the procedures of previous vulnerability studies [34,[62][63][64]69]. Figure 4 demonstrates the visual comparison of the population density and EBE models to the distribution of reported DF cases per 10,000.The population and population density models were similar in component and mapped results.All three of the models follow a similar spatial pattern across all years by indicating low risk in higher elevated municipalities and increased risk in the more moderate temperature valleys, which coincide with areas more suitable for vector survival [4,18,19,28].The maps suggest the population based models have a propensity to indicate higher risk in more densely populated areas, even if they exist above the elevation limit of the mosquito, such as Bogota.The EBE model does not indicate as high of a risk for densely populated areas, even those within the mosquito's habitable elevation.Therefore, it is anticipated that the population models added weight to municipalities with larger urban areas, potentially due to the inclusion of the population independent variable, which the EBE model did not have.These mapped results show similar distribution of the vector count, supporting the models were able to appropriately identify the contributing independent variables, for example, the temperature range and elevation.
Remote Sens. 2016, 8, 770 10 of 16 follow a similar spatial pattern across all years by indicating low risk in higher elevated municipalities and increased risk in the more moderate temperature valleys, which coincide with areas more suitable for vector survival [4,18,19,28].The maps suggest the population based models have a propensity to indicate higher risk in more densely populated areas, even if they exist above the elevation limit of the mosquito, such as Bogota.The EBE model does not indicate as high of a risk for densely populated areas, even those within the mosquito's habitable elevation.Therefore, it is anticipated that the population models added weight to municipalities with larger urban areas, potentially due to the inclusion of the population independent variable, which the EBE model did not have.These mapped results show similar distribution of the vector count, supporting the models were able to appropriately identify the contributing independent variables, for example, the temperature range and elevation.

Discussion
Both nighttime LST min cell-max zone and daytime LST max cell-mean zone, were consistently ranked high in the top two components (Figure 3) of this study.These results are similar to previous laboratory studies on the Aedes aegypti mosquito which indicated temperature range can impact the vector's ability to thrive [28,52,67,68], and field studies identifying higher low temperatures improve vector resilience [3,8].Due predominantly to the mountainous terrain, the Magdalena River watershed has a large temperature range which exceeds both the minimum and maximum temperature thresholds (which has been found to extend from 14 • C and 24 • C, respectively, depending on location or study parameters [3,28]).This provides more influence on the mosquito's ability to thrive [3,18,28], and impacts the extrinsic incubation period and duration of the gonotrophic cycle [30,41], than has been witnessed in previous studies.As noted, the temperature range variables were based off three annual assessments, so the results do not necessarily document the actual temperature limits of the Aedes aegypti.Additionally, since the temperature range is identified in this study as indicative of DF, it supports the theory of vector habitat re-distribution is impacted by climate change [4,6,21,70].
Although the LULC forest classification often contained an inverse relationship in Stage 3's results, the EVI variables (vegetation) had markedly better results in the top component of Stages 2 and 3 (Figure 3).It seems logical to expect a lower prevalence of DF transmission in dense forest, since there are lower population rates.However, the vegetation index will have a more inclusive classification of features (grasslands, agriculture, etc.), which may allow for more anthropogenic uses, compared to the individual dense forested space variable.The positive relationship between vegetation and dengue cases might be explained by the increased vegetation in wetter areas, which in turn are more likely to provide larval rearing sites.This is particularly relevant near populated areas where surrounding vegetation (for example, gardens, potted plants or even subsistence agriculture) may be more likely to be irrigated and to have a greater human exposure potential as compared to dense forest.Furthermore, greener areas are often associated with increased moisture and biodiversity, leading to more fauna and flora, which may also provide opportunities for suitable breeding sites [1].These water sources are commonly located in more populated or urbanized areas, which may have lower forest LULC-classified space.The lower resolution of the forest LULC product used, compared to EVI vegetation index, could also explain the difference.
The widely accepted view that Aedes aegypti preferentially breeds in peridomestic water sources, such as storage tanks or flower pots [18], which are more abundant in urban areas, plus the higher density of people around these sites makes transmission of DF easier [71].Indeed, urban spaces may even be oasis breeding sites, due to impervious surfaces collecting water and increasing the UHI, in unexpected habitat ranges (temp or elevation) [72].The unexpected weak relationship detected here between urban LULC and DF in the Stage 3 models could be due to a low proportion of urbanized space within the municipalities, misclassification of suburban/rural communities in the LULC dataset, and/or the spatial resolution used in this study.Conversely, it may suggest future investigations as to whether higher contact with the vector is made outside, or on the fringe of, urban space, but reported in urban areas.The LULC might be identifying here primarily developed space, such as infrastructure, rather than metropolitan characteristics.Populated areas can be very green, which would increase misclassification of the urban LULC [50,73], while high vegetation index records would coordinate with the populated spaces.
Contrary to what was expected, precipitation, population, and LULC variables were less determinant within the Stage 3 results.Precipitation frequently loaded within the lowest 3-5 components across all models.This could represent a limitation of the data resolution or methodology used to measure rainfall's influence.Another potential limitation is the use of an annual aggregation of RS variables, which does not require the use of a temporal lag.Future studies incorporating higher temporal resolution should incorporate a time lag, to correspond with the gonotrophic cycle, to better identify precipitation trends.The low precipitation results could alternatively be explained by the dependence of this mosquito on peridomestic water sources for larval rearing sites.Use of containers filled with water by humans (i.e., water storage tanks, flower pots), may allow the mosquito to be less dependent on rainfall [74], particularly during the moderate ENSO seasons experienced during the study period [55].LULC overall loaded in the weakest 4-5 components during this project, particularly the EBE model, even though the literature indicated landscape composition to be important in modeling DF [3,6,18,19].This may indicate the 500 m low resolution MODIS LULC product is less capable of documenting environmental discontinuity for this type of study [3,34,50,73].This may impact the ability to identify urban LULC due to the large pixel resolutions and municipality aggregation used in this study [75,76].It should be noted that barren LULC frequently contained stronger results than urban, which might be due to a misclassification of urban, agriculture, or periphery/transitional landscapes [50].The yearly assessment of this project may also fail to identify seasonal transitional landscapes, land with intermittent use, which may impact the rate of transmission by bringing people to areas with a higher mosquito breeding potential [12,39].Seasonally transitional land could include agriculture or seasonal markets.Additionally, the majority of urbanized features in Colombia are built higher in elevation, where the mosquito has not been shows to naturally thrive [3,32].This cultural dynamic may reduce the effect of urban or similar LULC results within this study area.Future studies should continue to explore varying LULC variables, even though they were less influential during this project.
All three of the mapped models locate higher risk areas within the mountain valleys, which is supported by the knowledge that the Aedes aegypti does not naturally thrive in high altitudes [3,32].Although the population variables suggest increased identification of disease transmission risk in highly populated areas.Since impervious surfaces increase local thermal conditions in a phenomenon known as the urban heat island, it should be considered that urban environments hold the potential to be oasis breeding habitats.The increased temperature and water storage capacities of urban impervious features may provide a haven for mosquito habitats in higher elevations [3,72].Methods of transportation, particularly for goods, could bridge the gap between natural vector breeding sites in agricultural fields and municipalities outside the habitation zone.Human activities can also increase the promotion of vector breeding or host-vector contact through the establishment of gathering spaces in high vector breeding sites, such as seasonal markets, who then return infected to their urban residences where the disease is reported [39].Additionally, the EBE had a lower explanation of variance than the other models across all years, which may be attributed to the lack of a population independent variable.This suggests a need to strongly consider the use of population in future vector modeling or zoonotic niche studies.However, it seems appropriate to mention a priori knowledge of the areas would be required to take advantage of this type of information.
A supplementary study was conducted to ascertain the strength of the variables identified from the top 2 components in the Stage 3 analysis.A Stage 3 type PCA was conducted on the 6-7 variables which weighed heavier (0.7 rounded or higher) in the top two components of the original Stage 3 (see Figure 3).The variables were predominantly from the temperature, elevation, and vegetation categories.These results were very similar to those of the original 14 variable PCA.Although the top two studies had fewer components and a larger explanation of variance, which could be explained by the fewer input variables.The AUC results also demonstrate the similarity between the results, so the component results of the top two component models were not provided.These results simply supported the previously demonstrated influence of temperature, elevation, and vegetation.

Conclusions
This analysis was able to identify variables which are statistically useful to estimate the likelihood of DF transmission in the Río Magdalena watershed in Colombia, as is demonstrated by Figure 3. Stage 2 and 3 results indicate temperature is an important modeling variable within this study area.The results of nighttime min cell and daytime max cell variables demonstrate that temperature range is particularly relevant, presumably on how it relates to the habitation zone for the vector and virus replication.Elevation min also loaded highly through all of the results as an inverse value, supporting previously-identified altitude limitations to the vector.Vegetation was, similarly, an important variable to explain dengue cases, presumably due to a greater likelihood of larval rearing sites in wetter areas, which could also support increased quantities of vegetation.These results consistently suggest temperature and elevation as strongly related to the mosquito vector's ability to thrive and transmit the disease.
The PCA, AUC, and mapped results suggest the variables used in this exploratory analysis should be considered in future DF monitoring or predicting tools.Whether to adopt a population or EBE-based model would be dependent on the individual's prior knowledge and focus.The focus of population models appear to identify risk in higher populated areas, sometimes despite being located above the mosquito's expected habitable elevation.Such methodology could provide future studies with a population weight to identify spaces where there is a higher rate of transmission, rather than simply the presence of the mosquito vector.
As an exploratory study, the conditions and methods depicted within this report were able to appropriately quantify and model DF records through the use of environmental and population data.The results were able to explain approximately 70% of the dataset variability, across all of the models used.Future studies should include the variables identified in Stage 2 of this study, but consider modifying them in an attempt to further improve the model strength.Although elevation min proved to be a strong variable is this study, additional variations on elevation may prove beneficial in future studies.Future exploratory analysis could investigate whether a more computationally heavy variable could improve analysis, such as by measuring the amount of municipality area which is within the vector's habitable elevation limits or surface slope, either would better quantify the amount of potential breeding space over the simpler minimum elevation variable used in this study.Additionally, this analysis was focused on variables estimated through RS, however, it is imperative for future studies to incorporate a more direct variable of socioeconomic status.Other statistical methods, such as geographically weighted regression or boosted regression trees, should also be explored to interpret the relationship between similar independent variables and DF.Still, this project was able to appropriately and accurately accomplish its goal of identifying niche variables which identify increased Aedes aegypti, through DF cases, within the Magdalena Watershed.The results described by this exploratory analysis could be used to aid mitigation practices or develop an early warning system for the transmission of DF by the Aedes aegypti mosquito in future studies.

Figure 1 .
Figure 1.Study site identification of the Magdalena Watershed in Colombia, demonstrating its geographical location, elevation diversity, and presence of larger populated urban environments.

Figure 1 .
Figure 1.Study site identification of the Magdalena Watershed in Colombia, demonstrating its geographical location, elevation diversity, and presence of larger populated urban environments.

Figure 2 .
Figure 2. Graphic depicting stages of analysis.Figure 2. Graphic depicting stages of analysis.

Figure 2 .
Figure 2. Graphic depicting stages of analysis.Figure 2. Graphic depicting stages of analysis.

Figure 3 .
Figure 3. Model output of principal component loadings.The strongest component loading was bolded for each variable.

Figure 3 .
Figure 3. Model output of principal component loadings.The strongest component loading was bolded for each variable.

Figure 4 .
Figure 4. Population density and EBE model graphical comparisons to the reported cases per 10,000 population.

Figure 4 .
Figure 4. Population density and EBE model graphical comparisons to the reported cases per 10,000 population.

Table 1 .
Projected population and Dengue Fever Case records.

Table 2 .
Principal components analysis results.

Table 2 .
Principal components analysis results.

Table 3 .
ROC AUC result.There was no difference between the inputs in the population and population density models for 2012-2013 when only the top two component variables were used.