Advancing Cave Detection Using Terrain Analysis and Thermal Imagery

: Since the initial experiments nearly 50 years ago, techniques for detecting caves using airborne and spacecraft acquired thermal imagery have improved markedly. These advances are largely due to a combination of higher instrument sensitivity, modern computing systems, and processor-intensive analytical techniques. Through applying these advancements, our goals were to: (1) Determine the efﬁcacy of methods designed for terrain analysis and applied to thermal imagery; (2) evaluate the usefulness of predawn and midday imagery for detecting caves; and (3) ascertain which imagery type (predawn, midday, or the difference between those two times) was most informative. Using forward stepwise logistic (FSL) and Least Absolute Shrinkage and Selection Operator (LASSO) regression analyses for model selection, and a thermal imagery dataset acquired from the Mojave Desert, California, we examined the efﬁcacy of three well-known terrain descriptors (i.e., slope, topographic position index (TPI), and curvature) on thermal imagery for cave detection. We also included the actual, untransformed thermal DN values (hereafter “unenhanced thermal”) as a fourth dataset. Thereafter, we compared the thermal signatures of known cave entrances to all non-cave surface locations. We determined these terrain-based analytical methods, which described the “shape” of the thermal landscape, hold signiﬁcant promise for cave detection. All imagery types produced similar results. Down-selected covariates per imagery type, based upon the FSL models, were: Predawn— slope, TPI, curvature at 0 m from cave entrance, as well as slope at 1 m from cave entrance; midday— slope, TPI, and unenhanced thermal at 0 m from cave entrance; and difference— TPI and slope at 0 m from cave entrance, as well as unenhanced thermal and TPI at 3.5 m from cave entrance. We provide recommendations for future research directions in terrestrial and planetary cave detection using thermal imagery.


Introduction
Reliable detection of caves is required for both terrestrial and planetary speleological research. Improved cave detection on Earth will provide conservation biologists and resource managers with a tool to efficiently identify and prioritize caves for conservation and management [1]. In North America, cave-roosting bats are in decline due to the westward advance of white-nose syndrome. This epizootic disease has resulted in the loss of millions of bats [2] in 33 states and five Canadian provinces [3,4], and is affecting at Researchers conducted a series of experiments using several generations of QWIP (Quantum Well Infrared Photodetector) thermal instruments [47,61], finding a clear separation between cave, tunnel cave (i.e., a subterranean feature with entrances on either end, typically with frequent air flow), and random non-cave locations on the surface in thermal images captured at 10 min intervals over a 24 h period. Their results suggested how larger caves may be distinguishable from shallow alcoves and tunnel features. Using a similar dataset of thermal imagery, Titus et al. [59] further examined multiple thermal images containing cave entrances over a 24 h period; they reported the detectability of caves was optimal when multiple thermal images were acquired at either the warmest (midday) or coolest times of day (predawn).
In this paper, we applied advanced image processing techniques to multiple time-ofday thermal imagery to reduce the frequency of false positives. Specifically, we: (1) Determined the efficacy of methods developed for investigating topographic surfaces for thermal imagery analysis; (2) evaluated the usefulness of predawn and midday imagery for detecting caves; and (3) ascertained whether thermal characteristics captured at one time of day (predawn or midday), or the difference in thermal value between those two times, was more informative than another. Specifically, we compared the thermal signatures of cave entrances to all non-cave surface locations using a multivariate framework involving terrain analysis methodology applied to the thermal surface (i.e., topographic position index, slope, and curvature of thermal values) and unenhanced thermal values derived from thermal imagery.

Study Area and Field Data
Pisgah Lava Field (34 • 44 47 N, 116 • 22 30 W) is located about 175 miles northeast of Los Angeles, California ( Figure 1) and administered by Bureau of Land Management, Barstow Field Office. Situated within the eastern Mojave Desert, this lava field is approximately 21,000 years old [62]. Consisting of numerous thin flows covering approximately 51 km 2 and extending from the vent~18 km to the west and 8 km to the southeast, this lava field is primarily pāhoehoe lava mosaicked by pockets of 'a'ā lava. This large, sparsely vegetated lava field contains over 300 known lava tube caves [63]. Our focal area within the Pisgah Lava Field was 172 ha in size, with elevations ranging from 680 m to 772 m (x = 702 m, sd = 15.3). Two missions were flown over Pisgah to acquire thermal imagerythe first occurred from 1222 to 1457 h on 11 April 2011 with the second from 0426 to 0649 h local time on 13 April 2011. These times were selected to minimize temperature changes across the target area. We aimed to acquire predawn data prior to sunrise, while the midday collection was centered around the time of peak temperature (e.g., [64,65]).

Sensor Description, Linear Response, and Pixel Resolution
Thermal imagery was captured using a quantum well infrared photodetector (QWIP) thermal imaging camera, designed by NASA Goddard Space Flight Center in collaboration with QmagiQ, LLC of Nashua, New Hampshire. This experimental instrument was a precursor to the thermal infrared sensor onboard the Landsat 8 spacecraft [66]. Through this work, we collected field performance data to aid in maturing this technology to NASA technology readiness level 5 [67]. The QWIP thermal sensor responds to far infrared radiation from 7.5 to 9.1 µm spectral band with a peak response of 8.7 µm [68]. The detector array was cooled with a miniature Stirling cycle cryocooler to 67K (−206 • C) with a standard 50 mm IR lens and had an instantaneous field of view of 8.8 • × 11 • . Integration time was 0.0167 s. The camera was preset to capture two 16-bit images/second (2 frames per second). Each image is in a 320 × 256 pixel format where each pixel is a 30 µm square. While the imagery was not radiometrically calibrated, the instrument had the ability to resolve signals from objects where temperature variations were less than 0.02 • C [68]. Additionally, the imagery for both predawn and midday were captured within a~40 h window, the aircraft power source did not fluctuate, and the instrument was housed within a pressurized aircraft. The camera was stable over time-if the camera was at the same temperature looking at the same object at a given temperature, it would produce the same number of counts (DNs) to within <1% [68]. Titus et al. [69], using the same QWIP instrument, acquired two diurnal sets of oblique thermal images overlooking the Pisgah Lava Field from the approximate mid-elevation point on the Pisgah cinder cone in late March of 2010. In situ temperature data of both the surface and the near surface (~1 m AGL) atmosphere were acquired. They determined the precision of the brightness temperature (no correction for atmosphere or surface emissivity) was accurate to less than 1 • C. Based on the in situ calibration which included atmospheric correction, they reported the surface temperature estimates were accurate to~3.7 • C at 20 • C. Inadequate characterization of the atmosphere, where the air temperature and humidity data were used, may not accurately represent the conditions along the entire line-of-sight from camera to surface. This was likely the cause for the larger uncertainty in accuracy under field conditions, when compared to laboratory testing results. Refer to Appendix A and Titus et al. [69] for more details. Average aircraft height above ground level was 808 m for the predawn flight, and 825 m for the midday flight (which was similar to the distance (670 m) between instrument and target for the Titus et al. [69] study). We acquired~0.5 m resolution data at nadir. A total of 3675 and 4445 images were collected during the predawn and midday flights, respectively. Capture rate was at one and two second intervals for the predawn and midday flights, respectively. Predawn target ground speed varied from 56 m/s when flying in a westerly direction, to 69 m/s easterly. Midday ground speed averaged 63 m/s with little variation between easterly and westerly directions.
No atmospheric corrections were applied. Atmospheric temperature and opacity were not available. However, the analytical techniques employed in this study should be relatively insensitive to the effects of thermal remission and atmospheric absorption.
During midday, the observed radiance would decrease as the atmosphere is typically cooler than the observed surface, but the spatial indices would vary little as the atmosphere affect (offset) would be nearly uniform. Atmosphere would likely have even less of an effect for predawn temperatures as surface and atmosphere temperatures would be more similar. Data acquired one year prior using the same QWIP instrument supported this conclusion [69].

Georectification and Processing
We assigned approximate geospatial positions to all imagery using custom Visual Basic for Applications (VBA) functions in ArcGIS (refer to [70]). VBA functions roughly georeferenced, scaled, and oriented each image in its approximate geospatial location using a digital elevation model downloaded from the USGS 3D Elevation Program [71] and aircraft GPS data including time, altitude, and coordinate data. Using both ESRI World Imagery [72] and National Agriculture Imagery Program [73] imagery (1 m resolution) as base reference layers, we manually georeferenced a subset of 49 predawn and 60 midday images that were preselected based on their overlap with known cave locations (refer to Section 2.2.4). All procedures were executed using the Georeferencing Tool in ArcGIS 10.3 [74]. Georeferencing involved identifying reference points apparent on both the reference layer and the thermal images, and then mathematically warping the thermal images to match the base reference layer using a polynomial warping function (in most cases, an affine transformation) that minimized the root mean square error. We used bilinear interpolation to resample the warped image, and then selected a resample resolution that most closely approximated the original resolution. Because we had nadir-viewing imagery, and the elevation range in image footprints (roughly 126 m × 156 m) did not exceed 16.5 m (x = 6.7 m), we determined topographic terrain would not be a major impediment to the georectification process.
We used three datasets for our analysis: Predawn, midday, and difference imagery ( Figure 2). For the difference imagery, we reasoned the change between predawn DN values and midday DN values would be relatively low at cave entrances because entrance temperatures fluctuate less compared to the surrounding surface landscape. Thus, the lower pixel values associated with cave entrances would produce a distinguishable signature on the landscape. We mosaicked the best predawn images into a single composite image and repeated this process for midday images. Pixel values in overlapping areas were calculated as the pixel values of the last individual image in the mosaic list. We defined the best images as those with the least amount of geometric distortion required to georeference the image to the landscape. We then generated the difference imagery layer by subtracting the digital number (DN) values of all pixels within the predawn images from the overlapping corresponding DN pixel values in the midday images.

Thermal Transformations Based on Terrain Analysis
The fields of image analysis and object recognition have progressed steadily since the invention of photography and exploded with the advent of computer technology. Software can now, quickly and with high accuracy, recognize and identify obscure and subtle phenomena, such as individual human beings based on facial features or distinguish building footprints from roads or the general landscape. Often the most powerful and interesting methods depend on a specific characteristic of the data being measured. For example, Facebook's facial recognition software assumes an object in the image has eyes, a nose and a mouth [75]. Similarly, Microsoft's building footprint algorithm takes advantage of common characteristics of buildings, such as edges and edge angles [76].
As our goal was to identify thermal characteristics that distinguish caves from the general landscape, we identified methods that took advantage of the unique thermal characteristics of caves. Because temperatures at cave entrances tend to be more stable than that of the general landscape, we focused on methods that would specifically measure this thermal characteristic. Additionally, we applied methods that would capture the change in thermal radiance with increasing distance from the cave entrance. Methods developed for terrain analysis, which measured the rate of change of elevation, were applied to thermal imagery to assess the efficacy of these methods in identifying cave locations. Similar to digital elevation models, thermal imagery depicts temperatures of warm and cool areas appearing as high and low values, and thus presented an opportunity to analyze the thermal "shape" of the landscape. In this case, we treated the thermal surface as a 3D object with peaks, valleys, slopes, and curves. We hypothesized that cave entrances would be distinct features within the 3D thermal surface. To test this idea, we examined the predawn, midday, and difference thermal imagery datasets using the following topographic indices: (1) Topographic position index (TPI), (2) slope (but for thermal imagery should be considered a "thermal gradient"), and (3) curvature. For the predawn, midday, and difference datasets, we also included the actual, untransformed thermal DN values (hereafter "unenhanced thermal"). Figure 3 are histograms for the different imagery datasets for the three topographic indices and the unenhanced thermal.
TPI is the relative value of a specific pixel compared to the mean thermal value of a neighborhood of pixels [77,78], and provides a method for identifying the relatively low and high regions (i.e., local minimum and maximum values) in the thermal surface. In general, TPI serves to identify local valleys and peaks in the thermal surface; valleys represent negative values while peaks are positive values. When applied to our thermal surface, TPI identifies low or high thermal values (relative to their neighbors) and consequently identifies regions that represent a distinct local anomaly. We calculated TPI by first calculating the mean thermal value in a 5 m-radius circular neighborhood around each cell, and then subtracting that neighborhood value from the original cell value. A positive value indicates the area is higher (or warmer, on a thermal surface) than the pixels in the neighborhood. Our ArcGIS model for this technique is provided in [70]. For both slope [79,80] and curvature [81], we used standard ArcGIS algorithms. Given a 3 × 3 neighborhood, with cell values identified as Z 1 . . . Z 9 , and Z 5 being the central cell, and with L = cell edge length, slope is the standard first derivative (the rate of change of thermal values; Equation (1) in the direction of the greatest rate of change.
Curvature, in this case, is a measure of overall convexity or concavity of the thermal surface, calculated using the standard "curvature" output in ArcGIS. Conceptually, this is tantamount to the second derivative (the rate of change of slope; Equation (2) but is not defined as the curvature at a point on a line.
To avoid highly skewed slope and curvature distributions, we rescaled thermal DN values by dividing by 100 prior to calculating the slope and curvature values. This procedure resulted in more symmetric distributions, which improved our ability to detect patterns along the full range of observed values [82].

Study Sites and Pixel Averaging
Our study sites include several documented caves, which were located and mapped by the Southern California Grotto, National Speleological Society. These included 64 caves (within the midday imagery), 89 caves (predawn imagery), and 37 caves (difference imagery). All caves included in this study had entrances larger than the pixel resolution of the imagery. Given in-flight variations in the pitch, yaw, and roll of the aircraft, we did not acquire seamless and complete coverage of our study area; thus, we did not capture imagery for all known cave locations.
The study caves represent lava tubes of varying lengths and varying numbers of entrances. These include shallow through caves (or tunnels), shelter-like features extending 10 m to caves tens of meters in length with multiple collapse pit entrances. Given the exploratory nature of this study, we did not attempt to differentiate between shallow caves and caves with a deep zone environment (refer to [49,50]). Thus, we considered all features, regardless of size that permitted access to the subterranean realm, as caves. We also extracted pixel values for all locations on the landscape, which were identified as not being caves. In cases where we had multiple images that included a cave entrance, we used the image that was georeferenced with the least amount of geometric distortion.
We developed TPI, slope, curvature, and unenhanced thermal imagery values for all known cave and non-cave surface locations for each imagery dataset. Because thermal signal strength of cave entrances may be described as a gradation of thermal influence strongest near the center of the cave entrance and diminishing toward the outer edges of the entrance, we attempted to capture this slope using bilinear interpolation [83]. Starting at the center of the cave entrance (0 m) and increasing by 0.5 m concentric intervals, we extracted pixel values from 0 to 3.5 m ( Figure 4) and calculated the average for each distance interval. For each interval, we sampled multiple points evenly distributed in a circular pattern, separated by approximately the cell size (roughly 0.5 m). We then took the mean of all values at each distance range to determine the variable value at that distance from the cave. Thus, for each cave entrance location and landscape point, we had 32 covariates ([the four terrain analysis variables] × [average of pixel values at 0.5 m intervals]). For selecting non-cave surface locations, we used the following criteria: (i) Pixels ≥ 10 m from known cave entrances as these areas were most likely to represent non-cave thermal characteristics; (ii) only cave to non-cave locations on the same image were compared to minimize the effects of image-to-image variation; and (iii) only pixel locations that were far enough from the edge of the image to be completely sampled were included in the analysis. Through applying these criteria, we standardized our approach to sampling and analyzing non-cave pixels.

Analysis
We used 32 terrain covariates (i.e., curvature, slope, TPI, and unenhanced thermal values at the eight distance intervals from the cave entrance or non-cave surface location), which were repeated for each imagery dataset-predawn, midday, and difference. Pearson's Correlation coefficients (ρ) revealed that most covariates were substantially correlated and therefore interpretation of standard covariate selection methods would be challenging [84]. Therefore, our analysis focused on model prediction accuracy rather than covariate selection. Furthermore, traditional p-values or delta AIC (Akaike Information Criterion) values for each covariate was not appropriate [84]. To avoid over-fitting the models, we applied a machine-learning approach involving feature selection [85]. Our results are presented below. To select covariates to use, we performed a feature filter procedure [86] whereby we repeatedly and randomly sampled 4000 non-cave surface locations along with all known cave locations (including nine to 21 confirmed caves across the three imagery datasets). For each resulting 'filtered' sample, we performed a classic model selection routine ultimately restricting ourselves to using only covariates that were selected in more than 50% of the model runs. We considered two classic model selection routines: Forward stepwise logistic regression (FSL [87]) using AIC [88] and Least Absolute Shrinkage and Selection Operator (LASSO) on logistic regression utilizing 10-fold cross validation [85] for covariate selection. Both are common model selection methods for remote sensing applications (e.g., [89][90][91][92][93][94][95]). We aggregated the filtered sample results by including only covariates that were selected in more than 50% of the filtered samples. As a result of the two model selection techniques (which were designated FSL and LASSO), this became our competing models' framework.
As most classification models are binary (i.e., positive/negative, success/failure, true/false, etc.), there is some tuning parameter (c) which controls how strong the evidence must be before we predict a positive state (i.e., a cave and not the landscape). High classification thresholds result in higher misclassifications of caves as landscape (false negatives), while lower values result in higher misclassification of landscape pixels as caves (false positives). Fine tuning the model to lower the false negative rate inevitably raises the false positive rate. Therefore, when creating binary classification models for prediction, the threshold parameter tuning process must include a step to balance the false positive and false negative rates. Receiver operator characteristic (ROC) curves vary the model threshold to predict a positive state (c) and then plot the model's false positive rate verses the true positive rate over a range of thresholds. These ROC curves graphically depict classification curves (i.e., correct classification vs. misclassification) for known cave and non-cave landscape pixels. The quality of the classifier can be assessed from these curves by examining the area under the curve. A value of 1 represents a perfect classifier, while a value of 0.5 (a 1-to-1 line on the ROC curve) indicates the classifier is no better than a random guess.
Because our models were developed at the pixel-level employing the grid pixel extraction technique and not based on the total number of pixels that represent actual cave entrances, we grouped pixels into spatial clusters (code available in GitHub repository https://github.com/JeffJenness/Cave-Detection-Cluster-Analysis, accessed on 2 July 2021) to examine pixel aggregations of both known caves and regions of the non-cave landscape that the model identified as being cave-like. Pixelp values ranged from 0 (lowest similarity to known caves) to 1 (highest similarity to known caves). Thus, we arbitrarily selected the highest 0.01% of pixelp values, which were identified as being similar to cave entrances for each imagery type. Pixels were then grouped into clusters that were within a 2 m threshold distance of at least one other pixel within that cluster. Clusters were classified as being related to a known cave entrance if at least one pixel within the cluster was within the threshold distance of a known entrance. If clusters did not occur near known cave entrances, these aggregations represented either potential new caves or false positives.
We chose to use a custom clustering technique rather than other standard techniques (such as the ArcGIS "Hot Spot" analysis) for this step of the analysis. Specifically, we were most interested in grouping adjacent or highly proximal pixels that likely represented multiple observations of a single cave entrance. Our customized approach enabled us to set a distance threshold for grouping similar pixels.

Results
Pearson's correlations revealed high correlations across most paired comparisons of covariates. Variables were correlated within and between different variables and distances from cave entrances (e.g., slope 1.0 m vs. 2.0 m, as well as slope 0 m vs. TPI 0 m; Figure 5), as well as spatially (e.g., two surface locations within 3.5 m were highly spatially autocorrelated).
Given the large number of covariates used, and that developing a summary table of candidate models is both difficult to parse and interpret, we provide a graphical representation for discerning the most important covariates ( Figure 6, Table 1). Overall, model selection procedures for both regression analyses produced a similar set of covariates. For predawn and midday imagery, covariates representing pixels at or near (0.5 m) each cave entrance were most useful for cave identification/prediction. Interestingly, LASSO often selected covariates describing conditions 0.5 m from the cave entrance, while FSL typically selected covariates describing conditions centered directly over the cave entrance (i.e., at 0 m). Analyses performed on the difference imagery covariates did not present a clear pattern for model selection. Our FSL models generated simpler, and thus more interpretable models, than LASSO. Subsequently, we developed a final predictive model based on the FSL results and included covariates that were selected in more than 50% of our simulations.
The following covariates were selected per imagery type: Predawn-slope, TPI, curvature at 0 m from cave entrance, as well as slope at 1 m from cave entrance; midday-slope, TPI, unenhanced thermal at 0 m from cave entrance; and difference-TPI and slope at 0 m from cave entrance, as well as unenhanced thermal and TPI at 3.5 m from cave entrance. ROC curves for our final model had area under curve (AUC) values with 95% confidence intervals centered near 99% (predawn: (0.974, 0.998); midday: (0.994, 0.998); difference: (0.994, 0.999)). To examine the sensitivity of the variables included in our models, we varied the inclusion threshold between 30% and 70% and observed no practical difference in AUC values (Figure 7).   Our cluster analysis revealed that predawn and difference imagery resulted in the highest proportions of pixels representing known caves (both at 79%), while 55% of the pixels within the midday imagery represented known caves ( Table 2). The total number of pixel clusters representing known caves were 47, 43, and 17 for predawn, midday, and difference imagery, respectively. Additionally, we identified 26, 29, and 8 clusters as either false positives or potential new cave entrances for the predawn, midday, and difference imagery, respectively. Examples of both known cave entrances and features with cave-like thermal behavior are provided in Figure 8. Table 1. Summary of covariates and percentages in which each covariate was selected based upon 100 model iterations for forward stepwise logistic (FSL) and Least Absolute Shrinkage and Selection Operator (LASSO) regression analyses for terrain analysis indices (Curvature, Topographic Position Index, and Slope) and the Unenhanced Thermal imagery for the three imagery types (Predawn, Midday, and Difference). Distance is in meters from the center of known cave entrances. Values in the corresponding cells represent the percentage of iterations in which each covariate was selected. Bold values denote covariates selected more than 50% of the iterations and greyed cells indicate the covariates selected for the final models. Total pixels per imagery type are provided. We arbitrarily selected the top 0.01% of pixels representing known and predicted caves. Clusters were defined as pixel aggregations where each pixel within a 2 m threshold distance of at least one other pixel was included within the cluster. These clusters were subdivided into clusters of known and predicted caves. Pixelp values represent the score for each pixel, which is based on the most parsimonious suite of predictor variables for each imagery type. Values range from 0 (lowest similarity to known caves;p min ) to 1 (highest similarity to known caves;p max ).

Discussion
Through this work, we have demonstrated that applying techniques adapted from terrain analysis to thermal imagery holds significant promise for detecting caves. Across the three imagery types, combinations of slope, TPI, and curvature at 0 m from the cave entrance (i.e., the pixel at the approximate center of the cave's entrance) were the variables included in the most parsimonious regression models. This suggests the use of terrain analysis algorithms is a robust tool to enhance detection capabilities with thermal imagery.
Overall, the models generated for predawn, midday, and difference performed similarly. We observed no quantitative differences in cave detection rates. ROC area under curves were 0.998, 0.997, and 0.996 for predawn, midday, and difference imagery datasets, respectively (refer to Figure 7). We also achieved an acceptable false positive rate (percent of non-cave pixels incorrectly identified as a cave) of approximately 0.02% when the true positive rate (percent of cave entrance pixels correctly identified as a cave) was between 20 and 50%. In other words, a 0.02% false positive rate implies that when examining a given region of one million pixels, we would have approximately 200 false positives while correctly identifying between 20 and 50% of the actual caves present.
We observed a high degree of pixel aggregation centered on and around known cave locations. This phenomenon is influenced by imagery resolution, which affected the shape of the ROC curves. From our analysis, Figure 9 is an example of pixel aggregation where two of three clusters represented known caves. Given that our thermal imagery resolution was 0.5 m, and most cave entrances were meters in diameter (and thus represent multiple pixels), cluster analysis is a necessary step in imagery interpretation. Thus, identifying clusters of a small number of the highest-scoring pixels (i.e., those with the highestp values) and then subsequently examining the visible imagery would be a good strategy for identifying potential false positives or small cave entrances or skylights. For terrestrial applications, a small number of aggregated pixels may represent a cave with a small entrance. This feature would likely become a high priority target for groundtruthing. For Mars, a small number of aggregated pixels representing a potential cave may be a low priority target for further examination as a human shelter or storage depot, but potentially a higher priority target for robotic exploration-as it may be more buffered from the surface environment (e.g., [96,97]). Thus, the grouping of pixels with cave-like thermal signatures into clusters will enable workers to prioritize which potential cave localities require further examination (i.e., physical groundtruthing of terrestrial caves or additional imagery interpretation for planetary caves). As the difference imagery data layer performed similarly to predawn and midday imagery for the regression analysis, inclusion of this imagery type as part of the analytical framework may be considered unnecessary. However, analysis of the top-scoring 0.01% of the difference imagery identified the same high proportion of known caves as the predawn imagery (79%), although the lowerp max values suggest that the difference imagery in general performed poorly in capturing unique cave signatures.
However, we are hesitant to discount the utility of the difference imagery. Logically, this combined data layer had great potential. Its relatively unimpressive performance may have been due to smaller sample sizes (37 caves in the difference imagery compared to 64 and 89 caves in the midday and predawn imagery, respectively), and smaller analysis area (the overlap of midday and predawn imagery was 46% of the midday imagery and 36% of the predawn imagery). Furthermore, the warping and geometric transformations techniques employed to georeference the original images to the landscape almost certainly had some error. This error would have been compounded when we mosaicked multiple images into single midday and predawn images, and then the final difference image layer would have represented these compounded spatial errors. Subsequently, we suggest this introduced additionally noise into the analysis. With more accurately georeferenced images, we suspect the difference analysis would have performed much more favorably.
Our models show several distinct locations that exhibit thermal characteristics comparable to known cave entrances (Figure 8). The red peaks rising above the landscape correspond with cave-like thermal signatures. There are several other examples across the landscape with similar cave-like behavior but are not near any known caves; these features are either false positives, fissures, permanently shaded features, or undiscovered cave locations. These areas should ultimately be groundtruthed. A corresponding benefit of isolating the higher-probability sites is our ability to exclude large portions of the landscape as unlikely to be caves. Thus, we would be able to focus our examination of potential features of interest more efficiently.
Interestingly, a few known cave locations do not correspond with peaks on the probability surface, which raises the possibility that these caves may have some morphological or geometrical difference that changes their thermal signature. These features may be shallow caves with different thermal characteristics than actual cave entrances-the latter guided the development of our models. Alternatively, these thermally atypical features may occur in areas with different geologic material (i.e., loosely consolidated rock or soil), which is differentially influencing the thermal behavior within the immediate vicinity of the cave entrance. Further examination will be required to understand what sets them apart, as well as possibly developing models to capture their unique thermal signatures. These possibilities underscore the need to further examine these thermal signatures to potentially develop a method for organizing these features into categories. Once done, a groundtruthing effort may be undertaken, so that data may be collected to develop and test models that describe their thermal signatures.
Because cave entrances are rarely perfectly circular, entrance configurations (often oval, oblong, crescent shaped or elongated cracks) may have skewed our values for most cave entrances. While our radial sampling method from the cave entrance ( Figure 4) was both statistically and methodologically defensible and successful overall, in some cases we may have included pixels which are neither part of the cave entrance nor represent an area influenced by the cave entrance when we averaged values for a given feature. Selecting points in a radius from entrance may be extracting DN values in locations where surface temperatures are disproportionally influencing pixel values, which may have resulted in subpixel mixing [98]. It is also possible that we are extracting pixel values from surface locations where the thermal effects of the cave entrance were not influencing surface temperature. If this occurred, this would have skewed the average values for a given cave entrance. While the reality of imperfectly shaped cave entrances may have affected our results, we still differentiated known cave entrances from surface pixels regardless of cave entrance shape. However, a further refinement would be to visually examine each entrance (or suspected entrance) and then manually modify the pixel value extraction configuration to best fit the shape of the entrance of known or suspected cave entrances.
Generally, there are landscape conditions where caves may be either undetectable and/or the false positive rate may be high. Cave entrances occurring on north-facing slopes may be indistinguishable and lost due to topographic shadowing-as the DN values between these two features may be similar. Detection is also constrained by pixel resolution, such that cave entrances smaller than the pixel resolution may not be resolvable. Relying on one imagery type (e.g., predawn) could also increase the false positive rate as higher thermal inertia features, such as rock or bedrock outcrops, may be confused with cave entrances. Thus, acquiring imagery at two or more times over the 24-h cycle will be required to optimally improve detection capabilities under these conditions. Furthermore, poorly georeferenced thermal imagery may cause the cave locations to be shifted, or in particularly bad cases, may cause the predictive model to be based on skewed or warped data.
Overall, the identification and classification of non-cave landscape features that may be confused with cave entrances will be an important step towards further reducing the false positive rate. This becomes particularly relevant when examining possible cave features using only spacecraft-borne imagery of planetary bodies where a boots-on-theground groundtruthing effort is currently not possible. In the Pisgah dataset, we observed linear features (likely fissures), large collapse trenches, and areas of topographic shadowing. These features should be thoroughly examined. To further the science from both a terrestrial and planetary perspective, we could use this dataset as a feature class library. A first step would be to perform a supervised or unsupervised classification [99], or an object-based classification method to capture odd shapes on the landscape (e.g., [100][101][102]) within the thermal imagery. This would result in the development of feature categories, which then could be grouped based upon their likeness to one another; however, it is uncertain how these methods would perform in classifying the local thermal "shape" of the cave neighborhood. These procedures may require an intermediate step of describing shape metrics in a standard raster format before applying the classification methods. Thereafter, a field effort could be conducted to characterize and potentially reorganize these identified features into more specific categories. Once done, visible spectrum and thermal images of these features could be entered into a feature class library. These features could then be compared to similar features in both terrestrial and planetary imagery datasets. As this work progresses and more non-cave features with similar cave thermal signatures are identified, these data would be added to the feature class library.
Through applying the recommended methodological and field improvements elucidated above, this will enable us to further examine and hopefully quantify differences between deep and shallow caves, as well as identify and quantify those surface features, which may be confused with cave entrances. Through such an effort, we will further improve our ability to detect terrestrial and planetary caves-with the long-term goals of furthering cave conservation and management on Earth and identifying the best candidate cave sites for robotic exploration and human habitation on the moon, Mars and beyond.

Conclusions
We have shown that thermal imagery collected at least during predawn and midday are useful in detecting caves; other acquisition times may also be beneficial [69]. This finding has important implications for both terrestrial and planetary applications. As financial support continues to be one of the greatest challenges for biodiversity conservation efforts [103], the use of existing thermal imagery will factor prominently into budgeting cave research projects that may seek to detect caves for various conservation and management goals. For planetary missions, inherent difficulties often arise when attempting to acquire imagery from a specific region at a specific time from an orbiting or fly-by spacecraft. Thus, the flexibility illuminated through terrain analysis should expand the opportunities for detecting terrestrial and planetary caves using thermal imagery.
While the regression models for all three imagery types (predawn, midday, and difference) performed similarly, cluster analysis revealed that predawn and difference imagery out-performed midday imagery in identifying known cave entrance locations (see Table 2). This suggests that taking the time to develop and analyze difference imagery may be a worthwhile endeavor.
Finally, we emphasize the importance of cluster analysis as a necessary step for identifying and predicting cave locations. Overall, the number of pixels that look like a cave is less important than the groupings of these pixels. Thus, larger clusters of pixels may provide stronger evidence of a cave entrance.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Precision and Accuracy Calculations of QWIP Thermal Instrument
The analysis of DN values in this paper depends on the precision and calibration stability of the instrument over the period of data collection and not the accuracy (i.e., calibration) of the data. Importantly, the analysis of DN levels acquired at different times of day (e.g., predawn and midday) is valid if the DN variation for each time-period is larger than the instrument precision, and the mean difference in DN between the two time periods is significantly larger than the variation of DN for the two times of day.
Titus et al. [69], using the same QWIP instrument, acquired two 24-h sets of oblique thermal images overlooking the Pisgah Lava Field from the approximate mid-elevation point on the Pisgah cinder cone in late March of 2010 [104]. In situ temperature data of both surface and near surface (~1 m AGL) atmosphere were acquired. They determined the precision of the brightness temperature (no correction for atmosphere or surface emissivity) was accurate to 1 • C. Based on this in situ calibration, which included atmospheric correction, they found surface temperature estimates were accurate to~3.7 • C at 20 • C; this assumes an adequate characterization of the atmosphere, where air temperature and humidity data used may not accurately represent the conditions along the entire line-of-sight from camera to surface. They asserted this was likely the cause for the larger uncertainty in accuracy under field conditions, when compared to laboratory testing results. Refer to Titus et al. [69] for more details.
For the 2011 overflight data, no reliable surface or air temperature data were acquired. The calibration of the QWIP, which depends on the "Non-Uniformity Correction" setting was changed between the 2010 and 2011 QWIP campaigns. Therefore, we do not have absolute calibration coefficients. However, as previously mentioned, the technique presented in this paper only depends on the instrument data having high precision and the calibration remaining stable during the campaign. To increase our confidence, we present a comparison between the 2010 and 2011 datasets, as shown in Figure A1. Calibration curves assume atmospheric conditions as determined from PRISM data [105] and Humidity Converter [106] (see Table A1) and line-of-sight distance being the mean overflight altitude. Two sets of atmospheric conditions are shown for predawn and midday flight times. Surface radiance was corrected for atmospheric absorption and emission sensu Titus et al. [69]. The three sets of calibration coefficients from Titus et al. [69] are shown in Table A2. DN offset was increased by 4500 DN, while the calibration gain was unchanged. Surface emissivity was assumed to be 0.72, consistent with basalt. Box plots in Figure A1 show DN distributions for predawn (bottom left) and midday (upper right) for the overflight data (Table A3) and surface temperatures of the 11 sample sites for the 2010 study (Table A4). While surface temperature statistics are from the previous year and a few weeks earlier in the spring, close alignment of observed DN values and surface temperatures provided a confidence check that the QWIP continued to be a precision instrument with stable calibration over the period of the 2011 overflight campaign. The offset of the midday box plot could be due to an increase in gain from 2010, slightly higher surface temperatures (being early April and not late March), or both.
Flight times were selected to maximize the temperature difference between the two overflights while minimizing temperature changes during each overflight. Selection of flight times were based on the analysis of the 2010 data [104]. For the 11 sample sites where surface temperature was measured, the difference between minimum and maximum temperatures (temperature range) varied between 3.2 • C and 6.1 • C for the midday flight time span and between 0.9 • C and 2.1 • C for the predawn flight time span (Table A5). Temperature fluctuations should have a negligible effect on spatial indices calculated for either midday or predawn mosaics as the adjacent pixels would have been collected at the same time. For indices that utilize the difference between midday and predawn DN levels, temperature fluctuation may introduce additional uncertainty but the smallest difference in temperature for the 2010 data was 19 • C, which is easily discernable from the maximum variations observed during each overflight time span.
Acquisition of aircraft borne thermal imagery takes time, over which the surface temperature changes. In this paper, thermal indices generated from either midday or predawn flights were used for the orthorectified images, where all pixels were collected at the same local time. Because these indices were based on differences and not absolute DN values, these indices are relatively invariant to surface changes during each few-hour flight. Indices based on DN differences between midday and predawn flights must compare the mosaics, which do have variations in local time, and therefore changes in surface temperature. However, the expected DN changes over the time of flight is smaller than the DN gap between midday and predawn flights. This gap is apparent in both Figure A1 and Table A5. By selecting these two times of flight, we maximized the DN differences between the two overflight mosaics, while minimizing DN variations within each mosaic. Thus, the results presented in this paper should be robust.  Figure A1. Calibration curves estimated using the coefficients from Titus et al. [69] but adjusted for the Pisgah 2011 dataset. Red, blue, and black curves are based on the calibration coefficients derived from the B Cave, Station 7 trench, and combined datasets as shown in Table A2. Solid and dashed lines are for predawn and midday atmospheric conditions, respectively. Box plots show the comparison between 2011 DN range of values (Table A3) and the 2010 surface temperature range of values (Table A4). Top and bottom of vertical line (or whisker) represent maximum and minimum DN values, while upper and lower horizontal lines of the box represent the upper and lower quartiles for DN values; horizontal line within the box is the median DN value. Horizontal whisker, vertical bounding lines of boxes, and the vertical line at center represent minimum and maximum surface temperatures, upper and lower quartiles of surface temperature, and median surface temperature, respectively. Note: Box plot at bottom right is small due to low variation in predawn DN levels; thus, the box appears filled. Instrument bias has been increased by 4500 DN. The gain was not adjusted as we cannot determine if the April 2011 surface temperatures were higher or if the gain factor was slightly higher. It is likely that both occurred. A distinct gap for both surface temperature and measured DN levels exists between the midday and predawn data, suggesting that "topographical" indices that depend on midday and predawn differences should be robust even with some variation in surface temperatures during the overflight times.