Semantic Segmentation Deep Learning for Extracting Surface Mine Extents from Historic Topographic Maps

: Historic topographic maps, which are georeferenced and made publicly available by the United States Geological Survey (USGS) and the National Map’s Historical Topographic Map Collection (HTMC), are a valuable source of historic land cover and land use (LCLU) information that could be used to expand the historic record when combined with data from moderate spatial resolution Earth observation missions. This is especially true for landscape disturbances that have a long and complex historic record, such as surface coal mining in the Appalachian region of the eastern United States. In this study, we investigate this speciﬁc mapping problem using modiﬁed UNet semantic segmentation deep learning (DL), which is based on convolutional neural networks (CNNs), and a large example dataset of historic surface mine disturbance extents from the USGS Geology, Geophysics, and Geochemistry Science Center (GGGSC). The primary objectives of this study are to (1) evaluate model generalization to new geographic extents and topographic maps and (2) to assess the impact of training sample size, or the number of manually interpreted topographic maps, on model performance. Using data from the state of Kentucky, our ﬁndings suggest that DL semantic segmentation can detect surface mine disturbance features from topographic maps with a high level of accuracy (Dice coe ﬃ cient = 0.902) and relatively balanced omission and commission error rates (Precision = 0.891, Recall = 0.917). When the model is applied to new topographic maps in Ohio and Virginia to assess generalization, model performance decreases; however, performance is still strong (Ohio Dice coe ﬃ cient = 0.837 and Virginia Dice coe ﬃ cient = 0.763). Further, when reducing the number of topographic maps used to derive training image chips from 84 to 15, model performance was only slightly reduced, suggesting that models that generalize well to new data and geographic extents may not require a large training set. We suggest the incorporation of DL semantic segmentation methods into applied workﬂows to decrease manual digitizing labor requirements and call for additional research associated with applying semantic segmentation methods to alternative cartographic representations to supplement research focused on multispectral image analysis and classiﬁcation. impact of training sample size, or the number of manually digitized topographic maps available, on model performance. This study does not attempt to develop a new DL semantic segmentation


Introduction
Patterns of land cover and land use (LCLU) change can be very complex, especially when investigated over long time periods and/or in areas with multiple and changing drivers of alteration [1][2][3][4][5][6]. Mapping and quantifying LCLU change is particularly difficult in topographically complex landscapes where inaccurate terrain correction can result in misalignment between analyzed datasets [7] and in heterogenous landscapes with many and varied transitions between thematic classes [8][9][10]. Long time series of moderate spatial resolution satellite imagery have been of great value in documenting and quantifying LCLU change during their operational lifespans [1, 5,6]. For example, the Landsat program, which first collected data in the early 1970s and currently collects data with the instruments onboard Landsat 8, has been used to map changes across the United States (US) as part of the National Land Cover Database (NLCD) [5,6]. Unfortunately, such products have a limited historic scope since moderate spatial resolution Earth observation data from civilian sensors only extend back to the early 1970s, with more frequent collections and finer temporal resolutions only offered more recently [11,12]. Thus, in order to more completely document and quantify the historic extent and associated impacts of LCLU change, additional data sources should be investigated.
As a specific example, understanding LCLU change and associated environmental impacts resulting from surface coal mining and subsequent reclamation could benefit from extending the land change record. In the Appalachian region of the eastern United States specifically, coal mining began in the late 1800s, with significant intensification in the early-to-mid-1900s resulting from railroad expansion, industrialization, and increased demand [13][14][15]. Further, the modes and intensity of mining have changed substantially over time as a result of technological advances and economic drivers. For example, in the coalfields of southern Appalachia, including southern West Virginia (WV), eastern Kentucky (KY), southwestern Virginia (VA), and eastern Tennessee (TN), surface mining was historically dominated by highwall and contour mining while more recent extraction has relied on mountaintop removal [16][17][18]. Additionally, mine reclamation practices have changed over time; for example, reclamation requirements were greatly expanded by the US Surface Mining Control and Reclamation Act (SMCRA) of 1977 [19,20].
In the US, 1:24,000-scale, 7.5-min quadrangle topographic maps offer a wealth of historic information that could be integrated with Earth observation data to expand the historic record. The United States Geological Survey (USGS) topographic map program was operational from 1947 to 1992, with some map revisions continuing until 2006, during which time more than 55,000 maps were produced across the 48 states of the conterminous US [21]. Thus, such data could offer a means to extend change mapping by several decades for disturbances that were consistently documented on these maps. Unfortunately, such features are not commonly in a format that can be easily integrated into analyses (e.g., geospatial vector point, line, or polygon data). Figure 1 shows an area of surface mine disturbance represented on a 1:24,000-scale topographic map of the Offutt quadrangle in the state of KY. This map represents the 1954 landscape, and surface mining disturbance is denoted using a pink "disturbed surface" symbol. As evident in this example, mining is well differentiated on these maps; however, automating the extraction of such features is complicated by inconsistencies between maps, differences in mine disturbance symbology, and overprinting with contour lines, text and labels, and other features [22][23][24][25].
Given the complexity of extracting such features, in this study, we investigate the use of deep learning (DL), modified UNet semantic segmentation using convolutional neural networks (CNNs) as a technique for extracting surface mine features from historic topographic maps. We make use of digitized and georeferenced historic topographic maps made publicly available by the USGS [26] along with manually digitized extents produced by the USGS Geology, Geophysics, and Geochemistry Science Center (GGGSC) [23]. Our primary objectives are to (1) quantify how well models trained on a subset of topographic maps in KY generalize to different maps in KY, VA, and Ohio (OH) and (2) assess the impact of training sample size, or the number of manually digitized topographic maps available, on model performance. This study does not attempt to develop a new DL semantic segmentation Remote Sens. 2020, 12, 4145 3 of 25 algorithm or compare a variety of existing algorithms for this specific mapping task. Our primary focus in this study is the use of DL-based semantic segmentation for extracting historic information from cartographic maps, which provide a wealth of information to extend land change studies and further quantify anthropogenic landscape alterations. Such methods are needed to take full advantage of current and historic data.
Remote Sens. 2020, 12, x FOR PEER REVIEW 3 of 25 extracting historic information from cartographic maps, which provide a wealth of information to extend land change studies and further quantify anthropogenic landscape alterations. Such methods are needed to take full advantage of current and historic data.

Topographic Maps
The first USGS topographic map was produced in 1879; however, the number of maps produced was limited until the 1940s and 1950s when aerial photo acquisition and photogrammetric methods became routine for deriving elevation data from overlapping stereo images. The largest-scale maps produced for the entire conterminous US, which are used in this study, were produced on a 1:24,000 scale and span 7.5 minutes of latitude and longitude, known as 7.5-minute quadrangles [27]. When the program ended in 2006, it was replaced by the US Topo program, which generates digital maps and associated geospatial databases. Paper maps from the prior program were scanned and georeferenced into the National Map's Historical Topographic Map Collection (HTMC) and can be accessed as digital raster graphics (DRGs) for use in geographic information systems [28][29][30][31]. Currently, maps can be viewed and downloaded using topoView, which includes an interactive web map (https://ngmdb.usgs.gov/topoview/) [32].
Historic topographic maps document a wide variety of mining and prospecting features including pits, strip mines, disturbed surfaces, mine dumps, quarries, and tailings using point symbols, areal symbols, and text, which are the product of manual photograph interpretation and some field validations. Areas of surface mining disturbance are generally symbolized using a generic brown and/or pink pattern (see Figure 1) with some more unique symbols for specific features, such as tailings [23]. Such areal, thematic features are the focus of this study.

Topographic Maps
The first USGS topographic map was produced in 1879; however, the number of maps produced was limited until the 1940s and 1950s when aerial photo acquisition and photogrammetric methods became routine for deriving elevation data from overlapping stereo images. The largest-scale maps produced for the entire conterminous US, which are used in this study, were produced on a 1:24,000 scale and span 7.5 min of latitude and longitude, known as 7.5-minute quadrangles [27]. When the program ended in 2006, it was replaced by the US Topo program, which generates digital maps and associated geospatial databases. Paper maps from the prior program were scanned and georeferenced into the National Map's Historical Topographic Map Collection (HTMC) and can be accessed as digital raster graphics (DRGs) for use in geographic information systems [28][29][30][31]. Currently, maps can be viewed and downloaded using topoView, which includes an interactive web map (https://ngmdb.usgs. gov/topoview/) [32].
Historic topographic maps document a wide variety of mining and prospecting features including pits, strip mines, disturbed surfaces, mine dumps, quarries, and tailings using point symbols, areal symbols, and text, which are the product of manual photograph interpretation and some field Remote Sens. 2020, 12, 4145 4 of 25 validations. Areas of surface mining disturbance are generally symbolized using a generic brown and/or pink pattern (see Figure 1) with some more unique symbols for specific features, such as tailings [23]. Such areal, thematic features are the focus of this study.

Surface Mine Mapping
Despite the large database of historic cartographic representations of mining, much effort has been placed on the use of remotely sensed data for mapping and monitoring surface mining and subsequent reclamation. For example, Townsend et al. [33] documented surface mine extent change in portions of central Appalachia from 1976 to 2006 using a Landsat time series, while Pericak et al. [34] and Xiao et al. [35] used Landsat data and Google Earth Engine cloud-based computation to map surface mining in the mountaintop mining region of the eastern US and the Shengli coalfields of Inner Mongolia, respectively. Also using Landsat data, Sen et al. [36] applied disturbance/recovery trajectories for differentiating mining and mine reclamation from other forest displacing practices. Several studies have compared commercial satellite imagery, aerial orthophotography, and the combination of imagery and light detection and ranging (LiDAR) for mapping of mining and mine reclamation [37][38][39][40].
Despite the existing research on mine mapping from remotely sensed data, a review of the literature suggests that there is a lack of studies focused on extracting surface mine features from topographic maps and other cartographic representations. In fact, research on the use of machine learning (ML) or DL to extract features from topographic maps is generally limited. Li et al. [24] explored the recognition of text on topographic maps using DL while Uhl et al. [25] investigated the mapping of human settlements from maps using CNNs. Lui et al. [41] explored CNNs for the general segmentation of topographic maps. There has been some success in applying DL methods to other types of digital data, beyond the primary focus on multispectral satellite or aerial imagery. For example, DL has been applied to the detection of features from digital terrain data; Behrens et al. [42] investigated the application of digital elevation data and DL for digital soil mapping while Trier et al. [43] investigated the mapping of archeological features from LiDAR data. In a mining-related study, Maxwell et al. [44] used LiDAR-derived slopeshades and the Mask Regional-CNN (Mask R-CNN) DL algorithm for detecting instances of valley fill faces, which are an artificial landform resulting from mountaintop removal mining and reclamation practices. As these studies highlight, DL methods have been shown to be of value for mapping and extracting features from a variety of geospatial data layers, suggesting that further investigation of their application to historic topographic maps is merited.

Deep Learning Semantic Segmentation
Extraction of thematic information from geospatial data is both a common operational task and an active area of research [45]. Supervised classification, which relies on user-provided training data, using ML algorithms is often undertaken since these methods have been shown to be more robust to complex feature space than parametric methods [46,47]. However, the incorporation of spatial context information is limited when the pixel is used as the unit of analyses and no textural measures are included as predictor variables [48,49]. To overcome this lack of spatial context, especially when high spatial resolution data are used, geographic object-based image analysis (GEOBIA) first groups adjacent pixels into objects or polygons based on similarity. These regions then become the unit for subsequent analysis and classification [48,50,51]. CNN-based DL for semantic segmentation further expands the incorporation of spatial context information, and offers a natural extension of prior methods [52][53][54][55].
DL is an extension of artificial neural networks (ANNs). An ANN consists of neurons organized into layers where input layers represent input predictor variables, such as image bands, while output layers represent the desired output, such as land cover categories. Between the input and output layers are one or more hidden layers containing multiple neurons. Within the network, connections between neurons have associated weights that can be learned or updated. By applying a bias and a non-linear activation function while iteratively adjusting the weights through multiple passes, or epochs, on the training data while monitoring a loss metric, patterns in the data can be learned to make new predictions [46,56,57]. DL expands upon this ML framework to include many hidden layers and associated nodes (i.e., tens or hundreds), which can allow for the modeling of more complex patterns and a greater abstraction of the input data [52][53][54][55].
DL and its applications in remote sensing science and image analysis have expanded greatly with the inclusion of convolutional layers, which allow CNNs to learn spatial patterns by updating weights associated with spatial filters or kernels. These learned filters then pass over the image as a moving window to perform convolution and generate spatial abstractions of the data, or feature maps. When applied to DL, it is common for hundreds of such filters to be learned. Initially, CNNs were applied to scene labelling problems where an entire image is categorized. Advancements in CNNs have allowed for semantic segmentation, where each pixel is classified, similar to traditional remote sensing classification methods, or instance segmentation, where individual instances of a class are differentiated [52][53][54][55]. We make use of semantic segmentation in this study.
Fully convolutional neural networks (FCNs) offer one means to perform semantic segmentation with CNN-based architectures using a combination of convolution and deconvolution with up-sampling (i.e., increasing the size of the data array that represents the image). Example FCNs include SegNet [58], which was originally developed for semantic segmentation of digital photographs and introduced in 2017, and UNet [59][60][61], which was developed for segmentation of medical imagery and introduced in 2015. In this study, we use a modified version of UNet.
A conceptualization of a generic UNet is provided in Figure 2. First, in the convolution, contracting, or encoder component, spatial patterns are modeled at multiple scales by iteratively adjusting weights associated with a series of 3 × 3 2-dimensional (2D) convolutional layers and by applying 2 × 2 max pooling, which decreases the size of the data array by returning the maximum value of the feature maps within 2 × 2 pixel windows. This allows for spatial patterns to be learned at multiple scales since feature maps generated by the prior convolutional layer are used as input to the subsequent layer following the max pooling operation. Next, these same feature maps are used to convert the data back to the original dimensions using 2D deconvolution to output a semantic, or pixel-level, classification from the reconstruction. This is the deconvolution, expanding, or decoding component [59][60][61].
In Figure 2, the example UNet accepts training data as image chips with dimensions of 128 pixels (height) × 128 pixels (width) × 3 channels (for example, red, green, and blue). As the data are fed through the encoder component, 3 × 3 2D convolution is used to learn spatial filters, and the dimensions of the data array decrease (128 → 64 → 32 → 16 → 8) as a result of the max pooling operations while the number of filters increases (64 → 128 → 256 → 512 → 1024). In the decoding component, the size of the array increases as a result of 2 × 2 2D deconvolution applied to the feature maps learned in the encoding component while the number of filters decreases. In the final phase of the architecture, 64 channels or feature maps are used as input to perform a classification at each pixel location, where each pixel location is treated as a vector containing 64 values or predictor variables. The output is a probability of each pixel belonging to each of the defined classes, with the total of all class probabilities at each pixel location summing to 1.
UNet and other semantic segmentation methods have been applied to a variety of feature extraction and classification problems and have also been applied to a variety of geospatial and remotely sensed data. For example, modifications of UNet have been applied to the mapping of general land cover change [62], coastal wetlands [63], palm trees [64], cloud and cloud shadows [65], urban buildings and change detection [66][67][68], roads [69], and landslides [70]. Generally, UNet and other FCNs have shown great promise due to their ability to model complex spatial patterns and context while generating data abstractions that generalize well to new data [54,55]. Given the complex patterns present in topographic maps, where features may be overprinted with contour lines, points symbols, and/or text, and inconsistencies between different scanned topographic maps in regard to tone, hue, contrast, and/or sharpness, we hypothesize that this is an optimal method to explore for this mapping problem. UNet and other semantic segmentation methods have been applied to a variety of feature extraction and classification problems and have also been applied to a variety of geospatial and remotely sensed data. For example, modifications of UNet have been applied to the mapping of general land cover change [62], coastal wetlands [63], palm trees [64], cloud and cloud shadows [65], urban buildings and change detection [66][67][68], roads [69], and landslides [70]. Generally, UNet and other FCNs have shown great promise due to their ability to model complex spatial patterns and context while generating data abstractions that generalize well to new data [54,55]. Given the complex patterns present in topographic maps, where features may be overprinted with contour lines, points symbols, and/or text, and inconsistencies between different scanned topographic maps in regard to tone, hue, contrast, and/or sharpness, we hypothesize that this is an optimal method to explore for this mapping problem.

Study Areas and Input Data
In this study, we focus on the surface coal mining region of Appalachia in the eastern US due to the availability of historic topographic maps and training data and an abundance of historic surface mine disturbance. As shown in Figure 3, eastern KY is the primary area of study. Within this region, 100 7.5-minute quadrangles were selected (Table 1). Since multiple maps can be generated for each quadrangle representing different dates and historic conditions, a total of 122 maps were included in the study. Any topographic map that did not include any mapped surface mine disturbance was removed. In order to assess how well the trained models generalize to different topographic maps and new geographic extents, we selected 23 additional topographic maps in OH and 25 in VA based on the prevalence of surface mine features. A total of 170 historic topographic maps were included in the study.
Example mine extents were obtained from the publicly available USGS GGGSC prospect-and mine-related features database. This is an ongoing project, and data are not currently available for all historic maps and states. For example, data for WV is not yet complete [23]. This database contains manually digitized point and polygon features interpreted from mine symbols on historic topographic maps available in the HTMC. Examples of the historic topographic maps and mining features data are shown in Figure 4. We only used polygon features digitized from 1:24,000-scale maps in this study. Also, any features categorized as a pond were removed from the dataset, and we maintained all features that were denoted using the standard brown or pink "surface disturbance"

Study Areas and Input Data
In this study, we focus on the surface coal mining region of Appalachia in the eastern US due to the availability of historic topographic maps and training data and an abundance of historic surface mine disturbance. As shown in Figure 3, eastern KY is the primary area of study. Within this region, 100 7.5-minute quadrangles were selected (Table 1). Since multiple maps can be generated for each quadrangle representing different dates and historic conditions, a total of 122 maps were included in the study. Any topographic map that did not include any mapped surface mine disturbance was removed. In order to assess how well the trained models generalize to different topographic maps and new geographic extents, we selected 23 additional topographic maps in OH and 25 in VA based on the prevalence of surface mine features. A total of 170 historic topographic maps were included in the study. Table 1. Number of unique 1:24,000-scale, 7.5-minute topographic quadrangles within each dataset. A unique map is defined based on a SCANID. The Unique Quads column provides the number of unique quadrangle names in each dataset. The chips columns represent the number of 128-by-128 pixel chips in each dataset that contain mining and/or not mining pixels, only no mining pixels, and the total number of chips. The compilation years of the oldest and newest maps within each dataset are also provided. digitized features with the correct map based on this identifier, as features near quadrangle boundaries were not re-digitized if they were already captured in adjacent quads or if the features were captured in a prior version of the topographic map and did not change. So, all features that intersected the extent of the quadrangle were extracted and then manually inspected for all 170 maps.
Features that were not present on the map were removed, and any missing features were added; however, missing features were a rare occurrence, as the database was very comprehensive.   Example mine extents were obtained from the publicly available USGS GGGSC prospect-and mine-related features database. This is an ongoing project, and data are not currently available for all historic maps and states. For example, data for WV is not yet complete [23]. This database contains manually digitized point and polygon features interpreted from mine symbols on historic topographic maps available in the HTMC. Examples of the historic topographic maps and mining features data are shown in Figure 4. We only used polygon features digitized from 1:24,000-scale maps in this study. Also, any features categorized as a pond were removed from the dataset, and we maintained all features that were denoted using the standard brown or pink "surface disturbance" pattern or symbols associated with tailings. In the public HTMC database, each historic topographic map is identified with a unique SCANID. We found that it was not possible to simply match the digitized features with the correct map based on this identifier, as features near quadrangle boundaries were not re-digitized if they were already captured in adjacent quads or if the features were captured in a prior version of the topographic map and did not change. So, all features that intersected the extent of the quadrangle were extracted and then manually inspected for all 170 maps. Features that were not present on the map were removed, and any missing features were added; however, missing features were a rare occurrence, as the database was very comprehensive. DL models that make use of CNNs require that training data be provided as small image chips. When semantic segmentation is being performed, labels must be provided as raster masks where each category is assigned a unique code or cell value [52][53][54][55]. Image chips and associated label masks were generated using the Export Training Data for Deep Learning Tool in the ArcGIS Pro software environment [71,72]. Chips were produced at a size of 128-by-128 pixels, as this allowed for the generation of a large number of chips, the use of a large batch size during training and validation, and did not overwhelm the computational resources. Given the large number of available chips, and to decrease correlation amongst the data, no overlap was used so that each chip covered a unique extent. Moreover, only full chips were produced. Thus, chips from near the edge of maps, which do not allow for a complete set of 128-by-128 rows and columns of data, were not included. All chips that contained at least some cells occurring in the historic mine extents were used. In order to incorporate a variety of examples of the background, absence, or "not mining" class, we randomly selected an additional 150 absence-only chips from each topographic map. This subsample from the complete dataset of absence-only chips was selected to reduce class imbalance in the training process. DL models that make use of CNNs require that training data be provided as small image chips. When semantic segmentation is being performed, labels must be provided as raster masks where each category is assigned a unique code or cell value [52][53][54][55]. Image chips and associated label masks were generated using the Export Training Data for Deep Learning Tool in the ArcGIS Pro software environment [71,72]. Chips were produced at a size of 128-by-128 pixels, as this allowed for the generation of a large number of chips, the use of a large batch size during training and validation, and did not overwhelm the computational resources. Given the large number of available chips, and to decrease correlation amongst the data, no overlap was used so that each chip covered a unique extent. Moreover, only full chips were produced. Thus, chips from near the edge of maps, which do not allow for a complete set of 128-by-128 rows and columns of data, were not included. All chips that contained at least some cells occurring in the historic mine extents were used. In order to incorporate a variety of examples of the background, absence, or "not mining" class, we randomly selected an additional 150 absence-only chips from each topographic map. This subsample from the complete dataset of absence-only chips was selected to reduce class imbalance in the training process. Table 1 above summarizes the number of chips used in each dataset. In total, the model was trained using 30,392 examples (KY Training) and evaluated at the end of each training epoch with 5288 chips (KY Testing). To validate the final model, 6849 chips were withheld from the KY data (KY Validation); these chips did not overlap with the training or testing data. We also validated in the new regions within OH and VA using 10,299 and 15,890 chips, respectively (OH Validation and VA Validation). Example image chips and associated masks are shown in Figure 5.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 25 Example image chips and associated masks are shown in Figure 5.
In order to minimize autocorrelation in the training, testing, and validation datasets, all image chips from the same SCANID were included in the same partition. Moreover, all chips from the same quadrangle, as identified by the quad name, were included in the same partition. These divisions were determined using random sampling.

Modeling Training
DL model training and assessment were conducted using the R statistical and data science computational environment and language [73]. Specifically, models were generated using the keras [74] and tensorflow [75] packages. The keras package interfaces with the Keras Application Programming Interface (API) [76], which is written in Python [77] and can interface with several DL and tensor math backends, including Google's TensorFlow [78]. The reticulate package [79] allows for an interface between R and Python, and was used to allow the keras package to make use of the Python API. For image manipulation and preparation, we used the magick package [80], which provides bindings to the ImageMagick open-source image processing library. In order to implement and modify UNet specifically, we referenced the UNet example provided by RStudio and available on GitHub and as part of the online keras package documentation [81,82]. All experiments were In order to minimize autocorrelation in the training, testing, and validation datasets, all image chips from the same SCANID were included in the same partition. Moreover, all chips from the same quadrangle, as identified by the quad name, were included in the same partition. These divisions were determined using random sampling.

Modeling Training
DL model training and assessment were conducted using the R statistical and data science computational environment and language [73]. Specifically, models were generated using the keras [74] and tensorflow [75] packages. The keras package interfaces with the Keras Application Programming Interface (API) [76], which is written in Python [77] and can interface with several DL and tensor math backends, including Google's TensorFlow [78]. The reticulate package [79] allows for an interface between R and Python, and was used to allow the keras package to make use of the Python API. For image manipulation and preparation, we used the magick package [80], which provides bindings to the ImageMagick open-source image processing library. In order to implement and modify UNet specifically, we referenced the UNet example provided by RStudio and available on GitHub and as part of the online keras package documentation [81,82]. All experiments were conducted on a Microsoft Azure virtual machine, which provided access to an NVIDIA Tesla T4 graphics processing unit (GPU).
The UNet model used in this study is described in Table 2. Similar to the default architecture, we used 4 downsampling blocks in the encoder and 4 upsampling blocks in the decoder. The downsampling blocks each contained 2 convolutional layers, while the upsampling blocks contained 3. The default number of filters per block and kernel size were maintained. The architecture was designed to accept input tensors of shape 128 × 128 × 3 (height, width, channels) and differentiate two classes: mining features, coded to 1, and not-mining features, coded to 0. Since this is a binary classification problem, a sigmoid activation function [83] was used in the final layer, which resulted in a probabilistic prediction between 0 and 1, where 1 represents a predicted high likelihood of a mine feature occurring at the pixel location. In order to maintain the size of the arrays, padding was used along with a max pooling size and stride of 2 × 2. A total of 34,527,041 trainable parameters were included in the model. Some changes were made to the default UNet architecture. First, we used the leaky rectified linear unit (ReLU) activation function as opposed to a traditional ReLU in the convolutional layers to avoid "dying ReLU" issues [84]. We also used the AdaMax optimizer [85] instead of Adam (Adaptive momentum estimation) or RMSProp (Root Mean Square Propagation) and included a callback to reduce the learning rate if the loss plateaued for more than 5 epochs. Lastly, we used Dice coefficient loss as opposed to binary cross-entropy loss due to issues of class imbalance and our desire to balance omission and commission error rates for the mine features class [86][87][88].
The model was trained for 100 epochs using all the training examples in batches of 32 image chips, and the final model was selected as the model with the lowest Dice coefficient loss for the KY Testing data. The training examples were randomly shuffled between each epoch, and random data augmentations were included to increase the data variability and combat overfitting. Specifically, image chips were randomly flipped left-to-right and/or up-and-down. Slight random adjustments of brightness, contrast, saturation, and/or hue were also applied. Lastly, batch normalization was implemented for all convolutional layers. The model took roughly 24 h to train on the Microsoft Azure virtual machine using the available GPU.

Prediction and Model Validation
Once a final prediction was obtained, the keras package was used to predict the withheld validation image chips in KY, OH, and VA and to calculate assessment metrics. We also developed a custom R script to apply the prediction to entire topographic maps, which broke the input data into 128-by-128 pixel arrays, used the model to predict each subset, then merged the results to reproduce the topographic map extent. In order to remove poor predictions near edges of each chip, the outer 20 rows and columns of each chip were removed, and a 50% overlap between adjacent chips was applied so that only the center of each chip was used in the final, merged prediction. Any cell with a mine feature probability higher than 0.5 was coded to the mine features class while all other cells were coded to the not mine feature class. Using our custom script, an entire topographic map could be predicted in roughly 15 minutes.
For validation, we relied on common binary classification accuracy assessment metrics including precision, recall, specificity, the F1 score, and overall accuracy. Table 3 describes the terminology used to define these metrics. True positive (TP) samples are those that are in the positive class and are correctly mapped as positive, in this case mine features, while false positives (FPs) are not in the positive class but are incorrectly mapped as a positive. True negatives (TNs) are correctly mapped as negative, while false negatives (FNs) are mapped as negative when they are actually positive. Precision (Equation (1)) represents the proportion of the samples that is correctly classified within the samples predicted to be positive, and is equivalent to the user's accuracy (1-commission error) for the positive class. Recall or sensitivity (Equation (2)) represents the proportion of the reference data for the positive class that is correctly classified, and is equivalent to producer's accuracy (1-omission error) for that class. The Dice coefficient or F1 score (Equation (3)) are equivalent and represent the harmonic mean of precision and recall, while specificity (Equation (4)) represents the proportion of negative reference samples that is correctly predicted, and is thus equivalent to the producer's accuracy for the negative class. Lastly, overall accuracy or binary accuracy (Equation (5)) represents the proportion of correctly classified features [86][87][88][89].
Overall Accuracy or Binary Accuracy = TP + TN TP + TN + FP + FN (5) Model assessments were performed using the validation image chips in KY, OH, and VA, a total of 33,038 examples. Given that these chips did not incorporate all areas of the topographic maps, we also calculated assessment metrics for each validation topographic map. We argue that this is a more robust validation of the map products, as the entire extent of each map is assessed as opposed to just chips containing surface mine features or a subset of randomly selected absence-only chips. The model was validated on a total of 68 topographic maps that were not include in the KY Training or KY Testing sets.

Sample Size Comparisons
In order to assess the impact of sample size on model performance and model generalization, we also trained models on subsets of the available KY Training dataset. This was accomplished by randomly selecting all chips assigned to a defined number of topographic maps. In the random selection, the probability of selection was weighted by the relative land area of mining in each topographic map, since it was assumed that analysts undertaking manual digitizing over a small set of available maps would select maps with many mining examples present. In order to assess model variability at different sample sizes, 5 random subsets of the training topographic maps and associated chips were selected for 2, 5, 10, and 15 topographic maps each, or a total of 20 models. All models were trained for 100 epochs using the same settings used for the model that incorporated all KY Training image chips. We also included the same random image augmentations. The model, which consists of the learned weights at the end of the training epoch, with the largest Dice coefficient for the KY Testing data was selected as the final model for each sample. We also evaluated the results by predicting to the KY, OH, and VA validation datasets. We did not perform a validation at the topographic map scale due to computational constraints and processing time required to predict all the validation topographic maps using all 20 models. Figure 6 summarizes the results of the training process for the KY Training and KY Testing datasets for the 100 training epochs. As the model iterated through the KY Training set and the weights were updated, the training accuracy continued to increase while the pattern for the testing data was more variable or noisy. However, the overall trend was that accuracy for both the training and testing datasets stabilized quickly (i.e., after roughly 30 epochs). Following the initial rapid improvement, small gains in the training data accuracy were observed while the testing data performance stopped improving. Generally, overfitting was not observed. Performance on the KY Testing data stabilized with a Dice coefficient of 0.960 to 0.970 (Figure 6a), and the best performance on the testing data occurred at epoch 75, which was selected as the final model for predicting to the three validation datasets. Also, precision and recall were similar, suggesting a balance between errors of commission and omission for the mining features class (Figure 6c,d).

Results
Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 25 topographic map, since it was assumed that analysts undertaking manual digitizing over a small set of available maps would select maps with many mining examples present. In order to assess model variability at different sample sizes, 5 random subsets of the training topographic maps and associated chips were selected for 2, 5, 10, and 15 topographic maps each, or a total of 20 models. All models were trained for 100 epochs using the same settings used for the model that incorporated all KY Training image chips. We also included the same random image augmentations. The model, which consists of the learned weights at the end of the training epoch, with the largest Dice coefficient for the KY Testing data was selected as the final model for each sample. We also evaluated the results by predicting to the KY, OH, and VA validation datasets. We did not perform a validation at the topographic map scale due to computational constraints and processing time required to predict all the validation topographic maps using all 20 models. Figure 6 summarizes the results of the training process for the KY Training and KY Testing datasets for the 100 training epochs. As the model iterated through the KY Training set and the weights were updated, the training accuracy continued to increase while the pattern for the testing data was more variable or noisy. However, the overall trend was that accuracy for both the training and testing datasets stabilized quickly (i.e., after roughly 30 epochs). Following the initial rapid improvement, small gains in the training data accuracy were observed while the testing data performance stopped improving. Generally, overfitting was not observed. Performance on the KY Testing data stabilized with a Dice coefficient of 0.960 to 0.970 (Figure 6a), and the best performance on the testing data occurred at epoch 75, which was selected as the final model for predicting to the three validation datasets. Also, precision and recall were similar, suggesting a balance between errors of commission and omission for the mining features class (Figure 6c,d).   were observed, such as the inclusion of features that used similar thematic symbology. For example, some highways and small ponds were incorrectly detected.

Chip-Based Assessment
higher for the OH and VA validation data in comparison to recall, which suggest higher levels of omission error than commission error when the model is generalized to new geographic areas. This could potentially be attributed to differences in the representation or presentation of mine features in the new data. For the KY Testing and KY Validation data, the Dice coefficients or F1 scores were above 0.944. For the OH and VA validation chips, the Dice coefficients were 0.894 and 0.837, respectively. Similar to the results of Maxwell et al. [44], who investigated the mapping of valley fill faces from high spatial resolution digital elevation data, these results suggest some reduced performance when DL models are used to predict to new geographic extents; however, performance is still strong, suggesting that the model is valuable for predicting new data.   Table 4 summarized the assessment results for all datasets as calculated using the image chips. Assessment results for the KY Training data were included for comparison; however, this assessment could be misleading due to overfitting. Generally, overall or binary accuracies were high, which we attribute to the large number of background pixels in the image chips. For the KY Testing and Validation data, precision and recall were generally well balanced, whereas precision rates were higher for the OH and VA validation data in comparison to recall, which suggest higher levels of omission error than commission error when the model is generalized to new geographic areas. This could potentially be attributed to differences in the representation or presentation of mine features in the new data. For the KY Testing and KY Validation data, the Dice coefficients or F1 scores were above 0.944. For the OH and VA validation chips, the Dice coefficients were 0.894 and 0.837, respectively. Similar to the results of Maxwell et al. [44], who investigated the mapping of valley fill faces from high spatial resolution digital elevation data, these results suggest some reduced performance when DL models are used to predict to new geographic extents; however, performance is still strong, suggesting that the model is valuable for predicting new data.  Table 5 summarizes the assessment results for the topographic map-based assessment while Figure 8 provides some example results as two topographic maps from each state validation dataset. These results generally agree with those obtained using the chip-based validation. Strongest performance was generally observed when predicting to new topographic maps within KY (i.e., the KY Testing and KY Validation datasets). Overall accuracy and specificity were generally high, suggesting that most of the background pixels were correctly classified. For the OH and VA validation datasets, recall was generally lower than precision, suggesting higher levels of omission error. For all datasets, precision was above 0.890, suggesting low commission error rates.   Table 5 summarizes the assessment results for the topographic map-based assessment while Figure 8 provides some example results as two topographic maps from each state validation dataset. These results generally agree with those obtained using the chip-based validation. Strongest performance was generally observed when predicting to new topographic maps within KY (i.e., the KY Testing and KY Validation datasets). Overall accuracy and specificity were generally high, suggesting that most of the background pixels were correctly classified. For the OH and VA validation datasets, recall was generally lower than precision, suggesting higher levels of omission error. For all datasets, precision was above 0.890, suggesting low commission error rates.    Figure 9 further summarizes the distribution of model performance for the testing and validation data. Generally, there was more variability in the prediction of the VA data than the other datasets. However, all datasets had some outlying topographic maps where model performance was poor. Visual inspection of these poorly classified maps suggest that they were obtained for maps with a small percentage or land area of surface mining where small proportions of omission or commission error had large weight and thus greatly impacted the reported metrics.   Figure 9 further summarizes the distribution of model performance for the testing and validation data. Generally, there was more variability in the prediction of the VA data than the other datasets. However, all datasets had some outlying topographic maps where model performance was poor. Visual inspection of these poorly classified maps suggest that they were obtained for maps with a small percentage or land area of surface mining where small proportions of omission or commission error had large weight and thus greatly impacted the reported metrics.

Sample Size Comparisons
Figures 10 and 11 below describe the results of the sample size comparison. Figure 10 shows the changes in performance metrics for predicting the KY Training and KY Testing chips by epoch.
Generally, reducing the number of topographic maps used to train the model decreased model performance and increased variability between the different random subsets, as represented with the ribbons in the graphs that represent the range at each epoch. When chips from only two topographic maps were used, the Dice coefficient, precision, and recall (Figure 10a-c) varied greatly at the same training epoch, even after many training iterations. Also, the model generalized poorly to the KY Testing data; for example, the mean Dice coefficient for all five models across all epochs never rose above 0.800 for the KY Testing set but stabilized above 0.950 for the KY Training data. When the number of topographic maps was increased to 10 or 15, variability between the models decreased, overfitting was reduced, based on a comparison between the training and testing results, and the performance approached that obtained when using the entire training datasets of 84 maps and associated chips. This generally suggests that quality results can be obtained with a smaller dataset, which reduces manual labor requirements for training and validating DL models and adds practicality to including DL semantic segmentation into production-level data generation over large spatial extents and/or large volumes of data.

Sample Size Comparisons
Figures 10 and 11 below describe the results of the sample size comparison. Figure 10 shows the changes in performance metrics for predicting the KY Training and KY Testing chips by epoch. Generally, reducing the number of topographic maps used to train the model decreased model performance and increased variability between the different random subsets, as represented with the ribbons in the graphs that represent the range at each epoch. When chips from only two topographic maps were used, the Dice coefficient, precision, and recall (Figure 10a-c) varied greatly at the same training epoch, even after many training iterations. Also, the model generalized poorly to the KY Testing data; for example, the mean Dice coefficient for all five models across all epochs never rose above 0.800 for the KY Testing set but stabilized above 0.950 for the KY Training data. When the number of topographic maps was increased to 10 or 15, variability between the models decreased, overfitting was reduced, based on a comparison between the training and testing results, and the performance approached that obtained when using the entire training datasets of 84 maps and associated chips. This generally suggests that quality results can be obtained with a smaller dataset, which reduces manual labor requirements for training and validating DL models and adds practicality to including DL semantic segmentation into production-level data generation over large spatial extents and/or large volumes of data. topographic maps approached the performance levels reached with the full training dataset in which chips were derived from 84 topographic maps. For example, the average Dice coefficients for the 15 topographic map models were 0.940, 0.875, and 0.813 for the KY, OH, and VA validation sets, respectively. For comparison, the model using all the training samples yielded Dice coefficients of 0.949, 0.894, 0.837 for the same validation sets. This generally suggests that, although more samples can improve the results, a reduced sample size can be adequate, which would minimize manual labor requirements for generating and validating models.  Adding to the results summarized in Figure 10, Figure 11 shows the results when the models trained with a reduced sample size were used to predict to the KY, OH, and VA validation sets. Similar to the results for the model trained with all the training samples, these models generalized better to the KY Validation data as opposed to the OH and VA sets. For all sample sizes, the performance for the VA data was the lowest. Variability between different random sets tended to decrease as the number of topographic maps used in the training process increased. Models using 15 topographic maps approached the performance levels reached with the full training dataset in which chips were derived from 84 topographic maps. For example, the average Dice coefficients for the 15 topographic map models were 0.940, 0.875, and 0.813 for the KY, OH, and VA validation sets, respectively. For comparison, the model using all the training samples yielded Dice coefficients of 0.949, 0.894, 0.837 for the same validation sets. This generally suggests that, although more samples can improve the results, a reduced sample size can be adequate, which would minimize manual labor requirements for generating and validating models.

Discussion
Based on an evaluation on multiple datasets, our results suggest that DL semantic segmentation can be successfully applied to historic topographic maps to extract the extent of mining-related landscape disturbance. Although model performance decreased when generalized to new geographic extents, performance was still strong with Dice coefficients above 0.760. Practically, DL semantic segmentation can offer a means to reduce workloads and manual digitizing time for deriving vector geospatial data from historic, georeferenced topographic maps. This is especially true given that a reduced sample size can still yield accurate results, as demonstrated by our sample size reduction experimentation. Further, errors when generalizing to new data and geographic extents were dominated by omission error, as measured with recall, and precision remained above 0.890 for all datasets, even when the number of training topographic maps in the datasets were reduced to 15. In general, we would recommend large-area mapping projects to take advantage of DL methods to decrease manual labor and shorten project timelines. Results from DL models can then be further refined using manual interpretation by digitizing missing features, removing false positives, and refining feature outlines, a process that would require much less time than manually digitizing all features from scratch. Some components would still need to rely on manual interpretation, such as labeling the feature type associated with the digitized polygons. Given that DL semantic segmentation and its application to geospatial data has not yet matured, there is still a need to integrate these methods into data creation workflows and routine and operational mapping tasks. It should be noted that DL methods, including UNet, have been integrated into commercial, geospatial software tools, such as ArcGIS Pro [90]. This further simplifies their use in applied mapping tasks.
Generally, this research supports the value of topographic maps as a source of historic landscape data to assess change and reinforces some prior study's findings, such as the work of García et al. [22] that explored changes in river corridors and Uhl et al. [25] that mapped human settlement footprints. Reinforcing the findings of Uhl et al. [25] specifically, our study highlights the value of large, quality training datasets for training DL semantic segmentation algorithms to recognize features in topographic maps. Lastly, our study reinforces the documented strong performance of the UNet semantic segmentation method for extracting features and classifying pixels from a wide variety of data sources to support varying mapping tasks (for example, [54,55,[62][63][64][65][66][67]69,70,[91][92][93]). Such techniques, including future advancements and modifications, may eventually replace traditional ML methods, such as random forests (RF) and support vector machines (SVM), as operational standards in the field [46,[52][53][54][55].
This study has some notable limitations. First, it was not possible to fully assess the impact of hyperparameter selection and UNet architecture due to required computational time and cost, an issue which has been raised in prior DL studies (for example, [44]). The Dice coefficient was a valuable loss measure in this study due to class imbalance and because overall accuracy was generally high and did not adequately quantify errors. The use of the Dice coefficient, precision, and recall allowed for a more detailed quantification of omission and commission errors. Our sample size experimentation was informative but could be expanded to include more replicates and sample sizes. However, training many models is computationally demanding; for example, the 20 models used in this study to assess sample size impacts took over a week to train on a Microsoft Azure virtual machine with GPU support. In future research, we plan to further assess the impact of training data size and the number of available manually interpreted topographic maps and using transfer learning to refine models trained in on area for use in new geographic extents were the presentation of surface mining may be different. We also plan to combine surface mining extents extracted from historic topographic maps with recent maps of mine disturbance, such as those generate by Pericak et al. [34], to more fully quantify the cumulative landscape alterations and impacts of surface coal mining in Appalachia.
Future research should investigate the mapping of additional features from historic topographic maps, such as the extent of forests and wetlands. DL could also be applied to other cartographic representations that characterize historic landscapes, represent the cumulative efforts of many professionals over many decades, and are not readily available for use in spatial analysis and change studies. For example, features could be extracted from historic reference and geologic maps. As in the lead author's prior study [44], we argue that there is a need to develop multiple and varied benchmark datasets to support DL semantic segmentation research, including those derived from image datasets and other geospatial data, such as digital terrain data, historic topographic maps, and other cartographic presentations. Such datasets will be of great value in comparative studies and for further development and refinement of algorithms. Comparisons of algorithms were not a focus of this study. Since DL semantic and instance segmentation algorithm development and refinement are still actively being studied, there will be a continued need to investigate these new and refined methods for a wide variety of mapping tasks. For example, high-resolution networks (HRNets) [94][95][96][97] have recently been shown to be of value for dealing with issues of intra-class heterogeneity and inter-class homogeneity. Further, gated shape CNNs have been shown to be useful for differentiating features based on unique shape characteristics [98,99]. This further highlights the need for the development of a wide variety of benchmark datasets. Comparison of DL algorithms is complicated by processing time and computational costs, which makes it difficult to consistently and systematically compare algorithms, assess the impact of algorithm settings and architecture, experiment with reductions in sample size and generalization to new data and/or geographic extents, and incorporate multiple datasets into studies [44,54,55]. Since this study relied on a modified UNet algorithm and explored a single mapping task and input dataset, our findings associated with sample size and model generalization may not translate to other algorithms and/or classification problems, which further highlights the need for additional research.

Conclusions
This study highlights the value of DL semantic segmentation methods for extracting data from historic topographic maps, which offer a valuable record of historic landscape conditions that can be combined with more recent data, such as those derived from moderate spatial resolution satellite imagery, for extending the LCLU change record and more completely quantifying anthropogenic landscape alterations. This is especially true for disturbances with a long and complex historic record that pre-date the era of satellite-based Earth observation missions. We further documented model generalization to new data and geographic extents and the impact of reduced sample size. Overall, this research suggests that currently available semantic segmentation methods are applicable and generalizable, with some reduced performance. Further, large sample sizes may not be necessary to train models that generalize well, which can greatly reduce manual labor requirements.
DL techniques should be applied to these historic datasets to further extend the land change record. For programs that are attempting to digitize such features, such as the USGS GGGSC, augmenting processes with DL output will allow for more efficient production, which is very important given the volume of data that needs to be interpreted. To obtain accurate results, operational workflows must be developed that allow for training data generation, model development, prediction to new geographic extents and data, and manual digitizing to improve results where necessary. The data used in this study are made available through the West Virginia View website (http://www.wvview. org/data_services.html) while the code can be obtained from the associated GitHub repository (https://github.com/maxwell-geospatial/topoDL). We hope that these data and resources will aid in future DL research.