Landscape Similarity Analysis Using Texture Encoded Deep-Learning Features on Unclassified Remote Sensing Imagery

Convolutional neural networks (CNNs) are known for their ability to learn shape and texture descriptors useful for object detection, pattern recognition, and classification problems. Deeper layer filters of CNN generally learn global image information vital for whole-scene or object discrimination. In landscape pattern comparison, however, dense localized information encoded in shallow layers can contain discriminative information for characterizing changes across image local regions but are often lost in the deeper and non-spatial fully connected layers. Such localized features hold potential for identifying, as well as characterizing, process–pattern change across space and time. In this paper, we propose a simple yet effective texture-based CNN (Tex-CNN) via a feature concatenation framework which results in capturing and learning texture descriptors. The traditional CNN architecture was adopted as a baseline for assessing the performance of Tex-CNN. We utilized 75% and 25% of the image data for model training and validation, respectively. To test the models’ generalization, we used a separate set of imagery from the Aerial Imagery Dataset (AID) and Sentinel-2 for model development and independent validation. The classical CNN and the Tex-CNN classification accuracies in the AID were 91.67% and 96.33%, respectively. Tex-CNN accuracy was either on par with or outcompeted state-of-the-art methods. Independent validation on Sentinel-2 data had good performance for most scene types but had difficulty discriminating farm scenes, likely due to geometric generalization of discriminative features at the coarser scale. In both datasets, the Tex-CNN outperformed the classical CNN architecture. Using the Tex-CNN, gradient-based spatial attention maps (feature maps) which contain discriminative pattern information are extracted and subsequently employed for mapping landscape similarity. To enhance the discriminative capacity of the feature maps, we further perform spatial filtering, using PCA and select eigen maps with the top eigen values. We show that CNN feature maps provide descriptors capable of characterizing and quantifying landscape (dis)similarity. Using the feature maps histogram of oriented gradient vectors and computing their Earth Movers Distances, our method effectively identified similar landscape types with over 60% of target-reference scene comparisons showing smaller Earth Movers Distance (EMD) (e.g., 0.01), while different landscape types tended to show large EMD (e.g., 0.05) in the benchmark AID. We hope this proposal will inspire further research into the use of CNN layer feature maps in landscape similarity assessment, as well as in change detection.


Introduction
Earth system and environmental data have become abundant via a variety of sources ranging from model simulation data, citizen science, amateur drones, airborne sensors, commercial satellites, and easily accessible data such as Landsat [1,2]. These data are available at unprecedented spatial and temporal resolutions and are widely used for understanding processes of environmental change across time and space. Given the rapidity of human-induced landscape disturbances, there is increasing interest in using Remote Sens. 2021, 13,492 3 of 25 robust to scale variation, and thus reduce misclassification rates, Liu et al. [25] proposed a method in which randomly cropped image patches are used for model development. Gong et al. [26] also introduced a saliency-based feature extraction framework with antinoise transfer network and found the approach to yield high classification accuracy on benchmark datasets. CNNs with feature concatenation or fusion modules are simple but effective feature extraction frameworks that have been adopted to combine local and global image features for improving the performance of many scene classification and other pattern recognition tasks [27][28][29]. Ye et al. [30] presented a multi-stage model that extracts and fuses low-, middle-, and high-level features, and obtained 95% accuracy on the Aerial Image Dataset (AID). Kang et al. [31] also developed a network that captures contextual information via the fusion of deep and shallow features to improve ship-detection accuracy. A framework with dilated convolution and skip connections was found to learn multiresolution discriminative features for scene classification [32]. Similarly, Gao et al. [33] proposed a network in which feature maps generated from input images are passed on to a concatenating layer, forming a combined feature map with richer discriminative information. The authors concluded that their method significantly improved hyperspectral image classification. In a related study, Huang and Xu [34] used weighted concatenation to combine features across all CNN layers, yielding overall accuracy of 95% in AID. Similarly, Zeng et al. [35] developed a two-branch CNN in which local and global features are independently extracted and concatenated. With extensive experiments, the authors demonstrated that feature concatenation resulted in over 90% accuracy for most scene classes in AID.
Despite the state-of-the-art performance of current CNN architectures, deep learning algorithms are generally perceived as "black-boxes" in both computer vision and across other domains; consequently, there have been intensifying calls to interrogate and reveal the inner workings of deep learning models in disciplines such as geography [36]. Visualizing spatial attention maps (i.e., feature maps) is a fairly simple method of exploring how CNNs learn and make decisions on an input image. The approach may be gradient-based and involve computing network output changes with respect to input [37], or utilizing a deconvolution network that projects image features over a plane [38]. Zhou et al. [39] also proposed converting the linear decision (regression) layer into a convolutional layer for generating class-based attention maps. To improve gradient-based feature map quality, guided backpropagation has also been introduced [40]. As these approaches do not always produce class-specific feature maps [41], Selvaraju et al. [42] proposed Grad-CAM, which integrates guided backpropagation and class activation maps, and thus yielding classdiscriminative spatial attention maps. In a related research, Omeiza et al. [43] proposed Smooth Grad-CAM++ to improve the spatial resolution and localization of patterns in feature maps. Class-selective relevance mapping has also been proposed to derive feature maps that contain the most discriminative regions of interest in medical images [44].
In this study, we focused on gradient-based convolutional feature maps. Gradient with respect to an input image, is a sensitivity map measuring how changes at an input pixel spatial location affect changes in CNN model predictions [42]. Given an input image, if small changes to its pixels correspond to a large network output change, then it follows that such pixels encode "significant" spatial information. The above novel approaches to visualize and interpret CNN feature maps have been used extensively to evaluate and improve models' performance. However, we consider such CNN features to have potential for image similarity matching and retrieval. For example, a global representation vector extracted from a CNN has been shown to improve object-image retrieval on Oxford and Paris datasets [45,46]. As demonstrated in recent studies, CNNs with feature concatenation framework incorporate fine-grain textural details which encode relatively significant discriminative patterns [27,28]. Traditional CNN architectures, on the other hand, tend to focus largely on processing input images and feature tensors from individual layers. Thus, traditional CNNs have the tendency to discard tangible proportions of original image texture, as well as CNN layer features that contain discriminative information. To this Remote Sens. 2021, 13, 492 4 of 25 end, we propose training and deploying a texture-encoded CNN model (Tex-CNN) to evaluate landscape similarity. Our Tex-CNN is a simple, yet computationally efficient feature concatenation architecture for generating discriminative feature maps. In order to compare our proposal with existing techniques, a classical CNN was trained. Using the trained Tex-CNN and the classical CNN, we derived feature maps, to compare different or repeating spatial patterns across space. Given the discriminative learning behavior of convolutional filters, feature maps are sometimes sparse; to reduce the CNN feature maps to a compact representation that best encodes patterns in a given landscape, principal component analysis (PCA) was further performed, and feature maps with the highest eigen values were selected. The histogram of oriented gradients (HoG) vector was then extracted from each map for comparison using the Earth Movers Distance (EMD) algorithm.
The contribution of this study is, therefore, two-fold. (1) A gradient-based convolutional feature map approach to landscape similarity analysis was proposed. Using gradientbased features, the proposed landscape similarity assessment utilizes significant spatial patterns in a query and a candidate image for comparison. (2) A landscape similarity metric capable of detecting within-and between-landscape types was developed. The proposed metric effectively discriminates farm landscapes from mountainous, as well as forested, landscapes. The paper is arranged as follows: We first illuminate the importance of spatial feature maps in landscape comparison; next, the methodological pipeline is presented, followed by results, discussion, and conclusion.

Related Work
Prior to the emergence of state-of-the-art of CNNs capable of detecting and classifying objects and patterns, image texture processing was one of the earliest applications in which CNNs were employed to extract discriminative local features [47,48].

Representing Patterns in CNN Feature Maps
Convolutional feature maps can be thought of as spatial activation features encoding discriminative regions within a given input image [49]. A feature map can also be viewed as detection scores resulting from the application of a filter over spatial locations in a 2D image; the activation value obtained at the i-th location quantifies the importance of the pixel at that location [50]. Such locations may be linked, at least conceptually, to "landscape features of interest" or those areas of the landscape that are discriminative of the landscape scene label. The potential of a convolutional-feature-based approach in urban landscape change detection was presented in El Amin [51]. The authors demonstrated that CNN features can perform higher than "hand-crated features" and other state-of-the-art techniques. In related research, Albert [52] showed that features extracted from CNNs trained discriminatively on urban imagery effectively compare neighborhood similarity across European cities.
In landscape research where local-to-global changes or pattern similarity are sometimes of interest, CNN maps can be helpful. Feature maps represent local response regions of filters and thus encapsulate valuable pattern information [41]. These local regions also encode information pertaining to the underlying pattern-generating process. Feature maps from convolutional layers represent local descriptors of particular image regions which can be aggregated into global descriptors for image retrieval [53]. An image-retrieval framework is also closely related to the landscape-pattern comparison problem. For instance, CNN activations containing pronounced spatial information can be utilized for detecting repeated patterns [21]. The challenge to detect repetitive spatial patterns is similar to landscape similarity analysis problem. It has been illustrated that convolutional layer activations are local region descriptors and outperform many state-of-the-art descriptors [48,49]; thus, if these feature maps are well-pooled, a compact representation of a given landscape can be derived. Additionally, Zagoruyko and Komodakis [41] have shown that feature maps represent "knowledge learned" by a given network about the underlying pattern and can be transferred to other networks, to improve pattern detection. Furthermore, classical machine-learning algorithms for pattern detection or classification, such as Random Forest, Support Vector Machine, and Maximum Likelihood being employed in landscape research, can be coupled with deep feature extraction models to boost performance [54]. For example, it has been shown that feeding features from CNNs to other models improve results [29]. We therefore postulate that CNN-feature-based frameworks hold the potential to enable the detection and quantification of spatially patterned processes.

CNN-Feature-Based Image Retrieval
Image retrieval is an active research area in this era of "big data", where the objective is to find a set of images that are the most similar to a given query image. Content-based image retrieval (CBIR) is a widely applied technique for retrieving images in databases. In CBIR, low-level image descriptors (e.g., color, texture, and structure) are extracted to form an image representation; a suitable measure is then selected to estimate similarity between images. Several algorithms have been proposed for an improved CBIR. For example, Unar et al. [55] combine both visual and textual features for image retrieval. Zhang et al. [43] also developed an algorithm that segments an image into salient, nonsalient, and shadowed regions, in order to extract spatially relevant information. Earth observation data now available in various archives could provide a wealth of information through effective search and retrieval techniques [6].
Recent research has shifted towards the use of features extracted from deep convolutional layers of CNNs for image matching and retrieval [46,56]. The use of deep convolutional features for image retrieval is demonstrated in a study conducted by Babenko et al. [53]. Chen et al. [45] propose region-of-interest deep convolutional representation for image retrieval. Their approach first identifies regions of interest and proceeds to extract features from the fully connected layer. Shi and Qian [46] also adapted the region-ofinterest-based approach called strong-response-stack-contribution, by exploring spatial and channel contribution, to generate a more compact global representation vector for an object-based image retrieval challenge. Cao et al. [50] applied adaptive matching by splitting feature maps and later spatially aggregating them into regions of interest for comparison. Liu et al. [57] proposed extracting and pooling subarrays of feature maps as local descriptors for visual classification task and found that the method outperforms features from fully connected layers. The aforementioned applications hold potential for designing resource management and decision-making applications in geography.

Models' Architecture
In the context of landscape similarity mapping, global shape information present in fully connected layers is of less significance, as landscape patterns often lack unique or stable geometry across space. Given that lower layers capture local patterns [16], we concatenated multi-layer features, to learn a discriminative representation of the data-generating process. In feature fusion, feature maps from three convolutional layers (i.e., conv1, conv2, and conv3) are concatenated followed by flattening into feature vectors to yield a dense layer (denoted FC1). One possible approach to improving CNN features' discriminative potential is to apply attention pooling strategies that takes the weighted sum of different feature maps instead of concatenating features, as this technique exponentially increases model parameters as well. However, we adopted feature concatenation, as it has been proven to enable the extraction of multiscale features, potentially obviating the need for multiscale inputs during model development [58]. Moreover, attention strategies are effective for object recognition tasks but may not tangibly improve landscape pattern discrimination.
Work similar to our approach is the Andrearczyk and Whelan [59] feature concatenation framework. Figure 1 illustrates the architecture of a classical CNN, while Figure 2 depicts our model architecture. For model design, we build on the VGG16 model architecture and filter constellation [60]. Thus, 32, 64, and 128 Filters are used in the first, second, and third convolutional layers of the classical CNN and the Tex-CNN models. The proposed model architecture is intended to be simple with minimum parameters as possible for field deployment. A model with fewer than three convolutional layers does not only limit the number of feature maps available for exploration but may not be able to shatter the training data as well. Given the data at hand, models with over four convolutional layers could potentially pose overfitting challenges. been proven to enable the extraction of multiscale features, potentially obviating the need for multiscale inputs during model development [58]. Moreover, attention strategies are effective for object recognition tasks but may not tangibly improve landscape pattern discrimination.
Work similar to our approach is the Andrearczyk and Whelan [59] feature concatenation framework. Figure 1 illustrates the architecture of a classical CNN, while Figure 2 depicts our model architecture. For model design, we build on the VGG16 model architecture and filter constellation [60]. Thus, 32, 64, and 128 Filters are used in the first, second, and third convolutional layers of the classical CNN and the Tex-CNN models. The proposed model architecture is intended to be simple with minimum parameters as possible for field deployment. A model with fewer than three convolutional layers does not only limit the number of feature maps available for exploration but may not be able to shatter the training data as well. Given the data at hand, models with over four convolutional layers could potentially pose overfitting challenges.   been proven to enable the extraction of multiscale features, potentially obviating the need for multiscale inputs during model development [58]. Moreover, attention strategies are effective for object recognition tasks but may not tangibly improve landscape pattern discrimination.
Work similar to our approach is the Andrearczyk and Whelan [59] feature concatenation framework. Figure 1 illustrates the architecture of a classical CNN, while Figure 2 depicts our model architecture. For model design, we build on the VGG16 model architecture and filter constellation [60]. Thus, 32, 64, and 128 Filters are used in the first, second, and third convolutional layers of the classical CNN and the Tex-CNN models. The proposed model architecture is intended to be simple with minimum parameters as possible for field deployment. A model with fewer than three convolutional layers does not only limit the number of feature maps available for exploration but may not be able to shatter the training data as well. Given the data at hand, models with over four convolutional layers could potentially pose overfitting challenges.

Model Parameterization and Training
We opted for training our models from scratch, as this approach gives flexibility over model architecture. Although there is potential for data limitation, as well as over-fitting, in this framework [61], the approach facilitates feature maps comparison, as it ensures that features are the direct result of filters learned on data presented to models, compared to using pretrained networks in which filters learned from an entirely different domain than the task at hand. Given that the input image size is large enough (i.e., 225 × 225), we selected 7 × 7 convolutional kernels and used a fixed filter size with stride 1 throughout the convolutional layers. Filter receptive field size changes with layer depth and could result in profound differences in feature spatial resolution between successive layers. In the pooling layers, 2 × 2 max pooling with stride 2 is applied. The receptive field size at the third convolutional layer, therefore, becomes 46. We utilized 75% of the sample data for training and 25% for validation. To mitigate potential overfitting, 25% drop-out is used in convolutional layers, while 50% is applied to the FC1 layer [62]. The rectified linear unit (ReLU) is used as the activation function. Multiclass cross-entropy loss function is employed, and the models are trained for 30 iterations with Adam as the optimizer. Adam adaptively computes and updates gradients and is invariant to diagonal scaling of gradients [63]. The Keras-Tensorflow backend was used for building and supporting computations required to train the CNN models on a GPU with a NVIDIA-supported graphics card. Table 1 summarizes the models' architecture and parameters. Table 1. A summary of models' architecture and parameters.

Layer Name Convolution Max-Pooling Activation Drop-Out
Conv-1

Application Context: Landscape Comparison
Unclassified imagery, which is now ubiquitous due to the availability of sensors of varying types, offers the potential for landscape similarity queries. While land-cover classification in which pixels are labeled (classified) or objects are segmented and characterized is a predominant use of aerial and satellite imagery [64], in this modeling framework, we focus on characterizing whole scenes or landscapes. An implementation of this would be helpful for automating image retrieval and potentially provide a basis for mixed scenes and/or novel land-scene categories and/or descriptors. A conceptual representation for comparing unclassified images (aka landscapes/scenes) is depicted in Figure 3, using three landscapes/scenes denoted as X, Y, and Z, but the representation is expandable to multiple landscape types. Given an image, the feature map will be extracted for comparison, using EMD. EMD(X, X'), EMD(Y, Y'), and EMD(Z, Z') compute within-landscape similarity, while EMD(X, Y), EMD(Y, Z), and EMD(Z, X) estimate between-landscape similarity.
Benchmark datasets have long been used in computer vision for model development, due to the scarcity of labeled data, and the laborious processes required for generating such datasets, yet they remain relatively rare in geospatial research. The aerial imagery dataset is composed of high-resolution benchmark data recommended for training scene classification models [65]. The AID contains multi-resolution images; the pixel spatial resolution varies from about half a meter to eight meters, providing a suitable dataset for training classical CNN and Tex-CNN models. A common protocol in computer vision is to split a given dataset into training, validation, and test samples. This may sometimes result in high-accuracy reports resulting from overfitting. Owing to this caveat, and the need to find models capable of generalizing over a range of datasets for field application, we propose carrying out further validation by using a dataset from an entirely different sensor. As such, we employed Sentinel data to evaluate the generalizability of the developed models. Table 2   Benchmark datasets have long been used in computer vision for model development, due to the scarcity of labeled data, and the laborious processes required for generating such datasets, yet they remain relatively rare in geospatial research. The aerial imagery dataset is composed of high-resolution benchmark data recommended for training scene classification models [65]. The AID contains multi-resolution images; the pixel spatial resolution varies from about half a meter to eight meters, providing a suitable dataset for training classical CNN and Tex-CNN models. A common protocol in computer vision is to split a given dataset into training, validation, and test samples. This may sometimes result in high-accuracy reports resulting from overfitting. Owing to this caveat, and the need to find models capable of generalizing over a range of datasets for field application, we propose carrying out further validation by using a dataset from an entirely different sensor. As such, we employed Sentinel data to evaluate the generalizability of the developed models. Table 2 describes the datasets utilized in this study.

Data Augmentation
The AID consists of diverse landscape types; however, considering only three landscapes reduces the sample size. CNNs are "data hungry" models; thus, training such models from scratch by using fewer samples and classes is likely to pose data limitation issues and overfitting. We therefore attempt to circumvent this challenge via the application of data augmentation. To that end, we employ the Keras image data generator API to augment our training dataset. Given that the AID is multiresolution-image pixel sizes vary from about half a meter to eight meters, scale representation challenges are inherently reduced such that scale transformations may not make substantial difference following data augmentation. Bearing this in mind, horizontal flips and rotations (i.e., 45-180 degrees) Remote Sens. 2021, 13, 492 9 of 25 were applied to generate enough training data. Three-thousand samples were generated for each landscape type, and thus yielding 9000 samples for the three landscape types: farm, mountain, and forest. A Sentinel dataset was used to test the potential generalizability of the method on medium-resolution satellite imagery. Figure 4 illustrates samples of AID landscapes used in our experiments.
scapes reduces the sample size. CNNs are "data hungry" models; thus, training such models from scratch by using fewer samples and classes is likely to pose data limitation issues and overfitting. We therefore attempt to circumvent this challenge via the application of data augmentation. To that end, we employ the Keras image data generator API to augment our training dataset. Given that the AID is multiresolution-image pixel sizes vary from about half a meter to eight meters, scale representation challenges are inherently reduced such that scale transformations may not make substantial difference following data augmentation. Bearing this in mind, horizontal flips and rotations (i.e., 45-180 degrees) were applied to generate enough training data. Three-thousand samples were generated for each landscape type, and thus yielding 9000 samples for the three landscape types: farm, mountain, and forest. A Sentinel dataset was used to test the potential generalizability of the method on medium-resolution satellite imagery. Figure 4 illustrates samples of AID landscapes used in our experiments.

Activation/Feature Maps Derivation
Given a trained CNN model, gradient-based activation maps can be computed to allow for visualization of localized regions in an image that contribute significantly to a given output pattern. Using our trained classification model, activation maps are derived via backpropagation of filter responses to input pixel intensities [42]. ReLU is employed to constrain the backpropagation process to propagate only positive pixel values that activate filters; these pixel positions contain the highest weight and are therefore said to encode "significant patterns" or represent the signatures of the underlying pattern-generating process.
The gradient-based class activation map proposed by Selvaraju et al. [42] is derived as follows: Let Y c denote the score for a particular landscape scene. The gradient, with respect to Y c , is formulated as ∂Y c The weight term α c k captures the "significance" of feature map k for a target landscape type/scene. ReLU is applied to the weighted sum of feature maps, yielding heat-maps whose local regions highlight the most discriminant patterns in images. The resultant CNN activation maps pinpoint locations where the model focuses its attention on, since such locations contain significant spatial patterns. Therefore, activation maps can be referred to as "saliency maps" or "spatial attention maps".

Extracting HoG Vector from Feature Maps
For each landscape type, 50 scenes across different locations were selected for feature map extraction. We note that, since the number of filters in the second convolutional layer from which the feature maps are computed is 64, each image correspondingly yields 64 feature maps. We perform spatial filtering by using PCA to reduce the number of feature maps per image. PCA reduces feature map dimensionality, yielding a more compact image descriptor [53]. Such a step is inevitable when CNN feature maps are being compared; due to discriminative learning, not all filters respond to input images or pixels, and, as such, certain feature maps may contain no features/patterns where a filter is not activated by an input image [66][67][68]. Using PCA, a feature map (i.e., eigen map) that has the highest eigen value is selected. Next, the HoG vector is extracted from each landscape type feature map. HoG has been shown to extract effective image descriptors for pattern recognition tasks. For example, human face recognition across standard datasets is found to improve, using HoG descriptors [69]. In related research, different plant species were effectively recognized from leaf patterns, using HoG descriptors [70]. Setting the spatial parameters (i.e., cell size and cells per block) for extracting HoG features, however, requires a careful approach. In our implementation, the HoG vector is extracted by considering cell size 24 × 24 dimension and cell-per-block = 2 × 2 of each feature map. We deploy the EMD, a multivariate histogram distance measure, proposed by Rubner et al. [71], to compare the resultant HoG vector representing reference and test feature maps.

Formulating the Feature Map Comparison Metric
In the literature, there are a variety of pattern similarity comparison metrics, yet it is challenging to find robust and generic metrics to rely on when it comes to landscape similarity comparison. In this section, we illustrate how our convolutional feature map comparison metric was derived. Figure 5 is a depiction of our proposed convolutional feature-based landscape similarity comparison.

Formulating the Feature Map Comparison Metric
In the literature, there are a variety of pattern similarity comparison metrics, yet it is challenging to find robust and generic metrics to rely on when it comes to landscape similarity comparison. In this section, we illustrate how our convolutional feature map comparison metric was derived. Figure 5 is a depiction of our proposed convolutional featurebased landscape similarity comparison. Equations (3) and (4) illustrate our formulation and computation of within-and between-landscape similarities. Equations (3) and (4) illustrate our formulation and computation of within-and between-landscape similarities.
W Lsim = EMD HoG L L1 type, locX , HoG(L L1, locY ) (3) where L L1 type and L L2 type represent different landscape categories from different spatial locations. W Lsim and BLsim denote within-and between-landscape type comparison, respectively. For W Lsim, we compare similar landscapes; example L L1 type , but from different locations (e.g., locX vs. locY). For example, to compare farm landscapes, locX will represent a reference landscape, while locY (1,2,3..., n) denotes farm landscapes (e.g., 225 × 225 grids) from other locations of interest. Landscapes whose spatial extents are large could be tiled into spatial grids of equivalent dimension as the model input size for comparison.
BLsim involves a comparison of two disparate landscape types (e.g., forest vs. farm). HoG( . ) computes HoG feature vector, given an input feature map, while EMD( . ) estimates HoG feature vector similarity based on the EMD between landscapes. To test the proposed metric, 50 images from each landscape type were taken from the AID and randomly split into two subsets, thus yielding 25 images per subset, which are named G1 and G2 (e.g., farmG1 and farmG2 each contains 25 images belonging to farm landscapes). Using the metric, a compact distribution based on EMD is computed for within-and between-landscape, by comparing each scene type; for example, in farmG1, a selected scene is compared with all other scenes in farmG2. This permutation schema is repeated for all the 25 scenes in farmG1.

Landscape Type Prediction Models
Figure 6a-f depicts classification accuracies for landscape types on AID and Sentinel data. The confusion matrices are computed by deploying the models on the test images from AID and Sentinel datasets (i.e., 900 images for AID and 600 images for Sentinel-2). In Figure 6a,b, the Tex-CNN and the classical CNN classification accuracy reports are similar except for mountainous scenes where Tex-CNN has higher classification accuracy. In Figure 6c,d, the first row of the confusion matrix shows that over 90% of the farm landscapes are misclassified as forest in Sentinel dataset. About 70% of the mountain landscapes are correctly classified by the Tex-CNN, while the classical CNN achieves only 25% classification accuracy. Figure 6e,f shows classification accuracies after finetuning the models with a combination of AID and Sentinel data. It can be observed that misclassification rates for farm landscapes have been substantially reduced. Table 3 compares overall accuracy reports, as well as per-landscape type classification accuracies for reference state-of-the-art techniques and Tex-CNN on the AID. It can be seen that the Tex-CNN is highly competitive and, in some instances, outperforms other methods. Table 3. Overall accuracy (OA) and selected per-scene class accuracy for reference and our proposed Tex-CNN on the AID.

Exploring CNN Layer Features Suitability for Landscape Comparison
Given that CNN layers process inputs hierarchically, feature maps spatial resolution become coarser with layer depth: Earlier layers contain finer resolution features, while deeper layer representation gives coarser features. We conducted visual assessment of feature map quality, as well as the potential utilization of the second-and third-layer feature maps. Layer-one features were not included in this analysis, as gradient-based features cannot be computed by using input image data as the penultimate layer. Figure 7 depicts feature maps with the highest eigen values extracted from Tex-CNN. The feature maps are the result of applying PCA to layer two-and three-feature tensors. Notice how the spatial resolution changes across the layers. While layer-two eigen maps are fine-grained, with distinct patterns (e.g., farm boundaries, tree clusters), this pattern is not clearly interpretable in layer-three eigen maps. In Figure 7, row (a), layer-two shows high-resolution features with conspicuous farm boundaries. Contrarily, layer-three map depicts low-resolution features; the boundaries of individual parcels are blurred out. In Figure 7, row (b), layertwo shows fine-grained clusters of trees; layer-three, on the other hand, depicts coarse scale patterns which are not immediately recognizable as forest.

Exploring CNN Layer Features Suitability for Landscape Comparison
Given that CNN layers process inputs hierarchically, feature maps spatial resolution become coarser with layer depth: Earlier layers contain finer resolution features, while deeper layer representation gives coarser features. We conducted visual assessment of feature map quality, as well as the potential utilization of the second-and third-layer feature maps. Layer-one features were not included in this analysis, as gradient-based features cannot be computed by using input image data as the penultimate layer. Figure 7 depicts feature maps with the highest eigen values extracted from Tex-CNN. The feature maps are the result of applying PCA to layer two-and three-feature tensors. Notice how the spatial resolution changes across the layers. While layer-two eigen maps are finegrained, with distinct patterns (e.g., farm boundaries, tree clusters), this pattern is not clearly interpretable in layer-three eigen maps. In Figure 7, row (a), layer-two shows highresolution features with conspicuous farm boundaries. Contrarily, layer-three map depicts low-resolution features; the boundaries of individual parcels are blurred out. In Figure 7, row (b), layer-two shows fine-grained clusters of trees; layer-three, on the other hand, depicts coarse scale patterns which are not immediately recognizable as forest.

Mountainous Terrains
We hypothesized that feature maps from within-landscape types would have lower EMD values, while those originating from disparate classes would have higher EMD values. We first conducted a Kolmogorov-Smirnov test to ascertain the validity of this hypothesis. As expected, it turns out that between-class feature distributions were statistically significantly different (p < 0.001). A sample of mountain landscapes from the AID and Sentinel datasets is depicted in Figure 8. Feature map regions that are highlighted in warmer colors represent the most significant discriminative patterns learned by the three filters; notice that most of these areas are predominantly less vegetated. Regions with cooler (blue) colors are found to be less important, according to the model's weighting decision. Notice also that the filters sometimes perceive similar regions differently in terms of significant patterns-pixels that are found to be significant by one filter may be seen to have less weight by another filter, due to the discriminative learning behavior of CNNs.

Mountainous Terrains
We hypothesized that feature maps from within-landscape types would have lower EMD values, while those originating from disparate classes would have higher EMD values. We first conducted a Kolmogorov-Smirnov test to ascertain the validity of this hypothesis. As expected, it turns out that between-class feature distributions were statistically significantly different ( < 0.001). A sample of mountain landscapes from the AID and Sentinel datasets is depicted in Figure 8. Feature map regions that are highlighted in warmer colors represent the most significant discriminative patterns learned by the three filters; notice that most of these areas are predominantly less vegetated. Regions with cooler (blue) colors are found to be less important, according to the model's weighting decision. Notice also that the filters sometimes perceive similar regions differently in terms of significant patterns-pixels that are found to be significant by one filter may be seen to have less weight by another filter, due to the discriminative learning behavior of CNNs.  Figure 9 shows the results for comparing mountainous landscapes and farm landscape types. It can be seen from Figure 9a,d that feature maps from similar landscapes display smaller distances, and hence their distribution falls to the left, characterized by smaller EMD. Over 60% of features in Wclass_mount of Figure 9a,b show EMD score of 0.01, while more than 50% of between class comparison yields EMD values higher than 0.05. Moreover, it can be observed that aside from shape differences, there is little overlap in the distributions of within class (Wclass_mount, Wclass_farmG1, and Wclass_farmG2).
HoG can also be extracted directly from the original data (i.e., raw images) for comparison. We demonstrate this by computing EMD over the same set of original images used for extracting CNN feature maps. Figure 10a,d presents within-class (i.e., Wclass_mount), between-class (i.e., (mountain vs. farm) and (mountain vs. forest) EMD distributions. As can be seen in the derived CNN features, the mountain vs. forest comparison poses challenges for real image comparison as well.  Figure 9 shows the results for comparing mountainous landscapes and farm landscape types. It can be seen from Figure 9a,d that feature maps from similar landscapes display smaller distances, and hence their distribution falls to the left, characterized by smaller EMD. Over 60% of features in Wclass_mount of Figure 9a,b show EMD score of 0.01, while more than 50% of between class comparison yields EMD values higher than 0.05. Moreover, it can be observed that aside from shape differences, there is little overlap in the distributions of within class (Wclass_mount, Wclass_farmG1, and Wclass_farmG2).
HoG can also be extracted directly from the original data (i.e., raw images) for comparison. We demonstrate this by computing EMD over the same set of original images used for extracting CNN feature maps. Figure 10a,d presents within-class (i.e., Wclass_mount), between-class (i.e., (mountain vs. farm) and (mountain vs. forest) EMD distributions. As can be seen in the derived CNN features, the mountain vs. forest comparison poses challenges for real image comparison as well. Figure 11 presents farm landscape samples and their corresponding feature maps. Convolutional filters are randomly selected to illustrate patterns learned on farm landscape types. It can be observed that the filters specialize in detecting different features. For example, Filter 43 recognizes farm boundaries to be significant patterns, while Filter 8 weights blocks of vegetated areas higher. As shown in Figure 11a,b, the filters appear to assign significance to similar features in both AID and Sentinel datasets. Figure 12a-d depicts within-landscape feature maps' similarity (Wclass_farm) and between-class similarity (Bclass_mountG1 and Bclass_mountG2, for mountains; Bclass_ forestG1 and Bclass_forestG2, for forests). The Wclass_farm distribution shows most feature maps with EMD values close to zero, and over 65% of the feature maps show EMD values of 0.01. Conversely, Bclass_forestG1 and Bclass_mountG1 distributions tend to fall towards higher distances, with over 50% of feature maps having EMD value of 0.05.

Forested Landscapes
Forest landscapes from the AID dataset and their feature maps are depicted in Figure 13a,b. Filters 11, 15, and 53 depict features at varying grain sizes, yet they represent discriminative features from an identical forest landscape.
Remote Sens. 2021, 12, x FOR PEER REVIEW 16 of 27 Figure 9. Landscape similarity comparison. EMD similarity distribution for mountain, forest, and farm patterns is depicted in (a-d). Mountain feature map comparison is within-class (i.e., mountain vs. mountain). Between-landscape type similarity distribution is derived through mountain vs. farm (a,b), and mountain vs. forest comparisons (c,d).  Figure 11 presents farm landscape samples and their corresponding feature maps. Convolutional filters are randomly selected to illustrate patterns learned on farm landscape types. It can be observed that the filters specialize in detecting different features. For example, Filter 43 recognizes farm boundaries to be significant patterns, while Filter 8 weights blocks of vegetated areas higher. As shown in Figure 11a,b, the filters appear to assign significance to similar features in both AID and Sentinel datasets.  Figure 12a-d depicts within-landscape feature maps' similarity (Wclass_farm) and between-class similarity (Bclass_mountG1 and Bclass_mountG2, for mountains; Bclass_forestG1 and Bclass_forestG2, for forests). The Wclass_farm distribution shows most feature maps with EMD values close to zero, and over 65% of the feature maps show

Forested Landscapes
Forest landscapes from the AID dataset and their feature maps are depicted in Figure 13a,b. Filters 11, 15, and 53 depict features at varying grain sizes, yet they represent discriminative features from an identical forest landscape. Figure 14a,b illustrates the similarity distributions for within forest landscape (Wclass_forest) and forest vs. farm landscapes (Bclass_farmG1 and Bclass_farmG2). The two landscape types show distinct EMD similarity distribution with very little overlap. Moreover, high variance is noticeable in the between-landscape comparison. Feature maps in within-landscape comparison depict lower EMD scores, with over 60% of features showing EMD values of 0.00-0.01, while over 70% feature maps in between-landscape comparison show 0.05 EMD similarity scores. Figure 14c,d compares forest landscapes with mountains. Within-class distribution (i.e., Wclass_forest) shows lower variance and relatively shorter EMD scores. However, though the distributions depict different shapes, there tend to be substantial overlap in within-class and between-class (Bclass_mountG1 and Bclass_mountG2) distributions.   Figure 14a,b illustrates the similarity distributions for within forest landscape (Wclass_ forest) and forest vs. farm landscapes (Bclass_farmG1 and Bclass_farmG2). The two landscape types show distinct EMD similarity distribution with very little overlap. Moreover, high variance is noticeable in the between-landscape comparison. Feature maps in withinlandscape comparison depict lower EMD scores, with over 60% of features showing EMD values of 0.00-0.01, while over 70% feature maps in between-landscape comparison show 0.05 EMD similarity scores. Figure 14c,d compares forest landscapes with mountains. Within-class distribution (i.e., Wclass_forest) shows lower variance and relatively shorter EMD scores. However, though the distributions depict different shapes, there tend to be substantial overlap in within-class and between-class (Bclass_mountG1 and Bclass_mountG2) distributions.

Discussion
A comparison of the Tex-CNN accuracy reports on the test data, as shown in the confusion matrix (Figure 6), emphasizes the promising potential of textural information in capturing discriminative patterns. Although the performance of both models is virtually similar for farm and forested landscapes, we noticed a dramatic difference in the models' classification accuracies for mountainous terrain types. Our observation implies that the higher accuracy for the Tex-CNN prediction is partly attributable to the model's architecture, which encodes representative features with textural information. The incorporation of texture features enhances model performance, especially for complex patterns and datasets [59,75]. As seen in Figure 13, the feature maps display multi-resolution patterns in the forest landscape types. The feature concatenation method introduced may have encouraged the CNN to learn both fine and coarse grain spatial patterns [75]. A comparison of the Tex-CNN classification results (e.g., OA) and that of the state-of-the-art models in AID is presented in Table 3. The model is highly competitive with existing high-performing techniques. Per-landscape accuracy shows that our method is either at par or outcompetes other methods. It should be emphasized that the model is simple (i.e., small in size) and computationally efficient compared to other models (e.g., References [34,72]). Thus, Tex-CNN can be used to extract feature maps with minimum overheard cost.
In Figure 6c,d, it can be observed that classifying landscapes in Sentinel data is challenging for both models, as they did not perform up to expectation in the first row of the confusion matrix. Over 90% of the farm landscapes tend to be classified as forest (i.e., false positive); contrarily, 92% and 79% of forest landscapes are correctly classified by the Tex-CNN and the classical CNN, respectively. This is partly explained by the relatively low spatial resolution of Sentinel's dataset, as well as the data not being part of the training sample. Visual exploration of feature maps in Sentinel data shows most farm boundaries disappearing completely in higher layers of the CNN, thus making farm samples appear as if they contain only vegetation patterns. The absence of boundary-like patterns likely triggers filter responses, leading to the misclassification of farms as forest. A profound reduction in misclassification rates, especially for farm landscapes, was achieved by adding Sentinel data (Figure 6e,f). Thus, presenting models with multimodal data at training time is likely to improve discriminative learning, while reducing misclassification errors.
The use of feature maps in pattern recognition is borne from the notion that the human visual system extracts the most relevant structural information from visual scenes in order to make decisions or characterize them semantically [76]. There is a great deal of analogy between landscape similarity comparison and assessment of feature maps (dis) similarity common in computer vision research [10]. CNN feature maps are continuous-valued data which can avoid classification problems that arise in landscape research owing to landcover type discretization and artificial boundaries generation [77]. We adopted a novel approach to compare landscapes via the extraction of feature maps from specific landscape types. This framework leads to the availability sufficient feature templates describing a particular landscape and thus enabling robust similarity mapping. The PCA method resulted in objective selection of feature maps that best represent a given landscape. Feature map dimensionality reduction through PCA has been proven to not degrade but further improve the discriminative potential of convolutional features [53]. Figure 7 shows samples of original images and their corresponding eigen maps. For landscape similarity comparison, layer-two feature maps were utilized. As can be seen in the figure, layer-two yields compact and high-resolution feature representations than layer-three. This suggests that layer-two features may be suitable for similarity assessment, hence our adoption of the layer's feature maps.
In Figure 9a,b, mountainous landscapes show distinct differences with farm landscape types. The EMD values for the within class comparison (Wclass_mount) falls largely on the left, pointing to shorter distances and hence higher similarity. More than 60% of the feature maps show EMD values of 0.01. Over 50%, the feature maps between class comparison EMD values are as high as 0.05. This suggests that there exist significant discriminative features between these two distinct landscape types. Song et al. [78] provide evidence that, by using feature map distances, it is possible to select the most discriminative patterns to represent mountainous terrains. The feature maps within the similar landscape also tend to depict higher EDM densities, which is an indicator of feature maps clustering [39], and high-density (frequency) values imply that a large proportion of feature maps are similar. Figure 10a-d compares HoG features extracted directly from the original images. The EDM distributions are somewhat similar to the CNN feature maps, but it can be observed that the CNN features appear to be slightly sensitive; for example, fewer images in the betweenclass comparison fall in EDM of 0-0.01. Moreover, compared to the original image HoG features, it can be seen that EMD values' distribution tends to be peakier for within-class and a little flatter for between-class in the CNN feature comparison. This suggests that our Tex-CNN features may possess more image descriptors compared to raw image pixels.
When comparing mountains versus forested landscapes, EMD distributions appear to overlap. This challenge is not unexpected, given the diverse morphology of mountains in some images, especially given that some mountains contain forest. Furthermore, recalling that the model's performance at predicting mountainous terrains is low, it follows that feature maps derived for certain input images that record poor scores may be of lower quality for landscape comparison. This suggests that, if a model is optimized to predict a particular landscape type with high accuracy, its corresponding feature maps will be of better discriminative quality and hence can be suitable for mapping landscape similarity [41].
Farm landscapes turn out to be the most easily discriminated patterns using the CNN model's feature maps (see Figure 12a,b). As expected, the within-landscape type comparison shows smaller EMD values for farm feature maps, with between-landscape distributions falling towards the right. Additionally, there is very little overlap in the distributions of within-and between-feature map distributions. Higher EMD values suggest lower similarity scores for landscapes being compared. Moreover, within-class feature maps exhibit somewhat low variance in EMD values. Over 65% of the Wclass_farm shows 0.01 EMD. This shows higher similarity compared to the farm vs. forest comparison, where EMD values as large as 0.05 are recorded. The unique vertical and horizontal boundary features may be among the discriminative patterns the model learns in farm landscapes. Lower layers of CNN are superior in learning edges, blobs, curves, and fine-grained textural patterns [12]. This observation emphasizes the high prediction accuracy recorded for the farm-landscapes type, as shown in the confusion matrix ( Figure 6). Murabito et al. [76] study found that saliency maps, a variant of gradient-based attention maps (i.e., feature maps), improve pattern detection. Figure 14a-d depicts within-forest landscape and between landscapes, which consist of forest vs. farm (e.g., Bclass_farmG1), and forest vs. mountain (e.g., Bclass_mountG1). Figure 14a,b emphasizes the existence of distinct discriminative features between forest and farm landscapes, as these two distributions show very little overlap. More importantly, within-forest landscape (Wclass_forest) distribution shows lower EMD values, suggesting higher similarity scores. More than 60% of the feature maps have EMD values of 0.00-0.01, while over 70% of the between-landscape comparison shows 0.05 EMD similarity scores. However, the Wclass_forest vs. Bclass_mount distributions show overlaps (Figure 14c,d), though the shape of the distributions suggest that the two landscapes belong to distinctively different class types. The Kolmogorov-Smirnov test further confirmed that the distributions are statistically significantly different (p − value < 0.001).
The remote-sensing and spatial-analysis literature has many metrics for comparing spatial patterns, yet this domain is largely fractured, and sometimes lacks generic toolsets for comparing continuous valued (i.e., unclassified) image data [7]. Amirshahi [79] proposed extracting HoG and applying histogram intersection kernel to compare feature maps. Liu et al. [80] also introduced a similarity distribution learning framework, using a CNN ensemble to incorporate feature uncertainty similarity at training time. The extracted features from the trained model are then employed in image retrieval and scene classification. Given that CNN feature maps are inherently discriminative and can potentially handle similarity uncertainties, we propose a metric to compare CNN feature maps' similarity via the computation of feature EMD. Our approach applies gradient-based computation to extract discriminative spatial patterns given an input image. The extracted feature maps contain local descriptors which are essential for pattern recognition. Utilizing EMD resolves the problem of histograms' bin size on similarity scores.
Our proposed metric effectively distinguishes farm landscape types from non-farm landscapes. Mountainous terrains and forested landscapes are discriminated, as their distributions are significantly different. A highly sensitive spatial pattern domain metric may be able to overcome the overlaps seen in forested and mountainous landscapes distributions. We tested structural similarity and the complex-wavelet structural similarity metrics which capture spatial information but did not realize impressive results. We point out that our findings demonstrate the challenging nature of the AID dataset and its potential suitability for training models; despite containing fewer samples per scene categories, the images can be described as multi-scale (i.e., mountain features' size vary within the same landscape type). Such data can present challenges to CNNs without explicit multiresolution encoding [81]. To surmount such a limitation, Li et al. [20] suggested utilizing the last convolutional layer filters, since these enable the discovery of locally consistent spatial patterns. However, we chose not to apply these features, since they lack full geometric invariance, as well as fine-grain textural details [17]. Figure 8 further emphasizes our claim, as it illustrates the lower spatial resolution of layer-three feature maps. The last layer (i.e., FC1) encodes structure and global information (e.g., shape). As pointed out earlier, unlike object recognition, landscape patterns lack definite shapes; hence, features from this layer may not improve mountain vs. forest discrimination substantially. Furthermore, given that the FC1 features are 1D vectors, the approach to computing the HoG adopted cannot be applied. The bag-of-words approach widely used in CBIR [82] could improve mountain vs. forest distinction, but this approach was not considered in this work, as it is out of scope.
The low classification accuracy of the models on Sentinel data (see Figure 6c,d) emphasizes the potential effects of spatial resolution on models' performance. Interestingly, however, the Tex-CNN outperforms the classical CNN, as it shows high classification accuracy for mountains. The inclusion of texture information may have improved the model's performance across scales.

Conclusions
The landscape-similarity mapping problem can be formulated as a challenge to detect repeated patterns, in other words, similar patterns across different locations, as shown in a study conducted by Lettry et al. [21]. The problem of comparing landscapes can also be considered in the context of image-retrieval tasks, as demonstrated by Yandex and Lempitsky [53], using convolutional feature maps. Landscape similarity or change-detection problems may further be cast as image-quality assessment challenges, as demonstrated in Reference [79]. In this study, we showed that CNN-based features (aka spatial attention maps) contain discriminative descriptors of image quality and, hence, computing similarity over feature maps can be an effective and generic way to compare landscapes. Our approach provides evidence that a generic pattern-comparison metric can be developed from highly discriminative feature maps capable of mapping diverse landscape types.
The challenge encountered in the mixing of forest and mountain similarity distributions points to the potential occurrence of false positives when attempting to make search queries between forests and mountains. The models' performance being consistently low for mountains in AID and Sentinel data further emphasizes that scaling of features represented in feature maps might work for farms and forests but not for mountains. As mentioned previously, the morphology of the mountain class is highly variable; moreover, the presence of forest on mountains further complicates discrimination between the landscapes. In this context, a priori knowledge may help decrease false positives at the time of query. Moreover, the relatively low CNN classification accuracy for the mountain landscapes likely influenced the quality of feature maps derived from convolutional layer filters; hence, a higher-performing model would be crucial for deriving highly discriminative patterns relevant for landscape similarity comparison.
One potential limitation of the proposal stems from the fact that mixed landscape samples were not considered in model development; widening the sample size to include scenes that contain a mixture of two or more landcover types could improve the metric's performance, especially in discriminating mountains and forests. Such a fuzzy definition of landscape classes may be more useful for landscape-similarity and/or scene-retrieval applications in the future, as they more closely align with the complexity of landscapes in the real world. Moreover, the nested framework (i.e., PCA and HoG, and EMD) computations may increase the complexity of the proposed metric. Given that what constitutes the best approach to feature map selection approach remains an open question [57], an innovative and objective framework to select feature maps to enhance (dis)similarity detection, as shown by Rui et al. [83] by utilizing feature map separability index, needs future consideration. Additionally, further research needs to consider expanding the number of landscape types so as to test the robustness and generalizability of the proposed metric. Independent validation datasets from different sensors, such as Sentinel-2, can be challenging for models trained on high-resolution aerial imagery; thus, it is essential that future research considers combining samples of multi-modal datasets for model development. The utilization of gradient-based CNN feature maps for landscape-change detection also warrants future research.