2.1. Methods
Restricted Boltzmann machines (RBMs) are simple two-layer learning architectures that can be trained in an unsupervised or supervised fashion. In this work, we use unsupervised RBMs. The RBMs can be stacked, thus forming a deep learning model, called a Deep Belief Network (DBN) [
7]. The work described here only uses RBMs with a single layer.
An RBM is a variation of a hidden Markov field, whose energy function is linear in its free parameters. RBMs are “restricted”, due to the fact that edges can only make connections between adjacent layers. Each unit in the visible layer is connected to each unit in the hidden layer, but no other intra-layer connections are allowed. The energy function used for RBMs is:
v is the set of visible units,
h is the set of hidden units,
b and
c are the sets of offsets for the visible and hidden units, respectively, and
W is the set of weights for each of the edges that connect the layers. The initial energy function can be translated into the free energy formula:
This allows us, given the definition of energy-based models with hidden units, to define the probability distribution as:
RBMs are trained using a process called contrastive divergence, instead of performing gradient descent on the second derivative of the negative log-likelihood, as is done with traditional feed-forward neural networks. Contrastive divergence is used, in this case, as a way to speed up training, as an RBM is a special case of a Markov field, and the RBM would have to be run to convergence on its equilibrium distribution for each parameter update in order to use expected values from that distribution and calculate new updates. This is a computationally complex process, and the variance within the values sampled from the equilibrium distribution is typically high enough to cause issues when training. Instead of comparing the input or initial distribution with the equilibrium distribution, contrastive divergence runs an initial number
N of Gibbs sampling steps. In order to keep updates from causing the new distribution to deviate significantly from the initial distribution, the Kullback–Leibler (KL) divergence is measured for each parameter update. In addition, bias constraints have been recommended for the contrastive divergence process, in order to account for the sparsity and selectivity for sets of hidden units [
8]. This allows for activation diversity, meaning that each hidden unit only activates when necessary, not simply when an instance reaches the hidden layer of the RBM. Along with this, sets of hidden units, while sparsely activating, should not all activate at the same time, allowing for selectivity.
While patterns can be recognized by RBMs, the end-user cannot interpret the output of an RBM’s hidden layer as it is generated. In order to translate the output back into a human-readable format, we use a form of an agglomerative clustering technique called BIRCH clustering. A general clustering problem can be seen as a multi-objective optimization problem. The input is a set of
N data points with
M features. The goal is to group the data into a desired number of clusters,
K, while minimizing the given error (or distortion) function. In agglomerative or hierarchical clustering, each data point belongs to a cluster
. At each step, all clusters are compared, and a merge operation is performed:
, where
a and
b are cluster indices at step
i. This merge operation is performed on the two clusters whose merge minimally affects the error function. BIRCH clustering achieves the goal of pattern recognition while also being memory efficient, by performing the clustering through a tree-based approach [
9]. For the clustering process, the same pixels that are used to train the RBM are also used to train the clustering model.
2.2. Materials and Tools
The software was developed with Python 3.6.8. All of the RBM training and testing was implemented using Lrn2 Deep Learning Framework [
10], utilizing Theano with a GPUArray back-end (
https://github.com/Theano/Theano/wiki
11], which utilizes PyTorch [
12], but the libraries aforementioned worked well for this study. The hardware utilized was an NVIDIA GeForce Titan X GPU with 12 GB memory, as well as the NCCS Prism GPU Cluster (
https://www.nccs.nasa.gov/systems/ADAPT/Prism
13] on a machine running Ubuntu 14.04.5.
Our RBMs used two types of input sets. The first consists of geolocated orthorectified L1 data from a single instrument. The other consists of collocated orthorectified L1 data sets for spatially and temporally overlapping targets from multiple instruments with similar spatial resolution. The fusion techniques are described in the results section, along with examples. Each sample consists of itself (i.e., a pixel) and all of its neighboring pixels. This allows for a small amount of spatial context to be included as input, along with the spectral information. All pixels that are set to fill values or are out of specified valid ranges were not used. Regarding the spectral bands used, all spectral bands were used, with the exception of bands that were extremely noisy or known to be non-functional for the time period tested. For each RBM, at least 1,000,000 samples were used for training, and at least another 1,000,000 samples were used for testing. All input was also standardized (by channel) before being used as input to the RBM, and again before being used as input to the clustering model. Below, the reader will see that only >80,000 pixels were evaluated in the Landsat-8/Sentinel-2 tests. We believe this is still an adequate amount of data to evaluate the performance, but a smaller number of pixels was used because only a small percentage of the images were labeled (when labels were given). The same pixels were used to train both the RBMs and the clustering models. Once the clusters are generated, they then must be assigned a context in order to be used. For this experiment, we needed to assign a context relative to pre-existing products, such as pixel classifiers, fire masks, or aerosol optical depth (AOD) data sets, in order to properly compare within the same context. If there are already products, they can be used as a reference for automated mapping. Within the automated mapping process, a full image mask is built from the pre-existing product. This is either provided within the product, or there are instructions on how to compute one, given various certainty levels for each possible label. Given the full label set for the pre-existing product, spanning all test-set scenes, each cluster from our clustering product was mapped to the label it best agrees with. In some cases, as with the finer-scale evaluation of fire detection in the second subset of experiments, an automated mapping was paired with a secondary manual pixel labeling process, in order to account for clusters that correctly identified parts of an object (e.g., a fire), but were not identified in pre-existing products. This manual assignment process is much like that of the manual pixel-labeling process for training supervised learning models. However, our methodology utilizes the strong pattern-matching and data shape understanding capabilities that RBMs and BIRCH clustering offer, before human intervention occurs. These capabilities allow for a much simpler and less error-prone manual intervention technique. The separation of context assignment from the image segmentation/clustering itself is also valuable, as it allows the image segmentation product to be used for many other studies.
2.3. Large Scale Coarse Full Scene Evaluations
The first few experiments performed were large-scale multi-scene investigations, which aimed to quantitatively answer the following questions: Can we capture coarse-scale information across a large set of scenes inside and outside of the area, in order to train the models? Can we provide fused data based on a generic set of methodologies, such that they provide added value when fused, while the data based on singular instruments are still viable enough for use when collocation or overlap does not occur? Can we achieve this in a way that is not extremely resource-hungry? To answer these questions, we evaluated three separate sets of data sets. Our first set was from the Multi-angle Imaging SpectroRadiometer (MISR) and the Moderate resolution Imaging Spectroradiometer (MODIS)—two instruments aboard the same satellite, Terra. These were compared against classification data sets that have been previously produced using the science data-processing pipelines of the respective instruments. The second set was a data set from the IEEE GRSS Data Fusion Challenge from 2017, which consists of imagery from the Landsat-8 and Sentinel-2 satellites. These were compared against local climate zones, provided as labels. Finally, for this first set of experiments, we used data from the Hyperion instrument, a multispectral imager aboard the EO-1 satellite.
For all models generated, the architecture and parameterization remained the exact same. We did not want model variation to play a part in performance variation. We evaluated these comparisons in a few ways. The first was agreement, which considers the total percentage of labels that the pre-existing label sets and our mapped data sets agree upon, measured as:
Note that, due to the fact that there is inherent uncertainty within most of the pre-existing classification products and, as we will show, they are not always completely correct, we named this metric agreement, and not accuracy. The second metric is balanced agreement, which also measures the total percentage of agreement, but takes into account the imbalance of pixel counts across the different labels [
14]. In order to evaluate the structural understanding of the data through the output received from the models, we used a clustering metric called the Davies–Bouldin score [
15]. This metric is much like the more commonly used silhouette score, but is much less computationally complex and, therefore, more feasible to use, given the amount of data. The aim is to measure the compactness of each cluster and the separation between each cluster, as good clustering performance is assumed to provide compact clusters that are far away from one another in the actual feature space, and not the (line, sample) image space. For the Davies–Bouldin score, a lower score indicates a better clustering performance and, therefore, a better structural understanding of the data (especially when coupled with higher agreement percentages). Finally, we wanted to look at the computational cost of training the RBMs, and whether there was any extreme increase in processing incurred when the fusion was done. To this end, we measured the amount of time necessary to train the model and the number of iterations each RBM required to reach convergence. It should be noted that an early stopping condition was added to the training of these RBMs, which is a common practice [
16]. With this in mind, if the model’s reconstruction error, or the difference between the output distribution and the input distribution (as discussed in the Methods Section) remains the same or increases for three iterations, the training is stopped and convergence is assumed. We also generally know that RBMs perform well with only a few training iterations [
8], so a low number of iterations minimizes the chance of overfitting. Within the table, we provide evaluations for singular instruments and fusion sets, as well as results for clustering without passing the data through an RBM first (which we label here as “Raw”, as only the raw orthorectified radiances were used). We provide information of the latter in order to show that the RBM enhances the structural understanding of the fused data. For all cases, the single instrument data sets passed through the RBM and clustered were viable, and can definitely be used for image segmentation when fusion/collocation is not available; however, the RBM-based fusion product always performed the best.
MISR is an instrument onboard the Terra satellite, which consists of nine different cameras, one of which points at the nadir. The other eight are split into two equal groups of forward and aft cameras. Each group has cameras that point at matching angles, relative to the local normal at the Earth’s surface: 25.8°, 45.6°, 60.0°, and 72.5° [
17]. The Moderate resolution Imaging Spectroradiometer (MODIS) is an imager with 36 spectral bands, whose resolutions span from 250 to 1000 m. For this test, we used only the 1000 m bands, which are measured continuously during both day and night [
18]. In order to evaluate performance, we chose a region over the west coast of North America as the region to use for scenes to train the models with, and a partially overlapping region as the testing region, as depicted in
Figure 1. In this way, we allowed for the evaluation of performance inside and outside the training extents. The reference imagery used an as example within this paper, is from outside the training extent. As there was no large difference in agreement inside or outside the training extent, all confusion matrices shown are a combination of all test scenes. Within this first experiment, we trained models for one MISR camera on its own, the fusion of all nine MISR cameras, MODIS, and nine MISR cameras + MODIS fusion. We also looked at the raw MISR-9 camera + MODIS fusion product generated from clustering, without passing the data through the RBM. We compared against a couple of pre-existing products by mapping the classes from our RBM-based clustered product to classes with pre-existing MISR and MODIS pixel classification products that best agreed, using the pre-existing product as a label set. The first product used in this was from the MISR Support Vector Machine (SVM) classifier. This classifier is able to efficiently distinguish between clouds, aerosols, water, land, smoke/dust, and snow/ice, with an impressive global accuracy of 81% over all defined classes [
19]. The only drawback with the MISR SVM product that arose in the tested scenes is that the snow label is often applied to cloudy areas. This is a difficult problem to solve with classification alone, as some clouds and snow/ice contain the same materials, only at different altitudes. The other product used here was the MODIS cloud mask [
20]. This data set has classes for clouds, aerosols, land, desert, snow/ice, and water. It appears to have issues when identifying large areas of aerosols, due to a thresholding issue identified in some studies using this data set [
21]. The multiple land classes are extremely useful here, as it breaks land up into land and desert classes. There is also more detail in the inland water. One drawback is that the land/water identification is based on static data sets and, thus, it may not fully reflect what is seen in a given scene; nonetheless, the granularity is still useful.
The second coarser-scale experiment involved Landsat-8 and Sentinel-2 data provided by IEEE GRSS. Landsat-8 is a satellite platform that contains 2 instruments, one of which is the Operational Land Imager (OLI), and the other is the Thermal Infrared Sensor [
23]. Alongside the Landsat-8 data, data from the Sentinel-2 constellation of satellites, specifically a subset of channels from the Multi-Spectral Instruments (MSIs), was provided [
24]. There was also OpenStreetMap data available, providing information about land use, buildings, and water, but this was not used. For labels, hand labeled local climate zone (lcz) data was provided by WUDAPT [
25] and GeoWiki (
http://www.geo-wiki.org/
As our task was unsupervised, we used the scenes with LCZ label data as our test set, such that we could evaluate performance, and we trained on the scenes where the information was withheld. The training and testing extents, depicted in
Figure 1, are completely separate from one another; thus, we only evaluated scenes not used in the training process. One of these scenes, over Berlin, is the one we show as reference. On top of providing the mapped clusters for areas where the labels exist, we also provided a complete mapping of the scene for visual validation purposes. All labels (except for the bare soil/pavement/dirt/sand label) performed very well. As there was a relatively low number of labeled pixels for the soil/dirt/pavement/sand label, it was hard to conclude why the misclassification happened; however, using the confusion matrix and the mapped images, it appears as if most of these pixels appeared on or near roadways, and were labeled as part of the urban sprawl which, in this context, makes sense. This part of the experiment not only further demonstrates the structural understanding and fusion capabilities of the methodologies, but also indicates that collocated scenes over clear areas can easily be fused, even if they are not within the same temporal range.
The final experiment in this category used Hyperion data. Hyperion is a hyperspectral imager that flew aboard the Earth Observing-1 (EO-1) satellite. Using a single instrument is not traditionally thought of as data fusion; however, using the full set of Hyperion’s >200 channels is akin to data fusion, given the shear number of channels used for a single scene; it is a tangential use-case that is somewhat fascinating. The training and testing extents are global and completely separate, as seen in
Figure 1. We do not have label data for this experiment, and only carried out imagery and cluster analysis.
2.4. Fine Scale Evaluation in Select Scenes
The second set of experiments were intended to show what the improvement in large-scale agreement and cluster performance meant, in terms of finer-scale structural understanding of the data sets with large class imbalances. For this goal, we chose to look at fire and smoke detection in both MISR and MODIS, two instruments used in the previous experimental set, and two airborne instruments, MASTER and eMAS. In both cases, there are pre-existing fire detection products to be compared against. For fire and smoke, when a pre-existing product is available, a first pass is conducted with the automated mapping procedure described above. As these are finer-scale evaluations, a manual mapping process was also performed, in order to ensure that no detections were missed that the pre-existing product may not contain, but that our product did. Smoke detection (for the most part), as well as all burn scar detection was qualitative in this study, but will be further looked into in future work.
The airborne instruments whose data was used are the MODIS/ASTER airborne simulator (MASTER) and the Enhanced MODIS Airborne Simulator (eMAS). MASTER is an airborne imager, which was aboard a DC-8 aircraft for the scenes tested; it has a spatial resolution of 10–30 m/pixel with 50 spectral bands [
26]. eMAS is another airborne imager aboard the high-altitude ER-2 aircraft. The eMAS instrument has 38 spectral bands and a spatial resolution of 50 m [
27]. Data from these instruments were used separately as well as together, generating a MASTER/eMAS fusion product. Both eMAS and MASTER have pre-existing fire-detection products, which were generated using the same algorithms as MODIS. The training and testing extents for MASTER, eMAS, and MASTER eMAS fusion can be seen in
Figure 1. The MASTER, eMAS, and MASTER + eMAS fusion RBMs and clustering were generated and parameterized in the same way as the other models in this study, but no full-scene classification data sets are available for these instruments; hence, they are only included in this section. We show the evaluation of all three RBM-based products as well as the raw product, in this case, as these data sets were not a part of the initial experimental set above.
Over the two scenes that were almost spatiotemporally collocated for fusion, the two fire detection products were compared. The MASTER fire detection product was resampled to eMAS resolution and then quantitatively compared. The training and testing extents can be seen in
Figure 1.