# Looking for Change? Roll the Dice and Demand Attention

^{1}

^{2}

^{3}

^{*}

^{†}

^{‡}

## Abstract

**:**

`FracTAL`. We introduce two new efficient self-contained feature extraction convolution units: the

`CEECNet`and

`FracTAL`

`ResNet`units. Further, we propose a new encoder/decoder scheme, a network macro-topology, that is tailored for the task of change detection. The key insight in our approach is to facilitate the use of relative attention between two convolution layers in order to fuse them. We validate our approach by showing excellent performance and achieving state-of-the-art scores (F1 and Intersection over Union-hereafter IoU) on two building change detection datasets, namely, the LEVIRCD (F1: 0.918, IoU: 0.848) and the WHU (F1: 0.938, IoU: 0.882) datasets.

## 1. Introduction

#### 1.1. Related Work

#### 1.1.1. On Attention

`CBAM`), which is also a form of spatial and channel attention, and showed improved performance on image classification and object detection tasks. To the best of our knowledge, the most faithful implementation of multi-head attention [14] for convolution layers is [19] (spatial attention).

#### 1.1.2. On Change Detection

#### 1.2. Our Contributions

- We introduce a new set similarity metric that is a variant of the Dice coefficient: the Fractal Tanimoto similarity measure (Section 2.1). This similarity measure has the advantage that it can be made steeper than the standard Tanimoto metric towards optimality, thus providing a finer-grained similarity metric between layers. The level of steepness is controlled from a depth recursion hyperparameter. It can be used both as a “sharp” loss function when fine-tuning a model at the latest stages of training, as well as a set similarity metric between feature layers in the attention mechanism.
- Using the above set similarity as a loss function, we propose an evolving loss strategy for fine-tuning the training of neural networks (Section 2.2). This strategy helps to avoid overfitting and improves performance.
- We introduce the Fractal Tanimoto Attention Layer (hereafter
`FracTAL`), which is tailored for vision tasks (Section 2.3). This layer uses the fractal Tanimoto similarity to compare queries with keys inside the Attention module. It is a form of spatial and channel attention combined. - We introduce a feature extraction building block that is based on the residual neural network [32] and the fractal Tanimoto Attention (Section 2.4.2). The new
`FracTAL``ResNet`converges faster to optimality than standard residual networks and enhances performance. - We introduce two variants of a new feature extraction building block, the Compress-Expand/Expand-Compress unit (hereafter
`CEECNet`unit—Section 2.5.1). This unit exhibits enhanced performance in comparison with standard residual units and the`FracTAL``ResNet`unit. - Capitalising on these findings, we introduce a new backbone encoder/decoder scheme, a macro-topology—the
`mantis`—that is tailored for the task of change detection (Section 2.5.2). The encoder part is a Siamese dual encoder, where the corresponding extracted features at each depth are fused together with`FracTAL`relative attention. In this way, information exchange between features extracted from bi-temporal images is enforced. There is no need for manual feature subtraction. - Given the relative fusion operation between the encoder features at different levels, our algorithm achieves state-of-the-art performance on the LEVIRCD and WHU datasets without requiring the use of contrastive loss learning during training (Section 3.2). Therefore, it is easier to implement with standard deep learning libraries and tools.

## 2. Materials and Methods

#### 2.1. Fractal Tanimoto Similarity Coefficient

#### 2.2. Evolving Loss Strategy

#### 2.3. Fractal Tanimoto Attention

#### 2.4. Fractal Tanimoto Attention Layer

`FracTAL`spatial similarity vs. the dot product similarity that appears in SAGAN [39] self-attention. Let us assume that we have an input feature layer of size $(B\times C\times H\times W)=32\times 1024\times 16\times 16$ (e.g., this appears in the layer at depth 6 of UNet-like architectures, starting from 32 initial features). From this, three layers are produced of the same dimensionalty, the query, the key, and value. With the Fractal Tanimoto spatial similarity, ${\mathcal{T}}_{\u22a0}$, the output of the similarity of query and keys is $B\times C\times 1\times 1=32\times 1024\times 1\times 1$ (Equation (12)). The corresponding output of the dot similarity of spatial compoments in the self-attention is $B\times C\times C$ → $32\times 1024\times 1024$ (Equation (11)), having a C-times higher memory footprint.

`FracTAL`) for vision tasks as an improvement over the scaled dot product attention mechanism [14] for the following reasons:

- The $\mathcal{FT}$ similarity is automatically scaled in the region $(0,1)$; therefore, it does not require normalisation or activation to be applied. This simplifies the design and implementation of Attention layers and enables training without ad hoc normalisation operations.
- The dot product does not have an upper or lower bound; therefore, a positive value cannot be a quantified measure of similarity. In contrast, $\mathcal{FT}$ has a bounded range of values in $(0,1)$. The lowest value indicates no correlation, and the maximum value perfect similarity. It is thus easier to interpret.
- Iteration d is a form of hyperparameter, such as “temperature” in annealing. Therefore, the $\mathcal{FT}$ can become as steep as we desire (by modification of the temperature parameter d), even steeper than the dot product similarity. This can translate to finer query and key similarity.
- Finally, it is efficient in terms of the GPU memory footprint (when one considers that it does both channel and spatial attention), thus allowing the design of more complex convolution building blocks.

`FracTAL`is given in Listing A.2. The multihead attention is achieved using group convolutions for the evaluation of queries, keys, and values.

#### 2.4.1. Attention Fusion

`FracTAL`for two cases: self attention fusion and a relative attention fusion, where information from two layers are combined.

#### 2.4.2. Self Attention Fusion

`FracTAL`self-attention layer, $\mathbf{A}$:

`FracTAL`attention layer $\mathbf{A}$ with the features, $\mathbf{L}$, effectively lowers the values of features in areas that are not “interesting”. It does not alter the value of areas that “are interesting”. This can produce loss of information in areas where $\mathbf{A}$ “does not attend” (i.e., it does not emphasize), which would otherwise be valuable at a later stage. Indeed, areas of the image that the algorithm “does not attend” should not be perceived as empty space [11]. For this reason, the “emphasised” features, $\mathbf{L}\odot \mathbf{A}$, are added to the original input $\mathbf{L}$. Moreover, $\mathbf{L}+\mathbf{L}\odot \mathbf{A}$ is identical to $\mathbf{L}$ in spatial areas where $\mathbf{A}$ tends to zero and is emphasised in areas where $\mathbf{A}$ is maximal.

`ReLU`activations and produces the ${\mathbf{X}}_{\mathrm{out}}$ layer. A separate branch uses the ${\mathbf{X}}_{\mathrm{in}}$ input to produce the self attention layer $\mathbf{A}$ (see Listing A.2). Then we multiply element-wise the standard output of the residual unit, ${\mathbf{X}}_{\mathrm{in}}+{\mathbf{X}}_{\mathrm{out}}$, with the $\mathbf{1}+\gamma \mathbf{A}$ layer. In this way, at the beginning of training, this layer behaves as a residual layer, which has excellent convergent properties of resnet at initial stages, and at later stages of training, the Attention becomes gradually more active and allows for greater performance. A software routine of this fusion for the residual unit, in particular, can be seen in Listing A.4 in the Appendix C.

#### 2.4.3. Relative Attention Fusion

#### 2.5. Architecture

#### 2.5.1. Micro-Topology: The `CEECNet` Unit

`CEEC`building block is that it provides two different, yet complementary, views for the same input. The first view (the

**CE**block—see Figure 4b) is a “summary understanding” operation (performed in lower resolution than the input—see also [42,43,44]). The second view (the

**EC**block) is an “analysis of detail” operation (performed in higher spatial resolution than the input). It then exchanges information between these two views using relative attention, and it finally fuses them together by emphasising the most important parts using the

`FracTAL`.

**CE**block), which is responsible for summarising information from the input features by first compressing the total volume of features into half its original size and then restoring it. The second branch, a “mini ∩-Net” operation (

**EC**block), is responsible for analysing the input features in higher detail: it initially doubles the volume of the input features by halving the number of features and doubling each spatial dimension. It subsequently compresses this expanded volume to its original size. The input to both layers is concatenated with the output, and then a normed convolution restores the number of channels to their original input value. Note that the mini ∩-Net is nothing more than the symmetric (or dual) operation of the mini ∪-Net.

**EC**and

**CE**blocks are fused together with relative attention fusion (Section 2.4.3). In this way, the exchange of information between the layers is encouraged. The final emphasised outputs are concatened together, thus restoring the initial number of filters, and the produced layer is passed through a normed convolution in order to bind the relative channels. The operation is concluded with a

`FracTAL`residual operation and fusion (similar to Figure 4a), where the input is added to the final output and emphasised by the self attention on the original input. The

`CEECNet`building block is described schematically in Figure 4b.

**C**, is achieved by applying a normed convolution layer of stride equal to 2 (

`k`= 3,

`p`= 1,

`s`= 2) followed by another convolution layer that is identical in every aspect, except the stride that is now

`s`= 1. The purpose of the first convolution is to both resize the layer and extract features. The purpose of the second convolution layer is to extract features. The expansion operation,

**E**, is achieved by first resizing the spatial dimensions of the input layer using Bilinear interpolation, and then the number of channels is brought to the desired size by the application of a convolution layer (

`k`= 3,

`p`= 1,

`s`= 1). Another identical convolution layer is applied to extract further features. The full details of the convolution operations used in the

**EC**and

**CE**blocks can be found on Listing A.5.

#### 2.5.2. Macro-Topology: Dual Encoder, Symmetric Decoder

`macro`-topology (i.e., backbone) of the architecture that uses either the

`CEECNet`or the

`FracTAL`

`ResNet`units as building blocks. We start by stating the intuition behind our choices and continue with a detailed description of the

`macro`-topology. Our architecture is heavily influenced by the

`ResUNet-a`model [35]. We will refer to this macro-topology as the

`mantis`topology.

- We make the hypothesis that the process of change detection between two images requires a mechanism similar to human attention. We base this hypothesis on the fact that the time required for identifying objects that changed in an image correlates directly with the number of changed objects. That is, the more objects a human needs to identify between two pictures, the more time is required. This is in accordance with the feature-integration theory of Attention [11]. In contrast, subtracting features extracted from two different input images is a process that is constant in time, independent of the complexity of the changed features. Therefore, we avoid using ad hoc feature subtraction in all parts of the network.
- In order to identify change, a human needs to look and compare two images multiple times, back and forth. We need things to emphasise on image at date 1, based on information on image at date 2 (Equation (16)) and, vice versa, (Equation (17)). Furthermore, then we combine both of these pieces of information together (Equation (18)). That is, exchange information with relative attention (Section 2.4.3) between the two at multiple levels. A different way of stating this as a question is: what is important on input image 1 based on information that exists on image 2, and vice versa?

`mantis`macro-topology (with

`CEECNet`V1 building blocks, see Figure 5). The encoder part is a series of building blocks, where the size of the features is downscaled between the application of each subsequent building block. Downscaling is achieved with a normed convolution with stride,

`s`= 2, without using activations. There exist two encoder branches that share identical parameters in their convolution layers. The input to each branch is an image from a different date, and the role of the encoder is to extract features at different levels from each input image. During the feature extraction by each branch, each of the two inputs is treated as an independent entity. At successive depths, the outputs of the corresponding building block are fused together with the relative attention methodology, as described in Section 2.4.3, but they are not used until later in the decoder part. Crucially, this fusion operation suggests to the network that the important parts of the first layer will be defined by what exists on the second layer (and vice versa), but it does not dictate how exactly the network should compare the extracted features (e.g., by demanding the features to be similar for unchanged areas, and maximally different for changed areas—we tried this approach, and it was not successful). This is something that the network will have to discover in order to match its predictions with the ground truth. Finally, the last encoder layers are concatenated and inserted into the pyramid scene pooling layer (

`PSPPooling`—[35,49]).

`PSPPooling`layer (middle of network), we upscale lower resolution features with bilinear interpolation and combine them with the fused outputs of the decoder with a concatenation operation followed by a normed convolution layer, in a way similar to the

`ResUNet-a`[35] model. The

`mantis`

`CEECNet`V2 model replaces all concatenation operations followed by a normed convolution with a

`Fusion`operation, as described in Listing A.3.

`CEECNet`unit that has the same spatial dimensions as the first input layers, as well as the Fused layers from the first

`CEECNet`unit operation. Both of these layers are inserted into the segmentation

`HEAD`.

#### 2.5.3. Segmentation `HEAD`

`ResUNet-a`“causal” segmentation head, which has shown great performance in a variety of segmentation tasks [35,50], with two modifications.

`HEAD`relates to balancing the number of channels of the boundaries and distance transform predictions before re-using them in the final prediction of segmentation change detection. This is achieved by passing them through a convolution layer that brings the number of channels to the desired number. Balancing the number of channels treats the input features and the intermediate predictions as equal contributions to the final output. In Figure 6, we present schematically the conditioned multitasking head and the various dependencies between layers. Interested users can refer to [35] for details on the conditioned multitasking head.

#### 2.6. Experimental Design

`mantis`

`CEECNet`V1, V2, and

`mantis`

`FracTAL`

`ResNet`have an initial number of filters equal to

`nf`= 32, and the depth of the encoder branches was equal to 6. We designate these models with

`D6nf32`.

#### 2.6.1. LEVIRCD Dataset

#### 2.6.2. WHU Building Change Detection

#### 2.6.3. Data Preprocessing and Augmentation

`GroupNorm`[51] for all normalisation layers.

#### 2.6.4. Metrics

#### 2.6.5. Inference

#### 2.6.6. Inference on Large Rasters

#### 2.6.7. Model Selection Using Pareto Efficiency

## 3. Results

`FracTAL`depth $d=5$, although this is not always the best performant network.

#### 3.1. `FracTAL`Units and Evolving Loss Ablation Study

`FracTAL`

`ResNet`[32,40] and

`CEECNet`units we introduced against ResNet and CBAM [18] baselines as well as the effect of the evolving ${L}^{\mathfrak{D}}=1-{\langle \mathcal{FT}\rangle}^{\mathfrak{D}}$ loss function on training a neural network. We also present a qualitative and quantitative analysis on the effect of the depth parameter in the

`FracTAL`based on the

`mantis`

`FracTAL`

`ResNet`network.

#### 3.1.1. `FracTAL`Building Blocks Performance

`macro`-topological graph (backbone) but different in

`micro`-topology (building blocks). The first two networks are equipped with two different versions of

`CEECNet`: the first is identical with the one presented in Figure 4b. The second is similar to the one in Figure 4b, with all concatenation operations that are followed by normed convolutions being replaced with Fusion operations, as described in Listing A.3. The third network uses as building blocks the

`FracTAL`ResNet building blocks (Figure 4a). Finally, the fourth network uses standard residual units as building blocks, as described in [32,40] (ResNet V2). All building blocks have the same dimensionality of input and output features. However, each type of building block has a different number of parameters. By keeping the dimensionality of input and output layers identical to all layers, we believe the performance differences of the networks will reflect the feature expression capabilities of the building blocks we compare.

`FracTAL`outperform standard Residual units. In particular, we find that the performance and convergence properties of the networks follow:

`ResNet`〈

`FracTAL`

`ResNet`〈

`CEECNet`V1 〈

`CEECNet`V2. The performance difference between

`FracTAL`

`ResNet`and

`CEECNet`V1 will become more clearly apparent in the change detection datasets. The V2 version of

`CEECNet`that uses

`Fusion`with relative attention (cyan solid line) instead of concatenation (V1—magenta dashed line) for combining layers in the Compress-Expand and Expand-Compress branches has superiority over V1. However, it is a computationally more intensive unit.

#### 3.1.2. Comparing `FracTAL`with CBAM

`FracTAL`proposed attention with a modern attention module and, in particular, the Convolution Block Attention Module (

`CBAM`) [18]. We construct two networks that are identical in all aspects except the implementation of the attention used. We base our implementation on a publicly available repository that reproduces the results of [18]—written in Pytorch (https://github.com/luuuyi/CBAM.PyTorch, accessed on 1 February 2021)—that we translated into the mxnet framework. From this implementation, we use the

`CBAM-resnet34`model, and we compare it with a

`FracTAL-resnet34`model, i.e., a model that is identical to the previous one, with the exception that we replaced the

`CBAM`attention with the

`FracTAL`(attention). Our results can be seen in Figure 10, where a clear performance improvement is evident merely by changing the attention layer used. The improvement is of the order of 1%, from 83.37% (

`CBAM`) to 84.20% (

`FracTAL`), suggesting that the

`FracTAL`has better feature extraction capacity than the

`CBAM`layer.

#### 3.1.3. Evolving Loss

`CEECNet`

`V1`units. The macro-topology of the networks is identical to the one in Table A1. In addition, we also demonstrate performance differences on the change detection task by training the

`mantis`

`CEECNet`V1 model on the LEVIRCD dataset with static and evolving loss strategies for

`FracTAL`depth, $d=5$.

`CEECNet`V1 -based models. Here, the evolution strategy is same as above with the difference that we use different depths for the $\mathcal{FT}$ loss (to observe potential differences). These are $\mathfrak{D}\in \{0,10,20\}$. Again, the difference in the validation accuracy is $\sim +0.22\%$ for the evolving loss strategy.

`CEECNet`V1 units, as can be seen in Table 1. The

`CEECNet`V1 unit, trained with the evolving loss strategy, demonstrates +0.856% performance increase on the Interesection over Union (IoU) and a +0.484% increase in MCC. Note that, for the same

`FracTAL`depth, $d=5$, the

`FracTAL`ResNet network trained with the evolving loss strategy performs better than the

`CEECNet`V1 that is trained with the static loss strategy ($\mathfrak{D}=0$), while it falls behind the

`CEECNet`V1 trained with the evolving loss strategy. We should also note that performance increment is larger in comparison to the classification task on CIFAR10, reaching almost ∼1% for the IoU.

#### 3.1.4. Performance Dependence on `FracTAL`Depth

`FracTAL`layer behaves with respect to different depths, we train three identical networks, the

`mantis`

`FracTAL`

`ResNet`(

`D6nf32`), using

`FracTAL`depths in the range $d\in \{0,5,10\}$. The performance results on the LEVIRCD dataset can be seen in Table 1. It seems the three networks perform similarly (they all achieve SOTA performance on the LEVIRCD dataset), with the $d=10$ having top performance (+0.724% IoU), followed by the $d=0$ (+0.332%IoU), and, lastly, the $d=5$ network (baseline). We conclude that the depth d is a hyperparameter dependent on the problem at task that users of our method can choose to optimise against. Given that all models have competitive performance, it seems that the proposed depth $d=5$ is a sensible choice.

`FracTAL`depth $d=0$ (left panel) and $d=10$ (right panel). The features at different depths appear similar, all identifying the regions of interest clearly. To the human eye, according to our opinion, the features for depth $d=10$ appear slightly more refined in comparison to the features corresponding to depth $d=0$ (e.g., by comparing the images in the corresponding bottom rows). The entropy of the features for $d=0$ (entropy: 15.9982) is negligibly higher (+0.00625 %) than for the case $d=10$ (entropy: 15.9972), suggesting both features have the same information content for these two models. We note that, from the perspective of information compression (assuming no loss of information), lower entropy values are favoured over higher values, as they indicate a better compression level.

#### 3.2. Change Detection Performance on LEVIRCD and WHU Datasets

#### 3.2.1. Performance on LEVIRCD

`FracTAL`

`ResNet`and

`CEECNet`(V1, V2) outperform the baseline [1] with respect to the F1 score by ∼5%.

`CEECNet`V1 algorithm for various images from the test set. For each row, from left to right, we have input image at date 1, input image at date 2, ground truth mask, inference (threshold = 0.5), and algorithm’s confidence heat map (this should not be confused with statistical confidence). It is interesting to note that the algorithm has zero doubt in areas where buildings exist in both input images. That is, it is clear our algorithm identifies change in areas covered by buildings and not building footprints. In Table 1, we present numerical performance results of both

`FracTAL`

`ResNet`as well as

`CEECNet`V1 and V2. All metrics, precision, recall, F1, MCC, and IoU are excellent. The

`mantis`

`CEECNet`for

`FracTAL`depth $d=5$ outperforms the

`mantis`

`FracTAL`

`ResNet`by a small numerical margin; however, the difference is clear. This difference can also be seen in the bottom panel of Figure 16. We should also note that the numerical difference on, say, the F1 score, does not translate to equal portions of quality difference in images. For example, a 1% difference in the F1 score may have a significant impact on the quality of inference. We further discuss this in Section 4.1. Overall, the best model is

`mantis`

`CEECNet`V2 with

`FracTAL`depth $d=5$. Second best is the

`mantis`

`FracTAL`

`ResNet`with

`FracTAL`depth $d=10$. Among the same set of models (

`mantis`

`FracTAL`

`ResNet`), it seems that depth $d=10$ performs best; however, we do not know if this generalises to all models and datasets. We consider that

`FracTAL`depth d is a hyperparameter that needs to be fine-tuned for optimal performance, and, as we have shown, the choice $d=5$ is a sensible one as in this particular dataset, it provided us with state-of-the-art results.

#### 3.2.2. Performance on WHU

`mantis`network with

`FracTAL`ResNet and

`CEECNet`V1 building blocks. Both of our proposed architectures outperform all other modelling frameworks, although we need to stress that each of the other authors followed a different splitting strategy of the data. However, with our splitting strategy, we used only 32.9% of the total area for training. This is significantly less than the majority of all other methods we report here, and we should anticipate a significant performance degradation in comparison with other methods. In contrast, despite the relatively smaller training set, our method outperforms other approaches. In particular, Ji et al. [28] used 50% of the raster for training and the other half for testing (Figure 10 in their manuscript). In addition, there is no spatial separation between training and test sites, as it exists in our case, and this should work in their advantage. Furthermore, the use of a larger window for training (their extracted chips are of spatial dimension $512\times 512$) increases in principle the performance because it includes more context information. There is a trade-off here though, in that using a larger window size reduces the number of available training chips; therefore, the model sees a smaller number of chips during training. Chen et al. [31] randomly split their training and validation chips. This should improve performance because there is a tight spatial correlation for two extracted chips that are in geospatial proximity. Cao et al. [57] used as a test set ∼20% of the total area of the WHU dataset; however, they do not specify the splitting strategy they followed for the training and validation sets. Finally, Liu et al. [58] used approximately ∼10% of the total area for the reporting test score performance. They also do not mention their splitting strategy.

#### 3.3. The Effect of Scaled Sigmoid on the Segmentation HEAD

`mantis`

`CEECNet`V1 learns the following parameters that control how “crisp” the boundaries should be, or else, how sharp the decision boundary should be:

## 4. Discussion

`CEECNet`V1 and

`FracTAL`

`ResNet`models is for the case of

`FracTAL`depth $d=5$.

#### 4.1. Qualitative `CEECNet` and `FracTAL`Performance

`CEECNet`V1 and

`FracTAL`

`ResNet`achieve a very high MCC (Figure 16), the superiority of

`CEECNet`for the same

`FracTAL`depth $d=5$ is evident in the inference maps in both the LEVIRCD (Figure 17) and WHU (Figure 18) datasets. This confirms their relative scores (Table 1 and Table 2) and the faster convergence of

`CEECNet`V1 (Figure 9). Interestingly,

`CEECNet`V1 predicts change with more confidence than

`FracTAL`

`ResNet`(Figure 17 and Figure 18 ), even when it errs, as can be seen from the corresponding confidence heat maps. The decision on which of the models one should use is a decision to be made with respect to the relative “cost” of training each model, available hardware resources, and the performance target goal.

#### 4.2. Qualitative Assesment of the `Mantis` `Macro`-Topology

`mantis`

`FracTAL`

`ResNet`model, trained on LEVIRCD with

`FracTAL`depth $d=10$.

`ratt12`(left panel) and

`ratt21`(right panel) for a set of image patches belonging to the test set (size: $3\times 256\times 256$). Here, the notation

`ratt12`indicates that the query features come from the input image at date ${t}_{\textcolor[rgb]{}{1}}$, while the key/value features are extracted from the input image at date ${t}_{\textcolor[rgb]{}{2}}$. Similar notation is applied for the relative attention,

`ratt21`. Starting from the top left corner, we provide the input image at date ${t}_{1}$, the input image at date ${t}_{2}$, and the ground truth mask of change, and after that, we visualise the features as single channel images. Each feature (i.e., image per channel) is normalised in the range of $(-1,1)$ for visualisation purposes. It can be seen that the algorithm emphasises from the early stages (i.e., first layers) to structures containing buildings and boundaries of these. In particular, the

`ratt12`(left panel) emphasises boundaries of buildings that exist on both images. It also seems to represent all buildings that exist in both images. The

`ratt21`layer (right panel) seems to emphasise the buildings that exist on date 1 more but not on date 2. In addition, in both relative attention layers, emphasis on roads and pavements is given.

## 5. Conclusions

- A novel set similarity coefficient, the fractal Tanimoto coefficient, that is derived from a variant of the Dice coefficient. This coefficient can provide finer detail of similarity at a desired level (up to a delta function), and this is regulated by a temperature-like hyperparameter, d (Figure 2).
- A novel training loss scheme, where we use an evolving loss function, that changes according to learning rate reductions. This helps avoid overfitting and allows for a small increase in performance (Figure 11a,b). In particular, this scheme provided a ∼0.25% performance increase in validation accuracy on CIFAR10 tests, and performance increase of ∼0.9% on IoU and ∼0.5% on MCC on the LEVIRCD dataset.
- A novel spatial and channel attention layer, the fractal Tanimoto Attention Layer (
`FracTAL`—see Listing A.2), that uses the fractal Tanimoto similarity coefficient as a means of quantifying the similarity between query and key entries. This layer is memory efficient and scales well with the size of input features. - A novel building block, the
`FracTAL``ResNet`(Figure 4a), that has a small memory footprint and excellent convergent and performance properties that outperform standard ResNet building blocks. - A corollary that follows from the introduced building blocks is a novel fusion methodology of layers and their corresponding attentions, both for self and relative attention, that improves performance (Figure 16). This methodology can be used as a direct replacement for concatenation in convolution neural networks.
- A novel macro-topology (backbone) architecture, the
`mantis`topology (Figure 5), that combines the building blocks we developed and is able to consume images from two different dates and produce a single change detection layer. It should be noted that the same topology can be used in general segmentation problems, where we have two input images to a network that are somehow correlated and produce a semantic map. Moreover, it can be used for the fusion of features coming from different inputs (e.g., Digital Surface Maps and RGB images).

`mantis`

`FracTAL`

`ResNet`and

`mantis`

`CEECNet`V1 and V2, outperform other proposed networks and achieve state-of-the-art results on the LEVIRCD [1] and the WHU [36] building change detection datasets (Table 1 and Table 2). Note that there does not exist a standardised test set for the WHU dataset; therefore, relative performance is indicative but not absolute: it depends on the train/test split that other researchers have performed. However, we only used 32.9% of the area of the provided data for training (which is much less than what other methods we compared against have used), and this demonstrated the robustness and generalisation abilities of our algorithm.

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A. CIFAR10 Comparison Network Characteristics

**Table A1.**

`CEECNet`V1 vs.

`CEECNet`V2 vs.

`FracTAL`

`ResNet`vs.

`ResNet`building blocks comparison. All Building Blocks use kernel size

`k = 3`, padding

`p = 1`(SAME), and stride

`s = 1`. The transition convolutions that half the size of the features use the same kernel size and padding; however, the stride is

`s = 2`. In the following, we indicate with

`nf`the number of output channels of the convolution layers and with

`nh`the number of heads in the multihead

`FracTAL`module.

Layers | Proposed Models | ResNet |
---|---|---|

Layer 1 | BBlock[nf = 64, nh = 8] | BBlock[nf = 64] |

Layer 2 | BBlock[nf = 64, nh = 8 ] | BBlock[nf = 64] |

Layer 3 | Conv2DN(nf = 128, s = 2) | Conv2DN(nf = 128, s = 2) |

Layer 4 | BBlock[nf = 128, nh = 16] | BBlock[nf = 128] |

Layer 5 | BBlock[nf = 128, nh = 16] | BBlock[nf = 128] |

Layer 6 | Conv2DN(nf = 256, s = 2) | Conv2DN(nf = 256, s = 2) |

Layer 7 | BBlock[ nf = 256, nh = 32] | BBlock[nf = 256] |

Layer 8 | BBlock[ nf = 256, nh = 32] | BBlock[nf = 256] |

Layer 9 | ReLU | ReLU |

Layer 10 | DenseN(nf = 4096) | DenseN(nf = 4096) |

Layer 11 | ReLU | ReLU |

Layer 12 | DenseN(nf = 512) | DenseN(nf = 512) |

Layer 13 | ReLU | ReLU |

Layer 14 | DenseN(nf = 10) | DenseN(nf = 10) |

## Appendix B. Inference across WHU Test Set

`mantis`

`CEECNet`V1 D6nf32 model, can be seen in Figure A1. The predictions match very closely the ground truth.

**Figure A1.**Inference across the whole test area over the NZBLDG CD dataset using the

`mantis`

`CEECNet`V1 D6nf32 model. From left to right: 2011 input image, 2016 input image, ground truth, prediction (threshold 0.5), and confidence heat map.

## Appendix C. Algorithms

`FracTAL`associated modules with mxnet style pseudocode. In all the listings presented,

`Conv2DN`is a sequential combination of a 2D convolution followed by a normalisation layer. When the batch size is very small, due to GPU memory normalisation (e.g., smaller than 4 data per GPU), the normalisation used was Group Normalisation [51]. Practically, in all

`mantis`

`CEECNet`realisations for change detection, we used GroupNorm.

#### Appendix C.1. Fractal Tanimoto Attention 2D Module

**Listing A.1.**mxnet/gluon style pseudocode for the fractal Tanimoto coefficient, predefined for spatial similarity.

from mxnet.gluon import nn class FTanimoto(nn.Block): def __init__(self,depth=5, axis=[2,3],**kwards): super().__init__(**kwards) self.depth = depth self.axis=axis def inner_prod(self, prob, label): prdct = prob*label #dim:(B,C,H,W) prdct = prdct.sum(axis=self.axis,keepdims=True) return prdct #dim:(B,C,1,1) def forward(self, prob, label): a = 2.**self.depth b = -(2.*a-1.) tpl= self.inner_prod(prob,label) tpp= self.inner_prod(prob,prob) tll= self.inner_prod(label,label) denum = a*(tpp+tll)+b*tpl ftnmt = tpl/denum return ftnmt #dim:(B,C,1,1) |

from mxnet import nd as F from mxnet.gluon import nn class FTAttention2D(nn.Block): def __init__(self, nchannels, nheads, **kwards): super().__init__(**kwards) self.q = Conv2DN(nchannels,groups=nheads) self.k = Conv2DN(nchannels,groups=nheads) self.v = Conv2DN(nchannels,groups=nheads) # spatial/channel similarity self.SpatialSim = FTanimoto(axis=[2,3]) self.ChannelSim = FTanimoto(axis=1) self.norm = nn.BatchNorm() def forward(self, qin, kin, vin): # query, key, value q = F.sigmoid(self.q(qin))#dim:(B,C,H,W) k = F.sigmoid(self.k(vin))#dim:(B,C,H,W) vs. = F.sigmoid(self.v(kin))#dim:(B,C,H,W) att_spat = self.ChannelSim(q,k)#dim:(B,1,H,W) v_spat = att_spat*v #dim:(B,C,H,W) att_chan = self.SpatialSim(q,k)#dim:(B,C,1,1) v_chan = att_chan*v #dim:(B,C,H,W) v_cspat = 0.5*(v_chan+v_spat) v_cspat = self.norm(v_cspat) return v_cspat #dim:(B,C,H,W) |

import mxnet as mx from mxnet import nd as F class Fusion(nn.Block): def __init__(self, nchannels, nheads, **kwards): super().__init__(**kwards) self.fuse = Conv2DN(nchannels, kernel=3, padding=1, groups=nheads) self.att12 = FTAttention2D(nchannels,nheads) self.att21 = FTAttention2D(nchannels,nheads) self.gamma1 = self.params.get(’gamma1’, shape=(1,), init=mx.init.Zero()) self.gamma2 = self.params.get(’gamma2’, shape=(1,), init=mx.init.Zero()) def forward(self, input1,input2): ones = nd.ones_like(input1) # Attention on 1, for k,v from 2 qin = input1 kin = input2 vin = input2 att12 = self.att12(qin,kin,vin) out12 = input1*(ones+self.gamma1*att12) # Attention on 2, for k,v from 1 qin = input2 kin = input1 vin = input1 att21 = self.att21(qin,kin,vin) out21 = input2*(ones+self.gamma2*att21) out = nd.concat(out12,out21,dim=1) out = self.fuse(out) return out |

#### Appendix C.2. FracTALResNet

`ResBlock`consists of the sequence of

`BatchNorm`,

`ReLU`,

`Conv2D`,

`BatchNorm`,

`ReLU`, and

`Conv2D`. The normalisation can change to

`GroupNorm`for a small batch size.

import mxnet as mx from mxnet import nd as F class FTAttResUnit(nn.Block): def __init__(self, nchannels, nheads, **kwards): super().__init__(**kwards) # Residual Block: sequence of # (BN,ReLU,Conv,BN,ReLU,Conv) self.ResBlock = ResBlock(nchannels, kernel=3, padding=1) self.att = FTAttention2D(nchannels,nheads) self.gamma = self.params.get(’gamma’, shape=(1,), init=mx.init.Zero()) def forward(self, input): out = self.ResBlock(input)#dim:(B,C,H,W) qin = input vin = input kin = input att = self.attention(qin,vin,kin)#dim:(B,C,H,W) att = self.gamma * att out = (input + out)*(F.ones_like(out)+att) return out |

#### Appendix C.3. CEECNet Building Blocks

`CEECNet`V1 unit with pseudocode.

import mxnet as mx from mxnet import nd as F class CEECNet_unit_V1(nn.Block): def __init__(self, nchannels, nheads, **kwards): super().__init__(**kwards) # Compress-Expand self.conv1= Conv2DN(nchannels/2) self.compr11= Conv2DN(nchannels,k=3,p=1,s=2) self.compr12= Conv2DN(nchannels,k=3,p=1,s=1) self.expand1= ExpandNComb(nchannels/2) # Expand Compress self.conv2= Conv2DN(nchannels/2) self.expand2= Expand(nchannels/4) self.compr21= Conv2DN(nchannels/2,k=3,p=1,s=2) self.compr22= Conv2DN(nchannels/2,k=3,p=1,s=1) self.collect= Conv2DN(nchannels,k=3,p=1,s=1) self.att= FTAttention2D(nchannels,nheads) self.ratt12= RelFTAttention2D(nchannels,nheads) self.ratt21= RelFTAttention2D(nchannels,nheads) self.gamma1 = self.params.get(’gamma1’, shape=(1,), init=mx.init.Zero()) self.gamma2 = self.params.get(’gamma2’, shape=(1,), init=mx.init.Zero()) self.gamma3 = self.params.get(’gamma3’, shape=(1,), init=mx.init.Zero()) def forward(self, input): # Compress-Expand out10 = self.conv1(input) out1 = self.compr11(out10) out1 = F.relu(out1) out1 = self.compr12(out1) out1 = F.relu(out1) out1 = self.expand1(out1,out10) out1 = F.relu(out1) # Expand-Compress out20 = self.conv2(input) out2 = self.expand2(out20) out2 = F.relu(out2) out2 = self.compr21(out2) out2 = F.relu(out2) out2 = F.concat([out2,out20],axis=1) out2 = self.compr22(out2) out2 = F.relu(out2) # attention att = self.gamma1*self.att(input) # relative attention 122 qin = out1 kin = out2 vin = out2 ratt12 = self.gamma2*self.ratt12(qin,kin,vin) # relative attention 211 qin = out2 kin = out1 vin = out1 ratt21 = self.gamma3*self.ratt21(qin,kin,vin) ones1 = F.ones_like(out10)# nchannels/2 out122= out1*(ones1+ratt12) out211= out2*(ones1+ratt21) out12 = F.concat([out122,out211],dim=1) out12 = self.collect(out12) out12 = F.relu(out12) # Final fusion ones2 = F.ones_like(input) out = (input+out12)*(ones2+att) return out |

import mxnet as mx from mxnet import nd as F class Expand(nn.Block): def __init__(self, nchannels, nheads, **kwards): super().__init__(**kwards) self.conv1 = Conv2DN(nchannels,k=3, p=1, groups=nheads) self.conv2 = Conv2DN(nchannels,k=3, p=1, groups=nheads) def forward(self, input): out = F.BilinearResize2D(input, scale_height=2, scale_width=2) out = self.conv1(out) out = F.relu(out) out = self.conv2(out) out = F.relu(out) return out |

import mxnet as mx from mxnet import nd as F class ExpandNCombine(nn.Block): def __init__(self, nchannels, nheads, **kwards): super().__init__(**kwards) self.conv1 = Conv2DN(nchannels,k=3, p=1, groups=nheads) self.conv2 = Conv2DN(nchannels,k=3, p=1, groups=nheads) def forward(self, input1,input2): # input1 has lower spatial dimensions out1 = F.BilinearResize2D(input1, scale_height=2, scale_width=2) out1 = self.conv1(out1) out1 = F.relu(out1) out2= F.concat([out1,input2],dim=1) out2 = self.conv2(out2) out2 = F.relu(out2) return out2 |

## Appendix D. Software Implementation and Training Characteristics

`mantis`

`CEECNet`and

`FracTAL`

`ResNet`were built and trained using the mxnet deep learning library [60] under the GLUON API. Each of the models was trained with a batch size of ∼256 on 16 nodes containing 4 NVIDIA Tesla P100 GPUs, each in CSIRO HPC facilities. Due to the complexity of the network, the batch size in a single GPU iteration cannot be made larger than ∼4 (per GPU). The models were trained in a distributed scheme using the ring allreduce algorithm and, in particular, its implementation on Horovod [61] for the mxnet [60] deep learning library. For all models, we used the Adam [56] optimiser, with momentum parameters $({\beta}_{1},{\beta}_{2})=(0.9,0.999)$. The learning rate was reduced by an order of magnitude whenever the validation loss stopped decreasing. Overall, we reduced the learning rate three times. The depth, $\mathfrak{D}$, of the evolving loss function was increased every time the learning rate was reduced. The depths of the ${\langle \mathcal{FT}\rangle}^{\mathfrak{D}}$ that we used were $\mathfrak{D}\in \{0,10,20,30\}$. The training time for each of the models presented here was approximately 4 days.

## References

- Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens.
**2020**, 12, 1662. [Google Scholar] [CrossRef] - Giustarini, L.; Hostache, R.; Matgen, P.; Schumann, G.J.P.; Bates, P.D.; Mason, D.C. A change detection approach to flood mapping in urban areas using TerraSAR-X. IEEE Trans. Geosci. Remote Sens.
**2012**, 51, 2417–2430. [Google Scholar] [CrossRef] [Green Version] - Morton, D.C.; DeFries, R.S.; Shimabukuro, Y.E.; Anderson, L.O.; Del Bon Espírito-Santo, F.; Hansen, M.; Carroll, M. Rapid assessment of annual deforestation in the Brazilian Amazon using MODIS data. Earth Interact.
**2005**, 9, 1–22. [Google Scholar] [CrossRef] [Green Version] - Löw, F.; Prishchepov, A.V.; Waldner, F.; Dubovyk, O.; Akramkhanov, A.; Biradar, C.; Lamers, J. Mapping cropland abandonment in the Aral Sea Basin with MODIS time series. Remote Sens.
**2018**, 10, 159. [Google Scholar] [CrossRef] [Green Version] - Caye Daudt, R.; Le Saux, B.; Boulch, A.; Gousseau, Y. Multitask learning for large-scale semantic change detection. Comput. Vis. Image Underst.
**2019**, 187, 102783. [Google Scholar] [CrossRef] [Green Version] - Varghese, A.; Gubbi, J.; Ramaswamy, A.; Balamuralidhar, P. ChangeNet: A Deep Learning Architecture for Visual Change Detection. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Lu, D.; Mausel, P.; Brondizio, E.; Moran, E. Change detection techniques. Int. J. Remote Sens.
**2004**, 25, 2365–2401. [Google Scholar] [CrossRef] - Coppin, P.; Jonckheere, I.; Nackaerts, K.; Muys, B.; Lambin, E. Review ArticleDigital change detection methods in ecosystem monitoring: A review. Int. J. Remote Sens.
**2004**, 25, 1565–1596. [Google Scholar] [CrossRef] - Tewkesbury, A.P.; Comber, A.J.; Tate, N.J.; Lamb, A.; Fisher, P.F. A critical synthesis of remotely sensed optical image change detection techniques. Remote Sens. Environ.
**2015**, 160, 1–14. [Google Scholar] [CrossRef] [Green Version] - Hussain, M.; Chen, D.; Cheng, A.; Wei, H.; Stanley, D. Change detection from remotely sensed images: From pixel-based to object-based approaches. ISPRS J. Photogramm. Remote Sens.
**2013**, 80, 91–106. [Google Scholar] [CrossRef] - Treisman, A.M.; Gelade, G. A feature-integration theory of attention. Cogn. Psychol.
**1980**, 12, 97–136. [Google Scholar] [CrossRef] - Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv
**2014**, arXiv:1409.0473. [Google Scholar] - Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. arXiv
**2014**, arXiv:1409.1259. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv
**2017**, arXiv:1706.03762. [Google Scholar] - Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. arXiv
**2017**, arXiv:1709.01507. [Google Scholar] - Wang, X.; Girshick, R.B.; Gupta, A.; He, K. Non-local Neural Networks. arXiv
**2017**, arXiv:1711.07971. [Google Scholar] - Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Chua, T. SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning. arXiv
**2016**, arXiv:1611.05594. [Google Scholar] - Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
- Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention Augmented Convolutional Networks. arXiv
**2019**, arXiv:1904.09925. [Google Scholar] - Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. arXiv
**2020**, arXiv:2006.16236. [Google Scholar] - Li, R.; Su, J.; Duan, C.; Zheng, S. Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images. arXiv
**2020**, arXiv:2011.14302. [Google Scholar] - Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv
**2020**, arXiv:2010.11929. [Google Scholar] - Sakurada, K.; Okatani, T. Change Detection from a Street Image Pair using CNN Features and Superpixel Segmentation. In Proceedings of the BMVC, Swansea, UK, 7–10 September 2015. [Google Scholar]
- Alcantarilla, P.F.; Stent, S.; Ros, G.; Arroyo, R.; Gherardi, R. Street-View Change Detection with Deconvolutional Networks. Robot. Sci. Syst.
**2016**. [Google Scholar] [CrossRef] - Guo, E.; Fu, X.; Zhu, J.; Deng, M.; Liu, Y.; Zhu, Q.; Li, H. Learning to Measure Change: Fully Convolutional Siamese Metric Networks for Scene Change Detection. arXiv
**2018**, arXiv:1810.09111. [Google Scholar] - Asokan, A.; Anitha, J. Change detection techniques for remote sensing applications: A survey. Earth Sci. Inform.
**2019**, 12, 143–160. [Google Scholar] [CrossRef] - Shi, W.; Zhang, M.; Zhang, R.; Chen, S.; Zhan, Z. Change Detection Based on Artificial Intelligence: State-of-the-Art and Challenges. Remote Sens.
**2020**, 12, 1688. [Google Scholar] [CrossRef] - Ji, S.; Shen, Y.; Lu, M.; Zhang, Y. Building Instance Change Detection from Large-Scale Aerial Images using Convolutional Neural Networks and Simulated Samples. Remote Sens.
**2019**, 11, 1343. [Google Scholar] [CrossRef] [Green Version] - He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask R-CNN. arXiv
**2017**, arXiv:1703.06870. [Google Scholar] - Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv
**2015**, arXiv:1505.04597. [Google Scholar] - Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual Attentive Fully Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
**2021**, 14, 1194–1206. [Google Scholar] [CrossRef] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv
**2015**, arXiv:1512.03385. [Google Scholar] - Jiang, H.; Hu, X.; Li, K.; Zhang, J.; Gong, J.; Zhang, M. PGA-SiamNet: Pyramid Feature-Based Attention-Guided Siamese Network for Remote Sensing Orthoimagery Building Change Detection. Remote Sens.
**2020**, 12, 484. [Google Scholar] [CrossRef] [Green Version] - Lu, X.; Wang, W.; Ma, C.; Shen, J.; Shao, L.; Porikli, F. See More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens.
**2020**, 162, 94–114. [Google Scholar] [CrossRef] [Green Version] - Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens.
**2019**, 57, 574–586. [Google Scholar] [CrossRef] - Zhang, A.; Lipton, Z.C.; Li, M.; Smola, A.J. Dive into Deep Learning. 2020. Available online: https://d2l.ai (accessed on 1 January 2021).
- Kim, Y.; Denton, C.; Hoang, L.; Rush, A.M. Structured Attention Networks. arXiv
**2017**, arXiv:1702.00887. [Google Scholar] - Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. arXiv
**2018**, arXiv:1805.08318. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. arXiv
**2016**, arXiv:1603.05027. [Google Scholar] - Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv
**2015**, arXiv:1502.03167. [Google Scholar] - Newell, A.; Yang, K.; Deng, J. Stacked Hourglass Networks for Human Pose Estimation. arXiv
**2016**, arXiv:1603.06937. [Google Scholar] - Liu, J.; Wang, S.; Hou, X.; Song, W. A deep residual learning serial segmentation network for extracting buildings from remote sensing imagery. Int. J. Remote Sens.
**2020**, 41, 5573–5587. [Google Scholar] [CrossRef] - Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit.
**2020**, 106, 107404. [Google Scholar] [CrossRef] - Lindeberg, T. Scale-Space Theory in Computer Vision; Kluwer Academic Publishers: Norwell, MA, USA, 1994; ISBN 978-0-7923-9418-1. [Google Scholar]
- Wang, Z.; Chen, J.; Hoi, S.C.H. Deep Learning for Image Super-resolution: A Survey. arXiv
**2019**, arXiv:1902.06068. [Google Scholar] [CrossRef] [Green Version] - Tschannen, M.; Bachem, O.; Lucic, M. Recent Advances in Autoencoder-Based Representation Learning. arXiv
**2018**, arXiv:1812.05069. [Google Scholar] - Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. arXiv
**2019**, arXiv:1906.02691. [Google Scholar] - Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Waldner, F.; Diakogiannis, F.I. Deep learning on edge: Extracting field boundaries from satellite images with a convolutional neural network. Remote Sens. Environ.
**2020**, 245, 111741. [Google Scholar] [CrossRef] - Wu, Y.; He, K. Group Normalization. arXiv
**2018**, arXiv:1803.08494. [Google Scholar] - Haghighi, S.; Jasemi, M.; Hessabi, S.; Zolanvari, A. PyCM: Multiclass confusion matrix library in Python. J. Open Source Softw.
**2018**, 3, 729. [Google Scholar] [CrossRef] [Green Version] - Matthews, B. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta (BBA) Protein Struct.
**1975**, 405, 442–451. [Google Scholar] [CrossRef] - Emmerich, M.T.; Deutz, A.H. A Tutorial on Multiobjective Optimization: Fundamentals and Evolutionary Methods. Nat. Comput. Int. J.
**2018**, 17, 585–609. [Google Scholar] [CrossRef] [Green Version] - Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; Citeseer: Princeton, NJ, USA, 2009. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Cao, Z.; Wu, M.; Yan, R.; Zhang, F.; Wan, X. Detection of Small Changed Regions in Remote Sensing Imagery Using Convolutional Neural Network. IOP Conf. Ser. Earth Environ. Sci.
**2020**, 502, 012017. [Google Scholar] [CrossRef] - Liu, Y.; Pang, C.; Zhan, Z.; Zhang, X.; Yang, X. Building Change Detection for Remote Sensing Images Using a Dual Task Constrained Deep Siamese Convolutional Network Model. arXiv
**2019**, arXiv:1909.07726. [Google Scholar] - Waldner, F.; Diakogiannis, F.I.; Batchelor, K.; Ciccotosto-Camp, M.; Cooper-Williams, E.; Herrmann, C.; Mata, G.; Toovey, A. Detect, Consolidate, Delineate: Scalable Mapping of Field Boundaries Using Satellite Images. Remote Sens.
**2021**, 13, 2197. [Google Scholar] [CrossRef] - Chen, T.; Li, M.; Li, Y.; Lin, M.; Wang, N.; Wang, M.; Xiao, T.; Xu, B.; Zhang, C.; Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv
**2015**, arXiv:1512.01274. [Google Scholar] - Sergeev, A.; Balso, M.D. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv
**2018**, arXiv:1802.05799. [Google Scholar]

**Figure 1.**Example of the proposed framework change detection performance on the LEVIRCD test set [1] (architecture:

`mantis`

`CEECNet`V1). From left to right: input image at date 1, input image at date 2, ground truth buildings change mask, and colour-coded the true negative (

`tn`), true positive (

`tp`), false positive (

`fp`), and false negative (

`fn`) predictions.

**Figure 2.**Fractal Tanimoto similarity measure. In the top row, we plot the two-dimensional density maps for the $\mathcal{FT}$ similarity coefficient. From left to right, the depths are $d\in \{0,3,5\}$. The last column corresponds to the average of values up to depth $d=5$, i.e., ${\langle \mathcal{FT}\rangle}^{5}=(1/5){\sum}_{d}{\mathcal{FT}}^{d}$. In the bottom figure, we represent the same values in 3D. The horizontal contour plot at $z=1$ corresponds to the Laplacian of the $\mathcal{FT}$. It is observed that as the depth, d, of the iteration increases, the function becomes steeper towards optimality.

**Figure 3.**Fractal Tanimoto similarity measure with noise. On the top row, from left to right, is the ${\mathcal{FT}}^{0}(\mathbf{p},\mathbf{l})$, and $(1/10){\sum}_{i=0}^{9}\left({\mathcal{FT}}^{i}(\mathbf{p},\mathbf{l})\right)$. The bottom row is the same corresponding ${\mathcal{FT}}^{d}(\mathbf{p},\mathbf{l})$ similarity measures with Gaussian random noise added. When the algorithmic training approaches optimality with the standard Tanimoto, local noise gradients tend to dominate over the background average gradient. Increasing the slope of the background gradient at later stages of training is a remedy to this problem.

**Figure 4.**Left panel (

**a**): the

`FracTAL`Residual unit. This building block demonstrates the fusion of the residual block with the self attention layer

`FracTAL`evaluated from the input features. Right panel (

**b**): the

`CEECNet`(Compress–Expand/Expand–Compress) feature extraction units. The symbol ⊎ represents concatenation of features along the channel dimension (for V1). For version V2, we replace all of the concatenation operations, ⊎, followed by the normalised convolution layer with relative fusion attention layers, as described in Section 2.4.3 (see also Listing A.3).

**Figure 5.**The

`mantis`

`CEECNet`V1 architecture for the task of change detection. The Fusion operation (

`FUSE`) is described with mxnet/gluon style pseudocode in detail on Listing A.3.

**Figure 6.**Conditioned multitasking segmentation

`HEAD`. Here, features 1 and 2 are the outputs of the

`mantis`

`CEECNet`features extractor. The symbol ⊎ represents concatenation along the channels dimension. The algorithm first predicts the distance transform of the classes (regression), then re-uses this information to estimate the boundaries, and finally, both of these predictions are re-used for the change prediction layer. Here,

`Chng Segm`stands for the change segmentation layer and

`mtsk`for multitasking predictions.

**Figure 7.**Train–validation–test split of the WHU dataset. The yellow (dash-dot line) rectangle represents the training data. The area between the magenta (solid line) and the yellow (dash-dot) rectangles represents the validation data. Finally, the cyan rectangle (dashed) is the test data. The reasoning for our split is to include in the validation data both industrial and residential areas and isolate (spatially) the training area from the test area in order to avoid spurious spatial correlation between training/test sites. The train–validation–test ratio split is $\mathtt{train}:\mathtt{val}:\mathtt{test}\approx 33:37:30$.

**Figure 8.**Pareto front selection after the last reduction in the learning rate. The bottom panel designates with open cyan circles the two points that are equivalent in terms of quality prediction when both MCC and $\langle \mathcal{FT}\rangle $ are taken into account. The top two panels show the corresponding evolutions of these measures during training. There, the Pareto optimal points are designated with full circle dots (cyan).

**Figure 9.**Comparison of the V1 and V2 versions of

`CEECNet`building blocks with a

`FracTAL`ResNet implementation and a standard ResNet V2 building blocks. The models were trained for 300 epochs on CIFAR10 with standard cross entropy loss.

**Figure 10.**Performance improvement of the

`FracTAL`-

`resnet34`over

`CBAM-resnet34`: replacing the

`CBAM`attention layers with

`FracTAL`ones for two otherwise identical networks results in 1% performance improvement.

**Figure 11.**Experimental results with the evolving loss strategy for the CIFAR10 dataset. Left panel (

**a**): Training of two classification networks with static and evolving loss strategies. The two networks have identical

`macro`-topologies but different

`micro`-topologies. The first network (top) uses standard Residual units for its building blocks, while the second (bottom) uses

`CEECNetV1`units. The networks are trained with a static $\mathcal{FT}$ ($\mathfrak{D}=0$) loss strategy and an evolving one. We increase the depth $\mathfrak{D}$ of the ${L}^{\mathfrak{D}}=1-{\langle \mathcal{FT}\rangle}^{\mathfrak{D}}(\mathbf{p},\mathbf{l})$ loss function with each learning rate reduction. The vertical dashed lines designate epochs where the learning rate was scaled to 1/10th of its original value. The validation accuracy is mildly increased, although there is a clear difference. Right panel (

**b**): Training on CIFAR10 of a network with standard ResNet building blocks and fixed depth, $\mathfrak{D}$, of the ${L}^{\mathfrak{D}}=1-\langle {\mathcal{FT}}^{\mathfrak{D}}\rangle $ loss. The vertical dashed lines designate epochs where the learning rate was scaled to 1/10th of its original value. As the depth of iteration, $\mathfrak{D}$, increases ($\mathfrak{D}$ remains constant for each experiment), the convergence speed of the validation accuracy degrades.

**Figure 12.**Visualization of the last features (before the multitasking head) for the

`mantis`

`FracTAL`

`ResNet`models of

`FracTAL`depth $d=0$ (left pannel) and $d=10$ (right pannel). The features appear similar. For each panel, the top left first three images are the input image at date ${t}_{1}$, the input image at date ${t}_{2}$, and the ground truth mask.

**Figure 13.**Examples of inferred change detection on some test tiles from the LEVIRCD dataset of the

`mantis`

`CEECNet`V1 model (evolving loss strategy,

`FracTAL`depth $d=5$). For each row, from left to right, input image date 1, input image date 2, ground truth, change prediction (threshold 0.5), and confidence heat map.

**Figure 14.**A sample of change detection on windows the size of $1024\times 1024$ from the WHU dataset. Inference is with the

`mantis`

`CEECNet`V1 model. The ordering of the inputs, for each row, is as in Figure 13. We indicate with blue boxes successful findings and with red boxes missed changes on buildings.

**Figure 15.**Trainable scaling parameters, $\gamma $, for the sigmoid activation, i.e., $\mathtt{sigmoid}(x/\gamma )$, that are used in the prediction of change mask boundary layers.

**Figure 16.**

`mantis`

`CEECNet`V1 vs.

`mantis`

`FracTAL`

`ResNet`(

`FracTAL`depth, $d=5$) evolution performance on change detection validation datasets. The top panel corresponds to the LEVIRCD dataset. The bottom panel to the WHU dataset. For each network, we followed the evolving loss strategy: there are two learning rate reductions followed by two scaling ups of the ${\langle \mathcal{FT}\rangle}^{d}$ loss function. All four training histories avoid overfitting, thanks to making the loss function sharper towards optimality.

**Figure 17.**Samples of relative quality change detection on test tiles of size $1024\times 1024$ from the LEVIRCD dataset. For each row from left to right: input image date 1, input image date 2, ground truth, and confidence heat maps of

`mantis`

`CEECNet`V1 and

`mantis`

`FracTAL`

`ResNet`, respectively.

**Figure 19.**Visualization of the relative attention units,

`ratt12`(

**left pannel**) and

`ratt21`(

**right pannel**), for the

`mantis`

`FracTAL`

`ResNet`with

`FracTAL`depth, $d=10$. These come from the first feature extractors (channels = 32, filter spatial size $256\times 256$). Here,

`ratt12`is the relative attention where for query we use input at date ${t}_{1}$, and the key/value filters are created from input at date ${t}_{2}$. In the top left rows for each panel, we have input image at date ${t}_{1}$, input image at date ${t}_{2}$, and ground truth building change labels, followed by the visualisation of each of the 32 channels of the features.

**Figure 20.**For the same model as in Figure 19, we plot the difference of the first feature extractor blocks (

**left panel**) vs. the first Fusion feature extraction block (

**right pannel**). The entropy of the fusion features is half that of the difference channels. This means there is less “surprise” in the fusion filters in comparison with the difference of filters for the same trained network.

**Table 1.**Model comparison on the LEVIR building change detection dataset. We designate with

**bold**font the best values, with underline the second best, and with round brackets, $\left(\phantom{\rule{0.277778em}{0ex}}\right)$ the third best model. All of our frameworks (

`D6nf32`) use the

`mantis`macro-topology and achieve state-of-the-art performance. Here,

`evo`represents evolving loss strategy,

`sta`represents static loss strategy, and the depth d refers to the $\mathcal{FT}$ similarity metric of the

`FracTAL`(attention) layer. In the last column, we provide the number of trainable parameters for each model.

Model | FracTAL Depth | Loss Strategy | Precision | Recall | F1 | MCC | IoU | Model Params |
---|---|---|---|---|---|---|---|---|

Chen and Shi [1] | - | - | 83.80 | 91.00 | 87.30 | - | - | - |

CEECNetV1 | $d=5$ | sta, $\mathfrak{D}\in \{0,10,20,30\}$ | 93.36 | 89.46 | 91.37 | 90.94 | 84.10 | 49.2 M |

CEECNetV1 | $d=5$ | evo, $\mathfrak{D}\in \{0,10,20,30\}$ | 93.73 | (89.93) | (91.79) | (91.38) | (84.82) | 49.2 M |

CEECNetV2 | $d=5$ | evo, $\mathfrak{D}\in \{0,10,20,30\}$ | 93.81 | 89.92 | 91.83 | 91.42 | 84.89 | 92.4 M |

FracTALResNet | $d=0$ | evo, $\mathfrak{D}\in \{0,10,20,30\}$ | 93.50 | 89.79 | 91.61 | 91.20 | 84.51 | 20.1 M |

FracTALResNet | $d=5$ | evo, $\mathfrak{D}\in \{0,10,20,30\}$ | 93.60 | 89.38 | 91.44 | 91.02 | 84.23 | 20.1 M |

FracTALResNet | $d=10$ | evo, $\mathfrak{D}\in \{0,10,20,30\}$ | (93.63) | 90.04 | 91.80 | 91.39 | 84.84 | 20.1 M |

**Table 2.**Model comparison on the WHU building change detection dataset. We designate with

**bold**font the best values, with underline the second best, and with round brackets, $\left(\phantom{\rule{0.277778em}{0ex}}\right)$, the third best model. Ji et al. [28] presented two models for extracting buildings prior to estimating the change mask. These were the Mask-RCNN (in table:

`M1`) and MS-FCN (in table:

`M2`). Our models consume input images of a size of $256\times 256$ pixels. With the exception of [58] that uses the same size, all other results consume inputs of a size of $512\times 512$ pixels.

Model | FracTAL Depth | Loss Strategy | Precision | Recall | F1 | MCC | IoU |
---|---|---|---|---|---|---|---|

Ji et al. [28] M1 | - | - | 93.100 | 89.200 | (91.108) | - | (83.70) |

Ji et al. [28] M2 | - | - | 93.800 | 87.800 | 90.700 | - | 83.00 |

Chen et al. [31] | - | - | 89.2 | (90.5) | 89.80 | - | - |

Cao et al. [57] | - | - | (94.00) | 79.37 | 86.07 | - | - |

Liu et al. [58] | - | - | 90.15 | 89.35 | 89.75 | - | 81.40 |

FracTALResNet | $d=5$ | evo, $\mathfrak{D}\in \{0,10,20,30\}$ | 95.350 | 90.873 | 93.058 | 92.892 | 87.02 |

CEECNetV1 | $d=5$ | evo, $\mathfrak{D}\in \{0,10,20,30\}$ | 95.571 | 92.043 | 93.774 | 93.616 | 88.23 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Diakogiannis, F.I.; Waldner, F.; Caccetta, P.
Looking for Change? Roll the Dice and Demand Attention. *Remote Sens.* **2021**, *13*, 3707.
https://doi.org/10.3390/rs13183707

**AMA Style**

Diakogiannis FI, Waldner F, Caccetta P.
Looking for Change? Roll the Dice and Demand Attention. *Remote Sensing*. 2021; 13(18):3707.
https://doi.org/10.3390/rs13183707

**Chicago/Turabian Style**

Diakogiannis, Foivos I., François Waldner, and Peter Caccetta.
2021. "Looking for Change? Roll the Dice and Demand Attention" *Remote Sensing* 13, no. 18: 3707.
https://doi.org/10.3390/rs13183707