1. Introduction
The location of damaged buildings after a disastrous event is of utmost importance for several stages of the disaster management cycle [
1,
2]. Manual inspection is not efficient since it takes a considerable amount of resources and time. Preventing the use of such inspections results in the early response phase of the disaster management cycle [
3]. Over the last decade, remote sensing platforms have been increasingly used for the mapping of building damages. These platforms usually have a wide coverage, fast deployment, and high temporal frequency. Space, air, and ground platforms mounted with optical [
4,
5,
6], radar [
7,
8], and laser [
9,
10] sensors have been used to collect data to perform automatic building damage assessment. Regardless of the platform and sensor used, several central difficulties persist, such as the subjectivity in the manual identification of hazard-induced damages from the remote sensing data, and the fact that the damage evidenced by the exterior of a building might not be enough to infer the building’s structural health. For this reason, most scientific contributions aim towards the extraction of damage evidence such as piles of rubble, debris, spalling, and cracks from remote sensing data in a reliable and automated manner.
Optical remote sensing images have been preferred to perform building damage assessments since these data are easier to understand when compared with other remote sensing data [
1]. Moreover, these images may allow for the generation of 3D models if captured with enough overlap. The 3D information can then be used to infer the geometrical deformations of the buildings. However, the time needed for the generation of such 3D information through dense image matching might hinder its use in the search and rescue phase because fast processing is mandatory in this phase.
Synoptic satellite imagery can cover regional to national extents and can be readily available after a disaster. The International Charter (IC) and the Copernicus Emergency Management Service (EMS) use synoptic optical data to assess building damage after a disastrous event. However, many signs of damage may not be identifiable using such data. Pancake collapses and damages along the façades might not be detectable due to the limited viewpoint of such platforms. Furthermore, its low resolution may introduce uncertainty in the satellite imagery damage mapping [
11], even when performed manually [
12,
13].
To overcome these satellite imagery drawbacks, airborne images collected from manned aerial platforms have been considered in many events [
14,
15,
16,
17]. These images may not be as readily available as satellite data, but they can be captured at a higher resolution and such aerial platforms may also perform multi-view image captures. While the increase in the resolution aids in the disambiguation between damaged and non-damaged buildings, the oblique views enable the damage assessment of the façades [
14]. These advantages were also realized by the EMS, which recently started signing contracts with private companies to survey regions with aerial oblique imagery after a disaster [
18], as it happened in the 2016 earthquakes in central Italy.
Unmanned aerial vehicles have been used to perform a more thorough damage assessment of a given scene. The high portability and higher resolution, when compared to manned platforms, have several benefits: they allow for a more detailed damage assessment [
17], which allows lower levels of damage such as cracks and smaller signs of spalling to be detected [
19], and they allow the UAV flights to focus only on specific areas of interest [
20].
Recent advances in the computer vision domain, namely, the use of convolutional neural networks (CNN) for image classification and segmentation [
21,
22,
23], have also shown their potential in the remote sensing domain [
24,
25,
26] and, more specifically, for the image classification of building damages such as debris or rubble piles [
17,
27]. All these contributions use data with similar resolutions that are specifically acquired to train and test the developed networks. The use of multi-resolution data has improved the overall image classification and segmentation in many computer vision applications [
24,
28,
29] and in remote sensing [
25]. However, multi-resolution images are generated artificially when the input images are up-sampled and down-sampled at several scales and then fused to obtain a final stronger classifier. While in computer vision, the resolution of a given image is considered as another inherent difficulty in the image classification task, in remote sensing, there are several resolution levels defined by the used platform and sensor, and these are usually considered independently for any image classification task.
A growing amount of image data have been collected by map producers using different sensors and with different resolutions, and their optimal use and integration would, therefore, represent an opportunity to positively impact scene classification. More specifically, a successful multi-resolution approach would make the image classification of building damages more flexible and not rely only on a given set of images from a given platform or sensor. This would be optimal since there often are not enough image samples of a given resolution level available to generate a strong CNN based classifier. The first preliminary attempt in this direction, using image data from different platforms and optical sensors, has only been addressed recently [
30]. This work focused on the satellite image classification of building damages (debris and rubble piles) whilst also considering image data from other (aerial) resolutions in its training. The authors reported an improvement of nearly 4% in the satellite image classification of building damages by fusing the feature maps obtained from satellite and aerial resolutions. However, the paper limited its investigation to satellite images, not considering the impact of the multi-resolution approach in the case of aerial (manned and unmanned) images.
The present paper extends the previously reported work by thoroughly assessing the combined use of satellite and airborne (manned and unmanned) imagery for the image classification of the building damages (debris and rubble piles, as in
Figure 1) of these same resolutions. This work focuses on the fusion of the feature maps coming from each of the resolutions. Specifically, the aim of the paper is twofold:
The next section focuses on the related work of both image-based damage mapping and CNN feature map fusion.
Section 3 presents the methodology followed to assess the use of multi-resolution imagery, where the used network is defined and the fusion approaches formalized.
Section 4 deals with the experiments and results, followed by a discussion of the results (
Section 5) and conclusions (
Section 6).
3. Methodology
Three different CNN feature fusion approaches were used to assess the multi-resolution capabilities of CNN in performing the image classification of building damages. These multi-resolution experiments were compared with two baseline approaches. These baselines followed the traditional image classification pipeline using CNN, where each imagery resolution level was fed to a single network.
The network used in the experiments is presented in
Section 3.1. This network exploited two main characteristics: residual connections and dilated convolutions (presented in the following paragraphs). The baseline experiments are presented in
Section 3.2, while the feature fusion approaches are presented in
Section 3.3.
A central aspect of a network capable of capturing multi-resolution information present in the images is its ability to capture spatial context. Yu and Koltun [
51] introduced the concept of dilated convolutions in CNN with the aim of capturing the context in image recognition tasks. Dilated convolutions are applied to a given input image using a kernel with defined gaps (
Figure 2). Due to the gaps, the receptive field of the network is bigger, capturing more contextual information [
51]. Moreover, the receptive field size of the dilated convolutions also enables the capture of finer details since there is no need to perform an aggressive down-sampling of the feature maps throughout the network, better preserving the original spatial resolution [
52]. Looking at the specific task of building damage detection, the visual depiction of a collapsed building in a nadir aerial image patch may not appear in the form of a single rubble pile. Often, only smaller damage cues such as blown out debris or smaller portions of rubble are found in the vicinity of such collapsed buildings. Hence, by using dilated convolutions in this study, we aim to learn the relationship between damaged areas and their context, relating these among all the levels of resolution.
From the shallow
alexnet [
22], to the
VGG [
23], and the more recently proposed
resnet [
21], the depth of the proposed networks for image classification has increased. Unfortunately, the deeper the network, the harder it is to train [
23]. CNNs are usually built by the stacking of convolution layers, which allows a given network to learn from lower level features to higher levels of abstraction in a hierarchical setting. Nonetheless, a given layer
l is only connected with the layers adjacent to it (i.e., layers
l−1 and
l+1). This assumption has shown to be not optimal since the information from earlier layers may be lost during backpropagation [
21]. Residual connections were then proposed [
21], where the input of a given layer may be a summation of previous layers. These residual connections allow us to (1) have deeper networks while maintaining a low number of parameters and (2) to preserve the feature information across all layers (
Figure 3) [
21]. The latter aspect is particularly important for a multi-resolution approach since a given feature may have a different degree of relevance for each of the considered levels of resolution. The preservation of this feature information is therefore critical when aggregating the feature maps generated using different resolution data.
3.1. Basic Convolutional Set and Modules Definition
The main network configuration was built by considering two main modules: (1) the context module and (2) the resolution-specific module (
Figure 4). This structure was inspired by the works of References [
21,
52,
53]. The general idea regarding the use of these two modules was that while the dilated convolutions capture the wider context (context module), more local features may be lost in the dilation process, hence the use of the resolution-specific module [
51,
53] with the decreasing dilation. In this way, the context is harnessed through the context module, while the resolution-specific module brings back the feature information related to a given resolution. The modules were built by stacking basic convolutional sets that were defined by convolution, batch normalization, and ReLU (rectified linear unit) (called CBR in
Figure 4) [
54]. As depicted in
Figure 4, a pair of these basic convolutional sets bridged by a residual connection formed the simplest component of the network, which were then used to build the indicated modules.
The context module was built by stacking 19 CBRs with an increasing number of filters and a dilation factor. For our tests, a lower number of CBRs would make the network weaker while deeper networks would give no improvements and slow the network runtime (increasing the risk of overfitting). The growing number of filters is commonly used in CNN approaches, following the general assumption that more filters are needed to represent more complex features [
21,
22,
23]. The increasing dilation factor in the context module is aimed at gradually capturing feature representations over a larger context area [
51]. The red dots in
Figure 4 indicate when a striding of 2, instead of 1, was applied. The striding reduced the size of the feature map (from the initial 224 × 224 px to the final 28 × 28 px) without performing max pooling. Larger striding has been shown to be beneficial when dilated convolutions are considered [
52]. The kernel size was 3 × 3 [
55] and only the first CBR block of the context module had a kernel size of 7 × 7 [
52]. The increase in the dilation factor can generate artifacts (aliasing effect) on the resulting feature maps due to the gaps introduced by the dilated kernels [
52,
53]. To attenuate this drawback, the dilation increase in the context module was compensated in the resolution-specific module with a gradual reduction of the dilation value [
53] and the removal of the residual connections from the basic CBR blocks [
52]. This also allowed us to recapture the more local features [
53], which might have been lost due to the increasing dilations in the context module. For the classification part of the network, global average pooling followed by a convolution which maps the feature map size to the number of classes was applied [
52,
56]. Since this was a binary classification problem, a sigmoid function was used as the activation.
3.2. Baseline Method
As already mentioned, the multi-resolution tests were compared against two baseline networks. These followed the traditional pipelines for the image classification of building damages [
17,
27]. In the first baseline network (
Figure 5), the training samples of a single resolution (i.e., only airborne—manned or unmanned—or satellite) were fed into a network composed of the context and the resolution-specific module like in a single resolution approach. The second baseline (hereafter referred to as baseline_ft) used the same architecture as defined for the baseline (
Figure 5). It fed generic image samples of a given level of resolution (Tables 2 and 3) into the context module, while the resolution-specific one was only fed with the damage domain image samples of that same level of resolution. Fine-tuning a network that used a generic image dataset for training may improve the image classification process [
25], especially in cases with a low number of image samples for the specific classification problem [
57]. The generic resolution-specific image samples were used to train a network considering two classes: built and non-built environments. Its weights were used as a starting point in the fine-tuning experiments for the specific case of the image classification of building damages. This led to two baseline tests for each resolution level (one trained from scratch and one fine-tuned on generic resolution-specific image samples).
3.3. Feature Fusion Methods
The multi-resolution feature fusion approaches used different combinations of the baseline modules and their computed features (
Section 3.2). Three different approaches have been defined: MR_a, MR_b, and MR_c, as shown in
Figure 5. The three types of fusion were inspired by previous studies in computer vision [
58] and remote sensing [
30,
43,
48,
49]. In the presented implementation, the baselines were independently computed for each level of resolution without sharing the weights among them [
49]. The used image samples have different resolutions and they were acquired in different locations: the multi-modal approaches (e.g., [
48]), dealing with heterogeneous data fusions (synchronized and in overlap), could not be directly adopted in this case as there was no correspondence between the areas captured by the different sensors. Moreover, in a disaster scenario, time is critical. Acquisitions with three different sensors (mounted on three different platforms) and resolutions would not be easily doable.
A fusion module (presented in
Figure 5) was used in two of the fusion strategies, MR_b and MR_c, while MR_a followed the fusion approach used in Reference [
30]. This fusion module aimed to learn from all the different feature representations, blending their heterogeneity [
48,
58] through a set of convolutions. The objective behind the three different fusion approaches was to understand (i) which layers (and its features) were contributing more to the image classification of building damages in a certain resolution level and (ii) which was the best approach to fuse the different modules with multi-resolution information. The networks were then fine-tuned with the image data (X in
Figure 5) of the resolution level of interest. For example, in MR_a, the features from the context modules of the three baseline networks were concatenated. Then, the resolution-specific module was fine-tuned with the image data X of a given resolution level (e.g., satellite imagery).
The concatenation indicated in
Figure 5 had as input the feature maps which had the same width and height, merging them along the channel dimension. Other merging approaches were tested such as summation, addition, and the averaging of the convolutional modules, however, they underperformed when compared to concatenation. In the bullet points below, each of the fusion approaches is defined in detail. Three fusions (MR_a, MR_b, and MR_c) were performed for each resolution level.
MR_a: in this fusion approach, the features of the context modules of each of the baseline experiments were concatenated. The resolution-specific module was then fine-tuned using the image data of a given resolution level (X, in
Figure 5). This approach followed a general fusion approach already used in computer vision to merge the artificial multi-scale branches of a network [
28,
59] or to fuse remote sensing image data [
60]. Furthermore, this simple fusion approach has already been tested in another multi-resolution study [
30].
MR_b: in this fusion approach, the features of the context followed by the resolution-specific modules of the baseline experiments were concatenated. The fusion module considered as input the previous concatenation and it was fine-tuned using the image data of a given resolution level (
X, in
Figure 5). While only the context module of each resolution level was considered for the fusion in MR_a, MR_b considered the feature information of the resolution-specific module. In this case, the fusion model aimed at blending all these heterogeneous feature maps and building the final classifier for each of the resolution levels separately (
Figure 5). This fusion approach allows the use of traditional (i.e., mono resolution) pre-trained networks as only the last set of convolutions need to be run (i.e., fusion module).
MR_c: this approach builds on MR_a. However, in this case, the feature information from the concatenation of several context modules is maintained in a later stage of the fusion approach. This was performed by further concatenating this feature information with the output of the resolution-specific module that was fine-tuned with a given resolution image data (X in
Figure 5). Like MR_b, the feature information coming from the context modules and resolution-specific module were blended using the fusion module.
5. Discussion
The results show an improvement in the classification accuracy and the localization capabilities of a CNN for the image classification of building damages using the multi-resolution feature maps. However, each of the different feature fusion approaches behaved differently. The overall best multi-resolution feature fusion approach (MR_c) concatenates the feature maps from intermediate layers, confirming the need for preserving feature information from the intermediate layers at a later stage of the network [
25,
28]. This feature fusion approach also considers a fusion module (
Figure 5) that is able to merge and blend the multi-resolution feature maps. Other feature fusion studies using small convolutional sets to merge audio and video features [
58] or remote sensing multi-modal feature maps [
44,
48,
50] have underlined the same aspect. In general, the satellite and aerial (unmanned) resolutions were the ones which presented the most improvements when using multi-resolution feature fusion approaches. The aerial (unmanned) resolution also improved their image classification accuracy and localization capabilities (although marginally). In the aerial (manned) case, the resolution level had the least improvement with the multi-resolution feature fusion approach. This will be discussed in detail below.
The model transferability experiments generally had a lower accuracy, indicating the need for in situ image acquisitions to get optimal classifiers, as shown in [
17]. In the satellite case, both the precision and recall were higher in the multi-resolution feature fusion approaches, and the models captured fewer false positives and fewer false negatives. In the aerial (manned and unmanned) cases, the recall was higher and the precision was lower, reflecting that a higher number of image patches were correctly classified as damaged but more false positives were also present. In the aerial (manned) resolution tests, the multi-resolution feature fusion approaches had worse accuracies than the baselines. In this case, the best approach was to fine-tune a network which used generic aerial (manned) image samples during the training. In the aerial (manned) case, the image quality was better (high-end calibrated cameras), with more homogenous captures throughout different geographical regions. The aerial (unmanned) platform image captures were usually performed with a wide variety of compact grade cameras which presented a higher variability both in the sensor characteristics and in their image capture specifications. Consequently, there was a variable image quality compared to the aerial (manned) platforms.
The transferability tests of aerial (unmanned) imagery, contemporarily deal with geographical transferability aspects and also with very different image quality and image capture specifications. In such cases, the presented results indicate that the multi-resolution feature fusion approaches helped the model to be more generalizable than when using traditional mono-resolution methods.
The activations shown in the results are in agreement with the accuracy results. The multi-resolution feature fusion approaches presented better localization capabilities compared with the baseline experiments. Strike patterns and gridding artifacts can be seen in the activations. This could be due to the use of a dilated kernel in the presented convolutional modules, as indicated in [
52,
53].
The large image patches shown in
Figure 11,
Figure 12 and
Figure 13 show that both the satellite and aerial (unmanned) resolution levels can benefit more from the multi-resolution feature fusion approach in comparison to the baseline experiments. Furthermore, the aerial (unmanned) multi-resolution feature fusion identifies only one of the patches as a false positive, while correctly classifying more damaged image patches.
The previous study on multi-resolution feature fusion [
30], using both a baseline and a feature fusion approach similar to the MR_a, had better accuracies than the ones presented in this paper, although both contributions reflect a general improvement. The differences in the two works is in the training data that were extracted from the same dataset but considering different images and different damage thresholds for the image patches labelling (40% in this paper, 60% in Reference [
30]) The different results confirm the difficulties and subjectivity inherent in the manual identification of building damages from any type of remote sensing imagery [
12,
58]. Moreover, it also indicates the sensibility of the damage detection with CNN according to the input used for training.
6. Conclusions and Future Work
This paper assessed the combined use of multi-resolution remote sensing imagery coming from sensors mounted on different platforms within a CNN feature fusion approach to perform the image classification of building damages (rubble piles and debris). Both a context and a resolution-specific network module were defined by using dilated convolutions and residual connections. Subsequently, the feature information of these modules was fused using three different approaches. These were further compared against two baseline experiments.
Overall, the multi-resolution feature fusion approaches outperformed the traditional image classification of building damages, especially in the satellite and aerial (unmanned) cases. Two relevant aspects have been highlighted by the performed experiments on the multi-resolution feature fusion approaches: (1) the importance of the fusion module, as it allowed both MR_b and MR_c to outperform MR_a (2) the beneficial effect of considering the feature information from the intermediate layers of each of the resolution levels in the later stages of the network, as in MR_c.
These results were also confirmed in the classification of larger image patches in the satellite and aerial (unmanned) cases. Gridding artifacts and stripe patterns could be seen in the activations of the several fusion and baseline experiments due to the use of dilated kernels, however, in the multi-resolution feature fusion experiments, the activations were often more detailed than in the traditional approaches.
The model transferability experiments in the multi-resolution feature fusion approaches also improved the accuracy of the satellite and aerial (unmanned) imagery. On the contrary, fine-tuning a network by training it with generic aerial (manned) images was preferable in the aerial (manned) case. The different behavior in the aerial (manned) case could be explained by the use of images captured with high-end calibrated cameras and with more homogenous data capture settings. The characteristics of the aerial (manned) resolution level contrasted with the aerial (unmanned) case, where the acquisition settings were more heterogeneous and a number of different sensors with a generally lower quality were used. In the aerial (manned) case, the model transferability to a new geographical region was, therefore, more related with the scene characteristics of that same region (e.g., urban morphology) and less related with the sensor or capture settings. In the aerial (unmanned) case, the higher variability of the image datasets allowed to better generalize the model.
The transferability test also indicated that the highest improvements of the multi-resolution approach were visible in the satellite resolution, with a substantial reduction of both false positives and false negatives. This was not the case in the aerial (unmanned) resolution level, where a higher number of false positives balanced the decrease in the number of false negatives. In a disaster scenario, the objective is to identify which buildings are damaged (hence, having potential victims). Therefore, it is preferable to lower the number of false negatives, maybe at the cost of a slight increase in false positives.
Despite the successful multi-resolution feature fusion approach for the image classification of building damages, there is no information regarding the individual contribution of each of the levels of resolution in the image classification task. Moreover, the presented results are mainly related to the overall accuracy and behavior of the multi-resolution feature fusion and baseline experiments. More research is needed to assess which signs of damage are better captured with this multi-resolution feature fusion approach, for each of the resolution levels. The focus of this work was on the fusion of the several multi-resolution feature maps. However, other networks can be assessed to perform the same task. In this regard, MR_b, for example, can be directly applied to pre-trained modules, where the last set of activations can be concatenated and posteriorly fed to the fusion module. In this case, there is no need to re-train a new network for a specific multi-resolution feature fusion approach. There is an ongoing increase in the amount of collected image data, where a multi-resolution approach could harness this vast amount of information and help build stronger classifiers for the image classification of building damages. Moreover, given the recent contributions focusing on online learning [
27], the initial satellite images from a given disastrous event could be continuously refined with location-specific image samples that come from other resolutions. In such conditions, the use of a multi-resolution feature fusion approach would be optimal. This is especially relevant in an early post-disaster setting, where all these multi-resolution data would be captured independently with different sensors and at different stages of the disaster management cycle.
This multi-resolution feature fusion approach can also be assessed when considering other image classification problems with more classes. There is an ever-growing amount of collected remote sensing imagery and taking advantage of this large quantity of data would be optimal.