GSCA-UNet: Towards Automatic Shadow Detection in Urban Aerial Imagery with Global-Spatial-Context Attention Module

: As an inevitable phenomenon in most optical remote-sensing images, the effect of shadows is prominent in urban scenes. Shadow detection is critical for exploiting shadows and recovering the distorted information. Unfortunately, in general, automatic shadow detection methods for urban aerial images cannot achieve satisfactory performance due to the limitation of feature patterns and the lack of consideration of non-local contextual information. To address this challenging problem, the global-spatial-context-attention (GSCA) module was developed to self-adaptively aggregate all global contextual information over the spatial dimension for each pixel in this paper. The GSCA module was embedded into a modiﬁed U-shaped encoder–decoder network that was derived from the UNet network to output the ﬁnal shadow predictions. The network was trained on a newly created shadow detection dataset, and the binary cross-entropy (BCE) loss function was modiﬁed to enhance the training procedure. The performance of the proposed method was evaluated on several typical urban aerial images. Experiment results suggested that the proposed method achieved a better trade-off between automaticity and accuracy. The F1-score, overall accuracy, balanced-error-rate, and intersection-over-union metrics of the proposed method were higher than those of other state-of-the-art shadow detection methods.


Introduction
Benefiting from the significant advantage in spatial resolution, aerial images have opened up a new era in urban observation. However, with the increase in spatial resolution, the influence of shadows is increasingly prominent, particularly in urban areas as there are many tall standing objects such as buildings, towers, and trees. In shaded regions, the undesirable radiometric distortion resulting from sunlight blockage plays a negative role in image processing and analysis [1]. For instance, it is hard to correctly distinguish a dark water body from shadows because they are both low-reflection [2,3], which reduces the accuracy of water mapping. On the other hand, shadows usually carry some helpful spatial semantic cues. The size, shape, and height of the associated obstacle can be obtained by analyzing the geometric attributes of shadows. Utilizing such information could contribute to building recognition, 3D reconstruction, and height estimation [4][5][6]. Consequently, shadow processing is a crucial step for image analysis.
Over the past few years, shadow detection has drawn much scholarly attention. Reviewing the previous literature, existing shadow detection methods can be divided into four main categories: property-, model-, soft-shadow-, and machine-learning-based methods [7][8][9][10].
Because of their simplicity, both in principle and implementation, property-based methods including three subcategories: thresholding-, color-transformation-, and object segmentation-based, are widely reported. Considering that a shadow pixel usually has lower, even the lowest, intensity, thresholding-based methods obtain the final shadow map by thresholding the original image by using a set threshold, which is usually derived from the Otsu [11] and bimodal histogram-splitting methods [12,13]. A prominent drawback to the methods is that it is rather difficult to obtain the optimal threshold, especially when the gray level distribution pattern of an input image does not present obvious bimodal distribution, resulting in lower accuracy. To improve performance, color-transformation-based methods were developed. For multiband images, the original image is first converted to the specifically invariant color space (e.g., HSV, C1, C2, C3, and CIELCh) [14][15][16]. Several image indices were also developed to enlarge the contrast between shadows and non-shadows [17,18], e.g., the normalized saturation-value difference index [19]. Then, shadows and non-shadows are distinguished by thresholding the converted shadow feature map. As the similarity of spectral characteristics between shadow and other dark non-shadow objects is very high, color-transformation-based methods cannot effectively separate shadows from dark non-shadows. Another weakness of the two types of methods is that additional postprocessing steps, as introduced in [20], are usually required to eliminate the salt-and-pepper phenomenon and fill the remained holes, which limitedly improves accuracy. Motivated by the limitations of pixel-based property-based methods, object segmentation-based methods were applied to reduce the interference of dark non-shadow objects, and more accurately locate shadow boundaries [21,22]. However, the methods still exhibit limited ability to identify nonuniform shadows, particularly the shadows with high brightness.
Model-based methods require specific prior knowledge about the scene and sensor, such as topographic data, atmospheric conditions, and imaging parameters, to construct specific models for shadow detection. With this prior information, the methods can usually acquire reasonable results. Nevertheless, when this information is unavailable, they may fail. There are two typical model-based methods: geometrical and physics-based methods. The former uses digital-surface-model (DSM) data to compute the shade coverage on the basis of solar position throughout a strict mathematical method [23,24], and its accuracy is wholly reliant on the quality of the DSM data. The main bottleneck of geometrical methods for shadow detection in urban aerial images its high cost to yield the high-quality DSM data. On the basis of atmospheric and light conditions, physics-based methods employ the spectral information of each pixel to deduce the physical properties of the ground surface to recognize shadows [25,26]. Because of the complex principle, and because aerial images usually lack some specific spectral bands, these methods are rarely applied for aerial images.
For soft-shadow methods, the ultimate aim is to produce a shadow-probability map to visualize the possibility of each pixel belonging to a shadow. In contrast with a conventional binary shadow map, each pixel in the soft-shadow map is encoded by a probability value of each pixel belonging to a shadow. Specifically, if a pixel is in umbra, it should be assigned a value of 1; if a pixel belongs to a penumbra, it should be assigned a value between 0 and 1; otherwise, it should be assigned a value of 0. In this way, shadows, particularly the penumbras, are depicted and located more precisely. However, as described in the literature [27,28], when the size of an input image is large, not only does it need much manual intervention to label a large quantity of positive and negative samples to get a satisfactory matting result, but it would also be time-consuming to perform the detection procedure.
Regarding shadow detection as a binary classification task, machine learning-based methods often employ some typical learning-based classifiers such as perceptron classifiers [29] and support vector machines (SVMs) [30,31] to label shadow and non-shadow pixels in an input image on the basis of handcrafted low-level visual features, such as brightness, texture, and color [32]. Although it was proven that these methods can achieve good performance for images with simple scenes, a trained classifier may be left empty-handed when the scene of an image is complex, since the employed handcrafted features may vary with light conditions and shadow surfaces. Recently, with the success of deep convolutional neural networks (CNNs) in computer-vision tasks (e.g., object detection and image classification) [33][34][35], researchers have also been taking advantage of CNNs to detect shadows. The research in [36] first introduced a shadow detection method using deep learning technology. Two CNN networks were designed and trained to detect the shadow region and shadow boundary, respectively. Then, the conditional-random-field (CRF) model was used to obtain the final results. Due to the powerful ability of automatically extracting multilevel features, the performance of shadow detection was significantly improved in natural images compared to that of traditional methods. After that, numerous CNN networks, represented by cascaded networks and generated adversary networks (GANs), were designed to further enhance performance [37][38][39][40][41]. The major issue for current CNN-based methods is that only local contextual information is considered, and global context is ignored. Herein, the local context means that the correlation between nearby pixels. Instead, the global context means the correlation between all pixels. On the other hand, there are as of yet few studies on using CNNs to extract shadows in urban aerial images.
From the above summary, we can see that the trade-off between automaticity and accuracy for the current methods is not as good as required. The reason for this can be summarized in two aspects. First, the traditional approaches mainly focus on the nonrobustly handcrafted shadow features and do not exploit the rich contextual information contained in a single image, which is insufficient to solve the confusion caused by certain types of objects, resulting in that it cannot be applied in the image with a new scene structure adaptively. On the other hand, there is no prior information in most cases, such as topographic data, imaging parameters, and light conditions, making it arduous to achieve higher accuracy. Therefore, an advanced approach should be developed to resolve this problem. Because traditional methods cannot automatically learn shadow-feature presentation, the limitation could be better addressed using deep-learning technology. Moreover, with the rapid development of remote-sensing technology, aerial images are universally accessible. Sufficient image data have made it possible to utilize CNNs to detect shadows in urban aerial images.
Inspired by the above analysis, in this paper we put forward a novel deep learning-based method, namely, a global-spatial-context attention U-shaped network (GSCA-UNet) to fill the gap between model automaticity and accuracy for shadow detection in urban aerial images. Experiment results on several typical urban aerial images verified the effectiveness and superiority of the proposed method. The main contributions of this work are listed as follows, (1) we developed a spatial-attention module to capture the long-ranged contextual information for each pixel, which contributed to identifying the challenging shadows and non-shadows; (2) on the basis of the UNet network architecture reported in [42], we realized end-to-end automatic and accurate shadow detection in urban aerial images; and (3) we developed a self-adaptively weighted binary cross-entropy (SAWBCE) loss function that enhanced the training procedure.
The remainder of this paper is organized as follows. Section 2 introduces the details of the proposed approach. Section 3 describes the experiment implementation and results. The method and experiments are discussed in Section 4. Last, this paper is concluded in Section 5.

Methodology
In this work, a CNN-based method is proposed for detecting shadows in urban aerial images. The overall flow of our method consists of the three following parts. First, a new shadow detection dataset including training and validation sets was constructed by labeling the selected sliced image tiles and adversarial preprocessing. Afterward, the proposed GSCA-UNet was trained on the dataset, and the optimal parameters were obtained under the supervision of the modified BCE loss function. Last, the chosen test images were fed into the well-trained model to produce the final shadow detection results. The intuitive pipeline of our method is given in Figure 1.

Data Preprocessing
As a data-driven method, the dataset is the key to deep learning-based methods. In previous studies, three benchmark shadow detection datasets (i.e., UCF, SBU [43], and ISTD [44]) were widely used. After many attempts, we found that the three datasets could not meet the needs of accurate shadow detection in urban aerial images. In addition to the model factor, the major reason was that images in these benchmarks could not model the complexity of shadows for urban aerial images because of the contained shadows that were mainly caused by individual trees, people, or cars. Therefore, a new dataset is urgently required to support shadow detection in urban aerial images. The three stages in our data preprocessing are as follows.
We first selected 17 images from the International Society for Photogrammetry and Remote Sensing (ISPRS) Urban Classification and 3D Building Reconstruction benchmark, the Inria Aerial Image Labeling dataset [45], and our own unmanned aerial vehicle (UAV) data captured by DJI Mavic2. All theses selected images have a very high spatial resolution and are characterized by quite a few heterogeneous shadows. Due to the limitation of the GPU device's memory, we downsampled the chosen images, and sliced each image into small patches with a size of 256 × 256 pixels. After that, the boundaries of the shadow regions were manually depicted using ArcMap10.4 and Label Me software; then, the raster labels were obtained. This process took about one month. We chose part of the patches to annotate instead of labeling each patch because there were many redundant patches with similar shadows or without shadows. We also selected 64 images from the SBU and ISTD benchmarks and added them into our dataset. By doing so, an initial dataset containing 2450 images was obtained.
A meager amount of data could easily make the deep learning model overfitted, while manually annotating a large dataset is labor-intensive and time-consuming. Facing this fact, the GAN principle was used in this work. Specifically, taking the major challenge of shadow detection into consideration, we employed data augmentation methods to generate two simple types of additional adversarial images to further enlarge the dataset. Unlike the work described in [39], the cumbersome training procedure was not required for our approach, reducing the complexity of the method. The first type of the adversarial image is shaded-variant images that were obtained by applying the gamma transformation. We only changed the intensity value of each pixel in the shaded regions, and the nonshaded regions remained unchanged. This transformation is described as where r denotes the original gray value of each shadow pixel, c and γ are two parameters to adjust the final result, and s is the transformed result. In this study, c = 1 and γ ∈ {0.6, 0.7, 0.8, 1.2, 1.4, 1.6}. Given an input image, when γ < 1, the shaded regions were brighter. Conversely, they were darker when γ > 1. Like most image enhancement approaches [46], this transformation was performed on the V channel of the HSV color space. After transformation, six adversarial images with different color ranges in shades were obtained for every single image. As part of the shades might have been too bright after transformation, and this is unrealistic in real cases, the generated images with too-bright shaded regions were removed, as it might mislead the model to detecting false shadows. The second type of the adversarial images is disturbed-dark-object images. For this, we first selected images in which the non-shadow regions were larger. Subsequently, we randomly erased a polygonal region in the non-shadow region for each selected image. The size and shape were also random. To avoid that the erased region covered too large or too small a region, its size was limited to s ∈ [ 1 80 S ns , 1 20 S ns ], where s and S ns denote the areas of erased and nonshaded regions, respectively. Last, the erased region was filled with a random dark color to model the challenging dark non-shadow objects. Figure 2 gives an example of the synthetic adversarial images obtained by using the proposed method. Through the two procedures introduced above, 16,300 additional adversarial images were obtained. The generated adversarial images were added into the primary dataset, and the final dataset was constructed. Compared with the primary labeled dataset, the number of images was increased to 18,750 images, which enhanced the ability of the proposed GSCA-UNet model to characterize the unknown input shadows.

Global-Spatial-Context Attention Module
Researchers argued that the network that is capable of capturing more nonlocal contextual information plays a critical role in enhancing performance for semantic segmentation [47,48].
As shadow detection can be regarded as a dense pixel-level semantic-segmentation task, the global spatial semantic associations between pixels contained in an image are also crucial for accurate shadow detection, especially when there is no prior knowledge about the image scene.
To better demonstrate the necessity of the global spatial contextual information for shadow detection, three real instances are given in Figure 3. Figure 3a presents a typical case where the dark roof in non-shadowed regions is easily identified as a shadow. Figure 3b illustrates a case where a road with relatively high brightness in the shadow region is easily classified as non-shadow. The two given instances are widespread in urban aerial images. Applying the previous methods published in [19,20,22], we noted that only considering general shadow characteristics made it almost impossible to predict the true category for the two small images, as their local features (e.g., low brightness) are highly similar. However, if the two small images are both put back into the original image, this confusion is readily solved by comparing the distinction of brightness between the shaded regions where they are located with surrounding nonshaded regions. Taking Figure 3b as an example, the road in the shaded region could be determined as a shadow by comparing it with the adjacent nonshaded road because the former is relatively darker than the latter. Therefore, we needed to analyze the surrounding or larger regions to determine whether an object was a shadow or non-shadow. That is to say, to improve shadow detection performance requires to analyze the nonlocal relationships between pixels. The nonlocal relationships in this study were interpreted as the global spatial contextual information. The case in Figure 3c confirmed this analysis. When comparing C with the B, a strong indicator that C was a shadow was that B was brighter than C. However, comparing C with the A, A being darker than C would indicate that C was a non-shadow, even they were both shadows. It is well known that the convolution operation, except for global convolution, only acts upon the adjacent pixels, and the long-ranged pixels are neglected. As a matter of fact, the informative correlation between pixels that are far from each other is also valuable. As analyzed above, if the shadow detection model could pay attention to larger (nonlocal) regions, it boosts performance of detection. Similar to the channel attention mechanism of SeNet [49], which weighs information on the channel dimension according to the importance of each channel, we designed an attention module to weigh each pixel in a different spatial position. Thus, the GSCA module was developed to clarify the importance of each pixel by aggregating the correlation between all pixels located in the same row and column.
The structure of the proposed GSCA module is shown in Figure 4. Given an input feature map f with a shape of (C, W, H), where C, W, and H represent the number of channels, width, and height of feature map f , respectively, f is first fed to the 1 × 1 convolution layer in Paths 1 and 2 to compress the number of the channels, reducing computation cost, and generating two new feature maps f 1 and The size of f 1 and f 2 is C , W, H , where C < C and C = 1 2 C in this study. Afterward, a feature vector Q i is extracted for the pixel with position i on the spatial dimension of feature map f 1 . At the same time, feature vector K i is obtained by extracting all pixels at the same row and column of the i-th pixel in feature map K. Dimensions of Q i and K i are C × 1 and C × (H + W − 1), respectively. Inspired by DANet [50], the affinity operation (see Formula (2)) was applied to compute the degree of correlation between feature vectors Q i and K i . For the i-th pixel in f , the contextual information of all pixels in the same row and column where it is located is aggregated through the affinity operation. This procedure is only related to pixel location (spatial dimension).
where K i,j is the j-th element in the extracted feature vector K i from feature map f 2 ; i is the spatial pixel position; j = 1, 2, 3, ..., W + H − 1, ⊗, vector multiplication; and d i,j is the computed degree of correlation between Q i and K i,j , with its dimension 1 × (W + H − 1). d i,j was normalized using the sigmoid function to obtain attention vector A i,j . The sigmoid function is written as where x, input tensor. In the attention vector A i,j , elements are suppressed or emphasized according to the corresponding feature responses, clarifying the contribution of each position, and pushing the model to consider more semantic regions. For Path 3 in the GSCA module, the 1 × 1 convolution layer was utilized to ensure that the model could learn nonlinear features between channels [51] and generate another feature vector V i with the shape of C × (H + W − 1). V i collects all pixels in the same row and column from feature map f 3 . The global spatial contextual information is then aggregated by the element-wise summing result of A i,j and V i,j . In order to make the network quickly converge and reuse the features, residual identity mapping was added at the end of the GSCA module. The skip connection and identity mapping proved that it has significant positive impact on the convergence and accuracy of deep CNNs [52]. Therefore, the final reweighted feature map is obtained by where f i stands for the final reweighted feature map, and i denotes the spatial position of each pixel.

Conv 1×1
Input Feature Map f

Mul Add
Vector Vi

C×(W+H-1)
Vector Ki Figure 4. Diagram of the proposed GSCA module. Mul, element-wise product; Add, element-wise addition. Pixel marked with blue in the final output of the GSCA module was reweighted by aggregating contextual correlation of all pixels in the same row and column.

C×W×H
Applying the GSCA module over all pixels on the spatial dimension once, the global contextual information of each pixel in the horizontal and vertical directions was aggregated.

Network Architecture
In this section, we present the details of the proposed GSCA-UNet. To date, due to the apparent advantage in the simple and effective symmetrical encoder-decoder structure, UNet has been widely applied for various remote-sensing tasks, such as building extraction [53], cloud detection [54], and urban village segmentation [55]. The UNet architecture can hierarchically aggregate low-level features from the shallow layer in the encoder path and high-level features from the decoder path by simple skip connection. Most importantly, UNet has proven to potentially be able to handle pixel-level semantic-segmentation tasks with a relatively small dataset. Because of these reasons, the U-shaped architecture was adopted as the baseline architecture in this study.
The original encoder path of UNet is constructed by stacking a sequence of basic blocks that consist of a 3 × 3 convolution layer, a batch-normalization (BN) layer, and a rectified-linear-unit (ReLU) layer. In order to achieve better performance, the ResNeXt-101 (32 × 4d) [56] network was adopted instead of the original UNet encoder to more effectively encode shadow features. ResNeXt-101 is the improved version of ResNet-101, which is the champion network in the Large-Scale Visual-Recognition Challenge (ILSVRC) competition in 2015. We removed the last three layers from the original ResNeXt-101 network, namely, the softmax, fully connected, and global average pooling layers. The final stage that outputs features with 2048 channels was served as the bridge bottleneck block, separating encoder from decoder. The other stages were utilized as the encoder backbone. As illustrated in Figure 5, the encoder path consists of four stages. Given an input image, when it passes through each stage of the encoder, the successive convolution and pooling-downsampling operations decrease the spatial dimension and increase the channel dimension. In order to output the predicted shadow maps that are the same size on the spatial dimension as the input image, a modified expansive path of UNet was adopted for this work. Unlike UNet, we utilized bilinear interpolation to replace the transposed convolution for performing the upsampling operation, which not only had minimal impact on the output of the model, but also effectively eliminated the checkerboard phenomenon resulting from transposed convolution. On the basis of the module that concatenates low-level and high-level features in UNet, the new feature fusion module (FFM) embedded two GSCA modules were developed. Through FFM, low-level and high-level features were first concatenated, and then the weighted feature map was obtained. The spatial size and the refined details were gradually restored by upsampling and 3 × 3 convolution operations. At the end of the decoder in GSCA-UNet, a 1 × 1 convolution layer was applied to compress the channel dimension from 64 to 1. Following that the nonlinearity sigmoid layer was applied to output the final shadow-prediction maps, where each pixel was encoded by the probability of belonging to shadow. The final binary shadow detection map was obtained by thresholding the prediction maps. The ResNeXt-101encoder path is adopted to encode both coarse and fine dense semantic features of input image. The modified decoder path of UNet is responsible for fusing different-stage features and gradually retrieving the detailed information to output the final shadow predictions.

Feature Fusion Module
It is a gradual procedure to restore the size of a high-level feature map, requiring the fusion of low-level features to supply spatial-location information for improving the segmentation accuracy [57]. To meet the need for accurate shadow detection, we designed a new feature fusion module to fuse the high-level and low-level features, and weigh the fused feature maps. In FFM (see Figure 6), there are two inputs, namely, the low-level features from ResNeXt-101 encoder and the high-level features upsampled by earlier bilinear interpolation. The two inputs are first concatenated over channel dimension. Considering that the concatenated features usually have a high dimension, we applied a 1 × 1 convolution layer following it. The effect of 1 × 1 convolution in FFM can be interpreted as two aspects: compressing channels and nonlinear interaction between channels.
The feature map f output by the 1 × 1 convolution layer was input to the first GSCA module to produce a temporary feature map f . Although the horizontal and vertical contextual information could be aggregated via a single GSCA module, the pixel-wise association between all pixels in f was still sparse, which was insufficient for accurate shadow detection. In order to obtain denser global spatial contextual information, another GSCA module was embedded. The two cascaded GSCA modules are called recurrent GSCA modules. f was again input into the second GSCA module to yield final reweighted features f . On the basis of the recurrent GSCA modules, each pixel in f " contained dense pixel-wise global spatial contextual information. At the end of FFM, the final reweighted feature map f " was input to the following convolution blocks for recovering details.  Figure 6. Structure of feature-fusion module. f , f , and f " denote imputed feature map, temporary feature map, and final output of recurrent GSCA module, respectively.

Modified Binary Cross-Entropy Loss
The loss function is a fundamental component in deep learning networks. During the backpropagation phase, the deviation between model predictions and ground truth is measured by loss function. Thus, loss function directly affects model performance. For two-class semantic-segmentation tasks, the BCE loss is a commonly used one, which is expressed as where p, prediction label; and y, ground-truth value. For each training image, the shadow area proportion is not fixed. If the original BCE loss is directly utilized to supervise the training procedure, the imbalance of class distribution between shadows and non-shadows could affect the final detection accuracy. For instance, if we directly employed the original BCE loss, the model tended to detect relatively large non-shadows. To avoid the negative factor, we developed the SAWBCE loss. Unlike the weighted BCE loss in [58], the weight of the proposed loss function is self-adaptively acquired based on the prediction and ground truth at each iteration. The SAWBCE loss consists of two parts, i.e., where y, true label; and p, predicted label. In Equations (6) and (7), N n is the number of non-shadow pixels in the reference shadow map, N s is the number of shadow pixels in the reference shadow map, TP (true positive) is the total number of true shadow pixels that are correctly detected, and TN (true negative) is the number of true non-shadow pixels that are correctly detected. On the other hand, FP (false positive), number of non-shadow pixels that were mistakenly identified as shadow pixels; and FN (false negative), number of true shadow pixels that were detected as non-shadow pixels. The final SAWBCE loss is calculated by where loss 1 is applied to balance the area proportion of detected shadows and non-shadows, and loss 2 helps the model to focus on identifying difficult shadows [59]. Specifically, if the area of shadow regions in ground truth is smaller than that of non-shadow regions, loss 1 penalizes non-shadow regions more heavily than it does shadow regions. If a shadow area is larger than a non-shadow area, the shadow region receives a heavier penalty. If the correctly classified shadow regions are relatively small, loss 2 has a relatively larger value, and vice versa.

Training Details
The proposed GSCA-UNet shadow detection network was trained and evaluated on the constructed shadow dataset introduced in Section 2.1. The original dataset contained 18,750 images and corresponding binary labels. Before training, the hold-out strategy was adopted to split the dataset, and the divided ratio was set at 7:3. We randomly split 70% of the images for the training set to train the model, and 30% of images for the validation set to supervise whether the model was overfitted, and verify network performance during each training phase. The shadow image samples and corresponding annotated shadow labels of the training and validation sets are displayed in Figure 7.
In order to make the proposed GSCA-UNet model more robust, the following data augmentation methods were adopted for this research. For every single image in the training set, we first randomly flipped it horizontally and vertically. Then, we randomly rotated it with a degree of (−90, 90), which aimed to handle the problem of shadow rotation. Last, we randomly resized it with scale (0.8, 1) to deal with the shadow multiscale problem. The above probability values were all set to 0.8 in data augmentation.
For parameter initialization, the encoder and bridge block of GSCA-UNet were initialized by the parameters of the ResNeXt-101 network, which was pretrained well on the ImageNet dataset [60], and the other was initialized using Kaiming uniform [61]. We adopted the Adam [62] algorithm equipped with a weight decay rate of 0.0005 and an initial learning rate of 0.01 as the optimizer for the network. The exponential weight-decay strategy with a decay rate γ = 0.9 was employed to adjust the learning rate for each epoch. The number of epochs was set to 100 to make the network converge well. We set the minibatch size to 12 to make full use of the memory of GPU device in this study.

Test Images
Eight images cut out from several urban aerial images captured in different places were used as test images. All of these test images cover the dissimilar urban scenes and are characterized by quite a few shadows whose surfaces, shapes, and sizes are irregular. Among these test images, we selected four typical images (namely, the Toronto, Chicago, Vienna, and Austin images) for a better and fairer comparison between different state-of-the-art methods. The details of the four test images are displayed in Table 1. The Toronto and Chicago images were characterized by dense commercial buildings. For the two images, shadows were mainly attributed to the blocking from high-rise buildings, and the main obstacle for automatically detecting shadows is how to separate shadows from many misleading objects such as black vehicles, dark roofs, and nonuniform shadows with high brightness. In the Vienna image, there was a large number of shadows with a relatively small size caused by low-rise ridge-style buildings. The presence of many dark roofs and shadows with small sizes in this image made precisely detecting shadows greatly challenging. For the Austin image, the main urban land cover types are cement pavement and greenish water bodies that are often identified as shadows by traditional methods. Therefore, these test images were greatly suitable to verify the effectiveness of our approach.

Implementation Details
In the experiments, the proposed GSCA-UNet network was implemented on the basis of the PyTorch-1.1 framework in an Ubuntu 18.04 environment except for the GSCA module, which was done by CUDA programming using C++ programming language. All experiments were done on a workstation whose configurations are Intel Core i9-9900k CPU, 32 GB RAM, and an Nvidia GeForce RTX 2080ti GPU with 11 GB RAM.
The detailed variations of the loss value and overall accuracy at each training stage are displayed in Figure 8. Figure 8 shows that the convergence process of the proposed GSCA-UNet was stable, and overall accuracy exceeded 96%. Moreover, the overfitting phenomenon was not evident.

Comparison with State-Of-The-Art Methods
The performance of the proposed method was compared with three other reported state-of-the-art shadow detection methods for urban remote-sensing images, i.e., the image-matting method (IMM) [27], the extended-random-walker (ERW) method [20], and the object-oriented (OO) method [22]. The selection of the three methods was based on two considerations: (1) as an automatic shadow detection method, the proposed method had to compare with the same method type, and (2) because the supervised methods can usually deliver high accuracy with manual intervention, but most existing automatic detection methods exhibit relatively low accuracy, revealing the effectiveness of the proposed method requires making comparisons with the supervised methods with high accuracy. In addition, we compared our method with the bidirectional feature pyramid network (BFPN) with recurrent attention residual (RAR) modules. Note, the method is called BDRAR for short in the following section. BDRAR is CNN-based and achieved nearly saturated performance of 3.64 and 5.30 on the SBU and UCF datasets, respectively, for balanced error rate [40].
From these methods, we implemented the IMM and the OO method using the default parameters given in the corresponding literature. Before implementing the IMM, we marked abundant scribes using Photoshop software. In the ERW method, because the scenes of the test images being extremely complex, they always achieved poor performance for all test images. Therefore, we fine-tuned the threshold level to obtain the best results according to the specific image. We trained the BDRAR model with the best performance on our dataset using the training strategies given by the authors.

Visualization Results
The visualization results of the final shadow detection of our approach and of the four state-of-the-art methods are presented in Figures 9-12, from which we can observe that these methods were capable of extracting most shadows in diverse cases. The obtained results by the IMM method look reasonable compared to the reference shadow maps. However, as shown in Figures 9c-12c, quite a few positive (shadow) and negative (non-shadow) samples were needed to be labeled in advance to yield the final detection results for each input image.
Another obvious weakness of the IMM method can be found from Figures 9c and 10c, where some small shadows were missed, which was caused by the fact that it is arduous to mark small shadows. In addition, the image-matting procedure for the IMM method occupied too much computer memory, and the computation speed was very slow when the size of an input image that contained a great many shadows was large. As a result, the IMM method had lower automaticity, which limits the practical applications.
The OO method maintained well the integrity of the detected shadow regions. As illustrated in Figures 9e-12e, applying the object-instead of the pixel-based strategy effectively eliminated the adverse phenomenon in detected shadow maps, such as fragmented shadows and holes. Meanwhile, compared to the ERW method, this method reduced the interference to some extent from dark objects. For example, part of the black vehicles was correctly ruled out in the Toronto and Chicago images. Unfortunately, the OO method failed to handle the nonuniform shadows in the Toronto image and the dark water body in the Austin image. In Figure 9e, this method classified vehicles with bright surfaces in shaded regions as non-shadows. In Figure 12e, because the river had similar low-intensity features to those of shadows, it was detected as shadows. Another disadvantage of the OO method is that it is hard to obtain optimal segmentation parameters with it. Although the state-of-the-art multiscale segmentation method [63][64][65] was adopted in the experiments, the oversegmentation phenomenon still occurred. Taking the Austin image as an example, the OO method failed to separate trees and correspondingly casted shadows, which affected detection accuracy. Benefiting from the great advantage in feature extraction and the employed recurrent residual attention module, the BDRAR method produced satisfactory detection results. However, it can be seen in Figure 12 that it was insufficient to precisely locate the shadow boundaries, and part of the shadows was missed due to lack of consideration for global spatial contextual information. In terms of identifying dark objects, there were many wrongly labeled shadows in the Toronto and Vienna images, leading to poor performance.
On the foundation of the U-shape-based fully convolutional neural network and the global-spatial-context attention module, compared with the other methods, our method always obtained remarkable shadow detection results that were the most consistent with the reference shadow maps. A striking illustration of the results obtained by our method is shown in Figures 9g-12g, from which we can see that there was little noise, and most of the shadows and non-shadows were precisely detected. When handling the four input images that contained many dark non-shadows and bright shadows, the performance of the other automatic methods was insufficient. In contrast, our method effectively suppressed these negative effects due to fully exploiting the global spatial contextual information of each pixel. Taking two difficult cases in the Toronto image as an example, as illustrated in Figure 13, the black car marked with blue was identified as shadows by all compared methods as the contextual information of long-ranged pixels was ignored, while our method detected it correctly by considering the global spatial contextual information. On the other hand, the proposed GSCA-UNet also completely detected the other challenging bright and nonuniform shadow regions marked with red by comprehensively analyzing the near regions. The close-up results shown in Figure 13 confirmed that global spatial contextual information plays a positive effect on shadow detection. Most importantly, our method was able to produce accurate shadow detection results in diverse images with different light conditions, projected surfaces, and occlusions, without any user intervention or prior knowledge.
From the aforementioned comparisons, we could conclude that the balance between automaticity and accuracy for the proposed method was better than that of other advanced methods.

Quantitative Comparisons
To further objectively evaluate the performance of different methods, following the works in [8,18,20], two popular evaluation metrics for the pixel-level classification task, i.e., F1-score (F 1 ) and overall accuracy (OA), were adopted to quantitatively validate detection performance in the subsequent accuracy assessment. We also employed the balanced error rate (BER) [32,[37][38][39]43] that is widely used for shadow detection in natural images. As shadow detection can be regarded as a semantic-segmentation task, we also utilized intersection over union (IoU) as a commonly used metric for semantic-segmentation tasks to validate the proposed network. The four metrics are expressed as In Equations (9)-(13), TP (true positive) is the total number of true shadow pixels that were correctly detected, TN (true negative) is the number of true non-shadow pixels that were correctly detected, FP (false positive) is the number of the non-shadow pixels that were mistakenly identified as shadow pixels, and FN (false negative) is the number of true shadow pixels that were detected as non-shadow pixels.
In our two-class pixel-level shadow and non-shadow classification task, F 1 reports the numerical assessment of the method by taking both precision and recall into consideration, with a higher value meaning that model performance was better; OA assesses the rate in which pixels are correctly detected as part of their true categories; BER indicates the balanced error rate for shadow and non-shadow error rates; IoU measures the intersection ratio between actually detected shadows and true shadows. Comparing the results of the IMM, ERW, OO, BDRAR methods and ours, Tables 2 and 3 clearly show that our method had the best values in terms of F 1 , OA, BER, and IoU. From the Toronto to the Austin images, with the increase in difficulty for shadow detection, the performance of all methods was more or less reduced. For the two conventional automatic methods (ERW and OO) and the CNN-based BDRAR method, a sharp reduction in performance was observed in the Vienna and Austin images, especially Austin, which was mainly covered by dark water. The ERW and OO methods reached very low IoU values (29.11% and 28.97%), while our method's IoU value was 77.18%. Even if compared to the supervised IMM method, the IoU value (77.18%) of our method was still higher than that of the IMM (63.07%). From the quantitative comparisons, it can be concluded that our approach could acquire high-accuracy results while maintaining automaticity, which again powerfully demonstrates the effectiveness and superiority of the proposed method for shadow detection in urban aerial images.

Network-Design Evaluation
In this section, we evaluate the network design of the proposed GSCA-UNet. To better test and verify the effects of different network structures for shadow detection, we performed comparative experiments with the two other CNNs. The first network was the original UNet without the GSCA module and the ResNeXt-based encoder part, and the second was ResNeXt101-UNet, which was derived by removing the GSCA module from GSCA-UNet. Compared with our network, the complexity of the network structure of the two models above increased, and similarity with our model was high. Accordingly, the two models were suitable to verify the feature extractor of ResNeXt-101 and the GSCA module, respectively. We trained UNet and ResNeXt-UNet on our dataset by using the same loss function and training strategy to ensure analysis credibility. Detailed comparison results are displayed in Table 4. As illustrated in Table 4, compared with the baseline UNet network, utilizing the ResNeXt-101 as feature encoder improved F 1 , OA, BER, and IoU by 0.58%, 0.22%, 6.48%, and 1.03%. The main reason for the improvement is that, compared with the original encoder part, the ResNeXt-101 network not only had the advantage of a residual shortcut, but also had a deeper and wider structure, enhancing its ability to extract shadow features. After embedding the recurrent GSCA module, Table 4 shows that F 1 , OA, BER, and IoU were significantly improved by 4.47%, 1.72%, 31.3%, and 7.77%, respectively, which validated the effectiveness of the GSCA module. The reason why the proposed GSCA module could obviously improve performance could be that we established the global link between pixels by enlarging the receptive filed of the CNN by using the module. Although the receptive filed of a CNN model can be theoretically increased by stacking many convolution layers, or applying the atrous convolution operations [66] or the spatial-pyramid-pooling (SPP) module [67,68], the structural limitation of the convolution kernel makes the model only learn the local dependence for each pixel in each stage even if the network contains deeper layers. Therefore, for each pixel, the global spatial contextual information in the feature maps obtained by the ResNeXt101-UNet network was absent. Applying the GSCA module twice in each feature-fusion module, the concatenated feature map paid attention to global contextual information, and each pixel in the spatial dimension was reweighed. The dense global spatial contextual information for each pixel contributed to better identifying suspected shadows. Therefore, the model could more precisely locate shadows. To test the importance of the SAWBCE loss function, we conducted comparison experiments with the original BCE loss and SAWBCE loss. We used two GSCA-UNet models that were trained with BCE and SAWBCE loss, respectively. As listed in Table 5, the proposed model with SAWBCE loss had the best shadow detection accuracy.

Advantages of The Proposed Method
Shadow detection in urban aerial images has been a popular research topic in the last few decades, yet automatic and accurate methods to detect shadows are still lacking. The proposed method in this paper filled this gap. Benefiting from dense global spatial contextual information, our method could yield accurate results without any manual intervention or prior knowledge in diverse cases, even compared with the representative supervised IMM method, which can usually obtain results with high accuracy. The proposed method is based on deep learning technology. Under the support of sufficient image data, it can be transferred to other unknown shadow detection tasks. For practical applications, the precisely detected shadows by our proposed method are more suitable as useful information for corresponding studies, such as urban-building instance recognition, building height estimation, and information recovery. At the same time, the proposed GSCA module has the advantage of flexibility. It can be applied to other CNN-based dense semantic-segmentation tasks for urban aerial remote sensing images, such as building extraction.

Limitations and Further Improvements
Although the proposed shadow detection method filled the gap left by the inadequate balance between automaticity and accuracy, some inherent weaknesses should not be ignored. Generally, spatial information in low-level features helps to locate shadow regions, but low-level features might bring unexpected noise that could cause detection errors. In this study, we directly concatenated the low-and high-level features via skip connection, and the spatial information of the shallow layer and contextual information were not leveraged. Thus, an individual feature fusion module is suggested to rethink the relationship between low-and corresponding high-level features in the feature fusion model.
To aggregate spatial contextual information in the horizontal and vertical directions requires to perform the iterative process pixel by pixel, which comes with high computation cost. The time complexity is O ((W × H) × (W + H − 1)) on the channel dimension when the size of the input tensor is (B, C, W, H), where B, C, W, and H are batch size, channel number, width, and height, receptively. In this study, with size (12, 3, 256, 256) of the initial input, it took about 75.40 min to train each epoch, and 0.11 s to inference, which was relatively slow. On the other hand, the number of parameters for the proposed GSCA-UNet was 106.04 million, making the GSCA-UNet impractical for implementation on some lightweight platforms. Future work should pay more attention to decreasing the number of network parameters to achieve real-time detection.

Conclusions
Although tremendous efforts have been devoted to coping with the challenge of automatic and accurate shadow detection in urban aerial images over many years, it still remains a challenging problem due to the limitation of feature patterns and not considering nonlocal contextual information. In this paper, a new end-to-end deep-learning-based shadow detection method was proposed to address this issue. On the basis of the popular U-shaped architecture of the UNet network, the developed GSCA modules were embedded to self-adaptively aggregate global contextual information for each pixel and generate weighted feature maps. Experiments performed on several typical urban aerial images demonstrated that the proposed shadow detection method could effectively suppress the negative effects such as salt-and-pepper noise, holes, and inaccurate boundary locations of the detected shadow regions. Compared with four state-of-art supervised and unsupervised methods in terms of visual interpretation and several quantitative evaluation metrics, our method exhibited the best performance without any human intervention and prior knowledge, which revealed the superiority of the proposed method in maintaining a balance between automaticity and accuracy. There still remain some issues, such as costly computation and the large number of network parameters. In the future, a larger dataset should be supplemented, and a lightweight network should be developed to achieve real-time detection.