1. Introduction
Synthetic aperture radar (SAR) is an advanced Earth observation remote sensing tool. Its active radar-based remote sensing ensures its all-day and all-weather working advantage compared with optical sensors [
1,
2,
3]. Thus, so far, it has been widely applied in civil fields, such as marine exploration, forestry census, topographic mapping, land resources survey, and traffic control, as well as military fields, such as battlefield reconnaissance, war situation monitoring, radar guidance, and strike effect evaluation [
4,
5,
6]. Video SAR provides continuous multi-SAR images of the target imaging area to dynamically monitor the target scene in real time. It can continuously record the changes of the target area and exhibit the information from the time dimension through the form of visual active images, conducive to the intuitive interpretation of human eyes [
7]. Thus, it is receiving extensive attention from increasing scholars [
8,
9,
10].
Moving target tracking is one of the most significant applications using video SAR. It can provide the important information such as the geographical location, moving direction [
11], moving route, and speed of high-value targets in real time [
12]. Obviously, it contributes to ground traffic management and accurate attack of military targets. Thus, it has become a research hotspot in recent years [
7,
13]. So far, some scholars [
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30] have proposed various methods for video SAR moving target tracking that offered competitive results.
Notably, it is interesting that the commonality of the above video SAR moving target tracking methods is to indicate the real moving target with the help of the target’s shadow. This is because in video SAR, the Doppler modulation of the moving target echo is rather sensitive to target motion due to the extremely high working frequency, so a slight motion will lead to the large target location offset and target defocus in SAR images, as shown in
Figure 1. However, the above phenomena do not happen on the shadow of the moving target [
7], thus the shadow reflects the real position and motion state information of the moving target. More formation mechanisms of the moving target shadows in video SAR can be found in [
17].
Especially, moving target shadows are very informative from the following two aspects [
7]. On the one hand, the contrast between the moving target shadow and its background area, and the gradient information of the shadow intensity along the moving direction, are both closely related to the target speed. On the other hand, because the synthetic aperture time of a single frame image is short, the dynamic shadow also reflects the instantaneous position of the moving target in the scene [
7]. Thus, using shadows to complete the video SAR moving target detection and tracking task has become a new research pathway. Furthermore, combined with the Doppler processing technology, the shadow detection can also greatly expand the detectable velocity range of moving targets and improve the robustness of trackers further.
To summarize, moving target shadow detection in video SAR is extremely important and valuable. It is a fundamental and significant prerequisite of the moving target tracking. Only after the shadow is detected successfully can a subsequent series of tasks be carried out smoothly, such as trajectory filtering/reconstruction [
14], data association (i.e., target ID allocation), transformation discrimination between old target disappearance and new one appearance (i.e., target ID switching), velocity estimation [
31], SAR image refocusing [
11,
12], and so on. More descriptions about the relationship between detection and tracking can be found in [
32,
33]. Thus, this paper will research this valuable work emphatically, that is, video SAR moving target shadow detection. So far, various methods or algorithms [
16,
17] have been proposed for video SAR moving target shadow detection. These methods can be summarized as two types—(1) traditional feature extraction methods and (2) modern deep learning methods.
The traditional feature extraction methods are based on hand-designed features using expert experience. Wang et al. [
10] used a constant false alarm rate (CFAR) detector to detect the moving target shadow, but CFAR is very sensitive to ground clutters, resulting in poor migration ability. Zhong et al. [
14] designed a cell-average CFAR based on the mean filtering for the shadow detection, but their method relied heavily on the manual model parameter adjustment. Worse still, their detector was provided with weak capacity to suppress false alarms, which brought huge burdens to the follow-up tracker. Zhao et al. [
15] proposed a visual saliency-based detection mechanism based on the image contrast to enhance target shadow to improve discrimination performance by using an adaptive threshold. However, the shadow of the moving target is very dim [
16], and easy to submerge by surrounding clutters, leading to its less-salient features. Tian et al. [
16] proposed a region-partitioning-based algorithm to search for shadows, but this algorithm suffered from too-complex mathematical theories, with poor flexibility, extensibility, and adaptability. Liu et al. [
17] proposed a local feature analysis method based on the OTSU’s method [
18] to detect moving target shadows, but their method needed to model background clutters which is challenging for various backgrounds. Zhang et al. [
19] proposed a Tsallis-entropy-based [
34] segmentation threshold algorithm to classify background pixels and shadow pixels but obtaining the optimal threshold in the complex mathematical equations is rather time-consuming. Shang et al. [
20] leveraged the idea of change detection to detect moving target shadows in THz video-SAR images based on their own private terahertz radar system, but change detection (i.e., background subtraction) worked only on strictly static backgrounds and was sensitive to clutters. He et al. [
21] proposed an improved difference-based moving target shadow detection method where the morphological filtering was used to suppressed false alarms. However, their approach required a series of well-designed and complicated preprocessing techniques, reducing the application scope of the method. They improved their previous method [
21] in the report of [
22] using the speeded-up robust features (SURF) algorithm [
35], but computational costs were greatly increased. In short, the above traditional methods are all heavy in computation, weak in generalization, and troublesome in manual feature extraction. Moreover, they are both time-consuming and labor-consuming.
Modern deep learning methods mainly draw support from multilayer neural networks to automatically extract features based on given training samples. In the computer vision (CV) community, many deep learning-based methods using convolutional neural networks (CNNs) have boosted object detection performance greatly, e.g., Faster R-CNN [
36], feature pyramid network (FPN) [
37], you only look once (YOLO) [
38], RetinaNet [
39], and CenterNet [
40]. Some scholars [
41,
42] have applied them to detection and classification. Nowadays, many scholars in the video SAR community also have applied them for moving target shadow detection. Ding et al. [
24] applied Faster R-CNN to detect shadows, but the raw Faster R-CNN is designed for generic objection detection in optical natural images, so their direct use without critical thinking might be controversial if not considering the targeted video SAR task. Wen et al. [
25] adopted dual Faster R-CNN detectors to simultaneously detect shadows in the image spatial domain and range-Doppler (RD) spectrum domain. However, the shadow features in image spatial domain were not comprehensively mined by them, which led to missed detections and false alarms once the raw video SAR echo was not available. Huang et al. [
26] proposed an improved Faster R-CNN to boost the per-frame features by incorporating the spatiotemporal information extracted from multiple adjacent frames using 3D CNNs, which improved shadow detection performance. However, for the online shadow detection, it is impossible to draw support from the future image sequences to establish a spatiotemporal information space so as to enhance the past image sequences. Moreover, this method must require an accurate registration, a fixed scene, and a constant number of sequence images [
17]. These strict requirements are bound to limit algorithm applications in the velocity-independent continuous tracking radar mode of video SAR, which often has a constantly changing scene [
17]. Therefore, to achieve more flexible moving detection and tracking, one should better detect shadows using single-frame images [
17]. Yan et al. [
27] adopted FPN to detect shadows using their self-developed video MiniSAR system. They used the k-means to cluster video SAR targets, and then regarded the results as the basis for setting anchor box scales, so as to speed up network convergence and improve accuracy. However, their preset anchor box cannot resist shadow deformation once the motion speed is changed. Therefore, their model is powerless for noncooperative enemy moving targets. Zhang et al. [
28] also used FPN to detect shadows and added a dense local regression module to boost shadow location performance. However, their experimental dataset only contains some simple scenes, which is not enough to confirm the universality of the proposed method. Hu et al. [
29] adopted YOLOv3 equipped with FPN to provide initial shadow detection results for the follow-up tracker on the basis of the joint detector embedding model (JDE) [
33]. However, YOLOv3 may be not robust enough for more complex scenes. Additionally, in SAR surveillance videos, moving target shadows usually occupy relatively few pixels resulting in their small shape appearance, which is rather challenging to capture with YOLOv3 due to its poor small detection performance [
40,
41]. Wang et al. [
30] adopted CenterNet [
40] to detect shadows inspired by FairMOT [
43] and CenterTrack [
44], but this kind of anchor-free detector still lacks the capacity to deal with complex scenes and cases [
45], bringing about many missed detections and false alarms. It should be noted that Lu et al. [
46] proposed a RetinaTrack for online single-stage joint detection and tracking where RetinaNet [
39] was used to detect targets. Future scholars can also use RetinaNet to detect moving target shadows, because it solves the problem of extreme imbalance between foregrounds and backgrounds by introducing a focal loss. This imbalance is universal for SAR images. Thus, we will apply this focal loss for moving target shadow detection, for the first time, in this paper. To sum up, although the above existing deep learning-based moving target shadow detectors have achieved competitive detection results, their provided detection performance is still limited. For one thing, they tend to generate missed detections due to their limited feature-extraction capacity among complex scenes. For another thing, they also tend to bring about numerous perishing false alarms due to their poor foreground–background discrimination capacity.
Therefore, to handle the above problems, this paper proposes a novel deep learning network named ShadowDeNet for better moving target shadow detection in video SAR images. There are five core contributions to ensure the excellent performance of Shadow-DeNet. These are (1) a histogram equalization shadow enhancement (HESE) preprocessing technique is used for enhancing shadow saliency (i.e., contrast ratio) to facilitate the follow-up feature extraction, (2) a transformer self-attention mechanism (TSAM) is proposed for paying more attention to regions of interests to suppress clutter interferences, (3) a shape deformation adaptive learning (SDAL) network is designed based on deformable convolutions [
47] for learning moving target deformed shadows to conquer motion speed variations, (4) a semantic-guided anchor-adaptive learning (SGAAL) network is designed for achieving optimized anchors to adaptively match shadow location and shape, and (5) an online hard-example mining (OHEM) training strategy [
48] is adopted for selecting typical difficult negative samples to boost background discrimination capacity. We conduct extensive ablation studies to confirm the effectiveness of the above each contribution. Finally, the experimental results on the open Sandia National Laboratories (SNL) video SAR data [
49] reveal the state-of-the-art moving target shadow performance of ShadowDeNet, compared with the other five competitive methods. Specifically, ShadowDeNet is better than the experimental baseline Faster R-CNN by a 9.00%
f1 accuracy, and it is also superior to the existing first-best model by a 4.96%
f1 accuracy. Furthermore, ShadowDeNet merely sacrifices a slight detection speed in an acceptable range.
2. Methodology
ShadowDeNet is based on the mainstream two-stage framework Faster R-CNN [
36]. A two-stage detector usually has better detection accuracy than a one-stage one [
40], so we select the former as our experimental baseline in the paper.
Figure 2 shows the shadow detection framework of ShadowDeNet. Faster R-CNN consists of a backbone network, a region proposal network (RPN), and a Fast R-CNN [
36]. HESE is a preprocessing tool. TSAM and SDAL are used to improve the feature extraction ability of the backbone network. SGAAL is used to improve the proposal generation ability of RPN. OHEM is used to improve the detection ability of Fast R-CNN.
From
Figure 2, we first preprocess the input video SAR images using the proposed HESE technique to enhance shadow’s saliency or contrast ratio. The detailed descriptions are introduced in
Section 2.1. Then, a backbone network is used to extract shadow features. In ShadowDeNet, without losing generality, we select the commonly-used ResNet-50 [
50] as the backbone network. One can leverage more advanced backbone network which may achieve better performance, but this is beyond the scope of this article. In the backbone network, the proposed TSAM and SDAL are embedded, which can both enable better feature extraction. The former is used to pay more attention to regions of interests to suppress clutter interferences based on the attention mechanism [
51], which is introduced in detail in
Section 2.2. The latter is used to adapt to moving target deformed shadows to overcome motion speed variations based on deformable convolutions [
47], which is introduced in detail in
Section 2.3.
Immediately, the feature maps are inputted into an RPN to generate regions of interests (ROIs) or proposals. In RPN, a classification network outputs a 2
k-dimension vector to represent a proposal category, i.e., a positive or negative sample. Here,
k denotes the number of anchor boxes, which is set to nine in line with the raw Faster R-CNN. The determination of positive and negative samples is based on the intersection over union (IOU), also called Jaccard distance [
52,
53], with the corresponding ground truth (GT). Similar to the original Faster R-CNN, IOU > 0.70 means positive samples, while IOU < 0.30 means negative samples. Samples with 0.30 < IOU < 0.70 are discarded. Moreover, a regression network outputs a 4
k-dimension vector to represent a proposal location and shape, i.e., (
x,
y,
w,
h), where (
x,
y) denotes the proposal central coordinate,
w denotes the width, and
h denotes the height. Regression is performed to locate shadows in essence, whose inputs are the feature maps extracted by the backbone network, and outputs are the possible locations of shadows. In RPN, the proposed SGAAL is inserted to improve the quality of proposals. It can generate optimized anchors to adaptively match shadow location and shape inspired by the works of [
23,
45], which are introduced in detail in
Section 2.4.
Afterwards, one ROIAlign layer [
54] is used to map the proposals to the feature maps in the backbone network for the subsequent refined classification and regression. Note that the raw Faster R-CNN used one ROIPooling layer to reach such aim, but we replace it with ROIAlign because ROIAlign can address the problem of misalignments caused by twice-quantization [
54] so as to avoid a feature loss. Finally, the refined classification and regression are completed by Fast R-CNN [
55] to output the final shadow detection results. Moreover, in training, in Fast R-CNN, OHEM is applied to select typical difficult negative samples to boost background discrimination capacity inspired by the works of [
48,
53], which are introduced in detail in
Section 2.5.
Next, we introduce HESE, TSAM, SDAL, SGAAL, and OHEM in detail in the following subsections.
2.1. Histogram Equalization Shadow Enhancement (HESE)
Moving target shadows in video SAR images are rather dim [
16] and are always easy to be submerged by surrounding clutters, leading to their less-salient features from the human vision perspective.
Figure 3 shows a video SAR image and the corresponding shadow ground truths. In
Figure 3a, it is difficult for human vision to find the shadow of a moving target quickly and clearly if one does not refer to the ground truths in
Figure 3b. The contrast between the shadow and the surrounding is very low, resulting in their unclear appearances. Moreover, the patrol gate made of metal materials also poses serious negative effects to shadow detection. This experimental data is introduced in detail in
Section 3.1. Therefore, to perform some image preprocessing means is necessary, otherwise the learning benefits of features would be reduced.
Many previous scholars proposed various techniques for image preprocessing, e.g., denoising [
8], pixel density clustering [
24], morphological filtering [
17], visual saliency-based enhancement [
15], etc. However, they all rely heavily on expert experience with a series of cumbersome steps, reducing model flexibility. Therefore, we come up with the simple but effective histogram equalization to preprocess video SAR images. For brevity, we denote this process as the histogram equalization shadow enhancement (HESE).
For a video SAR image
I, if
ni denotes the number of occurrences of the gray value
i 0 ≤
i < 256, then the occurrence probability of pixels with the gray value
i is
where
n denotes the number of all pixels in the image, and
pI(
i) denotes, actually, the histogram of the image with pixel value
i, normalized to [0, 1]. The HESE is described by
Figure 4 shows the image histogram of the image in
Figure 3a before HESE and after HESE. From
Figure 4, after HESE, the whole gray value distribution (marked in red) is similar to the uniform distribution, so this image has a large gray dynamic range and high contrast, and the details of the image are richer. In essence, HESE is used to stretch the image nonlinearly, and redistribute the image pixel values so that the number of pixel values in a certain gray range is roughly equal. In this way, the contrast of the peak part in the middle of the original histogram is enhanced, while the contrast of the valley bottom part on both sides is reduced. Finally, the contrast of the entire image increases.
Figure 5 shows the moving target shadow enhancement results. From
Figure 5, one can clearly find that after HESE, the shadow in the zoom region becomes clearer. In
Figure 5a, the raw shadow is hardly captured by human eye vision, but in
Figure 5b, anyone can find the shadow quickly and easily. We also evaluate the shadow quality in the zoom region by using the classic 4-neighborhood method [
56]. The evaluation results are shown in
Table 1. From
Table 1, the shadow contrast with HESE is far larger than that without HESE (29,215.43 >> 20,979.31). The shadow contrast enhancement reaches up to ~40%, i.e., (29,215.43–20,979.31)/20,979.31. Moreover, the running time is just 13.06 ms, i.e., 7.66 images per second. This seems to be acceptable. Compared to many previous preprocessing means [
8,
15,
17,
24], HESE is rather fast with a rather simple theory and workflow. Readers can find more shadow enhancement results of other frames (#3, #13, #23, #33) in
Figure 6.
2.2. Transformer Self-Attention Mechanism (TSAM)
Attention mechanisms are widely used in the CV community that can adaptively learn feature weights to focus on important information and suppress the useless. So far, scholars from the SAR community have applied it to various applications, e.g., SAR automatic target recognition (ATR) [
57,
58], SAR target detection [
59,
60] and classification [
61,
62], and so on. Recently, transformer detectors [
63,
64] have received increasing concerns in the CV community. The remarkable characteristic of transformer models is the internal self-attention mechanism which is able to effectively capture some important long-range dependencies among the entire location space [
65]. In video SAR images, there are many clutter interferences from
Figure 3a, so we adopt such self-attention mechanism to suppress them so as to focus on more valuable regions of interests. We call this process transformer self-attention mechanism (TSAM). Specifically, we insert TSAM to the backbone network which enables efficient information flow to extract more representative features. As mentioned before, we selected ResNet-50 as our backbone network in
Figure 2, thus we insert TSAM to the residual block to promote better residual learning, as is shown in
Figure 7.
In
Figure 7, the first 1 × 1 convolution (conv) is used to reduce the input channel dimension, and the second 1 × 1 conv is used to increase the output channel dimension for the follow-up adding operation. The 3 × 3 conv is used to extract shadow features. TSAM is used behind the 3 × 3 conv, meaning that the extracted shadow features are prescreened by TSAM. In this way, the important features are retained while the useless interferences are suppressed. The above practice is similar to that in the convolutional block attention module (CBAM) [
66] and squeeze-and-excitation (SE) [
67], which can be described as
where
F denotes the input of a residual block,
F’ denotes the output,
f1×1(∙) denotes the 1 × 1 conv operation,
f3×3(∙) denotes the 3 × 3 conv operation, and
TSAM(∙) denotes the
operation.
Figure 8 shows the detailed implementation process of
TSAM. In
Figure 8,
H denotes the height of the input feature map
X,
H denotes the width, and
C denotes the channel number. According to Wang et al. [
68], the general transformer self-attention can be summarized as
where
i is the index of the required output location (i.e., the response of the
i-th location is to be calculated), and
j is the index that enumerates all possible locations, i.e., ∀
j.
x denotes the input feature map, and y denotes the output feature map with the same dimension as
x. The paired function
f computes the relationship between
i and all
j. The unary function
g calculates the representation of the input feature map at the
j-th location. Finally, the response is normalized by a factor
C(
x).
The paired function
f can be achieved by an embedded Gaussian function so as to compute similarity between
i and all
j in an embedding space, i.e.,
where
θ denotes the embedding of
xi and
ϕ denotes the embedding of
xj. From
Figure 8, they are implemented by using two 1 × 1 convs
Wθ and
Wϕ. That is,
θ(
xi) =
Wθxi and
ϕ(
xj) =
Wϕxj. Here, to reduce the computation cost, their kernel numbers are set to
C/2 if the input channel number is
C. The normalization factor is set as
C(
x) = Σ
∀jf(
xi,
xj). Thus, for a given
i,
f(
xi,
xj)/
C(
x) will become the
softmax computation along the dimension
j where
softmax is defined by
[
69]. Here, the
softmax computation is responsible for generating the weight (i.e., importance level) of each location. Similarly, the representation of the input feature map at the
j-th location is also calculated in an embedding space by using another one 1 × 1 conv
Wg. With the matrix multiplication, the output of the self-attention
Y is obtained. Finally, in order to complete the residual operation (i.e., adding), one 1 × 1 conv
Wz is used to increase the channel number from
C/2 to
C, i.e.,
In essence, TSAM is able to calculate the interaction between any two positions and also directly captures the remote dependence without being limited to adjacent points. It is equivalent to constructing a convolution kernel as large as the size of the feature map, so that more background context information can be maintained. In this way, the network can focus on important regions of interests to suppress clutter interferences or other negative effects of useless backgrounds.
2.3. Shape Deformation Adaptive Learning (SDAL)
The contrast between the shadow generated by the moving target and its background area, the gradient information of the shadow intensity along the moving direction, and the shape of the shadow are all closely related to the moving speed of the target [
7]. On the premise that the shadow can be formed, the smaller the moving speed of the target, the greater the shadow extension [
70], and the clearer the shadow contour of the moving target. However, the larger the moving speed of the target, the lower the shadow extension, and the more blurred the shadow contour of the moving target. In other words, when the motion speed of the moving target changes, the shadow shape will change. When the motion speed changes continuously in multiframe video SAR images, the shadow of the same target will become deformed. This challenges the robustness of the detector. Readers can refer to [
11] for more details about the shadow size relationship with the speed.
Figure 9 shows the moving target shadow deformation with the change of moving speed. In
Figure 9, due to the stopping signal of the traffic light, the vehicle-A’s speed is becoming smaller and smaller. One can find that the same vehicle-A exhibits shadows with different shapes (the zoom region) in the different frames in the video SAR. Thus, a good shadow detector should resist such shadow deformation. However, the feature extraction process of classical convolution neural networks mainly depends on convolution kernels, but the geometric structure of traditional convolution kernel is fixed. Only fixed local feature information is extracted each time when a convolution operation is performed. Thus, the classical convolution cannot solve the shape deformation problem. Fortunately, the recent deformation convolution proposed by Dai et al. [
47] can overcome this problem because its convolution kernel can produce free deformation to adapt to the geometric deformation of the target. The deformable convolution changes the sampling position of the standard convolution kernel by adding additional offsets at the sampling points. The compensation obtained can be learned through training without additional supervision. We call the above shape deformation-adaptive learning (SDAL).
Figure 10 is the sketch map of different convolutions. From
Figure 10, the deformation convolution can resist shadow deformation effectively by the learned location offsets.
Figure 11 shows the implementation of SDAL.
The standard convolution kernel is augmented with offsets ∆
pn which are adaptively learned in training to model various shape features, i.e.,
where
p0 denotes each location, ℜ denotes the convolution region,
w denotes the weight parameters,
x denotes the input,
y denotes the output, and ∆
pn denotes the learned offsets in the
n-th location. ∆
pn is typically fractional, so the bilinear interpolation is used to ensure the smooth implementation of convolution, i.e.,
where
g(
a,
b) = max(0, 1–|
a–
b|). We add another convolution layer (marked in purple in
Figure 11) to learn the offsets ∆
pn, and then, the standard convolution combining ∆
pn is performed on the input feature maps. Moreover, inspired by [
47], the traditional convolutions of the high-level layers, i.e., conv3_x, conv4_x, conv5_x in ResNet-50, are replaced with deformation ones to extract more robust shadow features. This is because the slightest change in the receptive field size among the high-level layers is able to pose a remarkable difference to the following networks, thus obtaining a better geometric modeling capacity in transformation of shape-changeable moving target shadows [
71].
2.4. Semantic-Guided Anchor-Adaptive Learning (SGAAL)
Anchors are the basis in modern object detection, which are usually a set of artificially designed boxes, used as the benchmark for classification and bounding box regression. However, previous video SAR moving target shadow detectors mostly adopt preset fixed-shape or fixed-size or fixed-scale anchors. In other words, they have changeless scales and aspect ratios, potentially declining the feature learning benefits of moving target shadows. Moreover, the raw anchors are arranged in the feature map densely and uniformly and are not in line with the video SAR image characteristic, as in
Figure 12a. This is because moving target shadows in video SAR images are distributed sparsely and unevenly; if the dense and uniform anchors are used, there will be many false alarms generated. Therefore, inspired by Wang et al. [
45], we design a novel semantic-guided anchor-adaptive learning (SGAAL) tool to generate high-quality location-adaptive and shape-adaptive anchors in the RPN, as in
Figure 12b. Here, we adopt the high-level deep semantic features to guide anchor generation which can ensure higher anchor quality [
45].
The aim of SGAAL is to adaptively obtain the anchor location and the corresponding shape, that is, the parameters (
x,
y,
w,
h) of anchors in the image
I, where (
x,
y) denotes the spatial coordinate of the anchor center,
w denotes the width of the anchor box, and
h denotes the height of the anchor box. Therefore, SGAAL can be described by
where
p(
x,
y|
I) denotes the prediction process of the anchor location for a given image
I, and
p(
w,
h|
x,
y,
I) denotes the prediction process of the anchor shape for a given image
I and the corresponding known location. In other words, SGAAL will first adaptively predict the location (
x,
y) of anchors, and then adaptively predict the shape (
w,
h) of anchors.
Figure 13 shows the detailed implementation process of SGAAL. In
Figure 13, the input semantic feature is denoted by
Q. Its height is denoted by
H, its width is denoted by
W, and its channel number is denoted by
C.
From
Figure 13, we use a 1 × 1 conv
WL to predict the anchor location whose channel number is set to 1, which will encode the whole
H ×
W location space. This 1 × 1 conv layer is followed by a
sigmod activation function that is defined by 1/(1 +
e−x) to represent the occurrence probability of the shadow location. Here, the location threshold is denoted by
εL. That is, when the value of the location (
x,
y) is bigger than
εL, then this location is assigned by a positive “1” label (i.e., the network should generate anchors at this location); otherwise, it is assigned by a negative “0” label (i.e., the network should not generate anchors at this location). As locations with shadows occupy a small portion of the whole feature map, we adopt the focal loss (FL) of RetinaNet [
39] to train this anchor location prediction network so as to avoid falling into a large number of negative samples, i.e.,
where
y denotes the ground-truth class.
y = 1 means the positive, otherwise it is the negative.
p denotes the predicted probability ranging from 0 to 1.
γ denotes the focusing parameter, set to 2 empirically, and
αt denotes the weighting factor, set to 0.25 empirically.
We use a 1 × 1 conv
WS to predict the anchor shape whose channel number is set to 2 because we need to obtain the anchor width
w and height
h. The anchor shape prediction is across the whole
H ×
W location space. However, the shape predictions whose corresponding location predictions are lower than the threshold
εL are filtered. This threshold
εL will be determined experimentally in
Section 5.4. The bounded IOU loss [
72] is used to train this anchor shape prediction network because it is more sensitive to box spatial locations, i.e.,
where
G denotes the ground-truth box and
P denotes the prediction box.
As a result, the anchor location and shape are obtained combined with
WL and
WS. Note that Wang et al. [
45] pointed out that the feature for a large anchor should encode the content over a large region, while those for small anchors should have smaller scopes accordingly, thus, following their practice, we also devise an anchor-guided feature adaptation component, which will transform the feature at each individual location
i based on the underlying anchor shape, i.e.,
where
qi denotes the
i-th location element of the raw feature map
Q, and
qi’ denotes the
i-th location element of the transformed feature map
Q’, and
wi and
hi denote the width and height of anchors at the
i-th location. Moreover,
A is a 3 × 3 deformable convolutional layer
WA which is used to predict the offset field from the output of the anchor shape prediction branch, and then apply the learned offset to the original feature map to obtain the final feature map
Q’.
Finally, based on the adaptively learned anchors, the high-quality proposals are generated, and then they are mapped to the transformed feature map
Q’ by ROIAlign to extract their corresponding feature regions for the subsequent classification and regression in Fast R-CNN, as in
Figure 2. In this way, the obtained optimized anchors will be able to adaptively match shadow location and shape so as to enable better false-alarm suppression ability and ensure more attentive shadow feature learning.
2.5. Online Hard-Example Mining (OHEM)
SGAAL can remove many negative samples by the location judgment, but it still does not solve the imbalance problem between positive samples and negative samples. For a location in the feature map, the number of the generated negative samples is still far more than that of positive ones, because background pixels usually occupy a larger proportion. Among a large number of negative samples, it is necessary to select more typical difficult negative samples and abandon easy ones to enhance background discrimination capacity. Online hard-example mining (OHEM) is an advanced difficult-identified negative sample mining method during training. It was proposed by Shrivastava et al. [
48] in 2016 and mainly selects some difficult negative samples as training samples in the training process of the target detection model, so as to improve the model parameters and make it converge to a better effect. Difficult samples refer to the samples which are difficult to distinguish, with a large training loss value. For a simple sample that is easy to correctly classify, it is difficult for the model to learning more effective information from it. Thus, hard samples are more valuable for model optimization and are more worth mining and utilization.
Figure 14 shows the detailed implementation process of OHEM. In
Figure 14, the classification loss of Fast R-CNN is denoted
losscls, and the regression loss is denoted
lossreg. We sum the
losscls and
lossreg of the negative samples. Their sum loss
losssum is then ranked, and the top k samples are selected into the hard-negative sample pool, where k is set to 256, inspired by [
48]. Moreover, the positive sample number is set to 256 to avoid falling into the local optimization of a certain positive or negative category. When the sample number in the pool reaches a batch size, they are mapped into the feature map by ROIAlign again to be trained repeatedly and emphatically. In this way, Fast R-CNN is able to learn more representative background features to further suppress false alarms. More details can be found in [
48].