3.1. Feature Fusion
The hierarchical backbone network yields feature maps of various sizes, unless otherwise mentioned, the backbone network in our experiments is Resnet-50. Consider feature map indexed in bottom-to-top order (, ), in which H, W and C are height, width and channel dimension, respectively.
The fusion of multi-level features can be expressed as Equation (
1):
where
is the fused output in the
i-th layer.
The goal of feature fusions is to integrate semantic information (from top layers) with spatial location information (from bottom layers). As a result, the process essence of feature fusion is to assess the importance of different features and filter out information that is inconsistent.
In mainstream semantic segmentation models, there are several types of feature fusions:
The spatial pyramid pooling is embedded at the top of backbone networks to encode multi-scale contextual information. PSPNet [
41] and Deeplabv3+ [
40] built pyramid poolings with different dilation rates in convolutional neural networks.
In encoder–decoder frame networks, the decoding process uses lateral connections [
39] or skip connections [
38] to integrate feature information and then outputs predicted probabilities.
Another type of method computes a weighted sum of the responses at all positions (such as Non-local neural networks [
42], CCNet [
43]).
Obviously, the fusion strategies described above have not explicitly established the feature correlations and are unable to quantify the importance of each feature. Thus, a controllable feature fusion module is proposed in this paper.
Gating mechanism has been proved to be valid in the evaluation of each feature vector in long-short term memory networks (LSTM) [
51]. Inspired by LSTM, the controllable fusion module is depicted in
Figure 5, which calculates the weighted sum of all features as the adjustable outputs. In order to explain the whole process, take the controllable fusion module in
i-th layer as an example. It can be formulated as follows:
where the weight factor in
i-th layer is
, and the sum of other layers is factorized by
. It is noticeable that the spatial dimensions of other layers are unified as
by bilinear interpolation. Furthermore, weight
is a vector activated by sigmoid function, and the specific computing process is shown in Equation (
3).
is optimized automatically according to the importance of feature . The larger the contribution of to the final prediction, the closer is to one and vice versa.
3.2. Edge Detection
The cross-entropy loss function is a classic loss function in semantic segmentation (shown in Equation (
4)):
where
N is the number of total pixels in a batch,
K is the total number of categories and
y and
represent label and prediction.
Segmentation methods are guided by two criteria: homogeneity within a congener segment and distinction from adjacent segments [
33]. Edge-based image segmentation methods attempt to detect edges between regions and then identify segments as regions within these edges. One assumption is that the edge feature aids in pixel location, while the body part containing rich semantic information aids in pixel categorization. The definition and location of edge regions are crucial to guide semantic segmentation with edge information.
Recent research has revealed that purified edge information can help with semantic segmentation. In DFN [
14], Canny operator was used to obtain additional edge information labels and a binary loss function was constructed for edge extraction. Gradient mutation of optical image was utilized in some research studies relative to edge perception loss function [
19,
49]. While the aforementioned works enhanced classification properties, they simply followed the principle that intra-class features (also known as body part) and inter-class features (also known as edge part) are of complete heterology and interact in an orthometric manner. There appears to be an implicit link between intra-class and inter-class features. Assuming that edge-body joint optimization can further improve semantic segmentation, an adaptive edge loss function is proposed.
It is necessary to review the online hard example mining (OHEM) algorithm before elaboration on the proposed adaptive edge loss function. The motivation of OHEM algorithm is to improve the sampling strategy for object-detection algorithms while dealing with extreme distributions of hard and easy cases [
52]. The authors proposed the OHEM algorithm for training Fast R-CNN and it proceeded in this manner: At iteration
t, the RoI network [
53] performs a forward pass using feature maps from the backbone (such as VGG-16) and all RoIs. Top 1% of them were assigned as hard examples after sorting the loss of outputs in descending order (namely take top 1% examples where the network performs worst). When implemented in Fast R-CNN, it computes backward passes only for hard examples in RoIs.
The OHEM loss function could be extended to semantic segmentation frameworks with only modest alterations. Outputs P, which refer to the predicted probability of each category, are sorted in ascending order throughout forward propagation. Threshold probability is updated according to the preset minimum number of reserved samples (typically is 100,000 when patch size is ). Actually, threshold probability is equal to , which is the -th value of predicted outputs. Hard examples are those with a probability of less than or equal to . Then, the remaining samples are filled with ignored labels and will not contribute to gradient optimization. Finally, only hard examples contribute to cross-entropy loss.
OHEM loss function is formulated as follows:
where
accounts for the number of pixels participating in practical backpropagation.
Although the OHEM algorithm can mine hard examples in semantic segmentation, the practical findings reveal that a majority of hard examples are dispersed along the boundaries. Different categories can be easily confused with each other due to visual resemblance. Inaccurate classification of pixels adjacent to boundaries is a bottleneck of FCN-likewise methods. The validity of OHEM is due to its ability to identify hard examples allowing for more effective hard-example optimization. Nevertheless, the OHEM loss function treats each pixel equally without identification of edge parts. This feature limits OHEM’s ability to interpret intricate scenes, such as remote sensing photographs. The optimization strategy must also analyze object structures in addition to mining hard examples. Hopefully, segmentation maps contain rich edge clues, which are essentials for semantic edge refinement.
In OHEM loss function, the number of sampled examples is set in advance, and it will partition all examples into two sections (hard and easy examples). In essence, the OHEM loss function only selects relatively harder samples. Consider the following two extreme scenarios: ① An image patch contains only one single object; ② an image patch comprises a variety of objects. There are many assimilative examples in the first scenario; thus, shall be reduced to avoid overfitting. In the second scenario, it is difficult to determine an exact category for pixels attached to both sides of the boundary. In this case, is usually a larger number to ensure sufficient examples for optimization. In the aforementioned cases, the selection of is fundamentally different, revealing that OHEM cannot precisely suit regardless of how hyperparameter is selected. Our goal is to design a loss function that can dynamically divide hard and easy examples based on each patch.
The model computes the probability of each label
y for a training example
x as follows:
where
k represents the
k-th category and
, and
is the model’s logits. Assume that for each training example
x, the true distribution over labels is
in Equation (
7).
Let us omit the dependence of
p and
q on example
x for the sake of simplicity. Thus, the cross-entropy loss for each example is defined as Equation (
8).
Minimizing this loss function is equivalent to maximizing the expected log-likelihood of the correct label. For a particular example x with label y, the log-likelihood is maximized as ( is Dirac delta), where the label is selected according to its ground-truth distribution .
Consider the case of a single ground-truth label y so that and for all . For a particular example x with label y, the log-likelihood is maximized for , where is Dirac delta. The optimization is guided by the cross-entropy loss function, where is substantially larger than .
This strategy, however, may result in over-fitting. If the model learns to assign full probability to the ground-truth label for a particular training example, it is not guaranteed to generalize. Szegedy et al. proposed a regularization mechanism named label-smoothing for a more adaptable optimization [
54]. They set a unique distribution over labels
and a smoothing parameter
in the label distribution, which were independent of the training example
x. For each training example
x with ground-truth
y, label distribution
was replaced with Equation (
9):
which mixed the original ground-truth distribution
and the fixed distribution
with weights
and
, respectively. The distribution of the label
k is obtained as follows: First, set it as the ground-truth label
; then, with probability
, replace
k with a sample drawn from the distribution
. They used the uniform distribution as
so that the label distribution was changed as Equation (
10):
where
K is the number of total classes.
represents the probability of the ground-truth labels being replaced.
In our proposed adaptive edge loss function, the ratio of hard examples to all examples determines . During actual optimization, a gradient information map using the Laplacian operator (deal with true label distribution) is calculated. Elements with a gradient of 0 are regarded as easy examples, while the other elements are all filled with 1 and are regarded as hard examples (as known as edge parts). Easy examples can only be optimized by cross entropy when constructing the final loss function, while hard examples are input into the adaptive edge loss function for optimization.
The calculation process of our proposed adaptive edge loss function is shown in Algorithm 1, and it can be formulated as follows:
where
and
are the number of easy examples and hard examples, respectively.
y is the ground-truth label and
is the predicted probability of the model.
Algorithm 1 Adaptive Edge Loss function |
Input:
D: training dataset composed of ; K: number of total categories; q: label distribution
Output: : optimal parameters of the network
- 1:
Initialize parameters of the network according to [ 55], , - 2:
for allD do - 3:
for to do - 4:
compute predicted probability - 5:
compute smoothed label distribution - 6:
for all pixel do - 7:
if Laplace( < 0 then - 8:
compute loss value for easy examples - 9:
- 10:
end if - 11:
if Laplace() > 0 then - 12:
replace as according to smoothed distribution - 13:
compute loss value for hard examples - 14:
- 15:
end if - 16:
end for - 17:
end for - 18:
compute final loss value according to Equation (11) - 19:
end for - 20:
optimize according to Stochastic Gradient Descent
|