The DARFP-SD algorithm mainly includes the Deformable Attention Recursive Feature Pyramid (DARFP) and the Bounding box Refinement (BR). The DARFP module firstly extracts features through the ResNet-50 backbone with the deformable convolution kernel [
18], which aims to increase the size of the effective receptive field so that the sampling point of the convolution operation can avoid the interference of stems to a certain extent and improve the quality of the learning network for sheltered pod features. To select the appropriate feature map to construct a recursive pyramid, DARFP then quantifies the relationship between the pod size and receptive field, increasing the recursive feedback connection with the feature learning network. BR designs an adaptive SDIoU-NMS branch, where the local area density will be predicted to help adaptively assign the NMS threshold. BR is supervised with the Repulsion loss [
19] and GIoU loss, which constrains the predicted box close to the corresponding ground truth and away from labeled boxes of other targets, which can improve the position accuracy of the candidate proposals. We will describe the detail of each module step by step in the following subsections.
2.2.1. Deformable Attention Recursive Feature Pyramid
Feature extraction based on deformable convolution. Traditional convolution operation learns the features through window sliding. When the size and stride of the convolution kernel are determined, the receptive field is fixed and its specific weight value will be determined in the network training process. Taking the
convolution kernel and input image
X as an example, the pixel
on the feature map
F can be calculated as Equation (
1):
where
represents the weight of convolution kernel in position
.
is the 8 neighborhood positions of
and can be formulated as
. Due to the uncertainty of the growth direction of pods in a single soybean plant, as shown in the blue area of the local pod feature map in
Figure 2, the fixed receptive field has a large number of sampling points outside the pods during feature learning, which will amplify the interference of background noise (such as stems) on pod features and restrict the quality of the feature map and candidate regions generated based on the feature map. To this end, we add an additional deformable convolution layer to predict the horizontal and vertical offsets for each pixel in the feature map. The whole feature extraction process of the deformable convolution for pods is shown in Equation (
2):
where
is the offset of the predicted pixel
. For each pixel, its final offset is the superposition of the offset components in the horizontal and vertical direction.
is obtained through the bilinear interpolation. As shown in the green area of the local pod feature map in
Figure 2, for pods with an uncertain attitude, deformable convolution can adaptively capture various shape and scale information of pods, effectively reducing noise interference.
Feature enhancement based on attention. For the feature map
, before constructing the DARFP, we introduce the channel attention and spatial attention based on the convolutional block attention module (CBAM) to enhance the feature quality, as in Equation (
3):
⨂ means the element-wise multiply, and
and
are the channel attention and spatial attention. For the original channel-wise feature
F, channel attention helps to capture the discriminative information of the object by learning the response relationship between channel features and the category label. For the crowed pods in our research scene, with semantic dependency between different channels, the features are guided to pay more attention to the pod areas rather than the complex background. The feature enhancement process based on channel attention is modeled through a max pooling
and average pooling
, as in Equation (
4):
Here,
and
are the learned weight of a shared Multilayer Perceptron.
means a Sigmoid activation function. For the uneven pods, fully embedding their spatial position information into the features is obviously helpful to improve the accuracy of detection and counting. Different from channel attention mechanism, we further utilize the spatial dependency between features to generate the spatial attention map, which can complementarily mine the spatial location information of pods ignored by the channel attention module. We calculate the spatial attention based on the feature maps enhanced with the channel attention. Similar to the channel attention, a max pooling and average pooling operation will be added to output
and
. Then, the two feature maps will be fused through a convolution operation
with a
kernel, as:
To make the most use of the semantic and spatial dependency between different channels captured by the and , we add the channel attention and spatial attention to each layer of the recursive feature pyramid.
Selection of feature maps. The area of the input image corresponding to any pixel on the feature map is described as the receptive field. The image information in the receptive field area directly affects the quality of the features learned by the network. The calculation method of thereceptive field in each layer is shown in Equation (
6):
is the size of the receptive field of the convolution layer
t, and
and
are the stride and kernel size of layer
t, respectively. For the soybean pods distributed by leaves or branches, to suppress the interference of background, we would like to let the receptive field be equivalent to the pod size. According to Equation (
6), the receptive field sizes of the C2, C3, C4 and C5 layer of ResNet50 are
,
,
and
. For the images of the single soybean plant collected in this study, the average size (length × width) of a single pod is about
pixels after randomly selecting and manually counting 50 images. In order to make the sheltered pod feature learning network universal for pods of different sizes, without adding additional convolution layers, we select the output of C3, C4 and C5 layer so that the original receptive field of the shallowest feature map is close to the average pod size. Similar to DCNV2 [
20], our DARFP introduces the deformable convolution with a
kernel to conv2, conv3, conv4 and conv5 of ResNet50, so that the feature extraction can improve the noise immunity at different scales.
Feature fusion based on recursive feature pyramid. The information contained in the feature maps output by different convolution layers is different. To fully exploit the limited pod features, the classical FPN [
21] fuses features of different scales along the top-down direction. However, the feedforward propagation is only conducted between the backbone and the pyramid structure, which means the gradient optimization information obtained during the pyramid constructing process cannot be fed back to the backbone to help the parameter learning. Motivated by DetectoRS [
22], we add cross layer feedback links for different feature pyramids. The feature map output from the previous recursive pyramid is first followed by a convolution operation. Then, the original feature and output feature will be stacked together as the feature layer of the next recursive pyramid. The transmission and calculation between the feature layers of the recursive feature pyramid are shown in Equation (
7):
represents the feature transformation operation with a convolution kernel. For any layer , and represent the feature maps of i layer and the i-th top-down operation of the FPN in recursions’ step l. After introducing recursions’ parameter l, the residual FPN can be expanded into a continuous network to extract and fuse features repeatedly, which can effectively improve the utilization of the priority feature information. The feedback also makes the parameter update optimize the feature extraction. In order to balance the feature quality and model training speed, the maximum number of recursions is set as 2.
2.2.2. Bounding Box Refinement
Non-maximum suppression is a common post-processing for object detection, which aims to suppress redundant predicted boxes in the detection results. However, limited by the cluster growth habit of the pod, only part of the pods can be successfully detected among the crowded multiple pods. Intuitively, the correct predicted bounding box belonging to one pod may be regarded as the offset predicted bounding box of another adjacent pod, which will be suppressed as a redundant predicted bounding box by the NMS algorithm [
23]. Increasing the NMS threshold can reduce the missed detection rate of pods theoretically, while it is challenging to manually set an appropriate threshold to handle the uneven pods with different densities at different locations. To this end, we design the adaptive SDIoU-NMS and Repulsion Loss to refine the bounding box.
Adaptive SDIoU-NMS. Adaptive SDIoU-NMS first introduces the DIoU [
24] to the Soft-NMS algorithms, which can measure the similarity and overlap between the two predicted boxes better. Compared with classical Soft-NMS, the adaptive SDIoU-NMS also considers the distance
between the center points of the two boxes. The suppression function in SDIoU-NMS can be calculated as Equations (
8)–(
10).
For the i-th object, is the classification scores of all predicted boxes. M and are the box with the highest score and other predicted boxes. b and represent the center points of the predicted box and the ground truth box, and is the Euclidean distance between these two center points. c is the diagonal distance of the minimum closure area that contains both the predicted box and the ground truth box. T is the threshold indicating the maximum IoU with all ground truth boxes.
For pods with an uneven number distribution, we expect a small threshold for sparse pods to remove more redundant boxes while a large threshold for dense pods to improve recall. To this end, based on SDIoU-NMS, the adaptive SDIoU-NMS further designs an independent density prediction branch to estimate the pod density, so that threshold
T can be dynamically adjusted according to the pod density. The density prediction branch adopts the VGG16 as the backbone, whose network structure is shown in
Figure 3. Note that, in order to consider more context information around the objects,
convolution kernel is used in the final convolution layer to increase the receptive field. The degree of density at the first target is defined as Equation (
11):
and
are the generated bounding box and ground truth. At the inference stage, the density prediction network outputs the object density at each position. Substituting the entire density value back into Equations (
8) and (
11), the adaptive SDIoU-NMS finally completes the operation of non-maximum suppression.