Our research focused on the identification and accurate segmentation of individual goose instances from complex backgrounds, enabling the fine extraction of contour features and facilitating group counting. This is a typical instance segmentation task and extension. In this paper, we attempted to use a query-based network model for goose instance segmentation, which was performed by combining the two subtasks of target detection (individual goose classification and localization) and semantic segmentation (identification of goose pixels) in one.
2.3.1. QueryInst Network
QueryInst (Instances as Queries) is a query-based end-to-end instance segmentation method consisting of a query-based target detector and six dynamic masks driven by parallel supervision. The algorithm primarily exploits the one-to-one correspondence inherent in target queries across different stages, as well as the one-to-one correspondence between masked RoI features and target queries in the same stage. This correspondence exists in all query-based frameworks, independent of the specific instantiation and application. The R-CNN head of QueryInst contains 6 stages in parallel. The mask head is trained by minimizing dice loss [
25]. The QueryInst model trained with ResNet-50 [
26,
27] as the backbone. The dynamic head architecture of QueryInst is shown in
Figure 4.
Query-based Object Detector
QueryInst can be built on any multistage query-based object detector but is instantiated with Sparse R-CNN [
28] as default, which has six query stages. The target detection implementation formula for geese is as follows:
where
represents an object query. N and d represent the length (number) and dimension of query q, respectively. In the t stage, the pooling operator
extracts the current stage bounding box features
from the FPN features, guided by the
bounding box predictions of the previous stage
. A multihead self-attention module
is applied to the input query
to obtain the transformed query
. Then, a box dynamic convolution module
takes the
sum
as input and
augments it by reading
while generating for the next stage
. Finally, the augmented bounding box features
are fed into the box prediction branch
for current bounding box prediction
.
Dynamic Mask Head
A query-based instance segmentation framework was implemented with a parallel supervision-driven dynamic mask head. The dynamic mask head at stage t consisted of a dynamic mask convolution module DynConvmask, followed by a vanilla mask head. The mask generation pipeline was reformulated as follows:
The communication and coordination of object detection and instance segmentation were realized with dynamic mask headers.
2.3.2. Model Architecture—QueryPNet
Neck Design
To enhance the propagation of information flow in the instance segmentation framework, we chose to use path aggregation networks in our model. High-level feature maps with rich segmentation information were used as one particular input for better performance.
Each building block obtained a higher-resolution feature map Ni and a coarser map Pi+1 through lateral connections and generated a new feature map Ni+1. Each feature map Ni was first passed through a 3 × 3 convolutional layer with a stride of 2 to reduce the spatial size. The feature map Pi+1 and each element of the down-sampling map were then summed through lateral connections. The fused feature maps were then processed by another 3×3 convolutional layer to generate Ni+1 for subsequent subnetworks. This was an iterative process. In these building blocks, we always used channel 256 of the feature map. All convolutional layers were followed by a ReLU. Then, the feature grids for each level were pooled from the new feature maps (i.e., {N1, N2, N3, N4}).
The implementation of the neck module in this paper was as follows, as shown in
Figure 5:
The information path was shortened, and the feature pyramid was enhanced with the precise localization signals present in the lower layers. The resulting high-level feature maps were then additionally processed using a bottom-up path enhancement method.
Through the adaptive feature pool, all the features of each level were aggregated, and the features of the highest level were distributed to the same N5 levels obtained by the bottom-up path enhancement.
To capture different views of each task, our model used tiny, fully connected layers to enhance the predictions. For the mask part, this layer had complementary properties to the FCN originally used by Mask R-CNN, and by fusing predictions from these two views, the information diversity increased and a better-quality mask was generated, while for the target in the detection part, a better-quality box could be generated.
Proposed region generation and RoIAlign operation
The obtained feature maps were sent to RPN [
29], where the tens of thousands of candidate predictors in the region proposal network were no longer used. This paper chose to use 100 sparse proposals. This portion of sparse proposals was used as proposals to extract the regional features of the geese through RoIAlign. These proposal boxes were statistics of potential goose body locations in the images, which were only rough representations of goose targets, lacking many informative details, such as pose, shape, contour integrity, etc. Therefore, we set 256 high-dimensional proposal features (proposal_feature) to encode rich instance features. After that, a series of bounding boxes could be obtained, and for a case where multiple bounding boxes overlapped each other, non-maximum suppression (NMS) [
30] was reasonably used to obtain bounding boxes with higher foreground scores, which were passed to the next stage.
In the backpropagation of the RoIAlign layer, was the coordinate position of a floating-point number (the sample point calculated during forward propagation). In the feature map before pooling, the abscissa and ordinate of each point were and less than 1, the corresponding point should be accepted.
The gradient of the RoIAlign layer was as follows:
where
represents the distance between two points, and
and
represent the difference between
and
. Through the RoIAlign process, the extracted features were correctly aligned with the input image, which avoided losing the information of the original feature map in the process. The intermediate process was not quantized to ensure maximum information integrity, and it solved the problem of defining the corresponding region between the region proposal and the feature map. The problem of subpixel misalignment when defining the corresponding region between the region proposal and the feature map was solved, resulting in more accurate pixel segmentation. Especially for small feature maps, more accurate and complete information could be obtained.
Goose target detection and instance segmentation
This paper used 5 target detection heads and 5 dynamic mask heads, which could reduce the number of training parameters and optimize performance to a certain extent. The features obtained by RoIAlign used bbox_head to implement goose bounding box regression and mask_head to predict goose segmentation masks (goose body regions). For network training, the loss function represented the difference between the predicted value and the true value. It played an important role in the training of the goose segmentation model. For the loss function design of the two subtasks, we used
[
31] for bbox_head, which was also an adjustment to the original model, and dice loss for mask_head loss.
For
, the implementation was as follows:
where α is a positive trade-off parameter, and v measures the consistency of following aspect ratio:
Then, the loss function can be defined as:
and the trade-off parameter α is defined as:
Overlapping region factors were given higher priority in regression, especially for non-overlapping cases.
Finally, the optimization of CIoU loss was the same as that of DIoU loss, but the relative gradients were different.
For cases in the range of [0, 1], the domination is usually a small value, which is likely to produce exploding gradients. Therefore, in specific implementation, in order to stabilize the convergence, the dominator is simply removed , the step size is replaced by 1, and the gradient direction remains unchanged.
The dice loss is a loss function proposed based on the dice coefficient, which is calculated by the following formula:
where the sums run over the N voxels of the predicted binary segmentation volume
∈ P and the ground truth binary volume
∈ G. This formulation of dice can be differentiated, yielding a gradient computed with respect to the
voxel of the prediction.
The imbalance between foreground and background pixels was dealt with in the above way.
Figure 6 shows the main architecture of our model. After the data enhancement operation, the data were sent to the ResNet backbone to extract the features richly. To better utilize the features extracted by the backbone, the innovated PANet was used. Additionally, we utilized a parallel detection method, allowing the target detection head and the dynamic mask head to detect and segment data at the same time. Moreover, this part adopted a multihead attention mechanism to extend the ability of both detection and segmentation. Five pairs of parallel detection heads were used in this paper.