1. Introduction
Pest images captured with a light trap have a wide range of variability in pest appearance owing to severe occlusion, pose variation, and even scale variation. In particular, when a large number of pests are caught in a trap, they stick to each other and are obscured by other pests. The posture of pests change in various ways. For example, their wings may be folded or unfolded. Pest wings can be clipped, distorting their shape. Moreover, pests appear similar because they have similar textures and colors. The number of pests can vary significantly. These issues make it difficult to distinguish them.
Figure 1 shows an example of pest images captured using a light trap. The pest counting problem, which aims to predict the number of pests from a pest image, is extremely challenging because of pose variation, changes in the number of pests, occlusion, and similar appearance in color and texture.
Two approaches can be considered for pest counting [
1]. One is object detection, which localizes bounding boxes from the pest image, and the other is crowd counting, which predicts a density map to determine the number of objects in an image.
Figure 2 shows an example of detected bounding boxes with an object detector and a density map predicted with a crowd counter. Thus far, these two approaches have been applied separately, depending on the number of objects, such as cars and pedestrians. In the case of a significantly large number of objects, crowd counting is used; in the opposite case, object detection is chosen. However, the number of pests varies greatly, as shown in
Figure 1. This raises the question of which approach is more suitable for trap-based pest counting. To the best of our knowledge, little research has been conducted on this topic thus far.
In this study, the two approaches mentioned above were tested and compared for trap-based pest counting. To the best of our knowledge, this was attempted for the first time. In addition, to overcome challenging problems, such as pose variation, occlusion, and similar appearance, a new pest-counting model, referred to as multiscale and deformable attention CenterNet (Mada-CenterNet), is proposed.
1.1. Proposed Mada-CenterNet vs. Conventional CenterNet
The new Mada-CenterNet is an extremely advanced version of the conventional CenterNet [
2] optimized for trap-based pest counting. The first reason for choosing CenterNet as a base model for pest counting is that it is viewed as a hybrid approach combining bounding box localization and density map generation. Unlike existing object detectors, such as Faster R-CNN [
3] and RetinaNet [
4], which focus on predicting parameters related to bounding boxes via a regression function, CenterNet additionally exploits heatmaps that contain white colors to indicate the centroids of pests. Notably, heatmap generation is similar to the density map generation widely used for crowd counting. The minor difference is that the centroids of pests still have white colors to maintain the peak values for more accurate localization after Gaussian filtering. This heatmap generation can alleviate severe occlusions and wide pose variation problems more robustly than conventional object detectors. Therefore, hypothetically, the hybrid approach, CenterNet, is more suitable for trap-based pest counting than other object detection and crowd counting models. The second reason is that CenterNet is reported to outperform state-of-the-art object detection models in terms of speed and accuracy for object detection datasets (for example, the COCO dataset [
5]). However, the following aspects of CenterNet must be revised for trap-based pest counting.
First, CenterNet predicts only a single-scale heat map. However, as shown in
Figure 1, the number of pests can vary significantly, depending on the timing of pest outbreaks. Large variations in the number of pests can cause scale problems. That is, when the number of pests is small, a small-scale heatmap is more effective and sufficient for pest counting, and vice versa, a large-scale heatmap is required to handle severe occlusion and wide pose variation problems. Therefore, CenterNet must adopt a multiscale heatmap generation approach. To this end, in the proposed Mada-CenterNet, low-resolution (LR) and high-resolution (HR) backbones are constructed for small-scale guided heatmap generation in a two-step fashion;
Second, CenterNet uses stacked hourglasses as the backbone. However, information does not flow between the stacked hourglasses. The internal LR and HR features produced inside the hourglasses are hypothesized to be jointly learned, thereby boosting the discriminative power of the hourglasses. In the proposed Mada-CenterNet, a new between-hourglass skip connection is designed based on deformable and multiscale attention to transfer internal LR feature information to the HR hourglass. This approach helps to generate more accurate HR heatmaps and increase the pest counting accuracy. In other words, a new LR and HR joint feature learning is proposed for Mada-CenterNet;
Third, because CenterNet was developed for object detection datasets with mild pose variation and occlusion, it excludes geometric transformation. However, as shown in
Figure 1, pest images can exhibit large pose variations and severe occlusions. To address these problems, the conventional CenterNet should consider geometric transformation to enhance the internal LR and HR features. In the proposed Mada-CenterNet, deformable convolution is newly adopted in the between-hourglass skip connection to be applied to the internal LR features that are jointly learned with internal HR features through multiscale attention, thereby focusing on more attentive areas and boosting the joint feature learning for more accurate pest counting.
1.2. Our Contributions
The contributions of this paper are twofold. First, there are numerous object detection models for pedestrians and cars, but few have been designed for trap-based pest counting. In this study, we proposed a new Mada-CenterNet optimized for trap-based pest counting. In particular, we present a pest counting model that can overcome challenging problems, such as pose variation, occlusion, and similar appearance. Second, our dataset and source codes will be accessible to the public, making it easier to develop trap-based pest counting by applying transfer learning. Moreover, the experimental results confirmed that the proposed model outperformed existing state-of-the-art (SOTA) models, indicating that the proposed model will become a baseline for trap-based pest counting. Our codes and dataset can be downloaded from
https://github.com/cvmllab (accessed on 28 July 2023).
3. Background
Because the proposed Mada-CenterNet is regarded as an extremely advanced version of the conventional CenterNet [
2], an introduction to the background of CenterNet is necessary. CenterNet has proven powerful performance for object detection. Furthermore, it outperforms state-of-the-art object detection models such as Faster RCNN and RetinaNet in terms of speed and accuracy.
Figure 4 demonstrates the architecture of the CenterNet model. As shown in
Figure 4, the CenterNet model uses two hourglasses as backbones for feature extraction and predicts three types of maps: heatmaps, offset maps, and bounding box maps. The two hourglasses are arranged in series and have the same scales in the feature domain. Unlike the conventional two-stage and single-shot object detectors, CenterNet requires to additionally predict two heatmaps with the same scale that are filled with white colors to indicate the centroids of the objects in the input image. Indeed, the centroids of the objects are identical to those of the bounding boxes that surround them. The centroid is referred to as the keypoint in [
2].
where
denotes the input pest image, and
denotes the heatmap.
and
denote the width and height of the input image, respectively.
denotes the stride used to determine the resolution of the heatmap, and
denotes the number of object classes.
has a value of one for the keypoints and zero for other pixel locations.
Gaussian filtering is applied at the keypoints to smooth the heatmap (
, according to Equation (3). Here,
denotes the delta function, and
denotes the Gaussian filter.
denotes the convolution operation, and
denotes the keypoint. The generation of the heatmap is similar to that of the density map, which has been widely used for crowd counting [
7]. However, a significant difference is observed between them. Compared with the density map, the pixel values at the keypoints in the heat map, which correspond to the white colors, remain unchanged after Gaussian filtering. Therefore, the summation of the heatmap is not equal to the number of pests in the pest image. The purpose of using the heatmap is to localize the centroids of the objects in the pest image. Therefore, the peak values should be maintained to effectively determine the keypoints with the brightest colors.
To train the CenterNet, a total loss function is defined as follows:
,
, and
calculate the prediction errors for the ground-truth heatmap, bounding box map, and offset map. First, to model a loss function for the heatmap, the focal loss, which is a variant of cross-entropy, is used to address the class imbalance problem during training, as shown in Equation (5). Here,
denotes the predicted heatmap, and a prediction
corresponds to the
th keypoint in the ground-truth heatmap.
denotes the total number of keypoints in the input image. The focal loss downweights the loss for well-classified examples and focuses more on difficult, misclassified examples. In Equation (5),
and
are set to 2 and 4, respectively. Second, in Equation (6),
contains the width and height of the ground-truth bounding box at the
-th keypoint, and
is the predicted bounding box map that has the same size as
, but the number of channels is two. Thus,
is the sum of the errors between the predicted and ground-truth bounding boxes. Third,
is required to reflect the discretization errors caused by downsampling at a ratio of
. In Equation (7), the parentheses imply rounding off to obtain an integer pixel location, and
denotes the offset map that has the same size as
and contains offsets for 2D pixel coordinates. In Equation (4),
and
denote weights that are set to 0.1 and 1, respectively. To reduce the total loss in Equation (4) iteratively, gradient-based optimizers [
26] can be used.
In the test phase, max pooling is first applied to the heatmap, which is predicted by the latter hourglass, to remove noise and determine the keypoints with the brightest colors. Subsequently, at the keypoints, bounding boxes are detected using the offset and bounding box maps. Therefore, in the case of CenterNet, locating the keypoints is crucial, resulting in an increase in pest counting accuracy.
4. Proposed Mada-CenterNet for Trap-Based Pest Counting
The conventional CenterNet has certain drawbacks. As shown in
Figure 4, the two hourglasses have the same scale in the feature domain. In addition, feature information does not flow between them. Moreover, geometric deformation is not considered. In the case of trap-based pest counting, severe occlusion, wide pose variation, and changes in the number of pests can appear in pest images, as shown in
Figure 1. To cope with these problems, conventional CenterNet must be advanced. Thus, a multiscale and deformable model based on internal multiscale joint feature learning is required for a more accurate pest counting.
The architecture of the proposed Mada-CenterNet to incorporate deformable and multiscale attention based on internal LR and HR multiscale joint feature learning is illustrated in
Figure 5; noticeable differences exist compared to
Figure 4. First, a between-hourglass skip connection, which is drawn with thick red lines in
Figure 5, is newly constructed. The internal LR features produced inside the LR hourglass flow into the HR hourglass via the between-hourglass skip connection to realize deformable and multiscale attention. This design enables the transfer of internal LR feature information into the HR hourglass and focuses on more important areas in the HR feature domain, thereby alleviating pose deformation and occlusion problems. Second, the internal multiscale features of the LR and HR hourglasses are extracted and fused in the proposed Mada-CenterNet. As shown in
Figure 5, LR and HR hourglasses are used as the LR and HR feature extractors, respectively. In pest images, the number of pests varied widely. In the case of a small number of pests, extracting small-scale features and predicting small-scale heat maps are more efficient and sufficient. In the opposite case, larger features are required owing to occlusion. Through the proposed between-hourglass skip connection, the LR hourglass provides small-scale internal features to the HR hourglass for multiscale-based attention, thereby boosting the discriminative power of the HR hourglass. In other words, the LR hourglass plays the role of a teacher network to transfer internal LR feature knowledge to the HR hourglass. Therefore, the proposed Mada-CenterNet can adapt to the number of pests in the input image, thereby alleviating the scaling problem and increasing the discriminative power of the HR hourglass. Third, geometric deformation is incorporated into the between-hourglass skip connection, where internal LR features are sampled to determine more discriminative LR features and jointly learn HR features, agnostic to pest occlusion and pose variation. This approach can enhance internal multiscale joint feature learning more effectively, thereby improving the pest counting accuracy.
The proposed Mada-CenterNet largely consists of an LR hourglass, upsampling feature transformation, global residual skip connection, and input feature fusion between-hourglass skip connection based on deformable and multiscale attention for internal multiscale joint feature learning and an HR hourglass.
4.1. Prediction of LR Maps
The input pest image is first embedded into the LR feature domain through the convolution and residual blocks and then fed into the LR hourglass for deep feature extraction.
where
denotes the pest color image, and
and
denote the residual and convolution blocks, respectively. Symbol
is a composite function, and
denotes the LR hourglass.
and
correspond to shallow and deep features, respectively, before and after passing through the LR hourglass.
To map deep features
to LR predictions, three types of maps, one convolution block, and one convolution operation, are additionally applied.
and
correspond to the predicted LR heatmap, offset map, and bounding box map, respectively. To train the three types of LR maps, a new LR loss function is defined as follows:
Here, stores the width and height of the bounding box map at the keypoint, and has offset information. and have the same number of channels, that is, two. is a grayscale map because the pest captured by the trap includes only one species. and are set to 0.1 and 1, respectively.
To generate the ground-truth LR heatmap, the HR bounding boxes are first scaled down according to stride , and then LR keypoints are redefined. The newly rendered LR heatmap has white colors at the redefined keypoints. Subsequently, Gaussian filtering is applied to blur the LR heatmap, according to Equation (3). Here, notably, the white colors are not unchanged after Gaussian filtering to maintain the peaks, enabling easy determination of the keypoints. This is the main difference between the heatmap and the density map. An offset map is created using discretized centroids of LR bounding boxes.
4.2. Upsampling Feature Transformation
The deep features output by the last convolution block behind the LR hourglass and the three predicted types of maps are exploited to predict the HR maps more accurately. However, a scale mismatch exists between the LR and HR maps. Therefore, feature scaling must be performed.
Here, denotes the upsampling block consisting of upsampling and convolution layers used to enlarge the feature maps. In this study, bicubic interpolation is used to implement the upsampling layer.
4.3. Global Residual Skip Connection and Input Feature Fusion
A better approach would be for the HR hourglass to utilize the information about the input pest images. To this end, a global residual skip connection (GRKC) is considered. In this study, residual and convolution blocks are used to design GRKC, as shown in
Figure 5. Through GRKC, visual information of the input pest image can be transferred to the HR hourglass.
represents the output feature map of GRKC, and represents the concatenation for feature fusion. In Equation (17), the upsampled LR feature maps, including the LR heatmap, offset map, and bounding box map, are fused with the input pest image in the feature domain, making the input features richer and more discriminative.
4.4. Between-Hourglass Skip Connection Based on Deformable and Multiscale Attention
Fused input features,
, contain the three predicted types of LR maps, and the goal of the HR hourglass is to predict the corresponding HR maps from
. Therefore, fused input features,
, help to improve the performance of the HR map prediction. In addition, the architecture of the LR hourglass is the same as that of the HR hourglass. Only the sizes of the internal feature maps are different. Unlike the conventional CenterNet, in this study, the LR hourglass was connected to the HR hourglass via a between-hourglass skip connection, as shown by the red lines in
Figure 5. In other words, the internal LR features produced inside the LR hourglass were fed into the HR hourglass to be jointly learned with the internal HR features. To fuse the internal LR and HR features at different scales, deformable and multiscale attention was designed.
Figure 6 illustrates the detailed architecture of the proposed between-hourglass skip connection, built based on deformable convolution and multiscale attention for internal multiscale joint feature learning.
4.4.1. Internal LR Feature Deformation
The standard convolution extracts local features with many filters in the DCNN and has shown to exhibit a powerful performance for feature learning, particularly for computer vision problems. However, the standard convolution has an inherent limitation in modeling geometric deformation, because it can only extract local features at regular grids from the center of the filter to be slid. To this end, deformable convolution was devised. Unlike the standard convolution, deformable convolution adds 2D offsets to regular grids in the standard convolution, thereby enhancing the DCNN capability of modelling geometric transformation.
The pest images captured in a light trap, which were targeted in this study, have severe occlusion and wide pose variation. Deformable convolution is considered to model geometric deformation. In the proposed Mada-CenterNet, deformable convolution is inserted into the between-hourglass skip connection to apply geometric deformation to the internal LR features, as shown in
Figure 6.
Here, indicates the output feature maps at the -th residual block (RB) in the LR hourglass, and denotes the deformable convolution. In the LR hourglass, is not applied to the first and last two RBs because of the computational complexity during multiscale attention. Equation (18) indicates that the LR hourglass produces the deformed version of the internal LR feature map, . Deformed LR features, , are transferred to the HR hourglass for internal multiscale attention fusion.
4.4.2. Internal LR Feature-Guided Multiscale Attention
The details of multiscale attention for internal LR and HR feature fusion are shown in
Figure 7. Unlike the original attention model in Transformer [
27], three types of inputs, key (K), query (Q), and value (V), are visual feature maps, and K has a different scale than Q and V. In other words, K is small scale and takes linearly transformed internal LR features,
, as the input. In addition, K represents deformed LR features. In contrast, Q and V are large scale. The internal HR feature maps are assigned to Q and V after applying linear transformation. In this study, scaled dot-product attention [
27] was chosen to implement multiscale attention. In
Figure 6, the HR hourglass follows the encoder–decoder framework and, thus, multiscale attention is implemented slightly differently depending on the encoder and decoder.
For the encoder of the HR hourglass, multiscale attention is designed as follows:
where
indicates the output feature maps at the
-th RB in the HR hourglass. Similar to the LR hourglass, the first two and last three RBs are excluded from the multiscale fusion. In Equation (19), input feature map,
, passes through the HR hourglass, followed by linear transformation,
, to produce internal HR features,
, that are then assigned to Q and V. Similarly, deformed internal LR features,
, are assigned to K after linear transformation,
, as shown in Equation (20). Scaled dot-product attention is used to implement the multiscale attention of the encoder. In Equation (21),
denotes the dimension of K, and
represents the softmax function used to calculate weights between 0 and 1. Here, deformed internal LR features,
, are used to calculate the similarity matrix,
, and determine the LR features that are more important for pest counting. In other words, it serves as a guide for learning weights for internal HR features,
.
For the decoder of the HR hourglass, multiscale attention additionally requires the internal HR features transferred by the encoder. The multiscale attention for the decoder is modified as follows:
Here,
represents the internal HR features transferred by the encoder via within-hourglass skip connection, as shown in
Figure 6, and then added to the scaled dot-product attention result.
In the proposed multiscale attention, internal LR features are jointly learned with internal HR features via between-hourglass skip connection. The internal LR features serve as a guide for enhancing the internal HR features. Compared with other vision transformers (VTs) [
28,
29], in this study, two types of internal LR and HR features, which are outputs of two backbones, were learned jointly based on scaled dot-product attention. Notably, the internal LR features were deformed to be more robust to pest occlusion and wide pose variation, enabling the internal HR features to be more discriminative. Other VTs use only one backbone; thus, the two types of internal multiscale features were not considered. This is a key difference between the proposed multiscale attention and other VTs.
4.5. Prediction of HR Maps
Internal HR features can be made more discriminative through the proposed multiscale attention, where the deformed internal LR features are jointly learned with the internal HR features to focus on the more important areas in the feature domain. The output features of the HR hourglass pass through the convolution block and are then transformed into final prediction HR maps.
Here, represents the output feature map of the HR hourglass, and and correspond to the predicted HR heatmap, offset map, and bounding box map, respectively.
To train the HR maps, a HR loss function is defined as follows:
The HR loss function is the same as the LR loss function in Equation (4), except that the predicted LR maps of the loss function are replaced with predicted HR maps . Once again, the white colors in the ground-truth HR heatmap remain unchanged after Gaussian filtering to maintain the peak values, enabling the easy determination of keypoints. This is the key difference between the heatmap and the density map. and are set to 0.1 and 1, respectively.