Trap-Based Pest Counting: Multiscale and Deformable Attention CenterNet Integrating Internal LR and HR Joint Feature Learning

Lee, Jae-Hyeon; Son, Chang-Hwan

doi:10.3390/rs15153810

Open AccessArticle

Trap-Based Pest Counting: Multiscale and Deformable Attention CenterNet Integrating Internal LR and HR Joint Feature Learning

by

Jae-Hyeon Lee

and

Chang-Hwan Son

^*

Department of Software Science & Engineering, Kunsan National University, Gunsan-si 54150, Republic of Korea

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(15), 3810; https://doi.org/10.3390/rs15153810

Submission received: 13 June 2023 / Revised: 13 July 2023 / Accepted: 29 July 2023 / Published: 31 July 2023

(This article belongs to the Special Issue Remote Sensing and Associated Artificial Intelligence in Agricultural Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Pest counting, which predicts the number of pests in the early stage, is very important because it enables rapid pest control, reduces damage to crops, and improves productivity. In recent years, light traps have been increasingly used to lure and photograph pests for pest counting. However, pest images have a wide range of variability in pest appearance owing to severe occlusion, wide pose variation, and even scale variation. This makes pest counting more challenging. To address these issues, this study proposes a new pest counting model referred to as multiscale and deformable attention CenterNet (Mada-CenterNet) for internal low-resolution (LR) and high-resolution (HR) joint feature learning. Compared with the conventional CenterNet, the proposed Mada-CenterNet adopts a multiscale heatmap generation approach in a two-step fashion to predict LR and HR heatmaps adaptively learned to scale variations, that is, changes in the number of pests. In addition, to overcome the pose and occlusion problems, a new between-hourglass skip connection based on deformable and multiscale attention is designed to ensure internal LR and HR joint feature learning and incorporate geometric deformation, thereby resulting in improved pest counting accuracy. Through experiments, the proposed Mada-CenterNet is verified to generate the HR heatmap more accurately and improve pest counting accuracy owing to multiscale heatmap generation, joint internal feature learning, and deformable and multiscale attention. In addition, the proposed model is confirmed to be effective in overcoming severe occlusions and variations in pose and scale. The experimental results show that the proposed model outperforms state-of-the-art crowd counting and object detection models.

Keywords:

pest counting; CenterNet; attention; feature learning; crowd counting; deformable convolution

1. Introduction

Pest images captured with a light trap have a wide range of variability in pest appearance owing to severe occlusion, pose variation, and even scale variation. In particular, when a large number of pests are caught in a trap, they stick to each other and are obscured by other pests. The posture of pests change in various ways. For example, their wings may be folded or unfolded. Pest wings can be clipped, distorting their shape. Moreover, pests appear similar because they have similar textures and colors. The number of pests can vary significantly. These issues make it difficult to distinguish them. Figure 1 shows an example of pest images captured using a light trap. The pest counting problem, which aims to predict the number of pests from a pest image, is extremely challenging because of pose variation, changes in the number of pests, occlusion, and similar appearance in color and texture.

Two approaches can be considered for pest counting [1]. One is object detection, which localizes bounding boxes from the pest image, and the other is crowd counting, which predicts a density map to determine the number of objects in an image. Figure 2 shows an example of detected bounding boxes with an object detector and a density map predicted with a crowd counter. Thus far, these two approaches have been applied separately, depending on the number of objects, such as cars and pedestrians. In the case of a significantly large number of objects, crowd counting is used; in the opposite case, object detection is chosen. However, the number of pests varies greatly, as shown in Figure 1. This raises the question of which approach is more suitable for trap-based pest counting. To the best of our knowledge, little research has been conducted on this topic thus far.

In this study, the two approaches mentioned above were tested and compared for trap-based pest counting. To the best of our knowledge, this was attempted for the first time. In addition, to overcome challenging problems, such as pose variation, occlusion, and similar appearance, a new pest-counting model, referred to as multiscale and deformable attention CenterNet (Mada-CenterNet), is proposed.

1.1. Proposed Mada-CenterNet vs. Conventional CenterNet

The new Mada-CenterNet is an extremely advanced version of the conventional CenterNet [2] optimized for trap-based pest counting. The first reason for choosing CenterNet as a base model for pest counting is that it is viewed as a hybrid approach combining bounding box localization and density map generation. Unlike existing object detectors, such as Faster R-CNN [3] and RetinaNet [4], which focus on predicting parameters related to bounding boxes via a regression function, CenterNet additionally exploits heatmaps that contain white colors to indicate the centroids of pests. Notably, heatmap generation is similar to the density map generation widely used for crowd counting. The minor difference is that the centroids of pests still have white colors to maintain the peak values for more accurate localization after Gaussian filtering. This heatmap generation can alleviate severe occlusions and wide pose variation problems more robustly than conventional object detectors. Therefore, hypothetically, the hybrid approach, CenterNet, is more suitable for trap-based pest counting than other object detection and crowd counting models. The second reason is that CenterNet is reported to outperform state-of-the-art object detection models in terms of speed and accuracy for object detection datasets (for example, the COCO dataset [5]). However, the following aspects of CenterNet must be revised for trap-based pest counting.

First, CenterNet predicts only a single-scale heat map. However, as shown in Figure 1, the number of pests can vary significantly, depending on the timing of pest outbreaks. Large variations in the number of pests can cause scale problems. That is, when the number of pests is small, a small-scale heatmap is more effective and sufficient for pest counting, and vice versa, a large-scale heatmap is required to handle severe occlusion and wide pose variation problems. Therefore, CenterNet must adopt a multiscale heatmap generation approach. To this end, in the proposed Mada-CenterNet, low-resolution (LR) and high-resolution (HR) backbones are constructed for small-scale guided heatmap generation in a two-step fashion;
Second, CenterNet uses stacked hourglasses as the backbone. However, information does not flow between the stacked hourglasses. The internal LR and HR features produced inside the hourglasses are hypothesized to be jointly learned, thereby boosting the discriminative power of the hourglasses. In the proposed Mada-CenterNet, a new between-hourglass skip connection is designed based on deformable and multiscale attention to transfer internal LR feature information to the HR hourglass. This approach helps to generate more accurate HR heatmaps and increase the pest counting accuracy. In other words, a new LR and HR joint feature learning is proposed for Mada-CenterNet;
Third, because CenterNet was developed for object detection datasets with mild pose variation and occlusion, it excludes geometric transformation. However, as shown in Figure 1, pest images can exhibit large pose variations and severe occlusions. To address these problems, the conventional CenterNet should consider geometric transformation to enhance the internal LR and HR features. In the proposed Mada-CenterNet, deformable convolution is newly adopted in the between-hourglass skip connection to be applied to the internal LR features that are jointly learned with internal HR features through multiscale attention, thereby focusing on more attentive areas and boosting the joint feature learning for more accurate pest counting.

1.2. Our Contributions

The contributions of this paper are twofold. First, there are numerous object detection models for pedestrians and cars, but few have been designed for trap-based pest counting. In this study, we proposed a new Mada-CenterNet optimized for trap-based pest counting. In particular, we present a pest counting model that can overcome challenging problems, such as pose variation, occlusion, and similar appearance. Second, our dataset and source codes will be accessible to the public, making it easier to develop trap-based pest counting by applying transfer learning. Moreover, the experimental results confirmed that the proposed model outperformed existing state-of-the-art (SOTA) models, indicating that the proposed model will become a baseline for trap-based pest counting. Our codes and dataset can be downloaded from https://github.com/cvmllab (accessed on 28 July 2023).

2. Related Works

Trap-based pest counting techniques can be broadly classified into two categories [1]. One is crowd counting, and the other is object detection. Crowd counting, which is also referred to as counting approaches by regression (or density estimation), predicts the density map where all pixels sum up to the number of pests, and object detection localizes the bounding boxes surrounding the pests. The number of detected bounding boxes is identical to that of pests in the image. This is the main difference between crowd counting and object detection methods. Figure 3 shows examples of the predicted density map and detected bounding boxes. In the upper right image, the generated density map is shown, where red indicates higher density areas and dark blue indicates lower density areas. In the density map, the sum of all pixels is equal to the number of pests. In the bottom right image, the detected bounding boxes overlap on the input image.

2.1. Density Estimation Approach for Pest Counting

In pest images, pests cling to each other and are obscured by other pests. Similarly, in crowded images, people are too close together and obscured by others. Therefore, because the pest-counting problem is similar to the crowd counting problem, crowd counting approaches can be considered for trap-based pest counting. Traditional object detection based on bounding box localization is unsuitable for crowd counting. Therefore, counting approaches using regression and density estimation have been developed. Classical crowd counting approaches [1] treat the counting problem as a regression problem to directly predict the crowd number from handcrafted image features or indirectly estimate the crowd number from the generated density map. Recently, counting via density estimation using deep convolutional neural networks (DCNNs) has become mainstream. Starting with CrowdNet [6], its multiscale and refinement variants, such as MCNN [7], SANet [8], and ICCNet [9], which utilize different kernels and density maps in size, are introduced. These multiscale approaches can handle scale variations of people. More recently, kernel-based density map generation (KDMG) [10] that learns optimal kernels and updates the density map adaptively has shown outstanding performance.

2.2. Object Detection Approach for Pest Counting

The object detector determines the locations of the objects in the image. In general, two types of object detectors are used. One is a sliding window approach, and the other is a region proposal approach. The sliding window approach moves a sliding window along a raster scanning direction and determines whether an object is contained in the window, whereas the region proposal approach generates a thousand bounding box candidates while merging super-pixels based on similarity in a bottom-up grouping manner. Handcrafted features, such as histogram of gradients (HOG) [11] and scale invariant feature transform (SIFT) [12], are extracted, and a support vector machine (SVM) [13] is trained to classify the candidates. Popular region proposals include selective search [14] and edge boxes [15]. However, these feature extractors are dependent on computer vision applications and are also limited in generating rich and discriminative features.

A recent trend is to use DCNN-based object detectors when large amounts of training data are available. DCNN-based object detectors can be divided into YOLO and RCNN families. The YOLO family is a single-shot detector that directly localizes bounding boxes from a full image using a regression model, and thus it is computationally efficient. Popular models include YOLO [16] and SSD [17]. The YOLO model enables real-time object detection, and the SSD model can handle objects in various sizes using multiple feature maps at different resolutions. The RCNN family is a two-stage detector that requires both region proposal and classifier learning. After RCNN, which is the earliest model, more advanced versions, such as Faster RCNN [3], Mask RCNN [18], and Cascaded RCNN [19], have been developed. In the RCNN, training is a multistage pipeline, and object detection is slow. To overcome these drawbacks, ROI pooling and RPN were developed and inserted into the Fast R-CNN and Faster R-CNN, respectively. Furthermore, instance segmentation was incorporated into the Faster R-CNN to construct the Mask R-CNN. The cascaded R-CNN addresses a noisy detection problem owing to an intersection over union (IoU) threshold. Further details are provided in the literature [19].

2.3. Trap-Based Pest Counting Approach

Although pest and fruit counting methods, such as MagoNet [20], InsectNet [21], and RPSN [22] exist, in this study, we concentrated on trap-based pest counting approaches. This is because most fruit and leaf image datasets have a simple background or mild occlusion and a narrow variation in their shape, which is considerably different from our pest datasets, as already discussed with Figure 1. With the advent of deep learning, moth detection [23] using a sliding window and trained classifier was introduced. Although this approach uses the DCNN, it still adopts a sliding window approach, and end-to-end learning is not realized for moth counting. Moreover, patch-based detection was conducted, implying that it is vulnerable to pest scale problems. Recently, PestNet [24] was introduced to first deal with large-scale pest detection, where channel and spatial attention are incorporated into the backbone to boost deep features, and a region proposal network is used with a position-sensitive score map to encode position information. However, the main architecture of PestNet is borrowed from Faster RCNN [3]. Pest-YOLO [25] was proposed to improve the YOLOv4, where a new confidence loss was designed to effectively solve the problem of hard samples. However, the performance of Pest-YOLO was slightly improved compared to YOLOv4. Our experiment confirmed that the performance of Faster R-CNN and YOLO was inferior to that of CenterNet. Therefore, a more advanced pest counting model is required.

3. Background

Because the proposed Mada-CenterNet is regarded as an extremely advanced version of the conventional CenterNet [2], an introduction to the background of CenterNet is necessary. CenterNet has proven powerful performance for object detection. Furthermore, it outperforms state-of-the-art object detection models such as Faster RCNN and RetinaNet in terms of speed and accuracy. Figure 4 demonstrates the architecture of the CenterNet model. As shown in Figure 4, the CenterNet model uses two hourglasses as backbones for feature extraction and predicts three types of maps: heatmaps, offset maps, and bounding box maps. The two hourglasses are arranged in series and have the same scales in the feature domain. Unlike the conventional two-stage and single-shot object detectors, CenterNet requires to additionally predict two heatmaps with the same scale that are filled with white colors to indicate the centroids of the objects in the input image. Indeed, the centroids of the objects are identical to those of the bounding boxes that surround them. The centroid is referred to as the keypoint in [2].

I \in ℜ^{W \times H \times 3}

(1)

Y \in {[0, 1]}^{(W / R) \times (H / R) \times C}

(2)

where

I

denotes the input pest image, and

Y

denotes the heatmap.

W

and

H

denote the width and height of the input image, respectively.

R

denotes the stride used to determine the resolution of the heatmap, and

C

denotes the number of object classes.

Y

has a value of one for the keypoints and zero for other pixel locations.

Y = \sum_{k \in Ω} δ (x - x_{k}) \otimes G

(3)

Gaussian filtering is applied at the keypoints to smooth the heatmap (

Y)

, according to Equation (3). Here,

δ

denotes the delta function, and

G

denotes the Gaussian filter.

\otimes

denotes the convolution operation, and

k

denotes the keypoint. The generation of the heatmap is similar to that of the density map, which has been widely used for crowd counting [7]. However, a significant difference is observed between them. Compared with the density map, the pixel values at the keypoints in the heat map, which correspond to the white colors, remain unchanged after Gaussian filtering. Therefore, the summation of the heatmap is not equal to the number of pests in the pest image. The purpose of using the heatmap is to localize the centroids of the objects in the pest image. Therefore, the peak values should be maintained to effectively determine the keypoints with the brightest colors.

To train the CenterNet, a total loss function is defined as follows:

L = L_{h} + λ_{b} L_{b} + λ_{o} L_{o}

(4)

L_{h} = - \frac{1}{N} \sum_{k}^{N} \{\begin{matrix} {(1 - \hat{Y} (k))}^{α} l o g (\hat{Y} (k)) i f Y (k) = 1 \\ {(1 - \hat{Y} (k))}^{β} {(\hat{Y} (k))}^{α} l o g (1 - \hat{Y} (k)) o t h e r w i s e \end{matrix}

(5)

L_{b} = \frac{1}{N} \sum_{k = 1}^{N} |{\hat{S}}_{k} - S_{k}|

(6)

L_{o} = \frac{1}{N} \sum_{k}^{N} |{\hat{O}}_{\tilde{k}} - (\frac{k}{R} - \tilde{k})| where \tilde{k} = [\frac{k}{R}]

(7)

L_{h}

,

L_{b}

, and

L_{o}

calculate the prediction errors for the ground-truth heatmap, bounding box map, and offset map. First, to model a loss function for the heatmap, the focal loss, which is a variant of cross-entropy, is used to address the class imbalance problem during training, as shown in Equation (5). Here,

\hat{Y}

denotes the predicted heatmap, and a prediction

Y (k) = 1

corresponds to the

k -

th keypoint in the ground-truth heatmap.

N

denotes the total number of keypoints in the input image. The focal loss downweights the loss for well-classified examples and focuses more on difficult, misclassified examples. In Equation (5),

α

and

β

are set to 2 and 4, respectively. Second, in Equation (6),

S_{k}

contains the width and height of the ground-truth bounding box at the

k

-th keypoint, and

{\hat{S}}_{p_{k}}

is the predicted bounding box map that has the same size as

\hat{Y}

, but the number of channels is two. Thus,

L_{b}

is the sum of the errors between the predicted and ground-truth bounding boxes. Third,

L_{o}

is required to reflect the discretization errors caused by downsampling at a ratio of

R

. In Equation (7), the parentheses imply rounding off to obtain an integer pixel location, and

{\hat{O}}_{\tilde{k}}

denotes the offset map that has the same size as

{\hat{S}}_{p_{k}}

and contains offsets for 2D pixel coordinates. In Equation (4),

λ_{b}

and

λ_{o}

denote weights that are set to 0.1 and 1, respectively. To reduce the total loss in Equation (4) iteratively, gradient-based optimizers [26] can be used.

In the test phase, max pooling is first applied to the heatmap, which is predicted by the latter hourglass, to remove noise and determine the keypoints with the brightest colors. Subsequently, at the keypoints, bounding boxes are detected using the offset and bounding box maps. Therefore, in the case of CenterNet, locating the keypoints is crucial, resulting in an increase in pest counting accuracy.

4. Proposed Mada-CenterNet for Trap-Based Pest Counting

The conventional CenterNet has certain drawbacks. As shown in Figure 4, the two hourglasses have the same scale in the feature domain. In addition, feature information does not flow between them. Moreover, geometric deformation is not considered. In the case of trap-based pest counting, severe occlusion, wide pose variation, and changes in the number of pests can appear in pest images, as shown in Figure 1. To cope with these problems, conventional CenterNet must be advanced. Thus, a multiscale and deformable model based on internal multiscale joint feature learning is required for a more accurate pest counting.

The architecture of the proposed Mada-CenterNet to incorporate deformable and multiscale attention based on internal LR and HR multiscale joint feature learning is illustrated in Figure 5; noticeable differences exist compared to Figure 4. First, a between-hourglass skip connection, which is drawn with thick red lines in Figure 5, is newly constructed. The internal LR features produced inside the LR hourglass flow into the HR hourglass via the between-hourglass skip connection to realize deformable and multiscale attention. This design enables the transfer of internal LR feature information into the HR hourglass and focuses on more important areas in the HR feature domain, thereby alleviating pose deformation and occlusion problems. Second, the internal multiscale features of the LR and HR hourglasses are extracted and fused in the proposed Mada-CenterNet. As shown in Figure 5, LR and HR hourglasses are used as the LR and HR feature extractors, respectively. In pest images, the number of pests varied widely. In the case of a small number of pests, extracting small-scale features and predicting small-scale heat maps are more efficient and sufficient. In the opposite case, larger features are required owing to occlusion. Through the proposed between-hourglass skip connection, the LR hourglass provides small-scale internal features to the HR hourglass for multiscale-based attention, thereby boosting the discriminative power of the HR hourglass. In other words, the LR hourglass plays the role of a teacher network to transfer internal LR feature knowledge to the HR hourglass. Therefore, the proposed Mada-CenterNet can adapt to the number of pests in the input image, thereby alleviating the scaling problem and increasing the discriminative power of the HR hourglass. Third, geometric deformation is incorporated into the between-hourglass skip connection, where internal LR features are sampled to determine more discriminative LR features and jointly learn HR features, agnostic to pest occlusion and pose variation. This approach can enhance internal multiscale joint feature learning more effectively, thereby improving the pest counting accuracy.

The proposed Mada-CenterNet largely consists of an LR hourglass, upsampling feature transformation, global residual skip connection, and input feature fusion between-hourglass skip connection based on deformable and multiscale attention for internal multiscale joint feature learning and an HR hourglass.

4.1. Prediction of LR Maps

The input pest image is first embedded into the LR feature domain through the convolution and residual blocks and then fed into the LR hourglass for deep feature extraction.

F_{L R}^{s} = f_{R B} ° f_{C B} (I_{H R})

(8)

F_{L R}^{d} = H_{L R} (F_{L R}^{s})

(9)

where

I_{H R}

denotes the pest color image, and

f_{R B}

and

f_{C B}

denote the residual and convolution blocks, respectively. Symbol

°

is a composite function, and

H_{L R}

denotes the LR hourglass.

F_{L R}^{s}

and

F_{L R}^{d}

correspond to shallow and deep features, respectively, before and after passing through the LR hourglass.

To map deep features

F_{L R}^{d}

to LR predictions, three types of maps, one convolution block, and one convolution operation, are additionally applied.

{\hat{F}}_{L R}^{p} = \{{\hat{F}}_{L R}^{H}, {\hat{F}}_{L R}^{O}, {\hat{F}}_{L R}^{B}\} = f_{C} ° f_{C B} (F_{L R}^{d})

(10)

{\hat{F}}_{L R}^{H}, {\hat{F}}_{L R}^{O},

and

{\hat{F}}_{L R}^{B}

correspond to the predicted LR heatmap, offset map, and bounding box map, respectively. To train the three types of LR maps, a new LR loss function is defined as follows:

L ({\hat{F}}_{L R}^{p}) = L ({\hat{F}}_{L R}^{H}) + λ_{L R}^{o} L ({\hat{F}}_{L R}^{O}) + λ_{L R}^{b} L ({\hat{F}}_{L R}^{B})

(11)

L ({\hat{F}}_{L R}^{H}) = - \frac{1}{N} \sum_{k} \{\begin{matrix} {(1 - {\hat{F}}_{L R}^{H} (k))}^{α} l o g ({\hat{F}}_{L R}^{H} (k)) i f F_{L R}^{H} (k) = 1 \\ {(1 - {\hat{F}}_{L R}^{H} (k))}^{β} {({\hat{F}}_{L R}^{H} (k))}^{α} l o g (1 - {\hat{F}}_{L R}^{H} (k)) o t h e r w i s e \end{matrix}

(12)

L ({\hat{F}}_{L R}^{B}) = \frac{1}{N} \sum_{k} |{\hat{F}}_{L R}^{B} (k) - F_{L R}^{B} (k)|

(13)

L ({\hat{F}}_{L R}^{O}) = \frac{1}{N} \sum_{k} |{\hat{F}}_{L R}^{O} (k) - (\frac{k}{R} - \tilde{k})|, where \tilde{k} = [\frac{k}{R}]

(14)

Here,

{\hat{F}}_{L R}^{B}

stores the width and height of the bounding box map at the keypoint, and

{\hat{F}}_{L R}^{O}

has offset information.

{\hat{F}}_{L R}^{B}

and

{\hat{F}}_{L R}^{O}

have the same number of channels, that is, two.

{\hat{F}}_{L R}^{H}

is a grayscale map because the pest captured by the trap includes only one species.

λ_{L R}^{b}

and

λ_{L R}^{o}

are set to 0.1 and 1, respectively.

To generate the ground-truth LR heatmap, the HR bounding boxes are first scaled down according to stride

R

, and then LR keypoints are redefined. The newly rendered LR heatmap has white colors at the redefined keypoints. Subsequently, Gaussian filtering is applied to blur the LR heatmap, according to Equation (3). Here, notably, the white colors are not unchanged after Gaussian filtering to maintain the peaks, enabling easy determination of the keypoints. This is the main difference between the heatmap and the density map. An offset map is created using discretized centroids of LR bounding boxes.

4.2. Upsampling Feature Transformation

The deep features output by the last convolution block behind the LR hourglass and the three predicted types of maps are exploited to predict the HR maps more accurately. However, a scale mismatch exists between the LR and HR maps. Therefore, feature scaling must be performed.

F_{H R}^{u} = f_{U B} ({\hat{F}}_{L R}^{p} = \{{\hat{F}}_{L R}^{H}, {\hat{F}}_{L R}^{O}, {\hat{F}}_{L R}^{B}\})

(15)

F_{H R}^{i n} = f_{U B} ° f_{C B} (F_{L R}^{d})

(16)

Here,

f_{U B}

denotes the upsampling block consisting of upsampling and convolution layers used to enlarge the feature maps. In this study, bicubic interpolation is used to implement the upsampling layer.

4.3. Global Residual Skip Connection and Input Feature Fusion

A better approach would be for the HR hourglass to utilize the information about the input pest images. To this end, a global residual skip connection (GRKC) is considered. In this study, residual and convolution blocks are used to design GRKC, as shown in Figure 5. Through GRKC, visual information of the input pest image can be transferred to the HR hourglass.

F_{H R}^{s} = f_{R B} ° f_{C B} ° f_{C O N C A T} (F_{H R}^{i n}, F_{H R}^{u p}, F_{H R}^{g r k c})

(17)

F_{H R}^{g r k c}

represents the output feature map of GRKC, and

f_{C O N C A T}

represents the concatenation for feature fusion. In Equation (17), the upsampled LR feature maps, including the LR heatmap, offset map, and bounding box map, are fused with the input pest image in the feature domain, making the input features richer and more discriminative.

4.4. Between-Hourglass Skip Connection Based on Deformable and Multiscale Attention

Fused input features,

F_{H R}^{s}

, contain the three predicted types of LR maps, and the goal of the HR hourglass is to predict the corresponding HR maps from

F_{H R}^{s}

. Therefore, fused input features,

F_{H R}^{s}

, help to improve the performance of the HR map prediction. In addition, the architecture of the LR hourglass is the same as that of the HR hourglass. Only the sizes of the internal feature maps are different. Unlike the conventional CenterNet, in this study, the LR hourglass was connected to the HR hourglass via a between-hourglass skip connection, as shown by the red lines in Figure 5. In other words, the internal LR features produced inside the LR hourglass were fed into the HR hourglass to be jointly learned with the internal HR features. To fuse the internal LR and HR features at different scales, deformable and multiscale attention was designed. Figure 6 illustrates the detailed architecture of the proposed between-hourglass skip connection, built based on deformable convolution and multiscale attention for internal multiscale joint feature learning.

4.4.1. Internal LR Feature Deformation

The standard convolution extracts local features with many filters in the DCNN and has shown to exhibit a powerful performance for feature learning, particularly for computer vision problems. However, the standard convolution has an inherent limitation in modeling geometric deformation, because it can only extract local features at regular grids from the center of the filter to be slid. To this end, deformable convolution was devised. Unlike the standard convolution, deformable convolution adds 2D offsets to regular grids in the standard convolution, thereby enhancing the DCNN capability of modelling geometric transformation.

The pest images captured in a light trap, which were targeted in this study, have severe occlusion and wide pose variation. Deformable convolution is considered to model geometric deformation. In the proposed Mada-CenterNet, deformable convolution is inserted into the between-hourglass skip connection to apply geometric deformation to the internal LR features, as shown in Figure 6.

F_{L R}^{H (l)} = f_{D C} (H_{L R}^{l} (F_{L R}^{s}))

(18)

Here,

H_{L R}^{l}

indicates the output feature maps at the

l

-th residual block (RB) in the LR hourglass, and

f_{D C}

denotes the deformable convolution. In the LR hourglass,

f_{D C}

is not applied to the first and last two RBs because of the computational complexity during multiscale attention. Equation (18) indicates that the LR hourglass produces the deformed version

F_{L R}^{H (l)}

of the internal LR feature map,

H_{L R}^{l}

. Deformed LR features,

F_{L R}^{H (l)}

, are transferred to the HR hourglass for internal multiscale attention fusion.

4.4.2. Internal LR Feature-Guided Multiscale Attention

The details of multiscale attention for internal LR and HR feature fusion are shown in Figure 7. Unlike the original attention model in Transformer [27], three types of inputs, key (K), query (Q), and value (V), are visual feature maps, and K has a different scale than Q and V. In other words, K is small scale and takes linearly transformed internal LR features,

F_{L R}^{H (l)}

, as the input. In addition, K represents deformed LR features. In contrast, Q and V are large scale. The internal HR feature maps are assigned to Q and V after applying linear transformation. In this study, scaled dot-product attention [27] was chosen to implement multiscale attention. In Figure 6, the HR hourglass follows the encoder–decoder framework and, thus, multiscale attention is implemented slightly differently depending on the encoder and decoder.

For the encoder of the HR hourglass, multiscale attention is designed as follows:

Q = V = F_{H R}^{H (l)} = M (H_{H R}^{l} (F_{H R}^{s}))

(19)

K = M (F_{L R}^{H (l)})

(20)

{Attention}^{e n c o d e r} (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(21)

where

H_{H R}^{l}

indicates the output feature maps at the

l

-th RB in the HR hourglass. Similar to the LR hourglass, the first two and last three RBs are excluded from the multiscale fusion. In Equation (19), input feature map,

F_{H R}^{s}

, passes through the HR hourglass, followed by linear transformation,

M

, to produce internal HR features,

F_{H R}^{H (l)}

, that are then assigned to Q and V. Similarly, deformed internal LR features,

F_{L R}^{H (l)}

, are assigned to K after linear transformation,

M

, as shown in Equation (20). Scaled dot-product attention is used to implement the multiscale attention of the encoder. In Equation (21),

d_{k}

denotes the dimension of K, and

softmax

represents the softmax function used to calculate weights between 0 and 1. Here, deformed internal LR features,

F_{L R}^{H (l)}

, are used to calculate the similarity matrix,

Q K^{T}

, and determine the LR features that are more important for pest counting. In other words, it serves as a guide for learning weights for internal HR features,

F_{H R}^{H (l)}

.

For the decoder of the HR hourglass, multiscale attention additionally requires the internal HR features transferred by the encoder. The multiscale attention for the decoder is modified as follows:

{Attention}^{d e c o d e r} (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V + S

(22)

Here,

S

represents the internal HR features transferred by the encoder via within-hourglass skip connection, as shown in Figure 6, and then added to the scaled dot-product attention result.

In the proposed multiscale attention, internal LR features are jointly learned with internal HR features via between-hourglass skip connection. The internal LR features serve as a guide for enhancing the internal HR features. Compared with other vision transformers (VTs) [28,29], in this study, two types of internal LR and HR features, which are outputs of two backbones, were learned jointly based on scaled dot-product attention. Notably, the internal LR features were deformed to be more robust to pest occlusion and wide pose variation, enabling the internal HR features to be more discriminative. Other VTs use only one backbone; thus, the two types of internal multiscale features were not considered. This is a key difference between the proposed multiscale attention and other VTs.

4.5. Prediction of HR Maps

Internal HR features can be made more discriminative through the proposed multiscale attention, where the deformed internal LR features are jointly learned with the internal HR features to focus on the more important areas in the feature domain. The output features of the HR hourglass pass through the convolution block and are then transformed into final prediction HR maps.

{\hat{F}}_{H R}^{p} = \{{\hat{F}}_{H R}^{H}, {\hat{F}}_{H R}^{O}, {\hat{F}}_{H R}^{B}\} = f_{C} ° f_{C B} (F_{H R}^{d})

(23)

Here,

F_{H R}^{d}

represents the output feature map of the HR hourglass, and

{\hat{F}}_{H R}^{H}, {\hat{F}}_{H R}^{O},

and

{\hat{F}}_{H R}^{B}

correspond to the predicted HR heatmap, offset map, and bounding box map, respectively.

To train the HR maps, a HR loss function is defined as follows:

L ({\hat{F}}_{H R}^{p}) = L ({\hat{F}}_{H R}^{H}) + λ_{H R}^{o} L ({\hat{F}}_{H R}^{O}) + λ_{H R}^{b} L ({\hat{F}}_{H R}^{B})

(24)

The HR loss function is the same as the LR loss function in Equation (4), except that the predicted LR maps

{\hat{F}}_{L R}^{p}

of the loss function are replaced with predicted HR maps

{\hat{F}}_{H R}^{p}

. Once again, the white colors in the ground-truth HR heatmap remain unchanged after Gaussian filtering to maintain the peak values, enabling the easy determination of keypoints. This is the key difference between the heatmap and the density map.

λ_{H R}^{b}

and

λ_{H R}^{o}

are set to 0.1 and 1, respectively.

5. Experimental Results

In this study, two types of pests that mainly harm soybean were captured with a light trap and used as training and test datasets for pest counting. One was Spodoptera exigua, and the other was Spodoptera litura. The total number of pests was 4462. The average image resolution and average density were 2980 × 2970 and 21.35, respectively. The data augmentation technique, which includes flip, contrast, brightness, and saturation, was applied to generate synthetic pest images. The dataset was divided into training and test datasets using random sampling. The ratio of the training to test datasets was 7:3. In this study, the validation dataset was the same as the test dataset. Adam [26] was used as the optimizer, and the batch size was four. The number of epochs was 100, and the learning rate was set to 0.001. PyTorch was used as the deep learning framework. Input pest images had a fixed size of

512 \times 512

, and stride

R

was set to four for downsampling. All methods were tested on the same dataset to ensure a fair comparison.

5.1. Verification for Multiscale Approach

To verify the effectiveness of the multiscale heatmap generation approach, the LR and HR heatmaps were compared. Because these heatmaps include white colors at the keypoints, visually comparing the LR and HR maps was easy. However, other the offset and bounding box maps were not compared because of the absence of texture patterns in those maps. Figure 8 shows the ground-truth and predicted LR and HR heatmaps. In Figure 8, the numbers in parentheses above the heatmaps correspond to the number of pests. In Figure 8, the LR heatmaps were resized with bicubic interpolation for visual comparison such that the image size is the same as that of the HR heatmaps. Owing to the upsampling effect, the LR heatmaps are blurry at the keypoints. As shown in the upper part of Figure 8, where the number of pest images is relatively small, the predicted pest count from the LR heatmap was as accurate as that from the HR heatmap. This was possible because the occlusion was not severe. This indicates that the small-scale LR heatmap predictions were sufficient for pest counting. However, in the opposite case, the accuracy of the LR heatmap was lower than that of the HR heatmap. This was because the downsampling caused pixel information to be lost, exacerbating the occlusion problem. To address this scale issue, a multiscale approach was adopted in this study for the LR and HR map prediction. In a two-step process, small-scale LR maps were first predicted and used as side information to generate the corresponding large-scale HR map more accurately.

To compare the accuracies of the predicted HR and LR heatmaps for the input test images, the absolute error (AE) was evaluated. The AE scores of the HR and LR heatmaps were calculated as follows:

A E_{i \in \{H R, L R\}} = |p ({\hat{F}}_{i}^{H}) - p (F_{i}^{H})|

(25)

Here,

i

indicates one of the HR and LR heatmaps, and

{\hat{F}}_{i}^{H}

and

F_{i}^{H}

represent the predicted and ground-truth heat maps, respectively.

p

calculates the number of pests by localizing the keypoints and detecting the bounding boxes from the given heatmap. Therefore,

A E

is the absolute difference between the predicted and measured number of pests. For reference, keypoint localization was conducted based on max pooling, and the bounding boxes were detected using the offset and bounding box maps. Further details on keypoint localization are described in [2].

Figure 9 shows the AE scores of the predicted HR and LR heatmaps according to the number of pests. When the number of pests was fewer than 90, the AE scores of the LR heatmaps were similar to those of the HR heatmaps. To be more specific, the AE score difference was calculated, as shown in Figure 10. As shown in this figure, the AE score difference increased rapidly after 90. This indicates that the LR heat map was not suitable when the number of pests was large. That is, the performance of the LR heatmap deteriorated because of the occlusion problem. Therefore, when the number of pests increases, an HR heat map is required. This experiment confirms that the proposed multiscale heatmap generation approach was reasonable. As shown in Figure 9, the sum of the AE scores of the LR heatmaps was 451.2 and that of the HR heatmaps was 335.1. In other words, the accuracy of the HR heatmap was higher than that of the LR heatmap. This indicates that the use of the LR heatmap could be an effective way to improve the accuracy of the HR heatmap within a stacked hourglass framework. For reference, to check the effectiveness of only the proposed multiscale heatmap generation, the between-hourglass skip connection was removed (from the architecture shown in Figure 6). The proposed model without the between-hourglass skip connection was tested for AE calculation. In addition, the AE scores were averaged over the test images with the same number of pests.

Two types of pest counting models were compared for quantitative evaluation. One was counting by detection and the other was counting by density map estimation. SOTA models, such as MCNN [7], SANet [8], ICCNet [9], KDMG [10], RetinaNet [4], RedPoints [30], CenterNet [2], and Faster RCNN [3], were tested. Table 1 presents the mean absolute error (MAE) [7], root mean squared error (RMSE) [7], and averaged precision (AP) [3] for the SOTA pest counting models. Notably, KDMG [10] outperformed the counting by detection models, except for CenterNet [2] and the proposed models, which were classified by counting using detection models. This result indicates that crowd counting models could be directly used for trap-based pest counting and could obtain excellent performance not only for pest images but also for crowd images. Pest images have severe pose variation and occlusion problems, preventing counting with detection models from determining bounding boxes more accurately. However, KDMG can learn kernel shapes and iteratively update the density map, thereby significantly improving the pest counting accuracy. Furthermore, KDMG had the best performance among the conventional crowd counting models [10].

5.2. Performance Evaluation for Pest Counting Models

Although KDMG was the best among the tested SOTA models, its performance was inferior to that of CenterNet and the proposed models. Unlike conventional object detection models, CenterNet predicts a heatmap and then determines the keypoints for bounding box detection. This heatmap appears similar to the density map used for crowd counting models. This result reveals that the CenterNet-like architecture based on heatmap prediction can cope with pose variations and occlusion problems for accurate pest counting.

The proposed model outperformed all of the tested SOTA models. In particular, the proposed model showed a better performance than CenterNet and KDMG. The proposed model is an advanced version of the CenterNet optimized for trap-based pest counting. Specifically, the proposed method revises the conventional CenterNet in three ways. First, CenterNet uses single-scale heatmaps, which are weak in terms of pose variation and occlusion. The proposed multiscale CenterNet, as presented in Table 1, adopts multiscale heatmap generation. The predicted LR maps are used to learn the final HR maps more accurately. Table 1 confirms that this multiscale approach can overcome the drawbacks of the conventional CenterNet.

Second, the proposed multiscale attention CenterNet adds multiscale attention to CenterNet. In other words, internal LR features are jointly learned and fused with internal HR features based on scaled dot-product attention and between-hourglass skip connection. The internal LR features serve as a guide for learning the internal HR features. This enables internal LR and HR joint feature learning and increases the restoration accuracy of the HR heatmap. Owing to the LR and HR joint feature learning, the performance of the proposed multiscale attention CenterNet is improved, as presented in Table 1.

Third, the proposed Mada-CenterNet integrates deformable convolution into the between-hourglass skip connection to improve the proposed multiscale attention CenterNet. Unlike standard convolution, deformable convolution can enhance the CNN capability of modeling geometric deformations. The last row in Table 1 indicates that the proposed Mada-CenterNet can be more robust to occlusion and pose variation by determining more discriminative features, thereby improving the pest counting accuracy.

5.3. Pest Counting Results

Figure 11 shows the detected bounding boxes with the conventional CenterNet [2], predicted density maps with KDMG [10], and detected bounding boxes with the proposed Mada-CenterNet. As presented in Table 1, KDMG and CenterNet were the best models among the counting approaches based on density map estimation and detection, respectively. Therefore, only these two models were visually compared with the proposed Mada-CenterNet. In Figure 11, the numbers above the images represent the measured and predicted pest count. As shown in the figures, the proposed model can provide a more accurate number of pests than the KDMG and CenterNet. Unlike the existing object detection datasets [5], the input pest images include severe pose and scale variations, occlusion, and color/shape similarities. Although the KDMG and CenterNet are SOTA models, they have limitations in overcoming challenging problems. However, the proposed Mada-CenterNet incorporates deformable and multiscale joint feature learning via the between-hourglass skip connection, thereby addressing occlusion and pose/scale variation problems, finally improving the pest counting accuracy.

Figure 12 shows an example in which the proposed Mada-CenterNet is more robust against the pose variation and occlusion problems than CenterNet. As shown in Figure 12, the left and right columns show the detection results with the conventional CenterNet and the proposed Mada-CenterNet. In the left column, blue arrows indicate pests that were not detected by CenterNet. These results confirmed that the proposed model was more robust to pose variation and occlusion problems than CenterNet owing to the use of multiscale attention and deformable convolution.

6. Conclusions

This study presented a new Mada-CenterNet model for trap-based pest counting. Unlike object detection image datasets, pest images captured using a lighting trap can have severe pose and scale variations and even occlusion. Moreover, they have a similar appearance in color and texture, making pest counting challenging. To solve these problems, three main aspects were greatly advanced compared with CenterNet. First, a multiscale heatmap generation approach was adopted in a two-step fashion to adaptively learn the changes in the number of pests. Second, the internal LR and HR joint feature learning was modeled via between-hourglass skip connection and scaled dot-product attention to boost the internal HR features. Third, geometric deformation was incorporated into the between-hourglass skip connection to be more agnostic to pose variations and occlusion problems. The experimental results confirm that the proposed Mada-CenterNet was extremely effective in overcoming the scale/pose variations, texture similarity, and occlusion problems that appear in pest images. The proposed model upgraded the conventional CenterNet for trap-based pest counting. Moreover, the proposed Mada-CenterNet surpassed the SOTA models, including counting by detection and by density map estimation.

Author Contributions

Conceptualization, methodology, writing, funding acquisition, and supervision, C.-H.S.; software, validation, and data curation, J.-H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was carried out with the support of the Cooperative Research Program for Agriculture Science & Technology Development (grant no: PJ016303) and the National Institute of Crop Science (NICS), Rural Development Administration (RDA), Republic of Korea.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Lempitsky, V.; Zisserman, A. Learning to count objects in images. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–9 December 2010; pp. 1324–1332. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850v2. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Boominathan, L.; Kruthiventi, S.; Babu, R.V. CrowdNet: A deep convolution network for dense crowd counting. In Proceedings of the ACM international conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 640–644. [Google Scholar]
Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
Cao, X.; Wang, Z.; Zhao, Y.; Su, F. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 757–773. [Google Scholar]
Ranjan, V.; Le, H.; Hoai, M. Iterative crowd counting. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 278–293. [Google Scholar]
Wan, J.; Wang, Q.; Chan, A.B. Kernel-based density map generation for dense object counting. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1357–1370. [Google Scholar] [CrossRef] [PubMed]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 886–893. [Google Scholar]
Lowe, D.G. Distinct image features from scale-invariant key points. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Uijlings, J.R.; van de Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
Zitnick, C.L.; Dollár, P. Edge Boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 391–405. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, realtime object detection. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Drhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Kestur, R.; Meduri, A.; Narasipura, O. MangoNet: A deep semantic segmentation architecture for a method to detect and count mangoes in an open orchard. Eng. Appl. Artif. Intell. 2019, 77, 59–69. [Google Scholar] [CrossRef]
Xia, D.; Chen, P.; Wang, B.; Zhang, J.; Xie, C. Insect detection and classification based on an improved convolutional neural network. Sensors 2018, 18, 4169. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wang, F.; Wang, R.; Xie, C.; Zhang, J.; Li, R.; Liu, L. Convolutional neural network based automatic pest monitoring system using hand-held mobile image analysis towards non-site-specific wild environment. Comput. Electron. Agric. 2021, 187, 106268. [Google Scholar] [CrossRef]
Ding, W.; Taylor, G. Automatic moth detection from trap images for pest management. Comput. Electron. Agric. 2016, 123, 7–28. [Google Scholar] [CrossRef] [Green Version]
Liu, L.; Wang, R.; Xie, C.; Yang, P.; Wang, F.; Sudirman, S.; Liu, W. PestNet: An end-to-end deep learning approach for large-scale multi-class pest detection and classification. IEEE Access 2019, 7, 45301–45312. [Google Scholar] [CrossRef]
Wen, C.; Chen, H.; Ma, Z.; Zhang, T.; Yang, C.; Su, H.; Chen, H. Pest-YOLO: A model for large-scale multi-class dense and tiny pest detection and counting. Front. Plant Sci. 2022, 13, 973985. [Google Scholar] [CrossRef] [PubMed]
Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liand, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE International Conference on Computer Vision, Vitual, 11–17 October 2021; pp. 548–558. [Google Scholar]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. RepPoints: Point set representation for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9664. [Google Scholar]

Figure 1. Example of pest images captured using a light trap. Pose variation, occlusion, shape distortion, changes in the number of pests, and similar appearance in color and texture make it difficult to distinguish pests. Therefore, the pest counting problem is extremely challenging.

Figure 2. Object detection (left) vs. crowd counting (right). Which approach is better for trap-based pest counting? Thus far, little research has been conducted on this.

Figure 3. Trap-based pest counting using density estimation and object detection approaches.

Figure 4. Architecture of the conventional CenterNet for trap-based pest counting.

Figure 5. Proposed Mada-CenterNet for trap-based pest counting.

Figure 6. Proposed between-hourglass skip connection based on deformable and multiscale attention for internal multiscale joint feature learning.

Figure 7. Deformed internal LR feature-guided multiscale attention.

Figure 8. Experiments for scale variation: input pest images, ground-truth heatmaps, predicted LR heatmaps, and predicted HR heatmaps (left to right).

Figure 9. AE scores of the predicted HR and LR heatmaps.

Figure 10. AE score difference for the HR and LR heatmaps.

Figure 11. Experimental results: input pest images, bounding boxes detected with CenterNet [2], density maps predicted with KDMG [10], and bounding boxes detected with the proposed Mada-CenterNet (left to right).

Figure 12. Results of the proposed model that overcomes pose variation and occlusion problems. The left and right columns show the detected results with conventional CenterNet and the proposed Mada-CenterNet, respectively. In the left column, the blue arrows indicate the pests that were not detected by CenterNet, owing to the pose variation and occlusion problems.

Table 1. Quantitative evaluation for SOTA pest counting models.

Method	SOTA Models	MAE (↓)	RMSE (↓)	AP (↑)
Counting by density map estimation	SANet [8]	2.114	3.735	-
	MCNN [7]	4.104	7.494	-
	ICCNet [9]	9.710	11.000	-
	KDMG [10]	1.273	2.625	-
Counting by detection	RetinaNet [4]	4.634	10.449	0.748
	Faster RCNN [3]	3.312	10.297	0.802
	RepPoints [30]	1.471	3.436	0.918
	CenterNet [2]	0.766	1.981	0.953
	Proposed Multiscale CenterNet	0.752	1.901	0.961
	Proposed Multiscale Attention CenterNet	0.711	1.876	0.963
	Proposed Mada-CenterNet	0.696	1.806	0.968

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, J.-H.; Son, C.-H. Trap-Based Pest Counting: Multiscale and Deformable Attention CenterNet Integrating Internal LR and HR Joint Feature Learning. Remote Sens. 2023, 15, 3810. https://doi.org/10.3390/rs15153810

AMA Style

Lee J-H, Son C-H. Trap-Based Pest Counting: Multiscale and Deformable Attention CenterNet Integrating Internal LR and HR Joint Feature Learning. Remote Sensing. 2023; 15(15):3810. https://doi.org/10.3390/rs15153810

Chicago/Turabian Style

Lee, Jae-Hyeon, and Chang-Hwan Son. 2023. "Trap-Based Pest Counting: Multiscale and Deformable Attention CenterNet Integrating Internal LR and HR Joint Feature Learning" Remote Sensing 15, no. 15: 3810. https://doi.org/10.3390/rs15153810

APA Style

Lee, J.-H., & Son, C.-H. (2023). Trap-Based Pest Counting: Multiscale and Deformable Attention CenterNet Integrating Internal LR and HR Joint Feature Learning. Remote Sensing, 15(15), 3810. https://doi.org/10.3390/rs15153810

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Trap-Based Pest Counting: Multiscale and Deformable Attention CenterNet Integrating Internal LR and HR Joint Feature Learning

Abstract

1. Introduction

1.1. Proposed Mada-CenterNet vs. Conventional CenterNet

1.2. Our Contributions

2. Related Works

2.1. Density Estimation Approach for Pest Counting

2.2. Object Detection Approach for Pest Counting

2.3. Trap-Based Pest Counting Approach

3. Background

4. Proposed Mada-CenterNet for Trap-Based Pest Counting

4.1. Prediction of LR Maps

4.2. Upsampling Feature Transformation

4.3. Global Residual Skip Connection and Input Feature Fusion

4.4. Between-Hourglass Skip Connection Based on Deformable and Multiscale Attention

4.4.1. Internal LR Feature Deformation

4.4.2. Internal LR Feature-Guided Multiscale Attention

4.5. Prediction of HR Maps

5. Experimental Results

5.1. Verification for Multiscale Approach

5.2. Performance Evaluation for Pest Counting Models

5.3. Pest Counting Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI