Remote Insects Trap Monitoring System Using Deep Learning Framework and IoT

Insect detection and control at an early stage are essential to the built environment (human-made physical spaces such as homes, hotels, camps, hospitals, parks, pavement, food industries, etc.) and agriculture fields. Currently, such insect control measures are manual, tedious, unsafe, and time-consuming labor dependent tasks. With the recent advancements in Artificial Intelligence (AI) and the Internet of things (IoT), several maintenance tasks can be automated, which significantly improves productivity and safety. This work proposes a real-time remote insect trap monitoring system and insect detection method using IoT and Deep Learning (DL) frameworks. The remote trap monitoring system framework is constructed using IoT and the Faster RCNN (Region-based Convolutional Neural Networks) Residual neural Networks 50 (ResNet50) unified object detection framework. The Faster RCNN ResNet 50 object detection framework was trained with built environment insects and farm field insect images and deployed in IoT. The proposed system was tested in real-time using four-layer IoT with built environment insects image captured through sticky trap sheets. Further, farm field insects were tested through a separate insect image database. The experimental results proved that the proposed system could automatically identify the built environment insects and farm field insects with an average of 94% accuracy.


Introduction
Environmental maintenance is an essential domain in all urban centers to control various pests and insects. Generally, lizards, crawling and flying insects such as cockroaches, ants, flies, etc. are common in any built environment (human-made physical spaces such as homes, hotels, camps, parks, pavement, hospitals,food industries etc.). These insects and lizards may create health hazards like allergy, asthma, food contamination illnesses, etc. Apart from health risks, economic loss due to insects is very high in the agriculture field and food industries. It threatens food supply at all stages, crop cultivation, processing, storage, and distribution. Early insect identification is crucial for effective and affordable control in urban environments, agriculture fields, and food process industries. It will help protect humans from health issues, prevent economic loss on the food industry, and reduce the excessive use of various harmful pesticides in agriculture.
Generally, pest management companies use manual inspection methods to monitor the insect population and control tasks accordingly. It is a time-consuming task and requires enormous sliding window algorithm to detect insects from the extracted feature map. Liu et al. [28] developed a paddy field pest classification using Deep Convolutional Neural Network (DCNN) and Saliency Map. Here, AlexNet CNN architecture was customized for pest classification and localization. The network was trained with 5000 insect images and scored mean Average Precision (mAP) of 0.951.
Liu et al. [29] developed a DL based framework 'PestNet' for large scale multi-class pest detection. The PestNet was trained with 80,000 pest images and obtained 75.46% mAP for multi-class pest detection. Nguyen and Phan [30] developed insect detection scheme on traps using DCNN. The author uses the Visual Geometry Group 16 (VGG16) CNN framework for feature extraction and Single Short Detector (SSD) object detection algorithm for detecting the insects on the trap. The CNN framework was trained with 3000 insect images and obtained 84% detection accuracy. Xia et al. [31] proposed an improved CNN framework to agricultural insect detection and classification where VGG19 and Region Proposal Network (RPN) modules were combined and trained with 4800 insect images. The experimental results indicate that the model took 0.083 s inference time and scored mAP of 0.8922. Gutierrez et al. [32] evaluated the efficiency of computer vision, machine learning, and deep learning algorithms for pest detection in tomato farms. The evaluation study indicates that the deep learning framework provided a comparatively better solution among the three options.
As mentioned above, the studies indicate that IoT is a more suitable technique for remote trap monitoring schemes. A DL framework is an optimal method to detect and classify insects through images. This work combines the IoT and DL framework to establish the remote insect trap monitoring and insect detection system. The system has been designed to detect the built environment lizard, crawling, and flying insects such as cockroaches, ants and flies, and selected farm field insects (Planthoppers, Colorado, Empoasca, Mole-cricket, Manduca, Rice Hispa, Stink-bug, and Whiteflies). This paper is organized as follows; after providing an introduction, motivation, and literature review in Section 1, Section 2 present the overview of the proposed system. Section 3 discusses the experimental setup, results, and discussion. Finally, Section 4 concludes this research work. Figure 1 shows the outline of IoT and DL based remote trap monitoring and insect detection framework. Four-layer IoT is used for constructing the remote trap monitoring system [33][34][35]. It comprises of perception layer, transport layer, processing layer, and application layer. The DL framework is trained and runs on the processing layer to carry out the remote insect identification task. The detail of IoT architecture and configuration of DL and other IoT component detail is described as follows.

Perception Layer
The perception layer is an end node or lowest layer in a four-layer IoT framework. It is constructed with smart wireless cameras and sticky type insect trap sheet. The camera was tiny and required very low power. It can fix on the roof or medium or large size food storage container for continuously monitored the insects. The Perception layer sends an insect trap images to the processing layer through the transport layer communication protocols to establish the remote insect monitoring and identification task.

Transport Layer
Transport layer act as a communication bridge between all the IoT layers. It accomplishes the following task: Sending the insect trap image to the processing layer and transferring the processed data to the application layer. The transport layer uses the WiFi communication and Transmission Control Protocol/Internet Protocol (TCP/IP) transport protocol for sharing the information.

Processing Layer
A processing layer is a central unit in our remote insect trap monitoring and detection system. It comprises high-speed computing devices for handling the image and video frame, executing the insect identification algorithm, and processing the application layer request. The detail of insect detection and classification framework is described as follows.

Insects Detection Module
A deep learning-based object detection algorithm is trained to identify the insects on the trap. Generally, most of the insects are small in size. So, insect detection algorithms need a small object detection capability. In terms of detecting insect presence in an image, the insects cover only a small number of pixels in an image, i.e., the area of interest in an image when detecting an insect is very small.
In any object detection algorithm, extracting the features from small objects is particularly challenging because there is a good chance that smaller objects might get overlapped or too pixelated to extract any useful information. Moreover, information extracted from smaller objects could be lost as it is passed on through multiple layers of feature extractor [36]. Furthermore, the information extracted about smaller objects is very limited as these objects appearance is very restricted in a training image. Small objects' limited appearance leading to an incomprehensible feature extraction process could contribute to additional errors in detection results.
Due to various challenges discussed earlier in processing and detecting smaller objects in an image, insect detection needs an optimal feature extractor and detection algorithm, far more extensive, accurate, and apt for detecting smaller objects such as insects.
In this work Faster Region-based Convolutional Neural Networks (RCNN) RestNet 50 [37,38] is used for insect detection and classification task. Here, Faster RCNN is a two-stage object detector framework comprise of RPN and Fast RCNN module. It takes two shots to detect the object from the image. The first stage extracts the feature map and generating the region proposals, then the next step detecting the object from each proposal. Faster RCNN RestNet 50 is a unified deep learning based object detection approach. It performs both object detection and classification task. Figure 2 shows the block diagram of the Faster RCNN Residual neural networks 50 (ResNet50) framework. The unified DL framework consists of three modules, including ResNet 50, RPN, and Fast RCNN. Here, ResNet50 [39] is a pre-trained deep CNN algorithm that performs the feature extraction task and generates 1024 size feature maps. It's an optimal algorithm for small object features extraction task. RPN is a fully connected small CNN network that produces the multiple regions of proposals and their objectness score from ResNet generated feature map. The final component is a Fast RCNN module that extracts the more features from the RPN proposed regions and output the bounding box and class labels. The detail of each module and its described as follows. Feature extractor: Pretrained ResNet50 (trained on image-net) module is used in Faster RCNN object detection for feature extraction instead of VGG16. Unlike VGG 16 [40], VGG 19, ResNet is a much deeper network as the desired detection in this paper is much more complex than a regular image detection process. As the network becomes deeper, there is also a possibility of over-fitting, and therefore, a regularization process is done to eliminate any over-fitting on the training data. Moreover, the ResNet50 model uses average pooling layers, which reduces the data's spatial size, further reducing the computational time and complexity of the algorithm. The model size is reduced down to 102 MB for ResNet50.
The feature extraction module consists of 5 stages. The first stage consists of 7 × 7 convolution, batch normalization, rectified linear unit (ReLu), and max-pooling operation. The next four stages consist of a combination of residual convolution block and identity block. Here, the residual conv block comprises of series of three convolution layers (1 × 1, 3 × 3, 1 × 1) with Batch Normalization (BN) and ReLu, and skip connection with 1 × 1 convolution operation. Similarly, identity blocks comprise of three convolution layer include 1 × 1, 3 × 3, 1 × 1, and skip connection. The fifth stage Conv and identity block output is a 1024 size feature map, which serves as input to the RPN module and Fast RCNN detection network.
Regional Proposal Network (RPN) In this unified DL framework, RPN is a crucial component that uses the anchor box technique to generate the bounding box on the output image. Anchor boxes are a set of predefined bounding boxes which has a certain height and width. Anchor boxes detect multiple objects, objects of different scales, and overlapping objects. The size of the anchor box is automatically chosen based on object size in the training dataset. For generating the anchor boxes and 512 size feature map, RPN took the final feature map (ResNet generated feature map) as input and applied a 3 × 3 sliding window spatially. For each sliding window position, nine anchor boxes are generated based on the sliding window's center point. Those anchor boxes are three different ratios and three different scales. If the feature extraction layer's final feature map has width W and height H, then the total number of anchors generated will be W*H*k. In the output side, an anchor box's actual position in image level is determined by decode (by applying 16 as the stride in generating anchor boxes at image level) the feature map output back to the input image size. This process produces a set of tiled anchor boxes across the entire image. For each anchor, RPN predicts two things: The first is the probability that an anchor is an object (check whether the object is foreground or background and does not consider which class the object belongs to). Second is the bounding box regression for adjusting the anchors to fit the object better or similar to the ground truth. RPN used a 512 dimension feature map to carry out this operation and applied 1 × 1 convolutional operation to generate the low latitude vector. This low latitude vector is feed into the classifier layer and regression layer to predicts the above two things. Finally, Non-Maximum suppression (NMS) is applied to the generated proposal and filtering out the overlapping boxes.
Detection Network The second and final stage of Faster RCNN is a detection network where Fast RCNN components are adopted. The detection network comprises of Region of interest (ROI) pooling layer and a fully connected layer. The proposals generated by the RPN and the shared convolutional features are fed into the RoI pooling layer. RoI pooling layer extracts the fixed-size feature map for each RPN generated proposal. These fixed-size feature maps are forward to a fully connected layer with a softmax network and a linear regression function to classify the object and predict the bounding box for the identified objects. For accurate object localization, NMS (Non-Maximum suppression) is applied to remove redundant bounding boxes and retain the best one.

Application Layer
The application layer delivers the insect trap status information to the end-user. Smartphones and web interfaces are used to execute the application layer task. In this work, web-based GUI and android mobile are developed to track the status information. Figure 3 shows the layout of the android mobile application developed for insect monitoring.

Experimental Results
This section describes the experimental design procedure and results of the remote insect trap monitoring and insect detection method. The Figure 4 shows the experimental design flow of the proposed system.

Dataset Preparation and Annotations
The dataset preparation process involves collecting the insect's image from a different online source. Built environment lizard, crawling and flying insects ants, cockroach and flys and farm field insect Planthoppers, Colorado, Empoasca, Mole-cricket, Manduca, Rice Hispa, Stink-bug and Whiteflies were adopted for dataset preparation. In each insect class, 1000 images were used to train the model. The insect images were collected from following online insect image database include [41], IP102 [42], rice knowledge bank [43], bugwood [44]. Then, the data augmentation is applied to control the over-fitting and improve the CNN learning rate. The data augmentation process such as image rotation, flipping, and scaling was applied to collected images. The image resolution of 640 × 480 was used in both training and testing the CNN model. After the data augmentation process, the dataset labeling was performed using a bounding box and class annotations tool "LabelImg". LabelImg is a Graphical User Interface (GUI) based bounding box annotation tool used to mark the insect categories and rectangular bounding boxes of the insect images. The GUI is written in Python and uses Qt for its graphical interface. Annotations are saved as XML files in PASCAL Visual Object Classes (VOC) format and also support YOLO format.

Hardware Details
The unified DL based insect detection model was developed in Tensor-flow 1.9 open source machine learning platform run on Ubuntu 18.04 version. The model was trained and tested on GPU enabled workstation consists of Intel core i7-8700k, 64 GB Random Access Memory (RAM), Nvidia GeForce GTX 1080 Ti Gaming Graphics Processing Unit (GPU) Card (3584 NVIDIA CUDA Cores and 11 Gbps Memory Speed). The same hardware is used to run the insect detection and classification task.

Training
The pre-trained ResNet 50 model was used to feature extraction task. It was trained on the ImageNet dataset. Stochastic Gradient Descent (SGD) algorithm was used for training the Faster R-CNN module (RPN and Fast RCNN). It uses a momentum of 0.9 and an initial learning rate of 0.0002 with a batch size of 1, respectively. In the RPN training phase, each iteration 128 training samples are randomly chosen from each training image. Here, the ratio between positive (object) and negative (background) samples is 1:1 ratio. Further, the original Faster RCNN value parameter was used to fine-tune the NMS.
In the training phase, the loss for each prediction is estimated as a sum of the location loss L location and confidence loss L con f idence (Equation (1)). Here, the confidence loss is the fault in the prediction of object class and confidence level. The location loss is referred to as the squared distance between the coordinates of the prediction. A parameter α is adopted to equivalence the two losses and their impact on the gross loss. Root Mean Squared gradient descent algorithm is used to optimized this loss.
It calculate the weights w t at any time t using the gradient of loss L, g t and gradient of the gradient v t (Equations (2)-(4)). Hyper-parameter β, η are used to balance the terms used for momentum and gradient estimation, while is a small value close to zero for preventing divide by zero errors.
The efficacy of the various sets of training data is determined by K-fold (here K = 10) cross-validation process. The dataset is divided into K subsets and K − 1 subsets used for training, and the remaining one subset is used for evaluating the performance. This process is run in K times to obtain the mean accuracy and other quality metrics of the detection model. K-fold cross-validation is done to verify that the images reported are accurate and not biased towards a specific dataset split. The images shown are attained from the model with good precision.

Evaluation Metrics
Standard statistical measures such as accuracy (Equation (5)), precision (Equation (6)), recall (Equation (7)) and F measure (Equation (8)) were used to assess the detection and classifier performance. The bounding box's performance metrics are calculated by comparing the predicted bounding box with the actual bounding box. An Intersection Over Union (IOU) operation is used to determine how close the predicted bounding box is compared to the actual bounding box. The bounding box's accuracy is directly proportional to the IOU score calculated between the actual and predicted bounding box. Therefore, a confusion matrix is generated using the IOU score obtained from the bounding box. If the IOU is above a specific threshold, then that predicted bounding box is deemed to match the true bounding box and is considered a true positive or true negative. If the IOU score is lesser than the threshold, then the predicted bounding box is deemed to be incorrect and considered a false positive or false negative. In addition to this, the overall mean IOU value for all the predicted bounding box was also calculated to determine the model's performance. From IOU output, we can calculate the Average Precision (AP) and Recall metrics. It is a common performance indicator for evaluating the performance object detector module. Further, the performance metrics for object classification is calculated by constructing a confusion matrix using the actual and predicted image labels.
Here, tp, f p, tn, f n represents the true positives, false positives, true negatives, and false negatives, respectively, as per the standard confusion matrix.

Offline and Real Time Test
There were 150 test images used for each class. For the offline test, the images are collected from image databases that images are not used for training. The detection results of the built environment lizard and insect detection are shown in Figure 5. For a real-time remote insect traps monitoring trial, the insect trap sheet was fixed in three different locations: Kitchen, food storage region, and drainage water outlet. The trap was a monitor with a Trek AI ball WiFi-enabled camera. The detailed specification of the AI ball camera is given in Table 1. The camera is fitted as downward on the wall (vertically 90 degree) to focus on the insect trap and connected with the internet through a 2.4 GHz home WiFi network for transferring the trap sheet image to the processing layer. Further, to carry out the insect detection task, the trained model was configured in GPU enabled workstation, which runs the processing layer task. The experiment was performed for 14 days continuously to track the insects detected on the trap sheets. In our trap sheet cockroach, lizard and flies were trapped, and its results are shown in Figure 6.  The detection results ensure that the insect detection model run on the IoT processing layer has accurately detected and classified the insects with a higher confidence level in offline test and real-time. Further, statistical measures have been performed for estimating the robustness of the detection model. Table 2 shows the statistical measures result for built environment lizard and insect detection for both offline and online experiments. The table result indicates that the trained object detection framework has detected the cockroach with an average of 96.83%, a lizard with 97.68%, and house fly's with 95.16% respectively. It is observed that the confidence level and statistical measure values for the house fly detection are slightly low compared to other classes. This outcome is understandable, as house flies have less visibility and complex texture. Further, the model's computation time was assessed through model inference time; in this analysis, the trained model took 0.02 s for process the one 640 images.

Farm Field Insect Detection
For the farm field insect detection test, 150 test images are used for each class. The images are collected from the farm field insect image database. Figure 7 shows the farm field insect detection results. The experiment results show that the algorithm accurately localizes the insect with a higher confidence level. Further, accuracy, precision, recall, and F1 measure were computed to evaluate the model's statistical performance. The statistical performance of each class is shown in Table 3. In this analysis, we observe that the trained CNN model detected and classified the farm field insect with an average of 94% accuracy.

Comparison with Other Object Detection Framework
This section evaluates the performance of the proposed system with popular single-shot object detection algorithms SSD and Yolo V2. Here, MobileNet and Inception v2 [45] classifiers were used with SSD for feature extraction task [46][47][48]. Similarly, darknet19 feature extractor is used in Yolo v2 module [49][50][51]. The three detection frameworks are trained with the same lizard and insect image dataset and similar amount of training time.  The comparison ( Table 4) results indicate that the proposed system has good accuracy in localizing the insects compare to SSD and Yolo V2 detection models. Furthermore, the classification result indicates that Yolo has a false classification, and its accuracy level is relatively low compared with other models. Similarly, miss detection and false classification ratio have slightly high in SSD MobileNet and SSD Inception trial.
Furthermore, the computational cost of both training and testing was estimated for each model. The computational cost for training was estimated using each model's training time to reach training error is minimum, and the computational cost for the testing time was estimated by execution time per image. Table 5 shows the computational cost of both training and testing. In that analysis, we observe that minimum execution time was observed in YOLO v2 compared with the Faster RCNN and SSD. However, Faster RCNN ResNet50 obtained the best detection accuracy. In our experiment, accurate insect detection is a crucial objective. Hence, the Faster RCNN ResNet50 framework is more suitable for the insect identification field.

Comparison with Existing Work
This section performs the comparison analysis with an existing insect identification scheme reported in the literature. Table 6 shows the comparison report of various insect detection scheme. The comparison has been reported based on the algorithm used to identify the insect and detection accuracy.

Case Study Application Algorithm Detection Accuracy
Xia et al. [31] Farmfield VGG19 + RPN 0.89 Nguyen et al. [30] Pest detection on Traps SSD + VGG16 0.86 Liu et al. [29] agriculture pest identification PestNet 0.75 Rustia et al. [12] built environment insects YOLO v3 0.92 Ding et al. [27] farm field moth ConvNet 0.93 Proposed system built environment and farm field Faster RCNN ResNet 50 0.94 The Table 6 results indicate that the proposed scheme obtained improved insect identification accuracy than existing farm field insect identification and built environment insect identification methods. However, comparing the table result directly with our proposed system is not fair. Because the authors used different CNN frameworks, different dataset, and various training parameters; hence, the performance cannot be accurately compared.
To the best of our knowledge, we didn't find any IoT combined CNN based insect identification schemes in the literature. Our proposed system combines the IoT and CNN based insect identification scheme. Through, our scheme insect management systems easily monitor and identify the insects remotely without human assistance.

Application and Future Work
The proposed system is more useful to pest control industries for monitoring the pest in a various environment like food storage region, hospitals, and garden, etc. Further, early insect detection in farm field will reduce yield loss by up to 20-40% and also will help for to reduce the excessive use of various harmful pesticides in agriculture. The current study focuses only on built environment lizard, crawling and flying insects, and selected farm field insects. In our future work, we plan to develop the DL enabled drone and inspection class robot to detect the various types of rodents, develop phases of the built environment and farm field insects, and crop diseases.

Conclusions
Remote insect monitoring and automatic insect identification method were proposed in this paper using the IoT and deep-learning framework. Four-layer IoT framework was used to construct the remote trap insect monitoring system, and the Faster RCNN ResNet object detection framework was used to identify the insect class automatically. The efficiency of deep learning-based insect identification was tested offline and online with various insect image databases includes a built environment insect database and farm field insect database. In contrast with other object detection frameworks, including SSD and Yolo, the proposed scheme scored better insect detection accuracy. The experimental results showed that the trained model obtained 96% insect identification accuracy to built environment insects, 94% identification accuracy to farm field insect, and took an average of 0.2 s to process the one image. This case study proved that IoT and the DL based insect monitoring scheme automate the remote insect identification method through a trained CNN framework and overcomes the shortcoming of the insect management system.