Internet of Things Meets Computer Vision to Make an Intelligent Pest Monitoring Network

: With the increase of smart farming in the agricultural sector, farmers have better control over the entire production cycle, notably in terms of pest monitoring. In fact, pest monitoring has gained signiﬁcant importance, since the excessive use of pesticides can lead to great damage to crops, substantial environmental impact, and unnecessary costs both in material and manpower. Despite the potential of new technologies, pest monitoring is still done in a traditional way, leading to excessive costs, lack of precision, and excessive use of human labour. In this paper, we present an Internet of Things (IoT) network combined with intelligent Computer Vision (CV) techniques to improve pest monitoring. First, we propose to use low-cost cameras at the edge that capture images of pest traps and send them to the cloud. Second, we use deep neural models, notably R-CNN and YOLO models, to detect the Whiteﬂy (WF) pest in yellow sticky traps. Finally, the predicted number of WF is analysed over time and results are accessible to farmers through a mobile app that allows them to visualise the pest in each speciﬁc ﬁeld. The contribution is to make pest monitoring autonomous, cheaper, data-driven, and precise. Results demonstrate that, by combining IoT, CV technology, and deep models, it is possible to enhance pest monitoring.


Introduction
The Internet of Things (IoT) describes a network composed of several interconnected devices to obtain, transmit, and store data. The use of smart farming has grown exponentially with the use of IoT to produce better and cheaper food for a growing world population. In smart farming, data can be used, among other things, to make more efficient use of space, minimise waste, and monitor pests [1]. When data is in the form of images, the IoT can be used together with Computer Vision (CV), which is a field of research that enables machines to process, analyse, and obtain insights from images or videos. In this case, the IoT network obtains, transmits, and stores images, and the CV models process those images to obtain information or insights [2][3][4].
Smart farming uses IoT for several purposes, e.g., in irrigation systems to minimise water consumption or in animal feeding to optimise feeding [1]. Computer Vision (CV) is also used with different purposes, e.g., in crop monitoring to obtain insights about the best time to harvest or in animal counting [5,6].
Despite the potential of combining IoT and CV, these technologies are still not very commonly used in traditional pest monitoring, which is a visual task and one of the biggest challenges for farmers [7]. In traditional pest monitoring, technicians have to go to the field for the sake of identifying and counting the pests in traps. If the number of pests exceeds a threshold, then they need to use pesticides or biological monitoring. This may not be very precise, as technicians use extrapolation to know the number of pests, i.e., the pests in a small area are counted and used as a sample. It is also expensive, because it requires many hours of specialised work to count the pests and it can also lead to the overuse of pesticides if the visit to the field occurs too late, allowing the pests to propagate [8].
In this article. we present a system that was developed by combining IoT and CV to make an intelligent pest monitoring network. This network contributes to make the pest monitoring: • Autonomous: almost does not need human intervention; • Cheaper: reduces the need for technicians to identify and count the pests; • Data-driven: allows to know the daily pest evolution using images and not only when the technicians make pest detection and counts; • Precise: replaces the sampling by the total count of the pests.
These contributions allow farmers to optimise production quality and increase revenue. With that in mind, the proposed system can be implemented for all types of crops and farms (indoor and outdoor) to allow farmers to make informed decisions in an easy way.
For the implementation, we designed trapezoidal plastic structures to protect the traps and cameras for environmental conditions and to control the obtained images quality. At the base of these structures, we placed yellow sticky traps and, on the top, the cameras used as edge computing. The structures were placed in different crops and the obtained images from the traps were sent to the cloud. The cloud has a CV model trained to detect and count pests. The information resulting from the model is then processed and sent to an application. The application allows farmers to easily make an informed decision, such as when to use pesticides, just by looking at a dashboard with the evolution of pests in each trap. We used low-cost/low-power cameras and solar panels for power, which allows for long-term monitoring in environments without an external power supply. The complex calculations are in the cloud, where all backend tasks/services are centralised.
The rest of the article is structured as follows. Section 2 presents material and methods, describing the types of computation, communication protocols, network types, and CV models. Section 3 describes the proposed approach, including the dataset and CV model used, and the proposed IoT network. In Section 4, we present the conclusions.

Material and Methods
In pest monitoring, most approaches focus only on CV techniques [9,10], not considering the process to obtain images automatically and in large numbers, which hampers the application of CV models in real contexts due to lack of information/images. Works that present the two technologies together, CV and IoT, in pest monitoring, are therefore the most promising [11,12], since the images are obtained in a more real context and not under extremely controlled conditions, which could raise issues with the generalisation ability of the CV models.
In the next subsections, we first introduce the most relevant aspects of the IoT, such as computation types, communications protocols, and network topologies; and then the most relevant CV models, such as faster Region-Convolution Neural Network (R-CNN) and You Only Look Once (YOLO).

Internet of Things
The IoT is a network composed of interconnected nodes/devices that send, receive, and store data. Those nodes can use edge, fog, and cloud computing (see Figure 1). The communication protocols define the communication rules between nodes, and the network topology defines the nodes' distribution and communication links [13].
Depending on the purpose, environmental, hardware or communication constraints, IoT networks can use different types of computation, communication protocols, and network topology. Computation types mainly influence computational power, storage capacity, and latency. Communication protocols mainly influence the security and payload size of communications. The network topology mainly influences the reliability and scalability of the network [13].

Computation Types
As depicted in Figure 1, edge, fog, and cloud are interconnected computation types that should cooperate to fulfil the final service [14].
Edge computing is the computing performed on devices used to collect the data, i.e., devices that are closer to data sources. The devices used at the edge usually have low computational power and low storage capacity, which makes it difficult to store or process large amounts of data at the edge. Furthermore, the edge has low latency, mobility of its devices (like the fog), and low bandwidth available [13,15]. Examples of edge devices include low-cost cameras or humidity sensors.
Fog computing is the computing used to bring data storage and processing closer to the edge, enhancing the edge ability. The fog can have several devices with different computational and storage resources, which allows it to store or process data locally and almost in real-time. Normally, the fog is a bridge between edge and cloud computing that allows the reduction of the resources used by the cloud, while enhancing edge capabilities [13]. Examples of fog devices include Raspberry Pi and Arduino.
Cloud computing allows us to make more demanding calculations that are difficult at the edge or fog. This type of computation is efficient, scalable, and allows the tailoring of resources as needed [13]. It has disadvantages such as high latency, high bandwidth, and high energy consumption. This type of computing performs the data processing and analysis in the same place, usually in a centralised way. The types of cloud are public, community, hybrid, or private [13,15].
As the IoT is a network composed of several interconnected devices, it is necessary to choose the most appropriate communication protocols to communicate between nodes in each scenario.

Communication Protocols
Communication protocols define the rules to communicate between nodes, allowing them to send data between devices and users in a safe and reliable way. We can divide communication protocols into two types: (1) request-response and (2) publication-subscription.
The most used protocol based on request-response is the Hypertext Transfer Protocol (HTTP). This is one of the most stable and reliable protocols and allows the transmission of large amounts of data. Its disadvantages are high memory and energy consumptions, making it challenging to use at the edge.
The most used protocol based on publication-subscription, is Message Queuing Telemetry Transport (MQTT), which is a lightweight protocol that allows scalability and simplifies communication between devices (despite not having an encryption mechanism). It is suitable for sending small amounts of data in low bandwidth and high latency scenarios typical on the edge. It is also optimised to work on devices with memory constraints and has other advantages, namely in terms of privacy issues. Publishers and subscribers do not need to know each other's existence. One subscriber can receive data from many devices, and the publisher and subscriber do not need to be connected at the same time [16,17].
After choosing the types of computation and communication protocols, it is necessary to define how to arrange the nodes of the IoT network and how to connect these nodes. This arrangement of nodes and how they communicate with each other defines the network topology.

Network Topology
An IoT network is a representation of the distribution of its nodes in space and how they interconnect. The most used topologies are point-to-point, star, and mesh, as depicted in Figure 2 [18]. Point-to-point (P2P) is the simplest topology and has only two nodes. For this topology to work, none of the elements (the two nodes and the link between them) can fail, which is a tough constraint.
The star topology is one of the most common, since it is scalable and easy to deploy. In this topology, all nodes are connected to a central device that acts as a gateway. The nodes cannot communicate with each other, but only with the gateway. It is possible to add and remove nodes from the network without breaking it. It has the disadvantage of having a single point of failure (the gateway).
The mesh topology is not as common as might be assumed and is more expensive to deploy. All nodes are connected to each other, which causes redundancy in the connections. It also allows the network to work in case of a connection failure (because of the redundancy). Nevertheless, it needs more computational power and more memory than the star topology to move data from one node to the other [19,20].

Computer Vision
Computer Vision (CV) is an interdisciplinary research area that processes digital images or videos to extract information [21]. Currently, CV is strongly based on artificial intelligence, namely, on machine learning and deep learning models that replace humanvisual tasks, such as animal counting and crop or pest monitoring. In the specific context of this work, we can divide CV into image classification and object detection. Typically, image classification is used to classify an image, but it can be adapted to classify multiple objects in the same image. For that, we have to segment the objects in the image, extract their most relevant features, and then apply the classification model to each object. Object detection models incorporate all those steps into a single model and allow for the classification of multiple objects in the same image.
In pest monitoring, the goal is usually to detect and count the pests in each trap, and not to holistically classify the trap; hence, object detection models are usually more suitable. There are two main types of object detection models: (1) two-phase models, such as the Faster Region-Convolution Neural Network (Faster R-CNN) and (2) one-phase models, such as You Only Look Once (YOLO) [22].

Faster R-CNN
Faster R-CNN is a two-phase model because it includes two phases: first, it looks for the Regions of Interest (RoI) to predict if an object exists; then, using that RoI, it predicts the object's class and coordinates. This model has three main steps. The backbone is where a Feature Pyramid Network (FPN) obtains the feature maps of the images. The Region Proposal Network (RPN) predicts the RoI, that is, the regions that have objects and their coordinates. Lastly, there are the RoI classifier and the bounding box regressor. The first classifies each object in the bounding boxes (regions that have an object). The latter improves predictions for the dimensions of those boxes [23] (see Figure 3).

YOLOv5
YOLOv5 is one of the most recent YOLO models and like the other YOLO models, is a one-phase model because it only takes one look at the images, i.e., this model does not first look for the RoI and then process that area. Instead, it makes the classification and prediction all at once. It has three main steps. First, there is a backbone where a Cross Stage Partial Network (CSPN) extracts the most relevant features. Then, a neck generates feature pyramids using a Path Aggregation Network (PANet) to obtain new and better feature maps, i.e., features maps that combine the same information with different scales. The final step is the head (see Figure 4), which is used to acquire the bounding boxes from the feature maps, classify them, and predict their coordinates [25].

Proposed Approach
The main goal of this work is to propose and develop an IoT network that uses CV for pest monitoring. We start by constructing and validating the use of CV models for pest detection. Then, we define the type of devices and structures to capture images at the edge of the network. Next, we use the cloud to centralise the backend of the IoT network. This is where we store and process data and connect the edge and users of the network. Then, we develop and test a mobile application for users to easily monitor the pest's evolution. Finally, we connect all network nodes (edge, cloud, and application) and place the intelligent traps in different crops to test the whole proposed approach.

Computer Vision Model
We have developed CV models to detect the Whitefly (WF) pest. This type of pest is one of the most common; thus, it is one of the most relevant to fight, especially in tomato crops. The life cycle of WF starts as an egg, then passes on to nymph and ends up as an adult propagating to the entire crop. There are several species of whiteflies that are visually similar, such as Trialeurodes vaporariorum and Bemisia tabaci. The next subsections describe the dataset used to train several CV models (object detection models), the obtained results, and the chosen CV model for the IoT network.

Dataset
To train the computer vision models, we used a public dataset [27] that contains 284 images (5184 × 3456) of yellow sticky traps with WF and other insects ( Figure 5). The dataset contains 4940 annotated WF instances and, as described by the dataset authors [27], some unannotated or improperly annotated instances. This affects the precision of CV models, as the detection of an unannotated instance is classified as a False Positive detection. To improve the dataset, we contacted technicians and, with their help, we annotated the missing WF instances. From that process, the total annotations passed from 4940 to 5863 WF instances. Finally, we divided the whole dataset into a training (80%) and a testing (20%) dataset.

Model
We used the train dataset to train the Faster R-CNN, Scaled-YOLOv4, YOLOv5 (Small), and YOLOv5 (XLarge) models. The Faster R-CNN was published in 2017 [28], the YOLOv5 models were made public in June 2020 [29], and the Scaled-YOLOv4 (Large) in November 2020 [30]. YOLO models allow testing of different depths of the same architecture, from Tiny to XLarge (changing only the number of layers and neurons). This allows testing the state of the art of object detection and evaluating the trade-off between performance and depth of the network. The R-CNN allows the testing of a two-stage model and compares performance with one stage models.
To evaluate those object detection models, we considered three metrics. Mean Average Precision (mAP) is one of the most used metrics to assess the precision of object detection models [31]. We also considered the metrics memory consumption and average detection time for an image (using a NVIDIA Tesla T4 GPU).
As described in Table 1, YOLOv5 (XLarge) presented the best performance (best mAP) in the test dataset and therefore was the model deployed in the IoT network. The memory consumption was not an issue because we used the cloud, and average time for each detection was not relevant because we did not have the goal of performing real-time detection.

IoT Network
As depicted in Figure 6, we developed a star topology network with cameras (at the edge) placed in crops, the backend centralised in a private cloud, a pest monitoring application for the users obtain insights, and all communications using the HTTP protocol. The chosen architecture (star topology) allowed the development of a very efficient and inexpensive network. Furthermore, this type of topology allows us to easily scale the network to other crops just by adding new edge devices. The cameras used at the edge (ESP32-CAM), although low cost, allow the obtaining of images with enough quality to not influence the performance of the CV models. If we had used the fog layer, the cost would have increased significantly; each crop would need a fog device and solar panels to power them. The goal of this system is not to obtain images in real-time, so the lower latency of the fog (over the cloud) was not a decision factor. Another advantage of fog that was not important to us was the mobility of its devices, as we wanted to leave the devices in the same location for a long time (the most autonomous possible). However, introducing a fog layer could be a good solution for cultures without internet access or if the backend is on a public cloud.
The chosen communication protocol (HTTP) allowed us to send the images to the cloud without the payload size restriction that MQTT sometimes has when dealing with images. HTTP power consumption was not an issue because each edge node sent only a daily image to the cloud to perform daily monitoring). HTTP memory consumption was not a constraint, as we did not send high resolution images. However, the use of MQTT would result in a lower memory and energy consumption, but as this protocol was developed to send small amounts of data, the size of the images would have to be reduced to have viable transfers. As MQTT was designed to be lightweight, it does not use a security mechanism for data transmission. In that case, we would have to implement, for instance, Transport Layer Security (TLS), to guarantee the encryption of the transmitted images.
Using the application for automatic monitoring, the user only needs to visualise the graph with the daily evolution of the pest in order to choose which intervention to take. Nevertheless, this limits the monitoring to available nodes at the edge. With that in mind, we introduced a feature for the user to use their phone's camera to take a picture of a trap. With that, he can point the camera at any yellow sticky trap and is not limited to the ones at the edge. However, this changes how images are obtained and introduces the possibility of human error, which can influence the performance of CV models.
The next subsections describe the edge and cloud of the IoT network and show part of the application.

Edge
The number of edge nodes to use on each farm mainly depends on the size of the farm. With more nodes, it is possible to know more quickly and reliably how the number of pests is evolving. However, as the pests are attracted by the traps inside each structure, just one node is enough to see the daily evolution of the pests.
For each farm, we used one ESP32 microcontroller with a camera attached to obtain the trap images. This is a low-cost, low-power microcontroller suitable for when there is no external power source. It has only 520 kilobytes (KB) of memory, with 320 KB of DRAM (for storage) and 192 KB of IRAM (for instruction execution); the rest is RTC memory (which persists when the device is on standby) [32].
To test if the ESP-Cam had the capacity to detect pests, we counted the WF in the testing dataset (we counted 1371 WF) and then resized those pictures to different dimensions and sizes that are allowed by the Esp32-Cam. Next, we used the YOLOv5 (XLarge) model to make detections on those pictures, as shown in Figure 7. The previous table shows that it is possible to use the Esp32-Cam with 1280 × 1024 images for pest detection without losing significant performance in YOLOv5 (XLarge) (between 10.3% and 16.3% less detections).
To power the ESP32-CAM, we used two 18650 batteries and also solar panels to extend the battery life (see Figure 8). The daily power consumption of each edge node is 24 mA and the used batteries have 3350 mAh of power capacity, providing more than four months of duration for each battery. The solar panels with 150 mA and an average of 5 h of direct sun exposure during the year (in Coimbra, where these edge nodes are placed) allow the edge nodes to run without the need to change batteries. In crops, the edge is subject to adverse conditions that can affect the durability of the devices and the quality of the images. With that in mind, we designed a structure (see Figure 9 in this paper and Figure A1 in the Appendix A) to protect the ESP32-CAM from atmospheric conditions and to better control the image quality (such as brightness). At the base of this structure, we placed the yellow sticky trap, and the ESP32-CAM at the top.

Cloud
The backend part of the IoT network is in the cloud, which is available through an Application Programming Interface (API) that makes the images available to the CV model (YOLOv5 (XLarge)) and stores them. When an image arrives at the cloud, the model detects the pests and counts them (see Figure 10 in this paper and Figure A2 in the Appendix A). These discoveries, the metadata of the images, and plantation sites are stored in a relational database. The pest monitoring application, through the same API, can access the CV models, the database, and the stored images in the cloud.

Mobile Application
The mobile application was developed using React Native, and allows the user to monitor pests through a mobile device (see Figure 11). For that, the user can choose a plantation site and a specific edge node (one of the ESP32-CAMs) on a map. For that node, the user can see the evolution of the detected pests and the images obtained for each day. The middle image shows a pest evolution where the traps were changed almost every day. To complement this approach, it is also possible to take a photo of a trap, which is sent to the backend part of the IoT network. Then, the CV model makes the detection and returns results to the application. Figure 11. Part of the mobile app that allows users to monitor pests.

Conclusions and Future Work
The presented approach aims to contribute to more sustainable and efficient farming. We presented an IoT network to automate pest monitoring and to enhance part of the technicians work with new technologies. The star topology of the network allows its easy extension to new plantations and other pest types. This network is composed of edge devices, a private cloud, and a mobile application, and uses HTTP to communicate between nodes. The low-cost devices (such as the ESP32-Cam) are associated with low-power consumption, allowing the use of the network in crops with external power constraints. It is possible to run the edge nodes for four months using 18650 batteries, and without the need to change those batteries with the addition of solar panels. Furthermore, the use of low-cost devices at the edge does not significantly decrease the performance of CV models (between 10.3% and 16.3% less detections).
We tested four CV models (Faster R-CNN, Scaled-YOLOv4, YOLOv5 (Small), and YOLOv5 (XLarge)) and used three metrics for their evaluation: mAP, memory consumption, and average time for each detection. The model with the best mAP was YOLOv5 (XLarge), with a mAP of 89.70%. The model with the lowest memory consumption and average time for each detection was YOLOv5 (Small), with 15 MB and 0.01S, respectively.
As the model runs in the cloud and the goal is to monitor pests daily, we used the model with the highest mAP (YOLOv5 (XLarge)) for the monitoring system. For a real-time monitoring system, the model with the lowest memory consumption (to be closer to the edge) and the lowest average time for each detection was determined to be the best option, in this case YOLOv5 (Small). For a monitoring system with a CV model in the fog, the Scaled-YOLOv4 constitutes the best choice, as it is more balanced.
The mobile application allows users to easily monitor the pest's evolution, thus supporting more informed decisions. The application was presented to potential users that have shown interest in such a pest monitoring application, namely to integrate in their own current monitoring system.
Most approaches on pest monitoring only address CV models. Advances in IoT, such as reducing hardware costs and increasing computational power, make it especially relevant to use IoT networks for the deployment of those CV models, giving them a purpose and a use. Additionally, most proposed approaches use images obtained in controlled conditions, e.g., always the same light or without noise, making it difficult to develop CV models with good performance outside the lab. This work shows that the use of an IoT network allows the acquiring of images in real contexts and in large quantities, which can then contribute to improve the real performance of CV models.
Despite the advantages of an automatic monitoring system, there are some disadvantages that must be taken into account. The lack of precision of the CV model or the lack of the quality of the images can lead to false positives and/or false negatives, both of which can be tackled by large annotated datasets throughout the pest's different development stages obtained under real conditions. Future work will mostly focus on overcoming these disadvantages. For that, we will focus on the expansion of the IoT network to obtain more images and the creation of a dataset with ESP32-Cam images to improve CV models' performance under uncontrolled conditions. Lastly, we will evaluate performance of the monitoring system during multiple crop seasons.