A Human-Detection Method Based on YOLOv5 and Transfer Learning Using Thermal Image Data from UAV Perspective for Surveillance System

: At this time, many illegal activities are being been carried out, such as illegal mining, hunting, logging, and forest burning. These things can have a substantial negative impact on the environment. These illegal activities are increasingly rampant because of the limited number of ofﬁcers and the high cost required to monitor them. One possible solution is to create a surveillance system that utilizes artiﬁcial intelligence to monitor the area. Unmanned aerial vehicles (UAV) and NVIDIA Jetson modules (general-purpose GPUs) can be inexpensive and efﬁcient because they use few resources. The problem from the object-detection ﬁeld utilizing the drone’s perspective is that the objects are relatively small compared to the observation space, and there are also illumination and environmental challenges. In this study, we will demonstrate the use of the state-of-the-art object-detection method you only look once (YOLO) v5 using a dataset of visual images taken from a UAV (RGB-image), along with thermal infrared information (TIR), to ﬁnd poachers. There are seven scenario training methods that we have employed in this research with RGB and thermal infrared data to ﬁnd the best model that we will deploy on the Jetson Nano module later. The experimental result shows that a new model with pre-trained model transfer learning from the MS COCO dataset can improve YOLOv5 to detect the human–object in the RGBT image dataset.


Introduction
Since 2012, the United Nations has proclaimed 21 March as World Forest Day [1].The goal is to make people aware of the importance of forest sustainability.Based on data from the Food and Agriculture Organization of the United Nations [2], Indonesia is the eighth most forested country, with a total forest area of more than 50% of the total land area or about 93 million hectares.However, many illegal activities occur in large forest areas in Indonesia, such as land clearing without a permit, forest fires, and illegal hunting [3].Illegal activities carried out in the forest environment may cause many natural disasters such as landslides, floods, and the loss of the biological environment for many animals [4] .Various efforts have been made to maintain Indonesia's forests and their various biological species, whereas many illegal activities still occur.Due to a lack of personnel resources and thorough area coverage, the traditional technique of patrolling and monitoring these areas has not been able to resolve this issue.
The answer to this problem is to develop a surveillance system that keeps an eye on the neighborhood using artificial intelligence.Unmanned aerial vehicles (UAVs, often known as drones) and NVIDIA Jetson modules, a general-purpose GPU, are an affordable and effective solution because they only need a small number of resources.The proposed solution for a survaillance system using drones and Jetson can be seen in Figure 1.UAV technology is currently very advanced and is the most realistic solution today because it is flexible, fast, relatively inexpensive, lightweight, and easy to use [5].In several fields of studies, UAVs have been employed as tools for area and target coverage, path and trajectory planning, image analysis and vision-based techniques, networking, and flight control [6].Despite the massive use of UAVs in these various fields, there are still many challenges that need to be solved, which include weather conditions, shadows, illumination, and other variations.To overcome this challenge, RGBT images, which are also known as red, green, and blue images with thermal infrared information, are utilized.
Conceptually, a thermal infrared (TIR) image represents data that capture information outside the spectrum of the human eye.It captures wavelengths out of the visible light spectrum area, as we can see in Figure 2.This helps the TIR to overcome changes in light intensity that affect the color captured by the human eye.However, TIR also has a weakness: it is sensitive to temperature changes and does not contain detailed information such as visual RGB images [7].Due to the small size of humans in UAV videos, the UAV's motion, and the low resolution, the ability to detect poachers in UAV video, particularly thermal infrared footage, is an important topic of research.In this present study, several scenarios have been used to enhance the you only look once (YOLO) [8] object-detection method, which focuses on small human-object detection from a UAV perspective.The target presents a harder challenge for the object detection due to its various shapes and dense crowds.Therefore, the YOLOv5 model was trained using the RGB image and TIR dataset in order to evaluate how well it performed when identifying humans from aerial perspective data.
The main contributions of this paper are as follows: • Optimizing the YOLOv5s algorithm for small human-object detection dataset via the transfer learning method.

•
Developing a method to handle different environmental issues, including illumination and mobility change using thermal infrared (TIR) images in addition to RGB (RGBT) images.

•
The original dataset has been manually annotated to be YOLO-format-compatible, and the annotation will be made available to the public.• Proposing a surveillance system for wildlife conservation using NVIDIA Jetson Nano module.
This paper is organized as follows: Section 2 describes the object detection for surveillance and provides a brief overview of the NVIDIA Jetson Modules.Section 3 consists of the methodology and necessary background information, as well as the evaluation method.Sections 4 and 5 consist of the experiment's results and the conclusion, respectively.

Related Work
The technique of object detection in UAVs or drones has been developed for use in a variety of contexts, including aerial image analysis, monitoring agents, delivery routing agents, intelligent surveillance, and air force security.Hengstler et al. [9] introduce a new approach to the distribution model of the surveillance camera by using a low-resolution stereo camera that calculates all the captured images for the position, range, and dimension that UAVs use, called MeshEye.Widiyanto et al. [10] introduced a PSO algorithm for the odor-source localization model of automatic robotic movement by reconstructing two different points of robotic sensing.Zhao et al. [11] proposed a new mixed YOLOv3-LITE for image detection precision and speed, which can be used on a non-GPU computing system such as a mobile or portable device.
Several studies have been conducted in the field of object detection, especially with the availability of large datasets online and the increasing computing power, which have made extraordinary achievements in the field of computer vision [12].It has been observed that object detection has been able to solve general and specific problems.The two examples of single-stage detection include you only look once (YOLO) and single-shot multi-box detector (SSD) [13].Meanwhile, the RCNN family, which includes RCNN [14], Fast RCNN [15], and Faster RCNN [16], is categorized as being composed of two-stage detectors.These two categories of deep-learning-based detectors are divided based on accuracy and processing time.

You Only Look Once (YOLO)
The first YOLO method was introduced by Redmon et al. in 2016 [8].This single convolution network object detection has the ability to predict object categories and locations up to 45 fps.YOLO algorithm takes all the images in one instance and then divides the given image into the SxS grid system.Each grid on the input image is responsible for detecting and predicting the category of the object inside the bounding box that contains the class probability.The YOLO architecture has 24 convolutional layers for performing feature extraction and two fully connected layers for predicting the bounding box of the predicted object.In addition, YOLO is renowned for its high performance, but with a tiny model, which makes it an ideal candidate for real-time object detection for on-device deployment.
By late 2021, YOLO had been upgraded to version 5. Before this period, the first three YOLO versions were released in 2016, 2017, and 2018, respectively, and within a few months in 2020, two versions of this model were released, namely, YOLOv4 and YOLOv5.YOLO version 2 (YOLOv2) replaced the original architecture with a 19-layer feature called Darknet-19 [17].In the third version (YOLOv3), the network architecture was updated again to a more profound architecture known as Darknet-53 [18].Furthermore, YOLO version4 (YOLOv4), regarded as CSP Darknet-53, utilized the same Darknet-53 as the backbone architecture with additional cross stage partial connection (CSP) [19].YOLOv4 came up in 2020 with several additional features that are proven to enhance accuracy.

NVIDIA Jetson Modules
Embedded machine learning is evolving rapidly.NVIDIA is recognized as a manufacturer of graphics-processing units for gaming, professional markets, and system-on-chip units for mobile computing.Furthermore, it has also produced several NVIDIA Jetson modules, which is a family of embedded computers with integrated GPUs or modules designed for high-performance computing to create an embedded AI system easily [20].Jetson Nano is the cheapest of all NVIDIA Jetson modules, and with its 128 parallel processing cores, it has the ability to handle a real-time video feed.The main technical parameters of the Jetson Nano modules are summarized in Table 1.NVIDIA Jetson Nano used the compute unified device architecture (CUDA) as a parallel computing platform.Generally, CUDA is a development and execution enabling platform designed by NVIDIA for general proposed computing or program on graphical processing units (GPUs) [22].It allocates tasks that are parallel to others, which do not need to be executed sequentially on the GPU.Furthermore, it supports many programming languages, such as C, C++, Fortran, and Python.CUDA is useful in domains that require a lot of computing power or in situations where parallelization is possible and high performance is required.NVIDIA Jetson modules have been widely used in research in the field of computer vision; this is because NVIDIA Jetson general-purpose GPUs became a viable platform for the efficient execution of some computational models [23].
In this current study, the NVIDIA Jetson Nano was used to detect human appearance from a UAV perspective for the surveillance system.Additionally, the best YOLOv5 model was deployed from the RGBT dataset on Jetson Nano.An overview design of the Jetson Nano utilized is seen in Figure 3.

Object Detector
YOLOv5 [24] is the latest major version of YOLO till date.Jocher launched the YOLOv5 publicly on 9 June 2020 and is still being updated.The release of YOLOv5 includes four main different model sizes, which are YOLOv5s, the smallest; YOLOv5m, medium; and YOLOv5l, large; and YOLOv5x, the largest.When it was released, YOLOv5 was initially only intended for an image size of 640 pixels, but now it also offers 1280 pixels.
Furthermore, the architecture of YOLOv5 has a cross stage partial connection (CSP) backbone and PANET neck, just like YOLOv4.However, YOLOv5 utilizes the PyTorch instead of using the original Darknet.The significant improvements in YOLOv5 include mosaic data augmentation and auto-learning bounding box anchors.The architecture of YOLOv5 is shown in Figure 4.

Dataset
During the experimental design, the VisDrone 2021 RGBT dataset was used [25].This dataset was originally part of the VisDrone 2021 Crowd Counting Challenge, which is a challenge for counting people in each frame.This challenge aims to estimate the number of people in an image.VisDrone 2021 provides a dataset with pairs of RGB and TIR images.It is important to note that the VisDrone 2021 RGBT dataset was collected by the AISKYEYE team from the Lab of Machine Learning and Data Mining at Tianjin University, China.
These data consist of 1807 pairs of RGB and TIR images; an example of this pair image can be seen in Figure 5.This team collected the data from the actual UAV under several different scenarios as well as various lighting and weather conditions.The ground truth of the dataset is the object's target point in XML format.Before implementing this data in the experiment, some data prepossessing was performed to make it compatible with the YOLO format.In this study, the data was divided into training and test sets in the ratio of 80:20, respectively.

Experiment Setup
In this research, the YOLOv5 model was trained in the host machine with an NVIDIA RTX 3060 GPU, 12 GB of VRAM, Intel Corei9-10900K Processor (3.70 GHz, 20 MB), and memory of 32GB.After getting the best model from the training stage, it was converted to a TensorRT model that was deployed to the Jetson Nano.Finally, the model was tested with NVIDIA Deep Stream SDK.The training parameters used can be seen in Table 2. Since this study aims to deploy the inference model on the Jetson Nano module, the smallest model version of YOLOv5 (YOLOv5s) was chosen.The seven aforementioned scenarios are intended to investigate the impacts of the combination transfer-learning approach and dataset utilized so that the best scenario may be selected and applied to the Jetson Nano device.

Evaluation
The training scenarios for VisDrone RGB, TIR, and RGBT images were evaluated in both RGB and TIR test sets.The evaluation measurements utilized include precision (P), recall (R), and average precision (AP).The AP measures a combination of recall and precision for ranked retrieval results and is the average precision at various recall values [26].The formula to calculate P and R is as follows: where : • TP denotes true positive; • FP denotes false positive; • FN denotes false negative.

Experiments and Results
The experimental pipeline consists of two main stages: the first one is a model search or training process to find the best model to perform the human-detection task from a UAV perspective.This model search was performed on the computer host machine mentioned earlier.The second stage is the execution or inference in the Jetson Nano module.The flow of this experiment can be seen in Figure 6.

Model Search
The original YOLOv5 provided by ultralytics has the ability to detect small objects in both the RGB and TIR images, but such detection leads to a wrong classification.For example, YOLOv5 shown in Figure 7 classifies the small human being as a bird and kite.Intuitively, this occurs because the original YOLOv5 is a model trained on the COCO dataset, which has 80 classes and different perspectives.This indicates that the model trained on the MS COCO dataset is insufficient to solve the human classification problem from the standpoint of an unmanned aerial vehicle (UAV).
After the training was conducted using RGB, TIR, and RGBT images from Visdrone 2021 dataset, the model was tested using RGB and TIR test-set images.The result of experiment from seven scenarios can be seen in Tables 3 and 4.  Table 3 shows the comparison results for each of the seven training scenarios and the original YOLOv5 model when applied on the RGB images test set.It was observed that the performance from all trained models produced a better performance than the original YOLOv5.
The best model in this scenario was the YOLO-RGB-TL model, with an average precision of 79.8%; meanwhile, the YOLO-TIR model failed in the RGB images test as it produced a lower performance value.Table 3 also shows that the performance of both YOLO-RGB and YOLO-RGBT became better when pre-trained weight transfer learning from the MS COCO dataset was employed.This is evident as the model performance increased from 70% to 79.8% and 71.4% to 79.1% for YOLO-RGB and YOLO-RGBT, respectively.Furthermore, Table 4 shows the comparison results for each of the seven training scenarios and the original YOLOv5 model when applied to the TIR images test set.It was observed that the YOLO-TIR and YOLO-RGBT with transfer learning weight produced a TIR image test set with AP 88.8%.In Table 4, both YOLO-RGB and YOLO-RGB-TL did not produce the same result as YOLO-TIR and YOLO-RGBT models because the information in the TIR image was not as detailed as that in the RGB image.This limited information makes it to be difficult for this model, which is not trained with TIR images, to detect the object.The performance results of each scenario for the RGB and TIR images are shown from

Inference on the Jetson GPUs
It is important to note that the best model obtained from the previous step was chosen and was executed on the Jetson Nano module.The process of deploying the model in the Jetson Nano module includes converting the model to TensorRT and cloning the TensorRT project on the Jetson Nano.In the deployment process, the NVIDIA Deep Stream was installed and then the model was executed in the Jetson Nano module.The best model was run on the platform using the Keras API with TensorFlow v2.Jetson Nano modules were switched to the highest performance mode (nvpmodel 0), and the model processed images from the testing data set.

The Limitation of This Study
The limitation of this study is that the model we used as a foundation is YOLOv5s, a simple version of YOLOv5.Because we propose a method for applying the model to Jetson Nano, we consider resource constraints such as memory, time, and energy consumption.Additionally, the proportion of the RGB and TIR images has not been studied further to determine the optimal combination of these images for the most accurate objectdetection method.

Figure 1 .
Figure 1.UAVs with NVIDIA Jetson Nano for surveillance system.

Figure 6 .
Figure 6.Model search and human-object detection on Jetson Nano workflow. .

Table 3 .
Performance result on RGB test-set image.

Table 4 .
Performance result on TIR test-set image.