Feasibility Analyses of Real-Time Detection of Wildlife Using UAV-Derived Thermal and RGB Images

: Wildlife monitoring is carried out for diverse reasons, and monitoring methods have gradually advanced through technological development. Direct ﬁeld investigations have been replaced by remote monitoring methods, and unmanned aerial vehicles (UAVs) have recently become the most important tool for wildlife monitoring. Many previous studies on detecting wild animals have used RGB images acquired from UAVs, with most of the analyses depending on machine learning–deep learning (ML–DL) methods. These methods provide relatively accurate results, and when thermal sensors are used as a supplement, even more accurate detection results can be obtained through complementation with RGB images. However, because most previous analyses were based on ML–DL methods, a lot of time was required to generate training data and train detection models. This drawback makes ML–DL methods unsuitable for real-time detection in the ﬁeld. To compensate for the disadvantages of the previous methods, this paper proposes a real-time animal detection method that generates a total of six applicable input images depending on the context and uses them for detection. The proposed method is based on the Sobel edge algorithm, which is simple but can detect edges quickly based on change values. The method can detect animals in a single image without training data. The fastest detection time per image was 0.033 s, and all frames of a thermal video could be analyzed. Furthermore, because of the synchronization of the properties of the thermal and RGB images, the performance of the method was above average in comparison with previous studies. With target images acquired at heights below 100 m, the maximum detection precision and detection recall of the most accurate input image were 0.804 and 0.699, respectively. However, the low resolution of the thermal sensor and its shooting height limitation were hindrances to wildlife detection. The aim of future research will be to develop a detection method that can improve these shortcomings.


Introduction
For wildlife detection and monitoring, traditional methods such as direct observation [1] and capture-recapture have been carried out for diverse purposes [2]. However, these methods require a large amount of time, considerable expense, and field-skilled experts [3,4] to obtain reliable results. Furthermore, performing a traditional field survey can result in dangerous situations, such as an encounter with wild animals. Remote monitoring methods, such as those based on camera trapping [5], GPS collars [6], and environmental DNA sampling [7], have been used more frequently, mostly replacing traditional survey methods, as the technologies have developed. Camera-trapping methods can track the life cycle of animals at the nest level. Camera networks can be created by installing multiple cameras, and high-quality data can be acquired across the region of interest [8]. However, This paper proposes a new method for detecting animals. There were three main objectives, to address the limitations of previous research: (1) Reduce the animal detection time The main limitation of previous animal detection methods is that they cannot not be applied in the field in real time. ML-DL-based methods need an enormous number of training images, and it takes a long time to train the detection model. Methods using thermal images require preprocessing to detect animals. To address these limitations, the proposed method can detect animals based on single images, and image preprocessing is simplified.
(2) Enable detection in more environments ML-DL-based methods are only suitable for certain species and land cover types or environments. To improve detection versatility, the proposed method considers target size and surface temperature when detecting animals. Theoretically, the method can be adapted to all homeothermic animals if the body size and surface temperature are known. Here, we focused on detecting mid-sized animals (alpaca).
(3) Use thermal and RGB images acquired from the same thermal camera When a detection method needs both thermal and RGB images, separate thermal and RGB cameras are used. However, any thermal camera can save thermal and RGB images simultaneously, and the centroid is the same because the shooting time is the same. Therefore, by modifying the distortion caused by focal length, shooting area, and spatial resolution, thermal and RGB images can be used simultaneously for research without the requirement of two cameras [35].
The main goal of this study was to develop an automated method for detecting animals using a thermal image dataset, to apply it under in situ conditions in real time, and to achieve similar detection ability to previous methods. The fastest detection time was 0.033 s, the maximum detection precision was 0.804, and the detection recall rate was 0.699.

Study Site
An animal farm (37.827 • N, 127.882 • E) in the middle of a natural forest in Hongcheon, Republic of Korea, was used as the study site for data collection ( Figure 1). To determine the animal species and their locations, the UAV operated over the entire farm. Through this process, the distribution of land cover was also confirmed. The major species on the farm is Vicugna pacos (alpaca), so these animals were mainly used to develop the detection and analysis method. The farm also has a few Cervus nippon (sika deer), Struthio camelus (ostrich), and Camelus bactrianus (camel). The barns for each species are located on grassland or bare land, and they are mainly moving on those land covers. The area of the farm is approximately 12.02 ha, and the main cover type is forest (50%), followed by grassland (35%). The remaining contributors to land cover comprise artificial structures such as roads and buildings, and bare land. The minimum and maximum elevations on the farm are 450.56 and 512.00 m, respectively.

Data Acquisition
UAV flights were conducted using a MATRICE 210 UAV (DJI, Shenzhen, China), and the thermal camera was a FLIR ZENMUSE XT2 (DJI). The thermal camera has both an RGB sensor and a thermal sensor, and images are captured by both sensors at different resolutions. Each RGB image contains 4000 × 3000 pixels, and each thermal image contains 640 × 512 pixels. The spatial resolution of each RGB image at 25 m above the ground is 0.59 cm/pixel, whereas the resolution of each thermal image is 2.24 cm/pixel. Due to the increased focal length of the thermal sensor, each thermal image covers a narrower region [36].
The data were acquired on 25 November 2020. In Korea, November is considered to fall within the winter season, and snow typically falls from the middle of November. Although snow cover provides advantages, in the sense that a lower land-surface temperature is beneficial in automated animal detection and photographs can show not only the animals but also their tracks, thereby improving detection rates [37], a lack of adequate snow cover can inhibit animal detection, requiring the images to be filmed again [38]. Therefore, the shooting date was selected to occur when the air and land surface temperatures were low and there was no snow cover. This decision maximized the temperature difference between the targets and land cover types and facilitated more accurate detection of animals. Furthermore, by shooting images around noon, the shadow size of individual targets was minimized, which reduced the possibility of error from shadows.
After a programmed drone flight over the entire study site, the drone was controlled manually to capture the locations of the main target animals (alpaca). After finding a spot, 26 images were acquired from heights of 25-275 m above the ground at 10-m intervals to aid the development of a method to be used under various circumstances. The body lengths of the main target animals range from 80 to 100 cm when fully grown, and they have various fur colors, including black, gray, white, dark brown, and light brown. Based on the UAV results, the targets were sorted into four categories according to their visible condition. The category "isolated" indicated that the target stood alone, not touching any other target or obstacle. "Bordering" meant that two targets were touching each other, and "overlapping" meant that the targets' body parts were crossing each other's. "Partial" indicated that the target was partly visible at the edge of the image (Figure 2).

Data Preprocessing
As the outputs of the XT2 sensors have different pixel sizes, spatial resolutions, and coverage areas (Figure 3), they need to be modified to have the same properties. Furthermore, to acquire accurate results, temperature correction of the thermal images and masking of non-target regions are required.

RGB Lens Distortion Correction and Clipping
Due to the difference in focal length, the distortion in the images also differs [39]. The RGB sensor of XT2 has a focal length of 8 mm, but the thermal sensor has a focal length of 19 mm. When the focal length is shorter, the image is subject to barrel distortion compared with an image with longer focal length [40]. Therefore, to use the thermal and RGB images together, we had to correct the distortion in the RGB images. Python and the OpenCV2 library [41] were used for this purpose. After correction, the corrected RGB images were clipped and rescaled to have the same coverage as the thermal images ( Figure 4).

Thermal Image Correction by Fur Color
Although the body temperature of the target animals is the same across individuals, the surface temperature can differ because of the fur color. The surface temperatures of animals with brighter fur were lower [42] because of higher reflectance [43]. Surface temperature differences can cause errors in the detection process and must be corrected for.
The pixel value of each RGB channel is needed to identify bright targets. Based on our measurements, we found that the surface temperature of white animals was approximately 25% lower than that of animals with darker fur. Therefore, the pixels of thermal images located at the same locations as white pixels from RGB images were adjusted to have higher values ( Figure 5).

Unnatural Object Removal
The principle of animal detection using thermal images is to locate spots where the temperature is different, because homeothermic animals always have the same body temperature and this consistency creates a temperature gap between animals and their surrounding environment. However, artificial structures, e.g., buildings and roads, have a much higher surface temperature compared with animals or natural surfaces. Therefore, when these types of artificial land cover are included in a thermal image, numerous errors in animal detection occur [31]. To eliminate this error, artificial structures should be masked.
However, it is difficult to tell which parts of the image should be removed, since one of main purposes of this study was to develop a method that can be used for instant detection under in situ conditions, and pursuing this objective limited the time available to analyze images and locate artificial structures. Therefore, as an alternative to artificial cover detection, the unnatural color masking method was used. Fortunately, more than half of the artificial structures at the study site have unnatural colors, such as vivid red, vivid blue, and vivid orange ( Figure 6). As when correcting the temperature for fur color, for this step, temperature values were removed according to pixel color. Many possible errors can be prevented by removing these high-temperature artificial structures.

Methods
Our method requires both thermal and RGB images but especially thermal images, as these contain more useful information.
The open-source programming language Python was used in the Google Colab [44] environment to develop the proposed method. Google Colab, a cloud service based on Jupyter Notebooks, executes Python code using both CPU and GPU resources, thus enabling quantitative analysis on a scale that exceeds the limitations of personal computers. The main functions of the proposed method are Sobel edge creation [45] and contour drawing. OpenCV2, an optimized computer vision library, was used for image processing.
The automated detection results obtained using the proposed method were categorized based on shooting height and target shape, i.e., isolated, bordering, overlapping, or partial.

Sobel Edge Detection and Contour Drawing
Sobel edge creation refers to a method that finds edges simply. This gradient operator works vertically and horizontally [46]. When the difference in pixel value is larger, the Sobel edge has a higher value. By combining the vertical and horizontal Sobel edges, a biaxial Sobel edge can be made. This biaxial Sobel edge was used to draw the binary contours. After applying a threshold to the biaxial Sobel, the segmented image was used for contouring. At the same time as contours were drawn, the centroid point of each contour was marked on the images (Figure 7). The accuracy of the contours was high, but some were wrongly drawn around non-target objects, such as stones, wet soil, and artificial structures; therefore, to eliminate these false-positive results and obtain accurate results, the contours had to be sorted.

Object Detection and Sorting
To eliminate wrongly drawn contours, size-temperature filtering was used. The mean body length of the target animal was approximately 0.9 m, and the top-view area was approximately 4500 cm 2 . However, the body of the animal is fully covered with thick, curly fur, so its body heat is not shown clearly in the thermal image, making the animal look smaller than normal. Hence, the area filter was set to detect contours smaller than 3500 cm 2 and larger than 100 cm 2 . The minimum criterion was set much smaller than the common size of the target to find segmented body parts such as overlapping or partial targets. Additionally, the size of drawn contours can be small because of the animal's body shape. Therefore, to obtain a high probability of animal detection, the area filter was set with a large range.
For contours sorted by the area filter, the centroid temperature filter was used again. The maximum and minimum body temperatures of the targets 25 m above the animal's body were nearly 20 • C and 10 • C, respectively. Therefore, the filtering option was set to find contours warmer than 9 • C to ensure that every target was filtered. In addition, the temperature also changes with changes in shooting height. To minimize this error, we corrected the temperature by height. The shooting height and maximum temperature of the targets are linearly related (Figure 8). Temperature filtering was adapted using Equation (1). The main target animal in this study was the alpaca. Therefore, size-temperature filtering was designed and adapted to this species. However, this object detection and sorting method can be adapted to target other species by changing the filter criteria.

Input Images Generation
As mentioned previously, six kinds of input images were used for the automated detection method (Figure 9). These input images were generated to enhance the detection ability, shorten the detection time, and determine which type of input image produces the most accurate detection performance. The six kinds of input images were corrected RGB images, original thermal images, thermal images corrected for fur color, thermal images with masked unnatural colors, corrected RGB images × original thermal images, and corrected RGB images × all correction-applied thermal images.
The thermal and RGB images were processed using contour and centroid generation, size-temperature filtering, a target counting process after Sobel edge creation, and image binarization. These images were combined after image binarization, and each combined image could be used to generate contours corresponding to those of the two kinds of images. Therefore, combined images allowed for a more accurate detection ability.

Results
The automated detection results obtained using the proposed method were categorized based on shooting height and target shape, i.e., isolated, bordering, overlapping, or partial. The detection recall, precision, and time were also analyzed.
The detection results were assessed based on the detection precision and detection recall rate (Figure 10). The detection precision was calculated as the number of real animals among the automatic detections divided by the total number of detections. The detection recall rate was the number of real animals among the automatic detections divided by the number of animals in the image. These two values have the same range, from 0 to 1, and higher values indicate higher detection ability. To compare the detection precision and detection recall rates of the six kinds of images, the number of targets in each image was counted manually (Figure 11), and targets were labeled according to their shape category (i.e., isolated, bordering, overlapping, or partial). The number of targets in each of the 26 individual original images was about 40. However, for every type of input image, the numbers of targets detected tended to decrease with increased shooting height. At shooting heights greater than 100 m, fewer than 10 targets could be detected in each type of image, and at heights greater than 125 m, fewer than five targets could be detected.
Based on the results of manual and automatic counting, the detection precision and recall rate were evaluated. As the detection precision decreased dramatically above a height of 100 m, we focused on detection results at shooting heights lower than 100 m ( Table 1). The total number of targets was 316, consisting of 56 isolated targets, 243 bordering targets, 17 overlapping targets, and three partial targets. Of the 316 targets, 5 were ostriches.
When only RGB images were used for detection, the detection recall rate was 0.367, and the detection precision was 0.013. RGB images cannot be subjected to temperature filtering. Therefore, false-positive detection results such as soil, rocks, roofs, and roads could not be eliminated. This uncertainty in sorting led to the poor detection recall result.  When the input image contained thermal information, the detection precision and recall rate were higher. In particular, compared with the RGB-only detection results, the precision increased by at least 50-fold. The original thermal images and the two types of corrected thermal images also produced similar precision results of approximately 0.8. However, detection recall increased by approximately 20% when images corrected for fur color and temperature were used. Moreover, there were two types of combined thermal and RGB images. The first type was created by multiplying a corrected RGB image with the original thermal image, and the second type was obtained by multiplying a corrected RGB image with the all-corrections-applied thermal image. The detection recall rate using these images exceeding 0.6, and the detection precisions were 0.200 and 0.804, respectively. When corrected thermal images were used, the detection precision was approximately four-fold higher.
Use of the six types of input images resulted in different detection times. To calculate the detection time of an individual image by image type, the total detection times of the 26 images shot for each height range were summed, and then the sum was divided by 26. After repeating this process 50 times, the average detection time was calculated (Table 2). Of the four processing methods used in Google Colab, parallel processing had the fastest detection time for all image input types. The image type associated with the fastest detection time was the thermal image with unnatural color removal, which was associated with a detection time of 0.033 s. The image type associated with the slowest detection time was the corrected RGB image × corrected thermal image combined image. This image type was associated with a detection time that was three times slower than the fastest time. In general, when the input image had RGB channels, more detection time was required. This occurred because the detection method had to consider more channels and because the large numbers of errors associated with RGB images prolonged the true-false decision-making time of the method. Converting the detection time to frames per second (FPS), the input images including RGB channels were acquired at 9 FPS. The other input images were acquired at 25-30 FPS.

Detection Presicion and Recall
For wildlife detection, detection precision and recall are fundamentally important. The 26 images shot at each height range were used to generate six kinds of input images. The same detection method was used for each of these input images, and it detected between one-and two-thirds of the targets. When the input image had a thermal channel, the maximum detection precision increased approximately two-fold. Additionally, a detection recall rate of 0.699 was obtained when using the corrected RGB image and corrected thermal image together.
Previous studies have shown a diverse range of detection precision (Table 3). A method applied for hippopotamus detection performed best [30]. Studies of cattle [47], monkeys [48], and white-tailed deer [34] detected between 60% and 70% of their targets. Fur seal [49] and human [32] studies detected approximately 40% of their targets. Considering the differences in site environment, target size and shape, and thermal image shooting conditions, the detection method proposed here has above-average performance. Among the previous studies, only the white-tailed deer study [34] provided detection precision and recall results. As was found here, the previous study found very large numbers of false-positive detection results when RGB images were used as input images. This previous study used unsupervised pixel-based and object-based methods. When the unsupervised pixel-based classification method was used with RGB images, the detection precision was 0.046; when the object-based method was used, it was 1.0. However, the detection recall of the two methods had the same value of 0.484. Thus, according to this result, the object-based method did not detect more targets compared with the unsupervised pixel-based method. However, the method proposed here increased both the detection recall rate and detection precision by using different kinds of input images.

Instant Detection
Detection time is also a major factor in wildlife detection. To detect animals in realtime, detection time is a more important factor than detection precision or recall. The government of the United States limits the capture rate of thermal video equipment for export to 9 FPS, and most products have this capture rate [50] including the thermal camera used here. To apply our method to 9 FPS videos, the detection time should be less than 0.12 s. The full frame rate is 30 FPS. To be able to detect animals in real-time, the detection time should be less than 0.034 s.
The methods of previous studies based on machine learning and deep learning are difficult to use in real-time, and the authors have discussed these limitations [51]. The studies listed in Table 3 did not provide detection times, and their methods require a preprocessing step so cannot be used in real-time. A study of koalas [52] provided detectiontime results. When shooting from altitudes of 20, 30, and 60 m, the detection times were 1.3, 1.6, and 2.1 s, respectively, but these times are insufficiently fast for real-time use.
With parallel processing, the fastest detection time for a single input image using the method presented here was 0.033 s, and the slowest was 0.111 s. Converted to FPS, these times correspond to 30 and 9 FPS, respectively. When the input image had only a thermal channel, the FPS range was 25-30; when the input image had RGB channels, the rate was 9 FPS. Thus, all input image types can be used to analyze exported thermal videos. Single thermal channel images can detect almost every frame during real-time shooting.
Furthermore, the sensor always shoots thermal and RGB images simultaneously, so both types of input image can be used according to preference. The best way to use the method developed here is to check for the presence of wildlife in a thermal image with unnatural colors removed. For the frames with a confirmed presence of wildlife, the corrected RGB image combined with the corrected thermal image can be used to clearly determine numbers and locations.

Using the Proposed Method to Supplement Previous Methods
The proposed method can detect animals regardless of color, shape, or size and does not need to generate a training dataset. This advantage reduces the total time needed for detection, and, at the same time, the method can be used to generate the training dataset itself. As a result of the automated detection process, our method marks the outline and centroid of each target. Then, instant target sorting can be used to form sets of images of detected animals, and this stacked result can be employed for ML-DL training, even while simultaneously conducting UAV surveys in the field.
This quick and in-field detection method can be used to supplement the relatively precise and advanced existing methods. Not only is our method useful for creating a training dataset but also, when a trained model is used for detection, the region of interest in the RGB image can be minimized. This areal reduction can lead to time saving.

Utility of Thermal Sensors
The use of thermal sensors provides several benefits for wildlife detection, especially time saving in the detection process [53], enhanced detection performance [32,34], and wider application across many species. However, thermal sensors still have limitations and drawbacks. An important limitation is that thermal sensors cannot sense through obstacles such as tree canopies, hideouts, and bushes. RGB image-based detection methods also have this limitation. However, thermal images can be used to detect the body temperature of a camouflaged target and may have an advantage over RGB images.
Another more critical drawback is the sparse resolution of the thermal camera. Compared with an RGB image, the image generated from a thermal sensor has approximately 40-fold fewer pixels and one-quarter of the coverage area. Furthermore, the spatial resolution at the same shooting height is approximately four times lower. This drawback limits the shooting height for obtaining images for use in detecting wildlife. At a height of 100 m, the pixel resolution is approximately 9 cm, and at a height of 200 m, the resolution is approximately 18 cm (Figure 12). If the shooting height is higher than 100 m, a target of the size in this paper will be represented by only a few pixels, and if multiple targets are in contact with each other or overlapped, their edges become more difficult to distinguish, and blurred targets are not detected properly. In this study, a height of 100 m seemed to be the maximum height for significant detection of wildlife. If the target size or shape was different or a high-quality thermal sensor could be used, the maximum height would be higher.

Method Overview
Our method overcomes three limitations of previous studies: it can detect target animals in real-time with minimal data preprocessing, use two types of images for advanced detection ability, and be applied in diverse situations.
Size-temperature filtering enables our method to be applied to different species and land cover types. However, a lack of data means that further validation of its applicably to different land cover types is necessary. In addition, shooting altitude remains a limitation, similar to previous methods. Another drawback of the proposed method is that it cannot merge acquired images to minimize preprocessing and reduce detection time. However, although merging allows for more rapid detection of targets, partial targets might not be detected accurately. This could be overcome by using slower UAV flight speeds and higher frame rates.

Conclusions
This paper developed a new method for detecting animals using thermal and RGB images. The maximum detection precision was 0.804, and the recall rate was 0.699. The major improvement in detection time enables real-time usage.
This method has two limitations. The environments and conditions, such as the detection target, shooting time, and land cover, were not diverse when the raw data were acquired, so the method was developed using only limited data. Nevertheless, the method might be applicable to many species and circumstances. The sparse resolution of the thermal sensor is another limitation that limits the shooting height. Nonetheless, if high-resolution thermal images are used, the method may be able to detect smaller targets and it may be possible to fly the UAV at a higher altitude.
The focus of future work will be to diversify the target species and shooting conditions to clarify how versatile a thermal sensor-mounted UAV system would be in conducting wildlife surveys. Using these data, an advanced method that can detect targets at greater heights will be proposed.