1. Introduction
Human–wildlife interactions, including conflict, are increasingly common because expanding urbanization worldwide creates more opportunities for people to encounter wildlife [
1]. Consequently, compensation related to human-wildlife conflicts from 1980 to 2015 was USD 222 million in 50 countries. Livestock losses accounted for the majority, followed by crop damage [
2]. Human–wildlife conflict is a serious problem at the blurred boundary between urban areas and wildlife habitats in Japan. The amount of damage to agriculture by sika deer (
Cervus nippon) was approximately USD 40 million in 2020 (USD 1 = JPY 140) [
3] due to the increase in deer population. Wildlife conservation and management are required to solve this problem. The adaptive management of wildlife, a systematic approach for improving resource management by learning from management outcomes [
4], is essential. Adaptive management optimizes effects by circulating to formulate protection and extermination plans, implement measures, and grasp changes in wildlife populations based on population indices. However, there is insufficient population information on large mammals, which are crepuscular animals with large habitat areas [
5]. To resolve this, remote sensing images have been used to estimate the changes in the wildlife population. However, it is difficult to identify wildlife in remote sensing images, even in open areas, because the shapes of objects may differ markedly when viewed from above instead of from the side, as humans are accustomed to doing. Moreover, there is the potential for oversight because substantial data must be analyzed [
6]. To address this, automated wildlife detection methods in remote sensing imagery have been developed [
7] reviewed automated bird detection methods using remote sensing images.
A computer-aided detection of moving wild animals (DWA) algorithm was developed [
8] and applied to pairs of time-difference thermal images to support the extraction of moving wild animals. This is a rule-based method that identifies moving wildlife by extracting candidate objects from each pair of thermal airborne images and comparing the candidate objects between images. However, the producer accuracy was 77.3% and the user accuracy was 29.3%, which was not practicable [
9]. Drones have become widespread in recent years, and obtaining high-resolution images has become relatively easy [
10] used drone thermal images to detect European Hare (
Lepus europaeus) by visual inspection. Deep learning has dramatically improved the accuracy of image recognition. Although the processing time for training a deep learning model is high, for prediction, the time is short. It is therefore generally considered easier to put it to practical use than a rule-based approach in terms of processing time. Reference [
11] used red, green, and blue (RGB) drone images to detect deer using a deep learning object detection model, You Only Look Once (YOLOv4). Its mean average precision (mAP) was 69% when tested using only images of deer [
12] fused thermal and visible images to detect white-tailed deer (
Odocoileus virginianus), cows (
Bos taurus), and horses (
Equus caballus) in three classes with YOLOv5 and YOLOv7; with mAPs of 72%, 93%, and 99%, respectively, with YOLOv5; and 59%, 37%, and 64%, respectively, with YOLOv7.
However, because behavioral changes, such as wildlife escape when drones approach, have been confirmed [
13], it may be unsuitable as a way of determining the population. To date, the use of visible and near-infrared images has been limited to daytime because many large mammals, such as sika deer, are crepuscular. The current study used thermal images to identify wildlife in semi-dark conditions. However, few studies have been conducted on wildlife detection using thermal remote sensing images. Furthermore, it is difficult to distinguish wildlife from trees in thermal images under certain observation conditions [
14,
15] because the surface temperature contrast between the detection targets and the background is essential for extracting targets from thermal images. Therefore, existing studies on the application of thermal remote-sensing images to monitor wildlife [
16,
17,
18] have been limited to open and cool areas [
19]. Urban areas contain many hotspots, such as streetlights. The current study attempted to use pairs of overlapping thermal images obtained at different times to automatically extract moving wildlife. Of the moving objects with a time difference between image pairs, those smaller than cars were defined as moving wildlife. This study aims to develop a support system for extracting thermal images of moving wildlife using an airborne system.
The two major goals were as follows:
One of the proposed methods using deep-learning classification models was applied to thermal images using an airborne platform, and their classification accuracies were evaluated.
- 2.
Detection of airborne thermal images using a deep learning object detection model.
One of the proposed methods, which uses a deep-learning object detection model, was applied to airborne thermal images, and its detection accuracy was evaluated.
The datasets used in this study are described in
Section 2. The two proposed methods and the color-composite method are described in
Section 3. The results of the proposed method using the RGB drone and thermal airborne images, including the investigation results of the two methods for detecting small objects from large images, are presented in
Section 4. Discussion and conclusions are presented in
Section 5 and
Section 6, respectively.
5. Discussion
Regarding the proposed classification method, because a car is a moving hot spot, we expected that it would be difficult to distinguish from moving wildlife; however, in the results, the classification accuracy of “car” was better than that of “other.” This may be because cars are larger than wildlife, which makes it relatively easy to distinguish between them. Comparing
Table 1 and
Table 4, the classification accuracy of “without deer” using single thermal images was lower than that using color-composite images of pairs of thermal images. The cause is nonmoving hot spots such as streetlights (
Figure 7). These results indicate that even if training data were created, it would be difficult to classify wildlife and non-moving hotspots using single thermal images. A comparison of
Table 2 and
Table 4 indicates that the classification accuracies were not markedly different. Although standardization makes it easier to visually inspect images, it has little impact on the classification accuracy of deep-learning models. The models probably learned to ignore the differences in the background temperature. Therefore, although standardization cannot contribute to increasing accuracy, it can support visual interpretation when creating a training dataset.
At this time, the detection targets were extremely small (2–5 pixels). Therefore, this was a relatively difficult task. Therefore, two methods were investigated to increase the number of detection target pixels: enlarging the image bilinearly, increasing the number of neurons in the input layer, and increasing the size of the detection target relative to the image size by decreasing the grid division size of the image. The increase in classification accuracy for the “with deer” when the number of neurons was 100 to 200 (
Figure 9) was thought to be due to this effect. Conversely, the decrease in classification accuracy for the “with deer” when the number of neurons was 500 to 700 was presumed to be due to the edges of the deer region becoming smoother due to bilinear image enlargement. In other words, the optimal number of neurons in the input layer is 200–500. This study used a bilinear method; however, the results may change if other methods are used. Therefore, investigating the optimal method is a future challenge. Comparing
Table 3 and
Table 4, the classification accuracies using color-composite images with 100 × 100 pixels were clearly greater than those using color-composite images with 224 × 224 pixels. When creating training data, humans check the movement of hot spots and their positions relative to the background and surrounding hot spots; therefore, if the image size is smaller than 100 × 100 pixels, it will not be possible to create the training dataset itself. Based on these results, by optimizing the number of neurons in the input layer, dividing the image into grids, and making the grid size as small as possible to create the training data, the classification accuracy can be improved by classifying images containing small objects.
In the case where there was a deer at the edge of the image (
Figure 10a), “with deer” was not classified well. However, this can be solved using a moving window instead of grid division.
Figure 10b shows the case where the difference between the surface temperature of the deer and the background temperature was small, and “with deer” was not classified well. This can be solved by shooting before sunrise or after sunset, when the gap between the surface and background temperatures increases. Another advantage of shooting during these times is that crepuscular wildlife is more active before sunrise or after sunset, making it easier to capture moving wildlife. However, the failure of classification with “without deer” was due to the failure of image registration. This study used images captured at flight altitudes of 1000 and 1300 m. By making the flight altitudes the same,
Figure 10c can be solved. In addition, the eastern side of the shooting location was a hilly area, and the forest area was on a slope; therefore, image registration using a more accurate digital surface model with a structure for motion or aircraft light detection and ranging data can also be solved (
Figure 10d).
For the proposed detection method, there was no statistical difference in the AP between Faster R-CNN and YOLOv8. Comparing
Table 5 and
Table 7, there was no difference between the standardized and non-standardized images. This result was the same as that of the proposed classification method. Comparing
Table 6 and
Table 7, the APs using color-composite images with 100 × 100 pixels were clearly greater than those using color-composite images with 224 × 224 pixels. This result was the same as that of the proposed classification method. The APs with YOLOv8n and 400 neurons in the input layers of FR1 and FR2 were 87.5% and 82.0%, respectively (
Figure 12). In contrast, those with YOLOv8x were 85.5% and 76.2%, respectively. No statistical differences were observed between the two groups. In the case of YOLOv8x, it was not possible to increase the number of neurons in the input layer beyond 400 because of VRAM capacity limitations. In such cases, selecting a smaller model and increasing the number of neurons in the input layer can increase AP.
When putting the proposed methods into practical use, it is necessary to proceed with the development while considering the factors that have been clarified in previous research [
8] in addition to the above findings, and a common improvement method should be employed when using deep learning models, such as increasing the training data for various cases. Previous research has indicated the importance of the factors listed below. The factors that determine the extraction of moving wildlife from remote sensing images have been discussed previously [
8,
9] as follows:
To automatically extract targets from remote sensing images, the spatial resolution must be finer than one-fifth the body length of the target species. This yields two or more pure pixels that are not mixed with anything else. The head and body lengths of deer are 90–190 cm, and a spatial resolution of <20 cm is ideal. To achieve this spatial resolution, it is ideal to shoot at an altitude of ≤500 m [
9]; however, there is a restriction on the minimum flight altitude. Therefore, high-resolution thermal image sensors or fixed-wing drones that do not generate propeller noise are used. Furthermore, fixed-wing drones can capture images more frequently than airborne drones, and by flying several fixed-wing drones simultaneously, it is possible to capture images efficiently with time differences. However, fixed-wing drones resemble large birds; therefore, even if they do not make a sound, there are concerns that the deer may become alarmed and stop moving. Therefore, when fixed-wing drones are used, it is necessary to evaluate their effects on deer in advance. It is believed that more accurate detection is possible by obtaining higher-resolution images using these methods.
Objects under tree crowns did not appear in the aerial images. The possibility of extracting moving wildlife decreased as the area of the tree crowns in the image increased. Although a correction for the number of extracted moving wildlife using the proportion of forest is necessary for population estimates, it is not necessary to grasp population changes using the number of extracted wildlife as a population index [
8,
9].
Wildlife exhibits well-defined activity patterns, such as sleeping, foraging, migration, feeding, and resting. To identify moving wildlife, the target species must move when conducting a survey. When the shooting intervals are too short, the targets cannot be extracted because the movement distance in a given interval must be longer than the body length. Shooting intervals were determined after surveying the movement speed of the target species during the observation period. Because the maximum walking speed of deer is 4 km/h [
30] and the head and body lengths of sika deer are 90–190 cm, a shooting time difference of 2 s or more is required during deer movement. Because deer are crepuscular animals, it is better to shoot target areas multiple times in the early morning or evening, as in the present study [
9]. Moreover, radiative cooling must be considered when determining shooting intervals. The surface temperature of wildlife covered with hair differs from the air temperature because the hair insulates the external heat. Although the gap in the surface temperature between the wildlife and background temperature was not large, the shooting intervals caused a thermal gap between the two images owing to radiative cooling. Therefore, shooting in the early morning is optimal [
9]. The weather should also be considered for the same reason. Direct sunlight and shadows can affect surface temperature patterns, and direct sunlight reflection increases the surface temperature. Therefore, it is recommended that thermal images be captured on cloudy days. This study used images that were captured twice. However, because RGB images have three channels, the proposed method can be applied to three shots without changing the method. The proposed method cannot detect non-moving animals using a pair of images. However, by capturing two images with a relatively short time difference and then capturing another image after a certain time difference, it is possible to grasp behavioral patterns such as foraging or moving.
The classification test accuracies of “with deer” and “without deer” were >85% and >95%, respectively. The APs for detection, precision, and recall were >85%. Therefore, as explained in
Section 3.1, the detection accuracies of the classification models were higher than those of the detection models. Furthermore, using a classification model with a method that can output activation maps, such as Grad-CAM [
31], it is possible to show which objects in the image are wildlife. Users need to use detection models if they want to automatically count detected wildlife. Therefore, users can select the method depending on their purpose. As mentioned in
Section 3, the proposed classification and detection methods can be combined. For example, if Faster R-CNN is used to process only the grids classified as “with deer” by VGG-19, multiplying the classification accuracy of “with deer” with VGG-19 by the recall of Faster R-CNN will approximately match the recall of the combination method. The calculation was 74.4% and 75.2% when the above process was performed. Over-detection can be reduced substantially because the classification accuracy of “without deer” and the average of “car” and “other” was 91.2%. Therefore, it is possible to use the proposed classification method when monitoring habitats, the proposed detection method when accurately counting the number of deer, or a combination of the classification and detection methods when monitoring increases or decreases rather than the number of individuals themselves, depending on the user’s objectives.
The proposed methods are more applicable to larger, moving wildlife and objects than to deer. However, when deer and wildlife of similar size are captured together in an image, it is difficult to distinguish them. If species identification is required in areas inhabited by wildlife of similar size—for instance, if the target animals are active during the daytime—the use of visible imagery should be considered. If the target animals are small animals, airborne thermal imagery will not provide sufficient resolution unless technological innovations, such as sensors, exist. In this case, the use of drones should be considered after considering their impact on the target species. In this study, the cars were not affected. This is presumably due to the difference in size between deer and cars, as well as the fact that the car body had cooled to about 13 °C by the shooting time. In the case of motorcycles, which are similar in size to deer, we considered that they were not affected because their engines and mufflers were considerably hotter. In some cases, road-masking processes may remove them.
In this study, the effectiveness of the proposed methods was demonstrated, and we will apply the proposed methods to other areas under various conditions to verify the generalization performance and work toward practical applications.