Comparison of Different Methods of Animal Detection and Recognition on Thermal Camera Images

Popek, Łukasz; Perz, Rafał; Galiński, Grzegorz

doi:10.3390/electronics12020270

Open AccessCommunication

Comparison of Different Methods of Animal Detection and Recognition on Thermal Camera Images

by

Łukasz Popek

^1,*,

Rafał Perz

² and

Grzegorz Galiński

¹

Faculty of Electronics and Information Technology, Warsaw University of Technology, 00-665 Warsaw, Poland

²

Faculty of Power and Aeronautical Engineering, Warsaw University of Technology, 00-665 Warsaw, Poland

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(2), 270; https://doi.org/10.3390/electronics12020270

Submission received: 28 November 2022 / Revised: 28 December 2022 / Accepted: 29 December 2022 / Published: 5 January 2023

(This article belongs to the Special Issue Wireless Communication and Multimedia Technology – Theory and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

For most wild animals, the peak activity occurs during the night hours—their observation is possible only with the use of specialized equipment. Advancements in technology and the availability of thermal sensing devices allow researchers to examine wildlife. This study compares different methods for animal detection in thermal camera images including classical (HOG/SVM) and based on deep neural networks (Faster RCNN and YOLO). A comparison was performed to indicate the most beneficial mean Average Precision (mAP) for different levels of Intersection over Union (IoU) coverage thresholds and sensitivity (Recall). The results were evaluated on a scratch dataset containing two animal families (Cervidae and Suidae). The models were compared in terms of precision, recall, and training time. After a series of tests, the achieved performance was very satisfying: for the YOLOv3 network, the obtained mAP was above 90% for IoU > 50%; for Faster R-CNN, the obtained mAP was 87%; and for HOG/SVM, the obtained mAP was 40%. The training time for HOG/SVM was under 1 min, for YOLOv3 it was around 50 min, and for Faster R-CNN it was around 60 min. This research provides an important starting point for the creation of a ground-based system for autonomous observation of migration and population numbers, which is a unique approach in the field of wildlife research.

Keywords:

detection; deep neural networks; semantic segmentation; thermal imaging; wildlife surveillance; YOLO

1. Introduction

Nowadays, increasing urbanization and transformation of natural ecosystems result in frequent contact between humans and wild animals. The traditional approach in such cases cannot be applied—there is not enough light for RGB imaging. The development of thermal imaging technology and its wide availability has allowed people to survey and observe animals after dark. Most of the current research is in the field of wildlife taxation using drones, which allow a rough estimate of a species’ population [1]. This technique reduces the problem to localization of animal silhouettes on images and then binary classification if the Region of Interest (RoI) is a true animal. Unmanned aerial vehicles are also used for other environmental research, such as precision agriculture [2]. However, when estimating the range of occurrence and studying the spread of invasive species, this approach is too inaccurate: it suffers from low accuracy, high error rate, etc. The minimum altitude from which observations can be made so as not to frighten the animals is several tens of meters. This is too high for interspecies classification. It is too high for interspecies classification. Therefore, surveillance should be conducted from the ground. Moreover, recognizing objects in the images should be as autonomous as possible to reduce the work time of experts. This paper addresses the issue of automatic animal finding in images. The main goal is to compare existing methods to select the best one for further study of the above-mentioned issue. The development of an accurate method for classifying animals based on the analysis of thermal camera images is an extremely important issue in the field of animal observation and management of livestock populations. Vision systems capable of identifying animals are critical in enhancing the safe movement of autonomous vehicles, reducing the risk of collisions [3].

There are two fundamental problems in automating object detection: locating potentially RoI regions, and transforming a multivariate tensor into a classifiable feature vector. Two approaches are considered in undiminished work. One of the most popular concepts related to finding potential areas containing objects is the sliding window approach relative to the image. However, this method is very time-consuming and computationally expensive. Images from a thermal imaging camera are monochromatic, significantly simplifying their semantic analysis based on more efficient algorithmic methods. After literature analysis and testing, it was decided to use the Falzenszwalb algorithm [4] to search for regions of interest. Subsequently, the found regions were subjected to the operation of determining the Histogram of Oriented Gradients (HOG) [5], which in the final stage made it possible to assign classes using a trained Support Vector Machine (SVM) [6].

The problem of identifying and classifying animals using advanced algorithms has been addressed in many studies. The paper [7] demonstrates the use of a pre-trained FasterRCNN + InceptionResNetV2 network to classify European mammals with a 94% detection accuracy and a 71% species classification accuracy. The authors [8] conducted a comparative analysis of the pre-trained Faster R-CNN and RetinaNet models in the detection and classification of bear and deer. The mAP evaluation suggested that both models successfully learned to detect “boar” and “deer” with average precision exceeding 25%. The paper [3] presents the comparison of three types of CNN (basic CNN, VGG16, HOG + CNN) and the machine learning classifiers based on thermal images and HOG transformation. Obtained results of accuracy indicate the most efficient technique for the detection of wild animals was a combination of HOG + CNN—91%. The study [9] provides a novel method for classifying images under semi-supervised learning (SSL) or few-shot learning (FSL) conditions. The authors propose a solution based on using only the generator (decoder) network separately for each class that has shown to be effective for both SSL and FSL, which has shown improvements of 3.04% and 1.50% in terms of average accuracy relative to reference models.

The authors of [10] present a WilDect-YOLO detection model for accurate real-time endangered wildlife detection, based on a YOLO v4. WilDect-YOLO integrates DenseNet blocks to improve preserving critical feature information and reuse to detect distinct eight different endangered wildlife species that provide superior and accurate detection under various complex and challenging environments. Evaluation has found that at a detection rate of 59.20 FPS, WilDect-YOLO achieved mAP, F1-score, and precision values of 96.89%, 97.87%, and 97.18%, respectively, in detecting various wildlife species. A real-time object detection model developed on the YOLOv4 algorithm was presented in paper [11], which described detecting four different diseases in tomato plants under various challenging environments. At a detection rate of 70.19 FPS, the proposed model obtained a precision value of 90.33%, an F1-score of 93.64%, and a mean average precision (mAP) value of 96.29%.

The study [12] presented a highly accurate K-complex detection system. Efficiency evaluation was performed using the deep transfer learning feature extraction model, and the result indicated consistently high values (i.e., up to 99.8% precision and 0.2% miss rate) over different testing scenarios.

A highly accurate K-complex detection system, based on a multiple convolutional neural network and YOLO v3 was described in the paper [13]. The model performed consistently high precision (89.84–99.44%) and had a miss rate of (10.41–0.55%).

In parallel, the results obtained for popular architectures of convolutional neural networks were tested. In this context, a transfer learning approach was used—that is, for the previously trained YOLOv3 [13] and Faster R-CNN [14,15] models, a tuning process was initiated to adapt the network parameters to a specific problem. The dataset on which the results were evaluated was created from scratch. The results of the two proposed approaches were compared based on mean Average Precision (mAP) for different Intersection over Union (IoU) coverage thresholds and sensitivity (Recall).

2. Materials and Methods

2.1. Dataset

It should be emphasized that the main problem in building effective detection models is a correctly labelled dataset. In the case of the present problem, no publicly available materials met the requirements. For this reason, a dataset containing thermal images of two families of animals was created: cervids (Cervidae), which included images of red deer (Cervus elaphus), European roe deer (Capreolus capreolus), and fallow deer (Dama dama), denoted hereafter as the “deer” class, and swine, which mainly included images of European wild boar (Sus scrofa), denoted hereafter as the “wild_boar” class.

A Pulsar Helion 2 XP50 PRO camera was used to create the dataset. Part of the data was obtained from private collections courtesy of Polish hunters. The LabelIMG program [16], which generates xml files with bounding box parameters, was used to mark objects on the images. The results presented in this paper were obtained on a dataset of 400 monochrome images—250 containing objects from the “deer” class and 150 from the “wild_boar” class. Example images are shown in Figure 1. Each image contained at least one instance of an object, but usually, there were few animals. It should be noted that all images included in the dataset contain instances for detection, although there was usually more than one. The “deer” category contained 332 subjects (different animals) in whole dataset, while “wild_boar” had 561. The dataset was split into a training part (85% of the images) with 6-fold cross-validation in the learning process. The remaining test part (15% of the images) was separated from the training process and served as the final performance value of the method.

2.2. HOG/SVM

The flowchart in Figure 2 shows the algorithm depicting the detection process using classical methods.

The first step was preprocessing the images, i.e., removing artifacts and symbols that different cameras leave in the images. The next step found regions of interest with potential objects. Since the thermal camera images are monochromatic, it was decided to use segmentation based on the Falzenszwalb algorithm, which groups pixels of similar brightness. The result of semantic segmentation with the Falzenszwalb algorithm is depicted in Figure 3.

The segments thus determined were then transformed into rectangles based on the extreme pixels. The first selection process was then performed—elements that were too small (less than 100 pixels) were rejected, as were those whose length-to-width ratio significantly deviated from the value of 1 (animals in the images fit into an envelope that is close to a square). The objects selected in this way were first transformed to equal length and width, then scaled to 128 × 128 pixels. Since the images’ nature allows the objects’ edges to be captured unambiguously, it was decided to calculate the feature vector for subsequent classification based on a histogram of oriented gradients. This technique counts the occurrence of a particular brightness gradient orientation in a localized portion of the image. As a result, a given area of the image can be represented as a one-dimensional vector, making it possible to use a support vector machine as a classifier. To train the classifier, 30 examples from each class were specially prepared. The Scikit-Image library [17] was used to implement the Falzenszwalb, HOG, and SVM algorithms. The final step was the evaluation and visualization of the results to assess the performance of the entire method. The tests were performed on an Acer Nitro AN515-31 on CPU (Intel(R) Core(TM) i5-8250U) with the Windows 11 operating system.

2.3. Convolutional Neural Networks

In recent years, approaches based on deep neural networks have gained wide popularity in the context of object analysis in digital images. Most of the currently used solutions are based on tuning network parameters (finetuning), which are pretrained on huge datasets. This approach allows good results with a small number of training examples simultaneously. It should be noted, however, that models are most often trained on ordinary RGB images, which means that using them for issues with other image characteristics can end up in failure. This study uses two popular architectures: Faster R-CNN and YOLOv3.

R-CNN is a faster network whose operation is based on breaking an image into multiple layers in search of features, then using RPN (Region Proposal Network) to find areas that are potential objects, and finally classifying them and determining the envelope. During the tests, implementations for Faster R-CNN from the Detectron2 Model Zoo library [18] were used with the backboned R_50_FPN. The StandardRPNHead with an anchor number of 4 and 3 aspect ratio was used.

YOLOv3 skips the search for maps of regions of interest from the perspective of the problem. A grid is generated for the input image. Then the envelopes are matched. This approach, as opposed to a sliding window over the extracted feature maps, allows the YOLO model to avoid the confusion of a false-positive result for the background, as well as a holistic view of the image. For the tests in this paper, a pretrained YOLO architecture from the Darknet library was used [19].

It should be noted that the architectures used were virtually fully default in terms of the choice of parameters such as number of anchors and location of feature extraction heads. This should be explained by the fact that this is a first approach to the problem and the aim was to select an optimal method rather than to optimize a specific one. Since training a network is computationally very expensive, the Google Colab application [20] was used. It allows the free use of virtual machines containing GPUs and enables the rapid training of implemented solutions. The hardware utilized for training was NVIDIA V100 Tensor Core GPU.

3. Results

Tests were conducted for different configurations of network tuning hyperparameters. The main variables were the learning rate (Lr), which is responsible for regulating the rate of change of coefficients within the network, and the number of iterations, which is responsible for the learning time of the network. The initial levels of hyperparameter values were determined experimentally. Metrics measuring the precision and sensitivity of the model were used to assess detection accuracy. A general formula for precision averaged between classes:

{mAP}_{t} = \frac{1}{n} \sum_{1}^{c = n} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i}}

where t is the minimum threshold of coverage of the envelope determined by the model with the reference envelope to consider the object as correctly found, n is the number of classes, TP is the number of correctly determined objects, and FP is the number of false positives. In the case of sensitivity, the formula can be presented in the form:

Recall = \frac{1}{n} \sum_{1}^{c = n} \frac{{TP}_{i}}{{TP}_{i} + {FN}_{i}}

with the difference that FN stands for the number of false negative cases. The dataset was divided into three parts: training (70%), validation (20%), and test (10%). The test part was utilized only for determining the final score of the method. The final results are shown in Table 1.

The visualization in Figure 4 and Figure 5 shows the generation of a bounding box around the silhouettes of the animals found and the probability that they belong to a given class.

4. Discussion

As can be seen, despite the ease of SVM training evident in the duration, the classical approach strongly deviates from the metrics results determining the correctness of classification obtained for both neural networks. The main problem observed through visualization of the detection result is the lack of uniformity of brightness of animal silhouettes. This can be well observed in Figure 6. The images’ characteristics and the animals’ objects cause the occurrence of inhomogeneity of brightness of the whole body in the image. Changing the brightness of the image during preprocessing does not affect the quality of the results. When different binarization thresholds are set, this often distorts animal silhouettes to such an extent that they become unrecognizable. It should be emphasized, moreover, that any enhancement of image features, e.g., by using morphological operations or filtering, also fails to have the desired effect due to the nature of the images—the pixels especially in the background are arranged in the shape of isotherms. Although it is not possible to distinguish the zones with the naked eye, the above-mentioned operations amplify the effect, which results in a final deterioration of the image quality.

Another problem is related to the detection of animals that are in a compact group, so that their silhouettes overlap. Note that segmentation based on the Falzenszwalb algorithm groups pixels by color. Thus, the result is like semantic segmentation, not per individual instance. In the case of two animals standing next to each other in the image, marking the border of silhouettes becomes impossible for the algorithm. The phenomenon described can be seen in Figure 7.

Neural networks correctly coped with the task of finding two given classes of objects. The main source of errors lowering the final value of the metrics stems from separating multiple animal silhouettes. In contrast, errors in classification practically does not happen as well as the occurrence of false positives or marking the part of the silhouette that has more brightness. The only drawback remains time—the training takes about an hour, during which about 1000 iterations of network training are performed. It should also be noted that the “deer” class had higher detection precision and sensitivity than the “wild_boar” class.

One of the main problems that remains unsolved is finding a method to distinguish between the different species of the deer family. This is because the silhouettes are morphologically similar, and for different distances from the acquisition point the main differentiating factor, which is the size of the animal, cannot be captured. The method used in the work [19] is under consideration, which will use an appropriately trained architecture depending on the degree of silhouette coverage of the image.

5. Conclusions

In the course of the work, the basic assumption of selecting the optimal method for further development for detecting animals in thermal camera images was realized. Neural networks correctly coped with the problem of classification and finding objects in images of this nature, which cannot be said about the method based on HOG/SVM.

In the long run, the problem should be extended to other classes. Indeed, it can be a severe problem to capture differences between animals of the same family but of different species (e.g., roe deer vs. deer). Differences in the form of morphological features (different coloration) are impossible to capture with a thermal camera. At the same time, size in the absence of a comparison may not be sufficient for correct species distinction. Another development path for such a defined problem is to refine detection to a form that allows analysis of films, for example, to analyze footage from photo pools placed at feeding grounds. In order to enhance security or in an educational capacity, e.g., for nocturnal human activities such as the work of uniformed services at night or hunting, it is also possible to implement a module that would recognize objects in real-time on such a camera.

Author Contributions

Conceptualization, Ł.P.; methodology, Ł.P. and G.G.; software, Ł.P. and R.P.; validation, Ł.P. and G.G.; formal analysis, G.G.; investigation, Ł.P.; resources, Ł.P. and R.P.; data curation, writing—original draft preparation, Ł.P.; writing—review and editing, Ł.P. and G.G.; visualization, Ł.P.; supervision, G.G.; project administration, R.P.; funding acquisition, R.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to no harm for animals during researches.

Informed Consent Statement

Not applicable.

Data Availability Statement

The project is being developed commercially—for this reason access to the material, particularly the dataset, is limited at this stage. For further information or to collaborate, please write to popek.luka@gmail.com.

Conflicts of Interest

The authors declare no conflict of interest.

References

Witczuk, J.; Pagacz, S.; Zmarz, A.; Cypel, M. Exploring the feasibility of unmanned aerial vehicles and thermal imaging for ungulate surveys in forests—Preliminary results. Int. J. Remote Sens. 2018, 39, 5504–5521. [Google Scholar] [CrossRef]
Perz, R.; Wronowski, K. UAV application for precision agriculture. Aircr. Eng. Aerosp. Technol. 2018, 91, 257–263. [Google Scholar] [CrossRef]
Munian, Y.; Martinez-Molina, A.; Miserlis, D.; Hernandez, H.; Alamaniotis, M. Intelligent System Utilizing HOG and CNN for Thermal Image-Based Detection of Wild Animals in Nocturnal Periods for Vehicle Safety. Appl. Artif. Intell. 2022, 36, 1–29. [Google Scholar] [CrossRef]
Pedro, F.; Daniel, H. Efficient Graph-Based Image Segmentation. Int. J. Comput. Vis. 2004, 59, 167–181. [Google Scholar]
Stuart, R.; Norwig, P. Learning from Examples, Artificial Intelligence: A Modern Approach, Global Edition, 3rd ed.; Pearson: London, UK, 2016; Chapter 18; pp. 745–748. [Google Scholar]
Stuart, R.; Norwig, P. Perception, Artificial Intelligence: A Modern Approach, Global Edition, 3rd ed.; Pearson: London, UK, 2016; Chapter 24; p. 946. [Google Scholar]
Carl, C.; Schönfeld, F.; Profft, I.; Klamm, A.; Landgraf, D. Automated detection of European wild mammal species in camera trap images with an existing and pre-trained computer vision model. Eur. J. Wildl. Res. 2020, 66, 62. [Google Scholar] [CrossRef]
Vecvanags, A.; Aktas, K.; Pavlovs, I.; Avots, E.; Filipovs, J.; Brauns, A.; Done, G.; Jakovels, D.; Anbarjafari, G. Ungulate Detection and Species Classification from Camera Trap Images Using RetinaNet and Faster R-CNN. Entropy 2022, 24, 353. [Google Scholar] [CrossRef] [PubMed]
Khan, W.; Raj, K.; Kumar, T.; Roy, A.M.; Luo, B. Introducing Urdu Digits Dataset with Demonstration of an Efficient and Robust Noisy Decoder-Based Pseudo Example Generator. Symmetry 2022, 14, 1976. [Google Scholar] [CrossRef]
Roy, A.; Bhaduri, J.; Kumar, T.; Raj, K. WilDect-YOLO: An efficient and robust computer vision-based accurate object localization model for automated endangered wildlife detection. Ecol. Inform. 2022, 101919. [Google Scholar] [CrossRef]
Roy, A.M.; Bose, R.; Bhaduri, J. A fast accurate fine-grain object detection model based on YOLOv4 deep neural network. Neural Comput. Applic. 2022, 34, 3895–3921. [Google Scholar] [CrossRef]
Khasawneh, N.; Fraiwan, M.; Fraiwan, L. Detection of K-complexes in EEG waveform images using faster R-CNN and deep transfer learning. BMC Med. Inform. Decis. Mak. 2022, 22, 297. [Google Scholar] [CrossRef]
Wu, W.; Yin, Y.; Wang, X.; Xu, D. Face Detection With Different Scales Based on Faster R-CNN. IEEE Trans. Cybern. 2019, 49, 4017–4028. [Google Scholar] [CrossRef]
Joseph, R.; Ali, F. Yolov3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Tzutalin. LabelImg. Git Code. 2015. Available online: https://github.com/tzutalin/labelImg (accessed on 30 June 2022).
Documentation Scikit-Image, Version 0.19.2. Available online: https://scikit-image.org/docs/stable/ (accessed on 30 June 2022).
Documentation Detectron2. 2022. Available online: https://github.com/facebookresearch/detectron2 (accessed on 30 June 2022).
Documentation Darknet. 2022. Available online: https://github.com/pjreddie/darknet (accessed on 30 June 2022).
Bisong, E. Google Colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform; Apress: Berkeley, CA, USA, 2019. [Google Scholar]

Figure 1. Examples of two distinguishable objects—(a) “wild_boar”, (b) “deer”.

Figure 2. A block diagram of the HOG/SVM-based detection process.

Figure 3. Region Proposal gained from Falzenszwalb’s algorithm.

Figure 4. Detection result for YOLOv3 with the specified probability of the given class.

Figure 5. Detection result for Faster R-CNN with the specified probability of the given class.

Figure 6. Lack of detection result using HOG/SVM—the algorithm found only most bright parts of the animal.

Figure 7. Example of poor detection result for HOG/SVM—the animals’ silhouettes are overlaying on each other.

Table 1. The results gained for different methods.

Algorithm	Recall	mAP₅₀	mAP₇₅	Training Time
HOG/SVM	49%	40%	-	<1 min
Yolov3	70%	94%	85%	~50 min
Faster R-CNN	68%	87%	77%	~60 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Popek, Ł.; Perz, R.; Galiński, G. Comparison of Different Methods of Animal Detection and Recognition on Thermal Camera Images. Electronics 2023, 12, 270. https://doi.org/10.3390/electronics12020270

AMA Style

Popek Ł, Perz R, Galiński G. Comparison of Different Methods of Animal Detection and Recognition on Thermal Camera Images. Electronics. 2023; 12(2):270. https://doi.org/10.3390/electronics12020270

Chicago/Turabian Style

Popek, Łukasz, Rafał Perz, and Grzegorz Galiński. 2023. "Comparison of Different Methods of Animal Detection and Recognition on Thermal Camera Images" Electronics 12, no. 2: 270. https://doi.org/10.3390/electronics12020270

APA Style

Popek, Ł., Perz, R., & Galiński, G. (2023). Comparison of Different Methods of Animal Detection and Recognition on Thermal Camera Images. Electronics, 12(2), 270. https://doi.org/10.3390/electronics12020270

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Different Methods of Animal Detection and Recognition on Thermal Camera Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. HOG/SVM

2.3. Convolutional Neural Networks

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI