Train Fast While Reducing False Positives: Improving Animal Classiﬁcation Performance Using Convolutional Neural Networks

: The combination of unmanned aerial vehicles (UAV) with deep learning models has the capacity to replace manned aircrafts for wildlife surveys. However, the scarcity of animals in the wild often leads to highly unbalanced, large datasets for which even a good detection method can return a large amount of false detections. Our objectives in this paper were to design a training method that would reduce training time, decrease the number of false positives and alleviate the ﬁne-tuning effort of an image classiﬁer in a context of animal surveys. We acquired two highly unbalanced datasets of deer images with a UAV and trained a Resnet-18 classiﬁer using hard-negative mining and a series of recent techniques. Our method achieved sub-decimal false positive rates on two test sets (1 false positive per 19,162 and 213,312 negatives respectively), while training on small but relevant fractions of the data. The resulting training times were therefore signiﬁcantly shorter than they would have been using the whole datasets. This high level of efﬁciency was achieved with little tuning effort and using simple techniques. We believe this parsimonious approach to dealing with highly unbalanced, large datasets could be particularly useful to projects with either limited resources or extremely large datasets.


Introduction
Accurate animal counts are the cornerstone of robust conservation and management plans [1]. For species prone to be in conflict with humans or when populations densities can greatly vary in time and space, they need to be carried out frequently [2]. Many different techniques exist to assess animal populations, from indirect methods, like pellet counts, to direct visual counting [3][4][5]. Most often, animal censuses are species-specific and require substantial investments in time, money, and effort by wildlife management teams [6]. Whilst some species gather periodically in specific locations, making population assessment easier [5,7], others roam alone or in small groups across vast territories [8,9]. Perhaps the most commonly used technique in open or semi-open environments is direct visual counting. Be it from the ground or from a moving aircraft, it is relatively easy to set up and carry out. However, it is prone to errors due to animal movement, group sizes, poor lines of sights, or variations in the observer's capacities [5,10].
Unmanned aerial vehicles (UAV), commonly known as drones, have recently become more accessible to researchers [11]. They allow easy access to remote areas, are safer and less technically challenging than their manned counterparts, are less stressful for animals and offer the possibility to completely automate flights [2,12,13]. Moreover, the onboard good levels of performance and short training times. The main downside of this method is that it requires several rounds of training.
In this paper, we present a general method that simultaneously tackles two major hurdles of training neural networks in image classification for wildlife surveys: the high number of FP and the big size of datasets. More specifically, we showcase the effectiveness of HNM to reduce the number of FP while training quickly and efficiently, using a series of recent, simple, and available methods without needing extensive fine-tuning.

Data Acquisition
In order to train and test our models, we acquired images of red deer (Cervus elaphus), in a deer farm near La Chute, Québec, Canada. This setup allowed us to ensure the presence of around 250 deer in an open, but small, controlled environment. The site is composed of five enclosures of different sizes and vegetation cover (Figure 1b-d).
Geomatics 2021, 1, FOR PEER REVIEW 3 samples from the training data. Because only the relevant samples are selected to form the training set, the number of samples it contains is kept to a minimum while maintaining good levels of performance and short training times. The main downside of this method is that it requires several rounds of training.
In this paper, we present a general method that simultaneously tackles two major hurdles of training neural networks in image classification for wildlife surveys: the high number of FP and the big size of datasets. More specifically, we showcase the effectiveness of HNM to reduce the number of FP while training quickly and efficiently, using a series of recent, simple, and available methods without needing extensive fine-tuning.

Data Acquisition
In order to train and test our models, we acquired images of red deer (Cervus elaphus), in a deer farm near La Chute, Québec, Canada. This setup allowed us to ensure the presence of around 250 deer in an open, but small, controlled environment. The site is composed of five enclosures of different sizes and vegetation cover (Figure 1b-d). We used an electric multirotor UAV from Microdrones (Berlin, Germany), the md4-1000 (Figure 1a), equipped with a Sony RXI RII camera and a 35 mm lens. The camera takes RGB images of 7852 × 5304 pixels.
We flew over the site twice, in the summer on 25 August 2017 and in the winter on 3 March 2018 in order to get two sets of images under very different environmental conditions, to ensure we had a variety of backgrounds. In the summer, some of the enclosures were very dry, covered with bare soil or dry grass. Only one enclosure had green grass. In the winter, the deer were grouped in three enclosures. Most of the ground was covered We used an electric multirotor UAV from Microdrones (Berlin, Germany), the md4-1000 (Figure 1a), equipped with a Sony RXI RII camera and a 35 mm lens. The camera takes RGB images of 7852 × 5304 pixels.
We flew over the site twice, in the summer on 25 August 2017 and in the winter on 3 March 2018 in order to get two sets of images under very different environmental conditions, to ensure we had a variety of backgrounds. In the summer, some of the enclosures were very dry, covered with bare soil or dry grass. Only one enclosure had green grass. In the winter, the deer were grouped in three enclosures. Most of the ground was covered in snow but the rising temperatures were leaving patches of bare soil where the deer gathered (Figure 1c,d). From the UAVs perspective, the deer were either standing up or lying down, exposing their flanks. While this species can grow to be more than 2 m long, most of the herd were young individuals measuring around 1.6 m long or less.
Flights were done at 40 and 80 m above ground level in order to diversify the dataset. In the summer acquisition, we flew over each deer enclosure at least once, at both altitudes. Takeoff and landing were performed manually, away from the deer to avoid causing stress. The rest of the flights were carried out automatically along linear transects covering the whole enclosure. Images were taken with 80% frontal and lateral overlap. In the winter, the colder temperatures reduced battery life. For safety reasons, we flew over all the enclosures every flight.

Image Pre-Processing
For each acquisition, the images were grouped by flight and were split between positive images containing deer, and negative images not containing any. The positive images were manually annotated in ArcMap 10.6 from Esri, where each deer was marked by a point in a vector layer. Individual images of deer were extracted from square windows (of 350 × 350 pixels and 175 × 175 pixels at 40 and 80 m respectively, or about 3 m 2 , with ground sample distances (GSD) of 5.2 mm and 10.3 mm respectively) centered on each detection. These individual images were then sorted to only contain whole and unobstructed images of deer. Their negative counterparts were automatically generated by cropping the large negative images along a sliding window of the same size as the ones used for the individual images of deer. A random sample of the resulting small negative images was selected for each flight to match the number of positive images, thus creating a balanced binary classification dataset (hereafter cited as initial datasets).
The training, validation, and test sets were made of images from separate flights to avoid testing the network on images very similar to the ones it had already been trained on (due to the high overlap).
Unbalanced datasets were also created for the training and validation sets, containing all the available negative images. These datasets acted as the hard-negative 'mines' on which we ran our trained model to retrieve hard samples. We will refer to these datasets as training and validation "pools" to water down the mining references in the paper. The resulting datasets and their imbalance factor (the number of negative samples per positive sample) are summarized in Table 1.

Proposed Approach
The training and testing were carried out using Pytorch 1.1.0 [35] on three graphics processing units (GPU)(NVIDIA GeForce GTX 1080 Ti) with a Resnet-18 [36] pre-trained on Imagnet, from the Pytorch model zoo. Using the smallest network of the Resnet family gave good results while reducing training time compared to deeper architectures. For this binary classification task, the capacity of the network was enough to obtain good results in a reasonable amount of time. Because the images on ImageNet are likely to be very different from aerial images, we decided to fine-tune the whole network instead of specific layers.
In our training, we used the optimizer called Ranger, implemented by Less Wright (https://github.com/lessw2020/Ranger-Deep-Learning-Optimizer), that combines the recent techniques of Rectified Adam [37] and LookAhead [38], with a flat learning rate schedule and a batch size of 1100.
As we planned to train on data that the network had already seen through the HNM process, we expected overfitting (the loss of generalization performance after training too long on the same data) to be more present than if we were training from scratch every time. To address this issue, we used data augmentation at every epoch on the training data, composed of random flips, rotations, rescaling, and brightness changes as well as early stopping [39] with a patience of 20 epochs. The small size of our validation set allowed us to assess the validation performance after each epoch.
To select the learning rate, we used an implementation of the learning rate range test (LRRT) [40] (https://github.com/davidtvs/pytorch-lr-finder). This simple test allowed us to pick the learning rate without a grid search, therefore saving a lot of fine-tuning time.

Metric
The metric is the score by which the performance of the model is evaluated, either during training on the validation dataset or after training on the test set. The most common metrics for binary classification tasks are the accuracy and the F1 score [41]. However, these metrics can be misleading and generate overoptimistic results when applied to unbalanced datasets [41]. Instead, we chose to use the Matthews correlation coefficient (MCC), which requires the classifier to perform well on both negative and positive samples to get a good score, regardless of their ratio [41]. Because we expect the training data to become more unbalanced as the wildlife survey progresses, the MCC assures us that the score given will be consistent and unbiased by the imbalance throughout the process.

Workflow Proposed Hard-Negative Mining (HNM)
For both acquisitions, we started our training procedure with balanced training and validation sets (Table 1) on which we ran our model with 10 different seeds. We called this round of training "round 0" as no hard sample had been found yet. We used the validation MCC score to select the best model and ran it in inference mode on the training and validation pools to select the hard samples. Only new images not already present in the validation set were added, as opposed to the training set for which all images were added. We ran this process until the number of hard negatives reached our acceptability threshold of 10 FP on the validation pool. To make sure that the gains transferred well to the test set, we ran the best model of each training round on the test set, but all the decisions regarding the training were made based on the validation set. We modified the inference process to also output the class activation maps [42] of the incorrectly classified images, in order to visualize the areas the network was basing its decisions on. The whole process is summarized in Figure 2.

Training Times
For a given hardware setup, the training time directly depends on the number of images present in the training and validation sets. To compare the training time of our method to the training on the full dataset, we ran a training run on them for both acquisitions with the same techniques and optimizer used in HNM. For each acquisition, we performed a LRRT to pick a learning rate value, then trained the same pretrained network used in round 0 of HNM with an early stopping patience of five epochs. Training Times For a given hardware setup, the training time directly depends on the number of images present in the training and validation sets. To compare the training time of our method to the training on the full dataset, we ran a training run on them for both acquisitions with the same techniques and optimizer used in HNM. For each acquisition, we performed a LRRT to pick a learning rate value, then trained the same pretrained network used in round 0 of HNM with an early stopping patience of five epochs.

Hard-Negative Mining
It took only one round of HNM for both acquisitions to reach our acceptability threshold. The details of the training and validation sets for both acquisitions can be found in Table 2.

Hard-Negative Mining
It took only one round of HNM for both acquisitions to reach our acceptability threshold. The details of the training and validation sets for both acquisitions can be found in Table 2.
The inference on the training and validation pools after round 0 of HNM added 2837 and 854 negative samples to the summer training and validation set, bringing their imbalance factors from 1 to 1.86 and 1.94 respectively. Much fewer hard samples were found for the winter acquisition, with 53 added to the training set and 51 to the validation set. The imbalance factors went from 1 to 1.02 and 1.04 respectively. The proportions of hard samples in the full dataset were larger for the summer acquisition, with 0.45% and 0.99% for the training and validation pools respectively, against 0.01% and 0.06% for the winter dataset.
For round 1 of training, the summer training and validation sets were made of 0.98 and 2.06% of the training and validation pools respectively. For the winter acquisition, they represented 1.44% and 1.63% of the training and validation pools. Adding these hard samples had some notable impacts on the performance on the validation pool sets for both acquisitions after round 1 of training (Tables 3 and 4). The numbers of FP went from 854 to 4 and 51 to 5 for the summer and winter sets respectively, thereby reducing the FP rates by 99.5% and 90.2% respectively. The numbers of FN increased for both acquisitions between the two rounds of training, going from 0 to 9 for the summer acquisition and from 0 to 1 for the winter acquisition. However, round 1 of training still resulted in an overall gain in performance, as the MCC scores went from 0.715 to 0.993 for the summer acquisition and from 0.979 to 0.998 for the winter acquisition. The summer acquisition showed the highest gain from the HNM process, as the MCC on the validation pool improved by 28% between the two rounds, compared to the 1.9% increase in the winter validation pool. The FP rate on the validation pool at the end of round 1 of training was 1 FP for 21,276 negative images for the summer set and 1 FP for 15,385 negative images for the winter set. This transferred well to both test sets, which went from 1/185 and 1/13,761 to 1/19,162 and 1/213,312 for the summer and winter sets respectively (Tables 3 and 4).

Class Activation Maps
The class activation maps of most of the false negatives show that the decision to classify these images as negatives is based on the background surrounding the deer, avoiding the deer itself (Figure 3a).

Class Activation Maps
The class activation maps of most of the false negatives show that the decision to classify these images as negatives is based on the background surrounding the deer, avoiding the deer itself (Figure 3a).
The network failed to distinguish between the background and the deer on only two images on the summer test set and one on the winter test set (Figure 3b).

Training Times
The difference in training time between the two rounds of training of HNM and training on the full dataset for both acquisitions is summarized in Tables 5 and 6.  The network failed to distinguish between the background and the deer on only two images on the summer test set and one on the winter test set (Figure 3b).

Training Times
The difference in training time between the two rounds of training of HNM and training on the full dataset for both acquisitions is summarized in Tables 5 and 6. Table 5. Comparison of training times between the two rounds of hard-negative mining (HNM) and training on the full unbalanced dataset for the summer and winter acquisitions. One training on the full summer dataset took 932 min (15 h and 32 min). This is 136 min less than it took to train the 20 models with HNM. The training on the full winter dataset was much faster than for the summer dataset (295 min or 4 h and 55 min), a little more than a third of the time it took to train the 20 models of the HNM on the same acquisition. For both acquisitions, the final models of the HNM process outperformed the models trained on the full datasets on the validation and test MCC scores, by 13.19% and 6.79% on the summer validation and test MCC respectively and by 0.17% and 0.36% for the winter acquisition. On average, it took 106.8 and 87.2 min to train a single model through the HNM method for the summer and winter acquisitions respectively.

Discussion
Our goal in this paper was to reduce training time and the number of FN when training on highly unbalanced data, with minimal fine-tuning of hyperparameters. Our method managed to achieve better performance than the same model on the whole dataset, in a fraction of the training time and with very low FP rates (around 1 FP per 6 hectares on the summer test set and 1 per 60 hectares on the winter test set).

Training the Models
Early stopping allowed us to simultaneously avoid overfitting, save the version of the network with the best generalization performance, and limit training time. However, shortening the training also limits the capacity of an optimizer such as RAdam to reach its best performance when using a suboptimal learning rate. The LRRT offers an interesting synergy with early stopping in this regard, as it ensures that the learning rate is picked within the range containing the optimal value. Therefore, we can expect good performance from the beginning of training and an end result not too far from what it would have been with the optimal value. While training parameters are generally given in the literature, little information is provided regarding the process to pick them [8, 24,43,44]. The LRRT used in this article offers an interesting tool to standardize the search for good learning rate values at little cost. Moreover, it can also be used to find good values for other optimizer-related parameters, such as weight decay or momentum [27].
Slightly worse hyperparameter values than the ones chosen would likely have increased the number of hard samples and the time for the network to converge, but the performance between HNM rounds would likely have improved due to the addition of new, relevant data. In our eyes, the method presented here offers a good trade-off between shortening the training time and good generalization performance. An alternative approach to improve the performance on the final round of training (round 1 in this case) would be to spend more time fine-tuning the hyperparameters instead of using the same exact training method as in the previous rounds. However, in a case like ours where the performance on the validation set is already very good, we expect diminishing returns on the time invested in the fine-tuning.
Apart from the impact high levels of imbalance can have on generalization performance, training on large datasets requires a lot of computing time. Moreover, the hyperparameter tuning required to use any method that tackles imbalance on a full dataset, also takes significantly longer on a large dataset. While HNM doesn't completely remove the imbalance, it greatly reduces its magnitude. The other techniques to mitigate the impact of imbalance mentioned earlier could still be used in conjunction with HNM, but the fine-tuning of their parameters would require far less time due to the smaller size of the training and validation sets.
We noticed high levels of variability between runs using different random initializations for a given set of hyperparameters, despite what can be read in [43], and therefore encourage practitioners to try several runs before settling on a final model (see Appendix A for more details).

Hard-Negative Mining
Whilst the FP rate was better on the winter acquisition than on the summer one, the HNM impact on the classification performance between the two rounds of training was much stronger on the summer acquisition (improvement of a factor 100 on the summer test set against an improvement of a factor 14 on the winter test set). We believe the reason for this to be the higher variety of objects present in the negative class of the summer dataset compared to its winter counterpart. In the winter, snow covered the majority of the objects present in the images, thereby reducing the intraclass imbalance of the negative class while increasing the contrast between the deer and the background. With more objects to confuse the network in the summer acquisition, the first inference on the training and validation pools returned significantly more FP than for the winter dataset. Most of these FP were similar, with a vast majority of them being of rocks, tree trunks, or shadows and happened to be almost absent of the initial training set. In this regard, the HNM ended up being a way of oversampling confusing objects within the negative class of the training set.
While applying the network to new areas that may offer different background diversity than previously encountered and thus may decrease its performance [28], the HNM process can retrieve only the informative examples needed to fine-tune the network. This would allow the training process to scale well in time as more data is acquired, without needing to retrain the network from scratch.
Unsurprisingly, the training of a model through HNM was significantly faster than a simple training on the full dataset and achieved better results. This highlights the negative impact a high number of easy samples can have on performance when nothing is done to mitigate the imbalance.
We believe this approach could be very beneficial to studies that use CNN to perform image classification on imbalanced datasets applied to different species, either on cameratrap images [43,44] or on UAV imagery [24]. The latter is a particularly good example as it has annotation and classification methodologies very similar to ours but with a much higher proportion of FP (1 FP for 530 negative images and an MCC score of 0.3526). The vast majority of their negative class is made of ocean, with very little intraclass diversity and therefore few objects that could confuse the network. Most of the negative images are likely to be easy samples that negatively impact the network's performance and could be removed from the training set through HNM. However, the difference in overall performance between our study and theirs doesn't only come from our HNM method. As explained in their article, other factors such as network depth (4 layers against 18 in ours), the use of a non-pretrained network, or the fact that they favor the recall against the precision may also have a significant impact on their network's performance. We could expect similar results to ours for images of similar GSD, of animals of comparable sizes to the red deer and in a similar environment, such as white-tailed deer (Odocoileus virginianus), caribou (Rangifer tarandus), or black bear (Ursus americanus).
When facing high levels of imbalance (75% of their images are negative), Norouzzadeh et al. [43] opted for a two-stage pipeline, first separating empty and full (containing animals) images then classifying the species present in the full images. To that end, they first randomly selected negative images to match the number of positive images, as we do to start our round 0. However, they then carried out their training without using the rest of the negative images, amounting to half of their available data. A single round of mining on the negative data might have brought new informative samples, improving the ability of the network to distinguish between empty and full images, without causing too much imbalance. Perhaps this alone might have helped improve the performance of their one stage pipeline and reduced the need for a two-stage approach.

Perspectives on Future Work
The nature of image classification as it is performed here can lead to mistakes when the network is more confident in the background area around the deer than the deer itself. When looking at the class activation maps of most of the FN (Figure 3), both deer and background are properly distinguished by the network, but the prediction is not the one we expect. Preliminary testing on these images have shown that cropping a significant portion of the background area around the deer led the network to classify it as deer. We interpret this as the consequence of forcing the network to give only one, non-spatially-specific label to the images containing both classes, based on the one that gives the highest score. A promising avenue to improve on this is to use the same network to perform coarse semantic segmentation by transforming it into a fully convolutional network (FCN). This method outputs a raster per class, highlighting the areas in the image where the class is detected (Figure 4). Similar ideas in Kellenberger et al. [28] and Bowler et al. [26] achieved good levels of performance. We believe that this technique could be used as a detection method but additional work to fully automate the detection from the coarse segmentation map is needed to assess its effectiveness on full-size images. score. A promising avenue to improve on this is to use the same network to perform coarse semantic segmentation by transforming it into a fully convolutional network (FCN). This method outputs a raster per class, highlighting the areas in the image where the class is detected (Figure 4). Similar ideas in Kellenberger et al. [28] and Bowler et al. [26] achieved good levels of performance. We believe that this technique could be used as a detection method but additional work to fully automate the detection from the coarse segmentation map is needed to assess its effectiveness on full-size images.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to proprietary rights and could only be used for non-commercial purposes.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Microdrones collected the images.

Appendix A
Randomness has an important impact on training, from the initialization of the network to the order with which the samples are picked. Setting this randomness from the start is necessary to ensure that training runs are reproducible. This is easily done in Pytorch by setting the seed at the beginning of training. The seed is a number from which python generates pseudo-random numbers. From each seed, the number generated at the Nth call of a function using that given seed will always be the same, thus ensuring run reproducibility. However, because of the randomness it generates throughout the training, it is impossible to predict which seed will yield the best results. It is therefore necessary to use different seeds for each training run and keep the one that performs the best on the validation set.
We randomly picked 10 numbers between 1 and 10,000 to use as seeds. Every round of training was carried out 10 times, using a different seed from our list of seeds.
We noticed that by selecting the best models based on their validation MCC, the performance on the number of false positives on the validation and test pools were both times under the average for the ten different seeds (Tables A1 and A2). However, for both acquisitions, a wrong seed could more than double the number of false positives compared to the lowest value. The seed seems to have a non-negligible effect on the network's performance. Our understanding is that a good initialization, through a favorable seed, may position the network in a spot allowing it to reach a smaller loss faster than with another, less favorable, seed. Although this may not surprise the most experienced practitioners, we thought it could be beneficial to newcomers in the field.
Running a model with multiple seeds is a way to assess the model's average performance. In a real case scenario however, our end goal wouldn't be to have a good assessment of the average performance of a tuned model but to pick the best performing model we could get in a reasonable time. In the case of an iterative process such as ours, selecting the best performing networks early is likely to save even more computing time down the line. Therefore, we would recommend trying out at least three seeds after the hyperparameter selection and keeping the one that performs the best on the validation set. In our case, the small size of our training sets allowed us to test 10 different seeds without committing too much computing time. The number of seeds to try is left at the discretion of the practitioner as it depends on the availability of resources and time constraints. However, we believe that the possible gain in performance compared to the little effort needed to try different seeds make it worth testing.