Deep Learning Case Study for Automatic Bird Identiﬁcation

: An automatic bird identiﬁcation system is required for offshore wind farms in Finland. Indubitably, a radar is the obvious choice to detect ﬂying birds, but external information is required for actual identiﬁcation. We applied visual camera images as external data. The proposed system for automatic bird identiﬁcation consists of a radar, a motorized video head and a single-lens reﬂex camera with a telephoto lens. A convolutional neural network trained with a deep learning algorithm is applied to the image classiﬁcation. We also propose a data augmentation method in which images are rotated and converted in accordance with the desired color temperatures. The ﬁnal identiﬁcation is based on a fusion of parameters provided by the radar and the predictions of the image classiﬁer. The sensitivity of this proposed system, on a dataset containing 9312 manually taken original images resulting in 2.44 × 10 6 augmented data set, is 0.9463 as an image classiﬁer. The area under receiver operating characteristic curve for two key bird species is 0.9993 (the White-tailed Eagle) and 0.9496 (The Lesser Black-backed Gull), respectively. We proposed a novel system for automatic bird identiﬁcation as a real world application. We demonstrated that our data augmentation method is suitable for image classiﬁcation problem and it signiﬁcantly increases the performance of the classiﬁer.


Introduction
Several offshore wind farms are under construction on the Finnish west coast.The official environmental specifications define that bird species behaviour at the vicinity of wind turbines must be monitored.This concerns especially two species: the White-tailed Eagle (Haliaeetus albicilla) and the Lesser Black-backed Gull (Larus fuscus fuscus), which are explicitly mentioned in the environment license.The only way to fulfil this demand cost efficiently is to automate monitoring, and that requires automatic bird species identification at such a level that the aforementioned bird species are separable from all other species in the study area.The problem is how to identify bird species in flight automatically in real-time?The prototype system for automated bird identification is developed and placed at a test location on Finnish west coast.This system is still under construction.
The ultimate objective of bird monitoring in wind farms is to find suitable methods for collision detection [1,2], and especially to find possible deterrent methods [3].The WT-Bird of the Energy Research Centre of the Netherlands is the first (i.e., known to us) published research of this subject.The principle of the WT-Bird system is that a bird collision could be detected by the sound of the impact and that the bird species can be recognised by non-real time method from video footage [4,5].
However, it has known problems with false alarms in high wind circumstances concerning larger bird species and it has no automated species identification algorithm [6].
Radar is a feasible choice for the detection of birds since the identification need is restricted to the flying birds only.If merely a radar is used, the identification capability is limited to a few size classes according to radar suppliers.Obviously, external information is required and a conceivable method is to exploit visual camera images, thus a digital single-lens reflex (DSLR) camera with a telephoto lens is applied.This paper shows that convolutional neural network (CNN) with deep learning algorithm trained on real-world images is capable to achieve sufficient state-of-the-art performance as an image classifier.At present, all the images are manually taken at the test location.The images will be acquired automatically by the final system.

Radar System
We have used a radar system supplied by Robin Radar Systems B.V. (The Haag, Netherlands) because they provide an avian radar system that is able to detect birds.They also have tracker algorithms for tracking a detected object over time i.e., between the blips.The model we use is the ROBIN 3D FLEX v1.6.3 and it is actually a combination of two radars and a software package for implementation of various algorithms such as the tracking algorithms [7].

Video Head Control
We have used the PT-1020 motorized video head supplied by 2B Security Systems (Copenhagen, Denmark) [8].The video head is operated by Pelco-D control protocol [9] and the control software for it is developed by us with C in Linux Ubuntu 16.04 platform.The video head steering is based on height, latitude and longitude coordinates (WGS84) provided by the radar.No coordinate conversion from one system to another is needed because all calculations are performed in WGS84 system.However, the geographical coordinates are converted to the rectangular coordinates in accordance with the Finnish Geodetic Institute [10].

Camera Control
We have manually collected the images at the test site with a Canon 7D mark II camera (Tokyo, Japan) and a Canon 500/f4 IS telephoto lens (Tokyo, Japan).The software for controlling the camera is developed with C# in Microsoft Visual Studio 14.0 because this is the only environment supported by the Canon API at present.The Canon API library of EDSDKLib-1.1.2 is applied.The code is developed in accordance with the instructions and functions of the API.The Canon API library is available for application on the Internet [11].

Input Data
Input data for the identification system consist of digital images and parameters from the radar.The parameters from the radar are real numbers such as velocity of a flying bird in m/s and bearing (i.e., a heading: the horizontal angle between the direction of an object and that of true north) in degrees.All images for training the CNN are of wild birds in flight and they have been taken manually at the test location.There are also constraints concerning the area where the images have to be taken.Here, the area refers to the air space in the vicinity of the pilot wind turbine.We have used the wind turbine swept area (the diameter of the swept area is 130 m) as a suitable altitude level constraint for taking the images, because birds flying below or above the swept area are not in danger.At this stage, the images are only taken in the vicinity of 1350 m in lengthwise direction, which is the distance to the pilot wind turbine.There are 1164 images for each class and the number of classes is 8, thus the original training set size is 8×1164 = 9312.We applied data augmentation as it is well-known method to increase performance of an image classifier.In addition, the original (i.e., not augmented) data set includes plenty of data examples of images with various portion of cloudiness as the background and also with clear sky as the background.
The number of images of each class should be the same as a CNN is applied [12] and therefore the lowest number of images of the classes is used.The number of classes (which includes both key species) is 8 at this phase.The eight classes for training the CNN are the Common Goldeneye (Bucephala clangula), the White-tailed Eagle (Haliaeetus albicilla), the Herring Gull (Larus argentatus), the Common Gull (Larus canus, the Lesser Black-backed Gull (Larus fuscus fuscus), the Black-headed Gull (Larus ridibundus, the Great Cormorant (Phalacrocorax carbo) and Common/Arctic Tern (Sterna hirundo/paradisaea).

Data Augmentation
Our system is operating in natural environment and therefore prevailing weather has significant influence on the tonality of the images taken at the test site.Obviously, the lighting will be different in a different time of a day and a different time of a year, and thus the toning of the images will be changing according to lighting.Color temperature is a property of a light source.It is the temperature of the ideal black-body radiator that radiates light of the same color as the corresponding light source.In this context black-body radiation is the thermal electromagnetic radiation emitted by a black body.A black-body is an opaque and non-reflective body.It has a specific spectrum and intensity that depends only on the temperature of the black-body, and it is assumed to be uniform and constant.In our case, the light source is the sun that closely approximates a black-body radiator.Even though the color of the sun may appear different depending on its position, the changing of color is mainly due to the scattering of light and it is not because of the changes in the black-body radiation [13][14][15][16].
Color matching functions (CMFs) provide the absolute energy values of three primary colors which appear the same as each spectrum color.We applied the International Commission on Illumination (Commission internationale de l'éclairage, CIE) 10-deg color matching functions in our data augmentation algorithm [17].
The data augmentation is done according to the curves in Figure 1.Ref. [18] by converting an image into different color temperatures between 2000 K • and 15,000 K • with step size s, where s {50, 75, 100, 150, 200, 250, 300, 1000}.This makes the training set significantly larger, e.g., if s is 50, a class containing 1164 training examples becomes a class of 261 × 1164 = 303,804 examples + the original image.The augmented data set size as a result of various value of s is given in Table 1 for the original data set of size 8 × 1164 = 9312.After color conversion, the images are rotated by a random angle between −20 • and 20 • drawn from the uniform distribution.This value has been altered from 30 to 20 since our first publication because it was empirically noticed that the target birds had never a position angled this steep.Motivation for image rotation is CNN's property of being invariant to small translations but not rotation of an image [19].Examples of one original image and two images as an output of the augmentation algorithm with this original image as an input and s = 200 are presented in Figure 2. The color temperature of the original image is 7600 K • and the two augmented images 5600 K • and 9600 K • , respectively.

The Proposed System
The most important role of the radar is to detect flying birds, but it also provides parameters for bird identification (i.e., classification) [20,21].The parameters provided by the radar system are: the distance in 3D of a target (m), the velocity of a target (m/s) and the trajectory of a target.The distance of a detected bird is used to estimate the size of the bird in meters.Velocity of a target bird is used for the final classification.The system also includes the aforementioned camera with the telephoto lens and a motorized video head.The camera is controlled by the application programmable interface (API) of the camera manufacturer.The system has three servers: the radar server, the video head steering server and the camera control server.Software for the radar server is supplied by the manufacturer of the radar but the software for the other two servers is result of our development work.
We took series of images of a single target bird and each image is processed according to the schematic diagram of the system in Figure 3. Segmentation is computed in parallel to image classification in order to obtain an estimate of the target bird size in pixels, i.e., despite that segmentation is computed simultaneously when the classification process is started, it is not part of the actual classification, but the result of the segmentation is used for assigning a value to the size estimate parameter.When the estimate in pixels is known, the target bird size estimate in meters can be calculated.We studied methods from simple threshold to fuzzy logic for solving the problem at hand i.e., a dark figure against bright background and vice versa as well.At the extremity, the background and the target can share several colors in the RGB color space.We achieved the best results by applying fuzzy logic segmentation compared to the threshold segmentation and the edge detection segmentation [22,23].In particular, we applied Mamdani's fuzzy inference method [24].Figure 4a

Classification
The classification process is presented in Figure 5. Series of images of a single target (i.e., as a sequence of temporally consecutive frames of the same bird) are fed to the CNN that is applied to feature extraction.The two-step learning method is applied, i.e., the CNN is trained with the first N-1 layers viewed as feature maps and these maps are used to train a Support Vector Machine (SVM) classifier [25].The SVM classifier makes use of one-versus-all binary learners, in which, for each binary learner, one class is positive and the rest are negative.The total number of the binary learners is the same as the number of classes.A linear classification model is applied.Stochastic gradient descent with 10 as the mini-batch size, and the Hinge loss function with regularization term 1/n, where n is a number of training examples [26,27] are also applied.The output of the SVM is presented as P-vectors as follows: where c j is a probability of belonging to class j, nc is the number of classes and n is the number of images in each series, thus there will be one P-vector for each image in any given image series.There are also two parameters based on information provided by the radar system.The size of the target bird is estimated as follows.The frame size ([width x height y], in pixels) of the camera and the angle of view (α) of the lens are known.The distance (d) to the target bird is provided by the radar.The maximum number of horizontal (σ h ) and vertical (σ v ) pixels of the target bird are calculated from the segmented image, respectively.The angle of view, b, at the distance, d, is calculated over a right-angled triangle (see Figure 6).The horizontal number of pixels/meter is given by and the vertical number of pixels/meter by where, b h and b v denote the horizontal and the vertical angles of view, respectively.The estimate for the size of the bird in a single image in square meters as an area of rectangle is: The size estimate is presented as a vector with elements placed according to the class order (the classes are ordered alphabetically by their names), i.e., class 1, class 2, . . .class nc, where nc denotes the number of classes.The composition of the vector is following: calculate the average of the size estimates of the image series, check from the size-look-up table all the classes that contain the average size, e, turn those elements to one and set the others to zero, yielding Size Estimate, E = [e 1 , e 2 , ..., e nc ], ( with elements: The velocity of the target bird is composed in similar way as the E-vector in Size Estimate (5), i.e., check from the velocity-look-up table all the classes that contain the provided velocity, v, turn those elements to one and the others to zero.
with elements: The final classification is achieved by a fusion between the parameters provided by the radar and the predictions from the image classifier.The combined P-vector for a series of images is: where n is the number of images in each series and the fusion vector, Φ, is: where ". * " denotes element wise multiplication.The score, S, for final prediction is: where j is the index of the predicted class.

Convolutional Neural Network
The CNN network architecture is presented in Figure 7.The architecture of the CNN results in (200 − 12 + 2 × 1)/2 + 1 = 96 for one side of the feature map and as of the result of square feature maps there are 96 × 96 = 9216 neurons in each feature map of the first convolution layer.Note that there is no max-pooling layer between the first and the second convolution layers.Motivation for this is that we wanted all of the finest edges to be included in resulting feature maps.
The input image is normalized and zero-centered before feeding it to the network.CNN with Mini-batch training and supervised mode as well as stochastic gradient descent with momentum is applied [28][29][30][31].The L2 Regularization (i.e., weight decay) method for reducing over-fitting is also applied [30][31][32].Due to limited capacity of computer resources the network size in terms of free parameters is kept small, thus resulting in total of 92 feature maps which are extracted by convolution layers with kernel sizes  Each convolution layer is followed by a Rectified Linear Units (ReLU) nonlinearity layer [33], which simply applies a threshold operation, to all the components of its input.This non-saturating nonlinearity in deep CNN makes the training several times faster when applied together with the hyperbolic tangent sigmoid transfer function [33,34].Cross Channel Normalization layers follow the first and the second ReLU layers.These layers aid the generalization as their function may be seen as brightness normalization [34].
The purpose of max-pooling layer is to build robustness to small distortions.This is achievable by filtering over local neighbourhoods as follows: divide the input into rectangular pooling regions, and compute the maximum of each region, thus performing downsampling and reducing the overfitting as well [35].
There are three fully-connected layers at the end of the network for making final nonlinear combinations of features, and prediction by the last fully-connected layer followed by softmax activation which produces a distribution over the class labels with cross entropy loss function [31].

Hyperparameter Selection
The split into a training set and a validation set was 70% and 30%, respectively.The initial weights for all layers were drawn from the Gaussian distribution with mean 0 and standard deviation 0.01.Initial biases were set to zero.The L2 value was set to 0.0005 and mini-batch size was set to 128.The values of all the previously mentioned hyperparameters were fixed and we used manual tuning only for choosing the combination of the number of epochs and the learning rate drop period (LRDP).Two models with different values of the two parameters were trained on the original data set (i.e., no data augmentation applied).One model was trained on the augmented data set with s = 1100 and s = 350, respectively.Several models with various values of the two parameters were trained on the augmented data set with s = 200 and s = 50, respectively.The results of training these models are presented in Table 2, in which performance is presented as true positive rate (TPR, i.e., sensitivity).The initial values of the two parameters applied to training on each data set are selected empirically.As a result of running these tests, the best model in terms of performance is the model trained on the augmented data set with s = 50 (i.e., 2,439,744 training examples), the number of epochs = 8, and the LRDP = 3.Initial learning rate was set to 0.01 and when the same value was applied to the number of epochs and the LRDP the learning rate was kept constantly at its initial value.The learning rate decay schedule (LRDS) was applied when the values of the number of epochs and the LRDP were different of each other.In the LRDS method, the learning rate is dropped by a factor of 0.1 (i.e., the updated learning rate will be the current learning rate × 0.1) when a given number of epochs is reached.This given number of epochs is the effective value of the LRDP.Motivation for using the LRDS method is as training proceeds with shorter leaps on the loss function surface from some point on, the optimal value for the weights (i.e., in terms of performance as a classifier) can be found more accurately.If only the short leaps would be applied, the number of epochs should be very large, thus resulting in significant increase of training time.The challenge is to find the points from where on the learning rate should be reduced.We approached this problem in two ways.We fixed the LRDP value and altered the number of epochs.Initially, the problem was to find a suitable starting value for the LRDP.It was intuitively clear that the LRDP value should increase as the number of epochs increases.A small value of the LRDP combined with a high value of the number of epochs would lead to substantial underfitting.We also fixed the number of epochs and altered the LRDP value instead.The same initial value problem concerns this approach as well.However, the size of the respective data set should give some guidance for choosing the initial values.Moreover, as the number of training examples increases, the number of epochs should decrease in order to avoid overfitting.
We applied the dropout technique for improving the performance of our CNN [34,36].We trained models with fixed hyperparameter values with and without the dropout technique.If overfitting occurs, the results in terms of classification performance should be better as the dropout technique is applied compared to those models for which it is not applied.These tests indicate that some overfitting occurs when the models were trained on the augmented data sets but not necessarily on the original data sets.The dropout was implemented after the first and the second fully-connected layers by randomly setting the output neurons to zero with a probability of 0.5.

Results
The following results are based on manually taken images at the test site.The images have been taken at the same position where the camera will be installed.We trained two models on the original data set and several models on four different augmented data sets, in which s was 50, 200, 350 and 1100, respectively (see Table 2).The models with s {350,1100} were trained only for testing the data augmentation algorithm.The effect of the data augmentation algorithm on classification performance is presented in Figure 8.The best performance (in TPR) of the two models trained on the original data set is 0.7362.Performance for the models trained on the augmented data sets varies between 0.8687 and 0.9984, which shows clear improvement as the augmented training set size increases and especially compared to the models trained on the original set.Training with and without the dropout technique implied that overfitting will occur to some extent as the data augmentation is applied and the dropout technique decreases this overfitting.The results were different for the original data sets, in which case overfitting was insignificant.These results are logical due to the fact that the enhancement in performance obtained by the data augmentation is extracted from the original images, and thus it inevitably increases redundancy.The results for the original data sets imply that the number of training examples was simply not large enough.
We tested generalization of the models on 100 unseen images for each class, i.e., the data set for testing the models was 8•100,100 = 600 images that the models have never seen before.According to these tests the system achieves its state-of-the-art performance of 0.9463 with the augmented data set of the size 2.44 × 10 6 (i.e., the color conversion step size, s = 50), number of epochs 8, LRDP 3, and the dropout applied.
The receiver operating characteristic (ROC) curves and the area under the curve (AUC) for the 8 classes (i.e., bird species) are presented in Figures 9-12.The TPR values of the generalization tests are applied in these figures.The red curve is for the augmented data set and the blue curve is for the original data set.

Discussion
We assembled the non-deep (i.e., in terms of the number of the convolution layers, 3) CNN for image classification, and demonstrated that the model is suitable for real-world application, especially, when the number of training data is limited.We presented and demonstrated that our data augmentation method improves significantly the performance of the classifier, and the desirable state-of-the-art performance as an image classifier can be achieved by applying it.Thus, we showed that the data augmentation is crucial for the classification performance.We also showed that our model generalizes well to images never seen before and hence it is applicable for real-world problem.The number of images in the original data set have been increased since our first publication resulting in the better state-of-the-art performance of 0.9463 compared to the first result of 0.9100.It is noteworthy that this better result is achieved despite of the increased number of the classes, i.e., 8 compared to 6 [37].
The measured performance of the image classifier has been obtained without using the parameters supplied by the radar.It is obvious that those parameters (i.e., the E-and V-vectors) provide additional and relevant a-priori knowledge to the system and they can turn a misclassified (by images) class into the correct one.Data collection will be continued at the test site resulting in a larger original data set, and thus hopefully better performance of the classifier.The number of classes will increase as more images of scarcer species are collected.
We are currently working on the collision detection problem, but no collisions have been observed until now while the pilot wind turbine has been manually monitored for 30 months.It seems that collisions are quite rare in the research area and this makes the field testing of the possible collision detection methods challenging.More research is required of possible deterrent methods, especially on species or species group level.
We proposed a novel system for automatic bird identification as a real world application.However, the system has restrictions such as images can not be taken in pitch-dark or in poor visibility conditions.Infrared cameras may contribute to the collision detection, but their contribution to classification is poor because all color information is lost.The proposed system is still in the installation phase, so we have not yet been able to test the complete system.

Figure 1 .
Figure 1.Color temperature and corresponding red, blue and green (RGB) values presented according to Commission Internationale de l'Eclairage (CIE) 1964 10-degree color matching function.

Figure 2 .
Figure 2. Data example of the White-tailed Eagle.The image on the left is an augmented image with the color temperature 5600 K • .The original image is in the middle with color temperature 7600 K • .The image on the right is an augmented image with the color temperature 9600 K • .

Figure 3 .
Figure 3. Schematic diagram of the system.
,b show an example of segmentation.

Figure 4 .
Figure 4. Example of binary image acquired by the segmentation process.(a) an original image of the Herring Gull; (b) respective binary image as a result of segmentation of the original image.

Figure 6 .
Figure 6.Diagram of the size estimate calculation.

Figure 7 .
Figure 7.The architecture of the convolutional neural network.The letters, s, and, p, in the max-pooling layers denote stride and padding, respectively.In convolution layers, the first two numbers in the square brackets indicate the width and hight of the respective convolution kernel and the third number is the depth.The number before the brackets is the number of feature maps in respective convolution layer.

Figure 8 .
Figure 8.The red curve is for validation during training and the blue curve is according to the generalization test.The actual True Positive Rate (TPR) values are used with s {350,1100}, and the average TPR value is used of the models with s {50,200}, respectively.The starting value for both curves is the average value of the two models trained on the original data set.

Figure 9 .Figure 10 .Figure 11 .Figure 12 .
ROC curves for the White-tailed Eagle and the Lesser Black-backed Gull.(a) AUC for the original data set and for the augmented data set is 0.9137 and 0.9993, respectively; (b) AUC for the original data set and for the augmented data set is 0.7460 and 0.9496, respectively.ROC curves for the Herring Gull and the Common Gull.(a) AUC for the original data set and for the augmented data set is 0.6926 and 0.9128, respectively; (b) AUC for the original data set and for the augmented data set is 0.6967 and 0.9644, respectively.ROC curves for the Black-headed Gull and the Common/Artic Tern.(a) AUC for the original data set and for the augmented data set is 0.7583 and 0.9972, respectively; (b) AUC for the original data set and for the augmented data set is 0.8111 and 0.9508, respectively.ROC curves for the Great Cormorant and the Common Goldeneye.(a) AUC for the original data set and for the augmented data set is 0.8853 and 0.9870, respectively.(b) AUC for the original data set and for the augmented data set is 0.0.8807 and 0.9829, respectively.

Table 1 .
Number of images for augmented data set with various step, s, values.

Table 2 .
Convolutional Neural Network (CNN) performance (with the Support Vector Machine (SVM) as an actual classifier) as a result of various number of epochs and Learning Rate Drop Period (LRDP).