1. Introduction
Infrared camera-traps are a non-intrusive monitoring method that can monitor wildlife in nature reserves and collect monitoring images without causing disturbance. Fundamental studies, such as rare species detection, species’ distribution evaluation, animal behavior monitoring, and other vital studies [
1], are supported by this method. However, traditional analytics of the monitoring images can take a lot of manual labor and time. The automatic wildlife identification method based on deep learning can save time and labor costs, reduce the monitoring cycle of wildlife, and enable timely study of the distribution and living conditions, which can promote effective protection.
The method of automatic learning of represented features from data based on deep learning has effectively improved the performance of target detection [
2]. Therefore, deep learning has been used by many researchers for wildlife identification. Kamencay et al. [
3] studied the identification methods of principal component analysis, linear discriminant analysis, and local binary mode histogram of wild animal images. The experiment result shows that Local Binary Patterns Histograms (LBPH) are better than Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) for small training data sets. Okafor et al. [
4] proposed a wildlife identification method based on deep learning and visual vocabulary, which used grayscale color information and different spatial aggregation methods to complete the training process, the accuracy of which reached 99.93%. By incorporating knowledge about object similarities from visual and semantic domains during the transfer process, Tang et al. [
5] proposed the object similarity-based knowledge transfer method, which achieved state-of-the-art detection performance using the semi-supervised learning method. Owoeye et al. [
6] used the convolutional neural network (CNN) and recurrent neural network (RNN) to recognize the collective activities of a herd of sheep. Despite a deviation in the distribution of herd activity, the collective activities of the herd were accurately identified. Manohar et al. [
7] proposed an effective system for animal recognition and classification based on texture features, which obtained the required features from the local appearance and texture of the animals. Furthermore, they classified them by using K nearest neighbor (KNN) and support vector machine. The final accuracy of the KNN algorithm was 66.7%, which was better than Support Vector Machine (SVM). Zhang et al. [
8] proposed a method for selecting filters in deep networks using a weak detector and a spatially weighted Fisher vector, and this method achieved superior results in fine-grained identification of birds. However, this method does not adopt an end-to-end structure and it needs to be finely adjusted for the difference of data distribution in the middle part. Liu et al. [
9] first used the “you only look once” (YOLO) model to detect wildlife areas, and proposed a two-channel convolutional neural network to identify wildlife, achieving an accuracy of more than 90%. This method needs accurate markings of the positions of the targets, which reduces the efficiency of wildlife recognition. The above studies all detect and identify wildlife under certain conditions, but there are still problems, such as unsatisfactory accuracy, high calibration cost, and high algorithm complexity.
The Deep Residual Network (ResNet) was proposed by He et al. [
10]. The 152-layer neural network was successfully trained and won the championship in ILSVRC2015, the error rate of which on Top-5 (when the correct class is within the five most probable classes) was 3.57%, which was an outstanding effect. To simplify the calculation process and reduce the structure parameters, Xie et al. [
11] improved the structure of ResNet and proposed a new structure, ResNeXt. Thanks to its excellent performance, it is widely used in target detection, image recognition [
12], and other applications [
13]. Niki et al. [
14] combined the depth residual block with the slice convolution to generate a classification score for a particular food category, achieving an accuracy of 90.27% on the Food-101 data set. Based on the residual block concept, Peng et al. [
15] added a branch of attention to the last convolutional layer of each block of the network, which aims to focus on finger micro-gestures and reduce noise from the wrist and background. A better recognition effect was achieved on the holographic 3D micro-gesture database (HoMG). Researchers Nigati et al. [
16] used Visual Geometry Group Network (VGGNet) and ResNet to automatically classify high-resolution remote sensing plant community images. The highest classification accuracy of the ResNet model and the VGGNet model were 91.83% and 89.56%, respectively. The ResNet50 model presented better classification results. Koné et al. [
17] designed a layered residual network model to address the problem of large histological images and error-prone analysis, and an accuracy of 0.99 in pathological segmentation tests was achieved. Most importantly, the training based on ResNeXt or ResNet is a weakly supervised learning method, which only need image-level annotations. Tang et al. [
18] proposed a model enhancing the weakly supervised deformable part-based models (DPMs) by emphasizing the importance of location and size of the initial class-specific root filter, using only image-level annotations. Extensive experimental results proved that the model has competitive final localization performance in weakly supervised object detection.
In summary, weakly supervised training can improve the efficiency of target detection and recognition. As a mature deep convolutional neural network, the deep residual network is widely used in various image classification applications, and it has achieved satisfactory recognition effects. Therefore, an integration model combining Squeeze-and-Excitation network (SENet) and ResNext was utilized to improve the efficiency and accuracy of recognizing large volumes of monitoring images. The rest of this paper is organized as follows.
Section 1 reviews the related work of automatic wildlife recognition.
Section 2 introduces the structure of a multi-branch aggregation transformation residual network and SENet.
Section 3 shows the structure of SE-ResNeXt for wildlife recognition.
Section 4 evaluates the performance of the SE-ResNeXt on two different datasets. The conclusions are drawn in
Section 5.
2. Related Work
The residual network is a structure based on a convolutional neural network [
19]. Compared with the traditional neural network, the key difference of ResNet is the “shortcut connection”, which is realized by directly adding the input to the output. This is equivalent to the fusion of the underlying feature information to the top level, which makes it possible to retain some of the characteristic information of small receptive wild targets. In addition, the problem of “degeneration” of deep neural networks is addressed. When the number of the layers increases, the error rate is correspondingly improved, and the Stochastic Gradient Descent (SGD) optimization becomes more difficult, affecting the learning efficiency of the model. At the same time, the higher the number of layers in the network, the more likely the network can lead to serious gradient disappearance problems. To some extent, this problem can be solved by normalizing initialization and intermediating normalization to ensure the normal convergence of the deeper networks. However, in deeper networks, degradation still exists. In response to the above situation, a residual module is introduced to the deep network to perform “residual learning”. The entire network is simplified to learn the direct residuals of input and output, which greatly reduces the learning objectives and the training complexity.
2.1. The Construction of Residual Block
The basic structure of the residual module is shown in
Figure 1.
In this figure, x denotes the input and H(x) denotes the desired underlying mapping. The stacked nonlinear layers is expected to fit another mapping of F(x) = H(x) − x. The original mapping is recast into F(x) + x. It is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. In the extreme, if an identity mapping is optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. The formulation of F(x) + x can be realized by feedforward neural networks with “shortcut connections”. Shortcut connections are those skipping one or more layers. The simple addition can greatly increase the training speed of the model and improve the training results without adding extra parameters or computational complexity.
The residual learning is adopted to each stacked layer. The module is defined as follows.
where
x and
y denote the input and output of the module respectively. The function
represents the residual mapping to be learned. The operation
F + x is performed by a shortcut connection and element-wise addition. The dimensions of
x and
F must be equal in Equation (1). If the dimensions are inconsistent, a linear projection
Ws shall be performed by the shortcut connections to match the dimensions.
The square matrix Ws is only used when matching dimensions.
This residual hopping structure breaks the convention that the output of the traditional neural network n − 1 layer can only give the n layer as an input, so that the output of a certain layer can directly cross several layers as the input of a later layer. In this way, the error rate of the whole learning model will not be improved after superimposing the multi-layer network, which provides feasibility for the extraction and classification of high-level semantic features.
2.2. Multi-Branch Aggregation Transformation Residual Network
While depth and width keep increasing, they start to induce diminishing returns for existing models. To avoid this problem, this paper adopts a multi-branch aggregation transform residual network called ResNext. The architecture adopts strategy of repeating layers, while exploiting the split-transform-merge strategy in an easy, extensible way. A module in the network performs a set of transformations, each on a low-dimensional embedding, whose outputs are aggregated by summation. This design allows us to extend to any large number of transformations without specialized designs.
The basic unit structure is shown in
Figure 2.
Each unit is a bottleneck. Firstly, 1 × 1 convolution is used to compute reductions before other operations. The input feature map is converted into a four-channel feature map. Then a 3 × 3 convolution operation is performed, and next through the 1 × 1 convolution to increase the input channel number to 256. Finally, the feature maps of the 32 branches are aggregated. The structure is a 32 × 4 d structure, where 32 is the new degree of freedom introduced by ResNeXt, called cardinality.
2.3. SENet
SENet is a convolutional neural network structure proposed by Hu et al. [
20] in 2017. The structure consists of Squeeze, Excitation, and Reweight. Different from the traditional method of introducing a new spatial dimension for the fusion of feature channels, the network structure adopts a “feature recalibration” strategy to explicitly construct the interdependence between feature channels.
Figure 3 is the structural diagram of the SENet. The number of characteristic channels of input
x is c
1, and a feature of the feature channel number c
2 is obtained by a general transformation such as a series of convolutions. Different from the traditional CNN, the following three functions are used to “recalibrate” the previously obtained features.
The first operation is Squeeze. Feature maps obtained by the previous operations are aggregated to produce a channel descriptor. Through this operation, each two-dimensional feature map is compressed by a global average pooling to become a real number, and the output of the c feature maps is a real number column of 1 × 1
× c, which is expressed as in Equation (3) given in [
18]:
The next operation is excitation also known as adaptive recalibration. Hu et al. [
18] employ a simple gating mechanism with a sigmoid activation to fully capture channel-wise dependencies. This operation makes the model obtain a collection of per-channel modulation weights which are vital for SENet. As follows given in [
18]:
where
refers to the Rectified Linear Units (ReLU) [
21] function,
and
.
Firstly, the feature dimension is reduced by a fully connected layer and activated by ReLU. Then it is brought back to the original dimension by a fully connected layer.
The final operation is to re-scale the feature maps. The
obtained through excitation is weighted to the corresponding feature map by multiplication and then the model gets the final output of the block, that is c
2 in
Figure 3. The re-calibration of the original feature is completed as in Equation (5) given in [
18]:
where
and
refers to channel-wise multiplication between the scalar
and the feature map
.
3. The Design of SE-ResNeXt
To conclude the monitoring image features more accurately and adequately, then achieve higher accuracy and efficiency of the automatic recognition, the SENet and multi-branch aggregation transformation are merged and improved to form the SE-ResNeXt structure in this section.
Firstly, the feature of x is extracted and merged by the convolution group in the module, and the feature map, whose channel number is c, is obtained. A Squeeze operation on the feature map is compressed by using global average pooling to obtain a real number column of 1 × 1 × c. Then, the interdependencies between the channels are captured to obtain scale factors through an adaptive recalibration process. Finally, the features are recalibrated through the scale operation.
Table 1 shows the network structure configuration of ResNet-50 and SE-ResNeXt-50.
The numbers following the
fc indicate
C/r and
C, respectively. In
Table 1, all basic modules of two network structures consist of three convolution layers with 16 modules. The first and the last are 1 × 1 convolution layers and the middle one is a 3 × 3 convolution layer. The difference between the two network structures is that a convolution group rather than a single convolution block exists in the SE-ResNeXt-50 module. Additionally, each module has a full connection operation. This paper will evaluate the SE-ResNeXt model based on the above network structure in the following section.
5. Conclusions
In this paper, an integrated model based on ResNeXt and SENet is employed to realize automatic identification of wildlife. This model combines multi-branch aggregation and a Squeeze and Excitation structure, which can automatically acquire the importance of the feature channel and extract the important features more efficiently, without increasing computational complexity. The model is evaluated on two different datasets in three sets of experiments, with the results showing that the performance of SE-ResNeXt is the best. In the identification experiments of our own wildlife database, the highest recognition accuracy of SE-ResNeXt reached 93.5%. In the experiment on the Serengeti dataset, SE-ResNeXt also outperformed ResNet101, especially in species without or lacking obvious characteristics. Compared with the existing recognition algorithm based on marking locations of wildlife, the SE-ResNeXt can slightly improve the recognition accuracy without labeling images. To conclude, SE-ResNeXt outperforms other state-of-the-art methods in automatic recognition of wildlife monitoring images. Moreover, it improves the efficiency of automatic recognition of wildlife by reducing the location marking workload, which greatly shortens the processing cycle of monitoring data and improves the intelligence level of wildlife protection.