An Integrated Wildlife Recognition Model Based on Multi-Branch Aggregation and Squeeze-And-Excitation Network

Featured Application: The method proposed in the article can be used for intelligent monitoring and identiﬁcation of wildlife in the natural environment. Abstract: Infrared camera trapping, which helps capture large volumes of wildlife images, is a widely-used, non-intrusive monitoring method in wildlife surveillance. This method can greatly reduce the workload of zoologists through automatic image identiﬁcation. To achieve higher accuracy in wildlife recognition, the integrated model based on multi-branch aggregation and Squeeze-and-Excitation network is introduced. This model adopts multi-branch aggregation transformation to extract features, and uses Squeeze-and-Excitation block to adaptively recalibrate channel-wise feature responses based on explicit self-mapped interdependencies between channels. The e ﬃ cacy of the integrated model is tested on two datasets: the Snapshot Serengeti dataset and our own dataset. From experimental results on the Snapshot Serengeti dataset, the integrated model applies to the recognition of 26 wildlife species, with the highest accuracies in Top-1 (when the correct class is the most probable class) and Top-5 (when the correct class is within the ﬁve most probable classes) at 95.3% and 98.8%, respectively. Compared with the ROI-CNN algorithm and ResNet (Deep Residual Network), on our own dataset, the integrated model, shows a maximum improvement of 4.4% in recognition accuracy.


Introduction
Infrared camera-traps are a non-intrusive monitoring method that can monitor wildlife in nature reserves and collect monitoring images without causing disturbance. Fundamental studies, such as rare species detection, species' distribution evaluation, animal behavior monitoring, and other vital studies [1], are supported by this method. However, traditional analytics of the monitoring images can take a lot of manual labor and time. The automatic wildlife identification method based on deep learning can save time and labor costs, reduce the monitoring cycle of wildlife, and enable timely study of the distribution and living conditions, which can promote effective protection.
The method of automatic learning of represented features from data based on deep learning has effectively improved the performance of target detection [2]. Therefore, deep learning has been used by many researchers for wildlife identification. Kamencay et al. [3] studied the identification methods of principal component analysis, linear discriminant analysis, and local binary mode histogram of wild animal images. The experiment result shows that Local Binary Patterns Histograms (LBPH) are utilized to improve the efficiency and accuracy of recognizing large volumes of monitoring images. The rest of this paper is organized as follows. Section 1 reviews the related work of automatic wildlife recognition. Section 2 introduces the structure of a multi-branch aggregation transformation residual network and SENet. Section 3 shows the structure of SE-ResNeXt for wildlife recognition. Section 4 evaluates the performance of the SE-ResNeXt on two different datasets. The conclusions are drawn in Section 5.

Related Work
The residual network is a structure based on a convolutional neural network [19]. Compared with the traditional neural network, the key difference of ResNet is the "shortcut connection", which is realized by directly adding the input to the output. This is equivalent to the fusion of the underlying feature information to the top level, which makes it possible to retain some of the characteristic information of small receptive wild targets. In addition, the problem of "degeneration" of deep neural networks is addressed. When the number of the layers increases, the error rate is correspondingly improved, and the Stochastic Gradient Descent (SGD) optimization becomes more difficult, affecting the learning efficiency of the model. At the same time, the higher the number of layers in the network, the more likely the network can lead to serious gradient disappearance problems. To some extent, this problem can be solved by normalizing initialization and intermediating normalization to ensure the normal convergence of the deeper networks. However, in deeper networks, degradation still exists. In response to the above situation, a residual module is introduced to the deep network to perform "residual learning". The entire network is simplified to learn the direct residuals of input and output, which greatly reduces the learning objectives and the training complexity.

The Construction of Residual Block
The basic structure of the residual module is shown in Figure 1. Appl. Sci. 2019, 9, x; doi: www.mdpi.com/journal/applsci wildlife recognition. Section 2 introduces the structure of a multi-branch aggregation transformation residual network and SENet. Section 3 shows the structure of SE-ResNeXt for wildlife recognition. Section 4 evaluates the performance of the SE-ResNeXt on two different datasets. The conclusions are drawn in Section 5.

Related work
The residual network is a structure based on a convolutional neural network [19]. Compared with the traditional neural network, the key difference of ResNet is the "shortcut connection", which is realized by directly adding the input to the output. This is equivalent to the fusion of the underlying feature information to the top level, which makes it possible to retain some of the characteristic information of small receptive wild targets. In addition, the problem of "degeneration" of deep neural networks is addressed. When the number of the layers increases, the error rate is correspondingly improved, and the Stochastic Gradient Descent (SGD) optimization becomes more difficult, affecting the learning efficiency of the model. At the same time, the higher the number of layers in the network, the more likely the network can lead to serious gradient disappearance problems. To some extent, this problem can be solved by normalizing initialization and intermediating normalization to ensure the normal convergence of the deeper networks. However, in deeper networks, degradation still exists. In response to the above situation, a residual module is introduced to the deep network to perform "residual learning". The entire network is simplified to learn the direct residuals of input and output, which greatly reduces the learning objectives and the training complexity.

The Construction of Residual Block
The basic structure of the residual module is shown in Figure 1.   [9]).
In this figure, x denotes the input and H(x) denotes the desired underlying mapping. The stacked nonlinear layers is expected to fit another mapping of F(x) = H(x) − x. The original mapping is recast into F(x) + x. It is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. In the extreme, if an identity mapping is optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. The formulation of F(x) + x can be realized by feedforward neural networks with "shortcut connections". Shortcut connections are those skipping one or more layers. The simple addition can greatly increase the training speed of the model and improve the training results without adding extra parameters or computational complexity.
The residual learning is adopted to each stacked layer. The module is defined as follows. A basic block of ResNet (adapted from a previous study [9]).
In this figure, x denotes the input and H(x) denotes the desired underlying mapping. The stacked nonlinear layers is expected to fit another mapping of F(x) = H(x) − x. The original mapping is recast into F(x) + x. It is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. In the extreme, if an identity mapping is optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. The formulation of F(x) + x can be realized by feedforward neural networks with "shortcut connections". Shortcut connections are those skipping one or more layers. The simple addition can greatly increase the training speed of the model and improve the training results without adding extra parameters or computational complexity. The residual learning is adopted to each stacked layer. The module is defined as follows.
where x and y denote the input and output of the module respectively. The function F(x, {W i }) represents the residual mapping to be learned. The operation F + x is performed by a shortcut connection and element-wise addition. The dimensions of x and F must be equal in Equation (1). If the dimensions are inconsistent, a linear projection Ws shall be performed by the shortcut connections to match the dimensions.
The square matrix W s is only used when matching dimensions. This residual hopping structure breaks the convention that the output of the traditional neural network n − 1 layer can only give the n layer as an input, so that the output of a certain layer can directly cross several layers as the input of a later layer. In this way, the error rate of the whole learning model will not be improved after superimposing the multi-layer network, which provides feasibility for the extraction and classification of high-level semantic features.

Multi-Branch Aggregation Transformation Residual Network
While depth and width keep increasing, they start to induce diminishing returns for existing models. To avoid this problem, this paper adopts a multi-branch aggregation transform residual network called ResNext. The architecture adopts strategy of repeating layers, while exploiting the split-transform-merge strategy in an easy, extensible way. A module in the network performs a set of transformations, each on a low-dimensional embedding, whose outputs are aggregated by summation. This design allows us to extend to any large number of transformations without specialized designs.
The basic unit structure is shown in Figure 2.
represents the residual mapping to be learned. The operation F + x is performed by a shortcut connection and element-wise addition. The dimensions of x and F must be equal in Equation (1). If the dimensions are inconsistent, a linear projection Ws shall be performed by the shortcut connections to match the dimensions.
The square matrix Ws is only used when matching dimensions. This residual hopping structure breaks the convention that the output of the traditional neural network n − 1 layer can only give the n layer as an input, so that the output of a certain layer can directly cross several layers as the input of a later layer. In this way, the error rate of the whole learning model will not be improved after superimposing the multi-layer network, which provides feasibility for the extraction and classification of high-level semantic features.

Multi-Branch Aggregation Transformation Residual Network
While depth and width keep increasing, they start to induce diminishing returns for existing models. To avoid this problem, this paper adopts a multi-branch aggregation transform residual network called ResNext. The architecture adopts strategy of repeating layers, while exploiting the split-transform-merge strategy in an easy, extensible way. A module in the network performs a set of transformations, each on a low-dimensional embedding, whose outputs are aggregated by summation. This design allows us to extend to any large number of transformations without specialized designs.
The basic unit structure is shown in Figure 2.  Each unit is a bottleneck. Firstly, 1 × 1 convolution is used to compute reductions before other operations. The input feature map is converted into a four-channel feature map. Then a 3 × 3 convolution operation is performed, and next through the 1 × 1 convolution to increase the input channel number to 256. Finally, the feature maps of the 32 branches are aggregated. The structure is a 32 × 4 d structure, where 32 is the new degree of freedom introduced by ResNeXt, called cardinality. Each unit is a bottleneck. Firstly, 1 × 1 convolution is used to compute reductions before other operations. The input feature map is converted into a four-channel feature map. Then a 3 × 3 convolution operation is performed, and next through the 1 × 1 convolution to increase the input channel number to 256. Finally, the feature maps of the 32 branches are aggregated. The structure is a 32 × 4 d structure, where 32 is the new degree of freedom introduced by ResNeXt, called cardinality.

SENet
SENet is a convolutional neural network structure proposed by Hu et al. [20] in 2017. The structure consists of Squeeze, Excitation, and Reweight. Different from the traditional method of introducing a new spatial dimension for the fusion of feature channels, the network structure adopts a "feature recalibration" strategy to explicitly construct the interdependence between feature channels. Figure 3 is the structural diagram of the SENet. The number of characteristic channels of input x is c 1 , and a feature of the feature channel number c 2 is obtained by a general transformation such as a series of convolutions. Different from the traditional CNN, the following three functions are used to "recalibrate" the previously obtained features.

SENet
SENet is a convolutional neural network structure proposed by Hu et al. [20] in 2017. The structure consists of Squeeze, Excitation, and Reweight. Different from the traditional method of introducing a new spatial dimension for the fusion of feature channels, the network structure adopts a "feature recalibration" strategy to explicitly construct the interdependence between feature channels. Figure 3 is the structural diagram of the SENet. The number of characteristic channels of input x is c1, and a feature of the feature channel number c2 is obtained by a general transformation such as a series of convolutions. Different from the traditional CNN, the following three functions are used to "recalibrate" the previously obtained features. The first operation is Squeeze. Feature maps obtained by the previous operations are aggregated to produce a channel descriptor. Through this operation, each two-dimensional feature map is compressed by a global average pooling to become a real number, and the output of the c feature maps is a real number column of 1 × 1 × c, which is expressed as in Equation (3) given in [18]: The next operation is excitation also known as adaptive recalibration. Hu et al. [18] employ a simple gating mechanism with a sigmoid activation to fully capture channel-wise dependencies. This operation makes the model obtain a collection of per-channel modulation weights which are vital for SENet. As follows given in [18]: where σ refers to the Rectified Linear Units (ReLU) [21] function, Firstly, the feature dimension is reduced by a fully connected layer and activated by ReLU. Then it is brought back to the original dimension by a fully connected layer.
The final operation is to re-scale the feature maps. The c s obtained through excitation is weighted to the corresponding feature map by multiplication and then the model gets the final output of the block, that is c2 in Figure 3. The re-calibration of the original feature is completed as in Equation (5) given in [18]:  The first operation is Squeeze. Feature maps obtained by the previous operations are aggregated to produce a channel descriptor. Through this operation, each two-dimensional feature map is compressed by a global average pooling to become a real number, and the output of the c feature maps is a real number column of 1 × 1 × c, which is expressed as in Equation (3) given in [18]: The next operation is excitation also known as adaptive recalibration. Hu et al. [18] employ a simple gating mechanism with a sigmoid activation to fully capture channel-wise dependencies. This operation makes the model obtain a collection of per-channel modulation weights which are vital for SENet. As follows given in [18]: where σ refers to the Rectified Linear Units (ReLU) [21] function, W 1 ∈ R C r ×C and W 2 ∈ R C× C r . Firstly, the feature dimension is reduced by a fully connected layer and activated by ReLU. Then it is brought back to the original dimension by a fully connected layer.
The final operation is to re-scale the feature maps. The s c obtained through excitation is weighted to the corresponding feature map by multiplication and then the model gets the final output of the block, that is c 2 in Figure 3. The re-calibration of the original feature is completed as in Equation (5) given in [18] x c ] and F scale (u c , s c ) refers to channel-wise multiplication between the scalar s c and the feature map u c ∈ R H×W .

The Design of SE-ResNeXt
To conclude the monitoring image features more accurately and adequately, then achieve higher accuracy and efficiency of the automatic recognition, the SENet and multi-branch aggregation transformation are merged and improved to form the SE-ResNeXt structure in this section. The basic module is shown in Figure 4.

The Design of SE-ResNeXt
To conclude the monitoring image features more accurately and adequately, then achieve higher accuracy and efficiency of the automatic recognition, the SENet and multi-branch aggregation transformation are merged and improved to form the SE-ResNeXt structure in this section.
The basic module is shown in Figure 4.  Firstly, the feature of x is extracted and merged by the convolution group in the module, and the feature map, whose channel number is c, is obtained. A Squeeze operation on the feature map is compressed by using global average pooling to obtain a real number column of 1 × 1 × c. Then, the interdependencies between the channels are captured to obtain scale factors through an adaptive recalibration process. Finally, the features are recalibrated through the scale operation. Table 1 shows the network structure configuration of ResNet-50 and SE-ResNeXt-50.
conv,1 1,128 conv,3 3,128 4 conv,1 1,512  Firstly, the feature of x is extracted and merged by the convolution group in the module, and the feature map, whose channel number is c, is obtained. A Squeeze operation on the feature map is compressed by using global average pooling to obtain a real number column of 1 × 1 × c. Then, the interdependencies between the channels are captured to obtain scale factors through an adaptive recalibration process. Finally, the features are recalibrated through the scale operation. Table 1 shows the network structure configuration of ResNet-50 and SE-ResNeXt-50.  The numbers following the fc indicate C/r and C, respectively. In Table 1, all basic modules of two network structures consist of three convolution layers with 16 modules. The first and the last are 1 × 1 convolution layers and the middle one is a 3 × 3 convolution layer. The difference between the two network structures is that a convolution group rather than a single convolution block exists in the SE-ResNeXt-50 module. Additionally, each module has a full connection operation. This paper will evaluate the SE-ResNeXt model based on the above network structure in the following section.

Results and Discussion
Experiments for our study are performed on the paddlepaddle platform. The experiment system environment is Ubuntu 16.04 and the programming language is python. The hardware configuration encompasses an E5-2620 Central Processing Unit (CPU) and GTX1080ti Graphics Processing Unit (GPU).

Our Own Dataset
In this section, the images of wildlife were acquired by infrared camera in the Saihanwula National Nature Reserve in Inner Mongolia from 2010 to 2019 [22]. Thousands of wildlife monitoring images were obtained and a sample library of wildlife monitoring images was established. The monitoring images consist of 24-bit RGB true color images with a resolution of 2560 × 1920. In this paper, five common terrestrial protected species [23] were selected, including Red Deer, Goral, Roe Deer, Lynx, and Badger. Table 2 shows the image numbers for the five species in the image sample library. Figure 5 shows a sample of wildlife monitoring images.  Figure 5a shows the monitoring image of the red deer, in which the individual target is complete and its outline is clear. However, high-quality images are few in the dataset, as most images have various defects, such as obstructions (Figure 5b), indistinguishable backgrounds (Figure 5c), and targets being too far away (Figure 5d). When such images are artificially marked, subjectivity will affect the accuracy of feature extraction, affecting the characteristic learning ability of the model.   Figure 5a shows the monitoring image of the red deer, in which the individual target is complete and its outline is clear. However, high-quality images are few in the dataset, as most images have various defects, such as obstructions (Figure 5b), indistinguishable backgrounds (Figure 5c), and targets being too far away (Figure 5d). When such images are artificially marked, subjectivity will affect the accuracy of feature extraction, affecting the characteristic learning ability of the model.

Comparison Experiments of ResNet and SE-ResNeXt in Different Numbers of Layers
Based on our own dataset, ResNet and SE-ResNeXt with layers of 50, 101, and 152 were used for comparison experiments. The experimental learning rates in this section were set to 0.1 and 0.01, respectively. The comparison work is conducted by analyzing the training results from two groups of experiments. One group is set with the same number of layers under different structures. The other is set with different numbers of layers under the same structure. Figure 6 shows the results of the comparison experiments where the number of layers is different.
(a) (b) As shown in the above figures, the maximum training accuracy of ResNet is approximately 83.6%, regardless of the number of layers, which is the same as the SE-ResNeXt with different layers. The difference is that the maximum accuracy of SE-ResNeXt training is higher than ResNet, which is close to 90.6%, indicating that the training of SE-ResNeXt has better recognition performance. Even if the final accuracy is similar, the training process is different for models with different layers. Before 70 epochs, the higher the number of layers, the lower the accuracy of training when the number of iterations is the same. This is because the model has more parameters when the number of layers increases, so it needs more iterations to obtain higher precision. Figures 7 and 8 show the training accuracy variations with epoch changes when the learning rates are 0.1 and 0.01, respectively. When the learning rate is 0.1, the train accuracy of SE-ResNeXt is higher than that of ResNet throughout the training process, regardless of the number of layers. Moreover, as the epoch increases, the difference in accuracy also increases. When the learning rate is reduced to 0.01, the difference decreases. In the process of training, there is almost no difference between the SE-ResNeXt and the ResNet when the number of layers is 50. As the number of layers increases, the difference between SE-ResNeXt and ResNet increases marginally.    Table 3 displays the test accuracy of the six models in the two sets of experiments. It can be observed that the accuracy of 0.01 learning rate is higher than that of 0.1 learning rate for all six models. In addition, the learning rate has a significantly greater impact on ResNet than the impact on SE-ResNeXt. In two experiments with different learning rates, the accuracy differences of ResNet50 and ResNet152 both reached 2.7%, while the accuracy difference of SE-ResNeXt was only 1.6%. This shows that reducing the learning rate is beneficial for improving the recognition accuracy. At the same time, it can be seen intuitionally from the table that the test accuracy of SE-ResNeXt is always higher than that of ResNet when the models have the same number of layers. The maximum difference reaches 4.4%. From the table, it can be concluded that the training of SE-ResNeXtn101 achieves the best test results when the learning rate is 0.01.

Comparison Experiments of Accuracy for Each Category
To further validate the advantages of the integrated model, the accuracy of each category is compared in this section. According to the above experiments, ResNet152 and SE-ResNeXt101, which achieve the highest test accuracy in the corresponding structures, were compared. The confusion matrix is used to compare the accuracy for each category of the two models. SE-ResNeXt101 has higher recognition accuracy for five species than ResNet152 and the maximum difference is 4.4%. For the two models, the performance of recognizing Roe Deer is worse than other categories. This is because Roe Deer resembles Red Deer, but the number of Roe Deer images is only half the number of Red Deer and the image quality is lower than that of Red Deer. When identifying Red Deer, with the help of so many high-quality images, the accuracy of SE-ResNeXt101 is much higher than that of ResNet152. The results prove that SE-ResNeXt can achieve better performance in recognition tasks on our own dataset.

Comparative Experiment with Existing ROI-CNN Algorithm
Liu et al. proposed an automatic recognition algorithm for wildlife based on ROI-CNN [9], which achieved a recognition accuracy of 91.2%. The algorithm firstly extracts the region of interest from the images and obtains the ROI image. Then, the model is imported for the training process. There is a large amount of work in the image labeling step, which is regarded as a strong supervision study. To verify the advantages of the integration model, the same training samples are used and the test accuracy of each network structure is shown in Table 4. From Table 4, ResNet50, SE-ResNeXt50, and SE-ResNeXt101 perform slightly worse on the same sample than the ROI-CNN algorithm, but the accuracies are all above 90%. The test accuracy of SE-ResNeXt152 reaches 91.4%, which is slightly higher than Liu's recognition algorithm. Considering the labor and time cost, manual does is not need to be performed during using our integrated model, so the recognition efficiency of the model is better than that of Liu.

Comparative Experiments with Snapshot Serengeti Dataset
To further validate the validity of the model, the experiments on a public dataset called Snapshot Serengeti dataset [24] was conducted. It was acquired with 225 camera-traps placed in Serengeti National Park, Tanzania. More than one million sets of pictures were acquired, and each set contains 1-3 photographs. This dataset allows the computer science community to study and overcome the SE-ResNeXt101 has higher recognition accuracy for five species than ResNet152 and the maximum difference is 4.4%. For the two models, the performance of recognizing Roe Deer is worse than other categories. This is because Roe Deer resembles Red Deer, but the number of Roe Deer images is only half the number of Red Deer and the image quality is lower than that of Red Deer. When identifying Red Deer, with the help of so many high-quality images, the accuracy of SE-ResNeXt101 is much higher than that of ResNet152. The results prove that SE-ResNeXt can achieve better performance in recognition tasks on our own dataset.

Comparative Experiment with Existing ROI-CNN Algorithm
Liu et al. proposed an automatic recognition algorithm for wildlife based on ROI-CNN [9], which achieved a recognition accuracy of 91.2%. The algorithm firstly extracts the region of interest from the images and obtains the ROI image. Then, the model is imported for the training process. There is a large amount of work in the image labeling step, which is regarded as a strong supervision study. To verify the advantages of the integration model, the same training samples are used and the test accuracy of each network structure is shown in Table 4. From Table 4, ResNet50, SE-ResNeXt50, and SE-ResNeXt101 perform slightly worse on the same sample than the ROI-CNN algorithm, but the accuracies are all above 90%. The test accuracy of SE-ResNeXt152 reaches 91.4%, which is slightly higher than Liu's recognition algorithm. Considering the labor and time cost, manual does is not need to be performed during using our integrated model, so the recognition efficiency of the model is better than that of Liu.

Comparative Experiments with Snapshot Serengeti Dataset
To further validate the validity of the model, the experiments on a public dataset called Snapshot Serengeti dataset [24] was conducted. It was acquired with 225 camera-traps placed in Serengeti National Park, Tanzania. More than one million sets of pictures were acquired, and each set contains 1-3 photographs. This dataset allows the computer science community to study and overcome the challenges presented in camera trapping frameworks. Additionally, it also helps ecology researchers and educators in their camera-trap studies. In this work, 26 classes were selected form the dataset for recognition (they are listed in Table 5). Gomez-Villa et al. [25] adopted six very deep convolutional neural networks (ConvNets) to train the Snapshot Serengeti dataset and obtained satisfactory results. The six models are AlexNet [26], VGG Net [27], GoogLenet [28], ResNet-50, ResNet-101, and ResNet-152, which are the state-of-the-art in object recognition. Accuracy in the validation set is used as a performance metric. Top-1 (when the correct class is the most probable class) and Top-5 (when the correct class is within the five most probable classes) accuracies are presented to show how well the model performs in ranking possible species.
The best results in Gomez-Villa's experiments were obtained using ResNet-101, showing 88.9% for Top-1 and 98.1% for Top-5. Using the same dataset, this paper employs SE-ResNeXt50 and SE-ResNeXt101 to compare with Gomez-Villa's best results. Figure 10 shows the results of experiments, including Top-1 and Top-5 results alongside each experiment.
The results reach an accuracy of 88.8% and 97.1% for Top-1 and Top-5 respectively when the model is SE-ResNeXt50. When the model is SE-ResNeXt101, the results are 95.3% and 98.8% for Top-1 and Top-5 respectively. Figure 10 shows that the Top-1 and Top-5 of SE-ResNeXt50 have a very subtle difference from those of ResNet101. When trained with SE-ResNeXt101, the accuracy of Top-1 reaches 95.3%, 6.4% higher than that of ResNet101. The value of Top-5 is also slightly higher than that of ResNet101, which proves that SE-ResNeXt performs better than other state-of-the-art models in object recognition, when using the same dataset. Table 6 compares the accuracy of each species using the ResNet-101 in Alexander Gomez Villa's experiments and the SE-ResNeXt101 in this paper. Table 6 shows that SE-ResNeXt101 has an accuracy of more than 90% for all 26 species. The lowest accuracy is 91.7% on Wildebeest and the highest is 99.8% on Human. As for ResNet-101, the lowest accuracy is 65% on Grants gazelle and the highest is 99.5% on Guinea fowl. For Grants gazelle, the accuracy of SE-ResNeXt101 reaches 96.2%, surpassing that of ResNet-101. According to statistics, the average recognition accuracies of SE-ResNeXt101 and ResNet-101 for 26 species are 95.3% and 88.9%, respectively. It is shown that when the species characteristics are obvious, for example Giraffe and Human, the performance of the two models is similar. In thirteen classes, the SE-ResNeXt101 outperformed the ResNet-101 and the maximum difference reached 31.2% when recognizing Grants gazelle. The color of the Grants gazelle's hide is similar to that of the background, which makes the recognition more difficult. Similarly, when an identification problem is related to fine-grained classification, for example Lion female and male, the performance of ResNet-101 is obviously poor and the accuracy of SE-ResNeXt101 is still above 90%.
Appl. Sci. 2019, 9, x; doi: www.mdpi.com/journal/applsci Figure 10. Experiments for the Snapshot Serengeti dataset.  Figure 10. Experiments for the Snapshot Serengeti dataset. From the above results and discussions, it is concluded that the SE-ResNeXt can obtain the differential characteristics of the wildlife monitoring images through the multi-branch aggregation and "feature recalibration" strategy, which can improve recognition performance without marking the locations of wildlife. Better experimental results are obtained on different datasets, which indicates that the integrated model has better generalization capabilities and performs more stably in recognition.

Conclusions
In this paper, an integrated model based on ResNeXt and SENet is employed to realize automatic identification of wildlife. This model combines multi-branch aggregation and a Squeeze and Excitation structure, which can automatically acquire the importance of the feature channel and extract the important features more efficiently, without increasing computational complexity. The model is evaluated on two different datasets in three sets of experiments, with the results showing that the performance of SE-ResNeXt is the best. In the identification experiments of our own wildlife database, the highest recognition accuracy of SE-ResNeXt reached 93.5%. In the experiment on the Serengeti dataset, SE-ResNeXt also outperformed ResNet101, especially in species without or lacking obvious characteristics. Compared with the existing recognition algorithm based on marking locations of wildlife, the SE-ResNeXt can slightly improve the recognition accuracy without labeling images. To conclude, SE-ResNeXt outperforms other state-of-the-art methods in automatic recognition of wildlife monitoring images. Moreover, it improves the efficiency of automatic recognition of wildlife by reducing the location marking workload, which greatly shortens the processing cycle of monitoring data and improves the intelligence level of wildlife protection.