Large-Scale Fine-Grained Bird Recognition Based on a Triplet Network and Bilinear Model

The main purpose of fine-grained classification is to distinguish among many subcategories of a single basic category, such as birds or flowers. We propose a model based on a triple network and bilinear methods for fine-grained bird identification. Our proposed model can be trained in an end-to-end manner, which effectively increases the inter-class distance of the network extraction features and improves the accuracy of bird recognition. When experimentally tested on 1096 birds in a custom-built dataset and on Caltech-UCSD (a public bird dataset), the model achieved an accuracy of 88.91% and 85.58%, respectively. The experimental results confirm the high generalization ability of our model in fine-grained image classification. Moreover, our model requires no additional manual annotation information such as object-labeling frames and part-labeling points, which guarantees good versatility and robustness in fine-grained bird recognition.


Introduction
Image classification is a classical research topic in the computer vision field.Traditional image classification mainly categorizes semantic-level images or instance-level images.Semantic-level classification includes scene recognition [1,2] and object recognition [3,4], where the latter identifies different categories of objects such as cats and dogs.Meanwhile, instance-level classification distinguishes among individuals of an object, such as faces.Located between these two types, fine-grained image classification provides a more detailed class precision than coarse-grained image classification (such as object recognition), and detects subtle differences among the classes, often consisting of small local differences.For example, fine-grained classification distinguishes different types of birds [5], dogs [6], flowers [7], or any other object of interest.Fine-grained image classification is especially concerned with identifying the important distinguishing features.Such fine-grained features need to be tested against complex image backgrounds, and factors such as illumination, deformation, and occlusion interference with ambient noise should be reduced as far as possible [8].The acquisition of fine-grained features is more complex than acquisition of coarse-grained features, and relies on image annotation to determine the complex parameters in the model while avoiding the over-fitting problem caused by small data amounts.Therefore, improving the extraction and choosing the appropriate structure of the convolutional neural network and the connection of the network are key initiatives.
Fine-grained image classification can be divided into strongly supervised and weakly supervised approaches.In the strongly supervised approach, the category labels of the images during model training are supplemented by additional manual annotation information such as object-labeling boxes and part-labeling points.The authors of [9] detected the object levels and local areas in fine-grained images using a region-based convolutional neural network (R-CNN) algorithm.During the training phase, the R-CNN algorithm must mark the object frame and the part-labeling point.The object-labeling frame is also required in the test images.In [10], local areas were detected by a pose normalization algorithm.The images were cropped around the detected label box, and the local information at different levels was extracted for pose alignment.Finally, the convolution characteristics of the different layers were obtained.The model was constructed from two modules: one for local positioning, the other for feature learning of the global and local image blocks.However, the practicality of pose normalization is limited by the high expense of acquiring the annotation information.
Weakly supervised fine-grained image classification uses labels alone, without requiring additional annotation information.As local-area information is essential in fine-grained image classification, the detection of local areas would improve the performance of weakly supervised fine-grained image classification.The first algorithm with no reliance on annotation information was proposed in [11].This algorithm uses two levels of features-object-level features and local-level features-and completes the fine-grained image classification by using category labels.The authors of [12] extracted local-area information based on several essential points derived from the CNN features [13].A bilinear CNN model that performs local-area detection and feature extraction using two networks was proposed in [14].
Birds are an important part of natural ecosystem.Correct recognition of birds is very important for the protection and research of birds.However, classifying birds correctly is very difficult because the differences between the different species are very subtle.Therefore, our paper focuses on the fine-grained identification of birds.We propose a network based on the triplet network and the bilinear model.The inputs of the network consist of three images: a pre-selected image, another image of the same kind, and an image from a different category.After our experiment, we adopted the deep-learning architecture Xception as the basic network, and divided the whole architecture into two branches.One branch processes the features extracted by Xception and obtains the bilinear features.The output of the fully connected layer is then connected to an external output for category prediction.The other branch obtains a 2048-dimensional feature representation after global-average pooling and then obtains the squared distance between an identical and a different specimen.Based on this distance, two images are judged as being of the same or different species.The accuracy of this method was 88.91% on the large-scale bird dataset built by us, and 85.31% on the CUB200-201 public bird dataset.Our model has a good application prospect in bird recognition and protection.

Dataset and Evaluation Metric
Birds-1096: Our own database (Birds-1096) includes 1096 bird categories, each containing 200-350 images (giving a total of 459,828 images).The original dataset is composed of images of different resolutions, we resize the image to (299, 299, 3) for later use.We randomly selected 10 pictures from each category for testing, thereby assigning 109,600 images to the test set.Targets in the same category have diverse postures and large illumination changes, whereas the between-category differences are very subtle, with similar shapes and colors of the target.Some images from Birds-1096 are displayed in Figure 1.[15].Each subcategory consists of 30 images for training and 1130 images for testing, and each image has detailed annotations: a subcategory label, an object bounding box, 15 part locations, and 312 binary attributes.All attributes are visual in nature, pertaining to the color, pattern, or shape of a particular part.Some images from the CUB200-2011 dataset are displayed in Figure 2.
The classification performances of our approach and another method (for comparison) were evaluated by the accuracy metric, a widely used performance index in fine-grained image classification studies [16,17].

Xception
Xception is an improvement over InceptionV3 [18] that uses deep separable convolution [19].In traditional convolutional networks, the convolution kernel is deep, and the convolutional layer looks for correlations across space and depth [20].The basic idea of Xception is that cross-channel correlation and spatial correlation can be completely separated, and it is best not to map them jointly.Instead of splitting the input data into several compressed data blocks, the spatial correlation is mapped separately for each output channel, and a 1 × 1 depth convolution is then performed to obtain cross-channel correlation.The 3D map with a 2D + 1D map, including a spatial convolution that is performed separately for each channel, is replaced, followed by a 1 × 1 convolution per channel, which can be thought of as the first correlation across a 2D space, and a 1D space correlation is then requested.The deep separable convolution of Xception increases the network width, which not only improves the classification accuracy but also enhances the network's ability to learn subtle features.Thus, it is a feasible module for fine-grained image classification.The structure of Xception is shown in Figure 3.The image was resized to (299, 299, 3) and normalized.We used Inception v3 [21], Resnet-50 [22], and Xception [19] pre-trained on ImageNet [23] and fine-tuned in Birds-1096 and CUB200.Table 1 shows the accuracies of the compared models trained on the two bird datasets, and the corresponding accuracy (Acc) curves are plotted in Figure 4.
Based on the experimental results, we selected Xception as our basic network.The Xception model yielded higher accuracy than Resnet-50 and Inception-v3 on both datasets, indicating that the Xception model learns the subtle features in fine-grained image classification.The lower accuracy on the CUB200-2011 dataset than on Birds-1096 is due to the lack of training samples in each category.

Fully Shared Bilinear Model
Deep learning is successful, largely because it integrates the original decentralized processing (feature extraction and model training) into a complete system, enabling overall end-to-end optimization training.Lin et al. [24] designed an end-to-end network model called bilinear CNN (B-CNN), which achieved a very high accuracy on the CUB200-2011 dataset with a weakly supervised fine-grained classification model.
A B-CNN for image classification (see Figure 5) consists of a quadruple B = ( f a , f b , P, C).Here, f A and f B are feature functions based on CNN A and CNN B, respectively, P is a pooling function, and C is a classification function.Each feature function is a mapping f : L × I → R K×D that takes an image I ∈ I and a location l ∈ L and outputs a feature of size K × D. Outputs are combined at each location using the matrix outer product; i.e., the bilinear combination of f A and f B at the location l is given by bilinear Both f A and f B must have the same feature dimension K to be compatible.The value of K depends on the particular model.The pooling function P aggregates the bilinear combination of features across all locations in the image to obtain a global image representation Φ(I).We use sum pooling in all our experiments, i.e., Φ(I) = ∑ l∈L bilinear(l, I, f A , f B ). ( Note that pooling ignores the feature locations, so the bilinear feature Φ(I) is an orderless representation.If f A and f B extract features of size and K × N respectively, then Φ(I) is the size of M × N. The bilinear feature is a general purpose image representation that can be used with a classifier C. Intuitively, the outer product conditions the outputs of features f A and f B on each other by considering their pairwise interactions, similar to the feature expansion in a quadratic kernel.
The bilinear model can be divided into (a) non-shared, (b) partially shared, and (c) fully shared [25].Here, we adopt a fully shared approach based on Xception (see Figure 6).

Triplet Networks
The triplet network (inspired by the Siamese network) [26,27] is composed of three instances of the same feedforward network (with shared parameters).When fed with these three samples, the network outputs two intermediate values, namely, the L2 distances between the embedded representations of two of its inputs and the representation of the third input [28].The three inputs are denoted as x, x + , and x − , and the embedded representation of the network is denoted as Net(x).The penultimate layer is the following vector: Equation ( 3) encodes the pair of distances between the x + and x − inputs and the reference input x.The training process make the distances between different categories larger than the distance between images of the same class [29].Figure 7 shows the structure of a triplet network.

The Architecture
Our architecture combines the triplet network with a bilinear model.As an example, we assume a (299 × 299 × 3)-sized image.After processed by Xception, we retrieve a (14 × 14 × 2048)-sized feature map.As mentioned above, one of the Xception branches is based on the characteristics extracted by Xception.Adding a global-average pooling layer [30] remarkably improves the localization ability of the CNN, despite its training on image-level labels [31].The pooled averaging outputs a 2048-dimensional vector.The distance representation between this vector and another 2048-dimensional vector is obtained, and whether this distance represents a positive or negative sample pair is predicted by two neurons selected for that purpose.The other branch of Xception reduces the problem dimension through a (1 × 1 × 128) sized filter, obtains the bilinear vector, fully connects the layer outputs.

Train and Test
The training and testing steps are given below.
Step 1: Image data normalization.First, the image was scaled to (299 × 299) pixels, and each pixel data type of the image was converted to a floating point type and normalized to [−1, +1] by the following formula: where I is the image pixel matrix, and J is the result of data type conversion and normalization; Step 2: Model parameter initialization.Neural networks are commonly trained by fine-tuning, which extracts the features by a publicly pre-trained model and uses them in the targeted classification.Fine-tuning does not require a complete retraining of the model, which improves the efficiency and achieves a good result with fewer iterations.We initialized the convolutional layer and the softmax layer using the pre-training model parameters of the ImageNet classification [33] and Xavier's [34] method, respectively.
Step 3: Model training.We first trained the branch of the bilinear model using the Adam optimizer [35] and then trained the two branches together using stochastic gradient descent (SGD) [36] with a learning rate of 0.001.Under this training approach, the network converged to a good result.
To improve the generalization ability and reduce the over-fitting of the network, we adopted the Dropout regularization technique with a value of 0.5.
The model is trained in an end-to-end manner.As shown in Figure 8, the first half of the model constitutes the convolutional layer and the pooling layer.Therefore, the whole model can be trained using the gradient value of the latter half of the model.Suppose that, at each position l, the feature extraction function f A outputs A and A T .The pooled bilinear feature is A T × A, so dl/dx represents the gradient of the loss function of feature x.The gradient of the loss function at the network output is then obtained by the chain rule, thereby completing the end-to-end training of the model.
The bilinear branch makes a prediction from an entered image, whereas the triplet network verifies a pair of images.

Loss Function
Our entire network is weakly supervised by label learning.As the loss function in the branches of our network, we adopted the weighted cross entropy [37].The loss in the whole network is the sum of the weighted cross entropies of the different branches.By trial-and-error experiment, we found that the network achieved superior results when the bilinear and triplet-network branches were weighted by 1.0 and 0.5, respectively. (5) In the above expressions, y i denotes the label, ŷi is the predictive probability, j is the branch number, and W is the loss weight.

Comparisons with Other Models
Gradient-weighted class activation mapping (Grad-CAM) improves the transparency of convolutional neural network (CNN)-based models by visualizing the important regions of the input from a predictive perspective [38].The Saliency Maps are feature maps that tell us how the pixels in the image affect the image classification results [39].We generated guided Grad-CAM visual explanations and Saliency maps to better understand the focus in our deep networks [40] through gradient-based localization, and why our architecture improves the classification results.
To identify the maximum stimulation corresponding to the species on the original image, we visualized the dense layer of the network.By combining the triplet network, our model captures the details in the original images better than the Xception+Bilinear model.Representative results of the two methods are compared in Figure 9.The toes, torso, limbs, and other details of the birds confirm that our network better extracts features that distinguish among the different species.Our network pays more attention to detail and achieves a better overall effect.

System Accuracy
We compared our model with other state-of-the-art weakly supervised algorthms on the CUB200-20011.Compared with recurrent attention convolutional neural network (RA-CNN) [41], spatial transformer convolutional neural network (ST-CNN) [42], and picking deep filter responses (PDFR) [16], our model achieves a better performance.Besides achieving high classification accuracy on small-scale datasets such as CUB200-2011, our model classified the large-scale dataset Birds-1096 with an accuracy of 88.91%.Therefore, our model is both robust and generalizable, applicable to both small and large datasets.The accuracies of our model (Xception+Bilinear+Triplet), Xception+Bilinear, and Xception are shown in Table 2 and in the Acc plots below (Figure 10).

Discussion
In this paper, we propose a model based on a triplet network and bilinear methods for fine-grained bird recognition.Our proposed model was easily trained, which effectively increased the inter-class distance of the network extraction features and improved the accuracy of bird recognition.Compared with other weakly supervised methods such as RA-CNN, ST-CNN, and PDFR, our model achieved a better accuracy of 85.58% on the CUB200-2011.The proposed model not only used the label information of the species, but also constructed positive and negative sample pairs in the bird datasets, which exploited the unsupervised information.Experimental comparisons between our method and other existing approaches on two classification databases confirmed the superior classification accuracy of our approach.Moreover, our method also performed well on large-scale datasets.The model is generalizable, robust, and available for fine-grained image classification.In our future work, we will conduct the research on two directions.Firstly, we will find the solution on how to combine the the losses in different branches.Secondly, we will figure out how to integrate an attention mechanism with our model for more complex fine-grained categories.

Figure 1 .
Figure 1.Images sampled from Birds-1096.Three images were randomly selected in each category (column).

Figure 3 .
Figure 3.An "extreme" version of our Inception module, with one spatial convolution per output channel of the 1 × 1 convolution [19].

Figure 4 .
Figure 4. Acc curves of the evaluated basic networks trained on Birds-1096 and CUB200-2011.

Figure 5 .Figure 6 .
Figure 5. Image classification by a bilinear convolutional neural network (B-CNN).An image is passed through CNNs A and B.The CNN outputs at each location are combined using the matrix outer product and then average-pooled to obtain the bilinear feature representation.After passing the feature representation through a linear softmax layer, the class prediction is obtained[24].

Figure 7 .
Figure 7. Structure of a triplet network.

Figure 8 .
Figure 8. Architecture of our model.

Figure 9 .
Figure 9. Representative results of our model and the Xception+Bilinear model.

Figure 10 .
Figure 10.Acc curves of the evaluated different models trained on Birds-1096 and CUB200-2011.

Table 1 .
Accuracies of the evaluated models trained on Birds-1096 and CUB200-2011.

Table 2 .
Accuracy comparisons of our model and other models.CNN: convolutional neural network.