Exploring Misclassification Information for Fine-Grained Image Classification

Fine-grained image classification is a hot topic that has been widely studied recently. Many fine-grained image classification methods ignore misclassification information, which is important to improve classification accuracy. To make use of misclassification information, in this paper, we propose a novel fine-grained image classification method by exploring the misclassification information (FGMI) of prelearned models. For each class, we harvest the confusion information from several prelearned fine-grained image classification models. For one particular class, we select a number of classes which are likely to be misclassified with this class. The images of selected classes are then used to train classifiers. In this way, we can reduce the influence of irrelevant images to some extent. We use the misclassification information for all the classes by training a number of confusion classifiers. The outputs of these trained classifiers are combined to represent images and produce classifications. To evaluate the effectiveness of the proposed FGMI method, we conduct fine-grained classification experiments on several public image datasets. Experimental results prove the usefulness of the proposed method.


Introduction
Fine-grained image classification [1][2][3] has drawn much attention in recent years. Fine-grained images are very similar, which make them hard to distinguish. Many efficient fine-grained classification methods have been proposed.
Other researchers make use of the structure information of images with multiview correlations. Spatial as well as class-level information [22][23][24][25][26][27][28][29][30] is often used with intensive labeling requirements. Automatic detection of objects is also used; although effective, it introduces noisy information, especially when the number of fine-grained classes is large. Since the discriminative power of single view is limited, one natural way is to combine multiview correlations [31][32][33][34][35][36][37][38][39][40][41][42][43][44]. This is achieved by ensuring consistency of different views. Making use of multiview correlations can eventually improve the performance. Although many models have been proposed with good performance, they do not explore the discriminative information of a single image. In practice, images cannot always be classified correctly. They are often misclassified with several other classes. Besides, misclassification information is eventually not distributed. For example, when we are classifying red flowers of one particular class, the misclassification probability of red flowers of different classes is much larger than white flowers. The classes being misclassified are biased. Misclassification information should also be used to improve the classification performance.
To make use of misclassification information, in this paper, we propose a novel finegrained image classification method by exploring the misclassification information of images. The proposed method first makes use of prelearned fine-grained image classification models to obtain the misclassification information of images. Instead of using all images for classification, for one particular class, we select a number of classes that are most likely being misclassified with this class. The images of these selected classes and the particular class are then used to train classifiers. The selection and training processes are conducted for each class. As classifiers are trained with different images of varied classes, the outputs of these classifiers cannot be compared for direct classification. We concatenate the outputs of these learned classifiers to form new image representations and use them for classifier training. We evaluate the proposed method on several datasets, and the classification performance proves the usefulness of the proposed method.
The main contributions of the proposed method lie in three aspects: • First, we select a subset of images instead of using all the images by exploring the misclassification information for classification. This helps to get rid of noisy information and improve the discriminative power of learned classifiers. • Second, we construct new image representations by combining the outputs of classifiers for fine-grained image classification. In this way, we can make use of a number of prelearned models to boost the classification accuracy. • Third, the proposed method has good generalization ability by making use of prelearned classification models for misclassification information extraction and classification.
The rest of this paper is organized as follows. We discuss related work in Section 2. Section 3 gives the details of the proposed method. In Section 4, experiments and analyses on several fine-grained image datasets are given. Finally, we conclude in Section 5.

Related Work
Fine-grained image classification tries to classify a number of subclass images that belong to a particular class (e.g., flower images). The state-of-the-art fine-grained image classification methods could be roughly divided into two schemes. The first scheme tried to design discriminative classifiers on the class level while the second scheme made use of information beyond class-level supervision.
With the fast development of deep convolutional neural networks (e.g., AlexNet [9], VGG [11], and ResNet [12]), fine-grained classification performances have been greatly improved. Bilinear convolutional neural networks have also been introduced [3] to model the two-dimension layouts of image pixels. Chai et al. [15] combined segmentation and classifier training for joint classification. Instead of using class-level supervision, Zhang et al. [10] used image-level classifier by hierarchical learning of the structure information. Semantic classifiers [13] were also used for image classification. Although these well-designed classifiers have been proven useful for classification, they were often designed for general images instead of fine-grained images. Since fine-grained images often pose similar appearances, the intrinsic correlations of fine-grained images should be well explored. A number of works have been made. For example, Wah et al. [17] used the correlations of different birds. To determine the location of objects, Zhang et al. [18] combined parts of images with r-cnns; however, automatic detection introduced noisy information. To alleviate this problem, Cui et al. [19] made use of extra human labor to annotate the bounding boxes of objects. To avoid the influences of background areas, Zhang et al. [20] used objectness proposals to both visually and semantically model object, context, and background separately while He et al. [21] also spatially pooled information for classification. Although great improvements have been made, these methods all ignored misclassification information.
Using only class-level supervision is not enough for efficient classification. To alleviate this problem, researchers tried to make use of extra information, e.g., Russakovsky et al. [22] went one step beyond pyramid pooling by using object-centric spatial pooling. Chen et al. [23] contextualized object detection and classification while Angelova and Zhu [26] combined detection, segmentation, and classification into a unified framework. Lin et al. [27] learned important regions automatically from images while Xie et al. [28] made use of the hierarchical information of image parts. Farrell et al. [29] combined volumetric primitives and posed a normalized appearance for classification. Although effective, these methods' performances decreased when the number of classes increased.
Combination of multiple information could help alleviate the increment of classes to some extent. For example, Torresani et al. [31] used human-labeled information while Yang et al. [32] explored web images to assist with classification. However, this also introduced noisy correlations. Instead of using visual information solely, Farhadi et al. [33] represented images by attributes or semantics. Attributes were manually annotated by experts, which took lots of human labor. To make use of previous learned knowledge for classification, Zhang et al. [36] generated explicitly and implicitly semantic representations [37]. Wei et al. [38] targeted multilabel image classification while Zhang et al. [39] fused semantic information for event recognition. Wu and Ji [40] transferred information from other sources while Zhang et al. [41] shared labels among different views. 3D information was also used for classification in [43]. Ren et al. [44] used region proposal networks for object detection to assist the classification. These methods treated images of the same class as a whole instead of modeling each image separately. Some classes were relatively more similar than other classes. We should treat different classes of images separately.

Fine-Grained Image Classification with Misclassification Information
In this section, we give details of the proposed fine-grained image classification method by exploring the misclassification information of images. We first use the misclassification information from prelearned models for misclassified image selection. The selected images are then used to train a number of classifiers. We concatenate the outputs of learned classifiers for new image representations, which are then used for fine-grained image classifications, as shown in Figure 1. Moreover, to further improve the performance, we design several prelearned models with different backbones for image representation and classification, and combine these prelearned models to leverage the advantages of these models.

Exploring the Misclassification Information
We can make use of prelearned models to improve classification performance. Formally, let x m n be the visual features of the n-th image used for the m-th prelearned model, n = 1, . . . , N, m = 1, . . . , M, N is the number of images, M is the number of prelearned models, y n ∈ R C×1 is the corresponding label, and C is the number of classes. Table 1 gives the symbols used in this paper. The prelearned model refers to the classifier, which can be learned using either local features or deep convolutional neural networks. For example, when local feature is used, the prelearned model can be trained using the support vector machine classifier. When using deep convolutional neural networks, the prelearned models refer to various the state-of-the-art networks. For the m-th prelearned model f m c ( * ) corresponding to the c-th class, we use it to predict the classes of images aŝ whereŷ m n,c is the predicted class for the n-th image and the c-th dimension using the m-th prelearned model,ŷ m n,c is the c-th dimension ofŷ m n . Ideally, the predicted classesŷ m n,c should be the same as their ground truth labels. However, the prelearned models cannot predict all the images correctly in practice. Some images may be confused with different classes. For images of the same class, their predictions scatter over many classes. The misclassification is not evenly distributed. This is because images differ from each other in both semantics and visual appearances. For example, flower images with a similar color and shape are often misclassified with each other. However, the probability of misclassification is low when classifying flowers with different colors and varied shapes. The misclassification information is often discarded by previous models. However, we believe the misclassification information can also be used to improve classification performance.
Suppose for one particular class that images are often misclassified with several classes. We should concentrate on these misclassified classes to mine useful information instead of taking all image classes into consideration. Besides, different prelearned models have varied misclassification information for each class. The classification performance can be improved by jointly modeling this information.
Specially, for each class, we make use of this information by first selecting several classes that are most likely being misclassified with this class. We calculate the class distribution ofŷ m n for all images of the c-th class with m-th model and sort it in descending order. Let be the sorted class distribution, where We select the top K(K < C) classes in which images have mostly been misclassified. We select the classes that correspond to the first K dimensions of d m c . Images of the K classes along with the c-th class are then selected to construct a misclassification subset. In this way, we can obtain a subset of images x m i,c , i = 1, . . . , N m c with K + 1 classes, where N m c is the number of selected images corresponding to the c-th class and m-th model. This avoids using too many noisy images. Since images are often misclassified with the K-th classes, we can improve the classification performance by separating the K + 1 classes of images. Using a subset of easily confused images is more efficient than classifying all the images. It can get rid of some irrelevant images and increase the classification accuracy.

Confusion Information Based Image Representations and Classifications
To make use of the selected K + 1 class images, we train K + 1 one-vs-all classifiers to separate them. The advantages of using selected images lie in three aspects. First, we can get rid of irrelevant images and concentrate on the classes that are most likely being misclassified. Second, we can use various state-of-the-art image classification methods to improve the classification performance. Third, since the selection and training processes are conducted independently, it can be paralleled to save the computational time and improve the modeling efficiency.
For images corresponding to the c-th class with the m-th pretrained model, let g m c,k ( * ), k = 1, . . . , K + 1 be the learned classifiers that separate the K + 1 classes of images apart, we can then make use of the predictions for fine-grained image classification. Various efficiently prelearned models can be combined with the proposed method. However, since each g m c,k ( * ) is only used to classify the corresponding K + 1 classes of images, the predicted values cannot be directly compared.
To predict the class of one testing image, we can use the learned classifiers. The output of one learned classifier indicates the semantic similarity between this testing image and the class with the corresponding classifier. This information can be used for image representations, which has been proven by [6,13,49,[68][69][70]. We use the learned classifiers g m c,k ( * ), k = 1, . . . , K + 1, c = 1, . . . , C as new image representations. For each image x m n , n = 1, . . . , N, this is achieved by first using each learned classifier g m c,k ( * ) to predict its class as h m c,k,n = g m c,k (x m n ).
The predicted values are then concatenated as h m n = [h m 1,1,n ; ..; h m 1,K+1,n ; h m c,k,n ; ..; h m C,K+1,n ], where h m n ∈ R (CK+C)×1 . Note that we use the learned classifiers to predict the classes of all the images instead of images that belong to particular classes for two reasons: First, the selected top K classes cannot cover all the confused classes, especially when a relatively small K is used. Second, an image is predicted C times for final classification. This makes the proposed method more robust and effective than a single classifier.
Making use of the new image representations for final classification is quite straightforward. This can be achieved by learning C one-vs-all linear classifiers as ∀c = 1, . . . , C. w m c ∈ R 1×(CK+C) is the classifier parameter to be learned, α is the parameter that controls the influences of the regularization term, y m n is the corresponding binary label of the n-th image with the m-th prelearned model, ( * , * ) is the hinge loss function as (w m c h m n , y m n ) = max(0, 1 − w m c h m n × y m n ).
Finally, we predict the classes of images by linearly combining the predicted results using M prelearned models as and assign the testing image with the class that has the largest y n,c . We set λ m,c , m = 1, . . . , M to be the same value that is equal to using the mean of the predicted values for classification. Algorithm 1 gives the procedures of the proposed fine-grained image classification with the misclassification information method. First, the prelearned classifiers are trained using the training data, then they are used to predict the images of the training data using Equation (1). Based on the prediction results, the misclassification information and class distribution can be calculated using Equation (2). For each class, utilizing the selected K classes that are most likely to be misclassified or confused with the class, the classifier is trained again. The new image representation is obtained by concatenating the result of the C newly trained classifiers, using Equations (4) and (5). With the new image representation, the final classifiers are trained by Equations (6) and (8). It should be noticed that using Equation (6), the classification results can be obtained using one type of prelearned model, while using Equation (8), results of several/all types of prelearned models are combined to obtain the final results, which is expected to improve the performance.

Input:
Training images x m n and labels y n , prelearned classifier f m c , K testing images.
Testing phase 6: Calculate the misclassification information with prelearned classifiers using Equations (1) and (2); 7: Concatenate the predicted results of testing images using Equations (4) and (5); 8: Predict the classes of testing images using Equations (6) and (8). 9: return The predicted classes of testing images.

Experiments
To evaluate the proposed method (fine-grained classification with misclassification information, FGMI), we conduct fine-grained image classification experiments on the Flower-102 dataset [1], the CUB-200-2011 dataset [17], and the Cars-196 dataset [54]. Figure 2 shows some example images of the three datasets.

Experimental Setup
Both local-feature-based methods and deep convolutional neural network (CNN)based methods have been widely used for fine-grained image classification. CNN-based methods have greatly improved over local-feature-based methods. The proposed method can be combined with various prelearned models. We first evaluate the proposed FGMI using local features (FGMI-LF) on the Flower-102 dataset. The Flower-102 dataset, the CUB-200-2011 dataset, and the Cars-196 dataset are also used to evaluate the performances of FGMI when combined with various prelearned deep convolutional neural network models (FGMI-CNN).
To extract local features from the Flower-102 dataset, we followed the same procedure as [4] and densely extracted SIFT features, as in [55]. The minimum scale was 16 × 16 pixels with the overlap set to 6 pixels. We used the same local feature encoding strategies as prelearned models. The codebook size was set to 1024. We used the same data splits as provided in [1]. We calculated the classification accuracy for each class. The final performance was evaluated using mean classification accuracy. As per deep convolutional neural network (CNN)-based methods evaluated on the CUB-200-2011 dataset and the Cars-196 dataset, we followed the same experimental setup as the prelearned models to get the trained classifiers [9,11,12]. We used the same type of deep convolutional neural networks for classifier training with the corresponding prelearned model. Mean classification accuracy was used for performance evaluations. We used the reported results of other baseline methods for direct comparison. The baseline models were selected for two reasons: some models are widely used and extended by researchers; other models have achieved state-of-the-art performance on these three datasets.

The Flower-102 Dataset
This dataset has 102 classes of 8189 flower images with predefined train/validate/test split (10/10/rest images). There are different numbers of images per class, ranging from 40 to 258. The scale, pose, and lighting conditions vary between images. Some classes are visually similar and hard to separate. Table 2 gives the performance comparisons of the proposed method with several baseline models [1,2,14,20,53,54,73]. We give the performances of the proposed method when combined with these baseline models. FGMI-LF-AFC, FGMI-LF-LR-GCC, FGMI-LF-OCB, FGMI-LF-ICAI, FGMI-LF-BR, and FGMI-LF-S 3 R represent the proposed method combined with prelearned AFC, LR-GCC, OCB, ICAC, BR, and S 3 R models, respectively. We also give the performance of FGMI-LF when jointly combined with AFC, LR-GCC, OCB, ICAC, BR, and S 3 R for classification (FGMI-LF-Combined). We have three conclusions from Table 2 when local-feature-based methods are combined. First, the proposed method is able to improve over these baseline models. This is because we concentrate on the easily confused classes of these prelearned models. Second, the performances vary for FGMI-LF-AFC, FGMI-LF-LR-GCC, FGMI-LF-OCB, FGMI-LF-ICAI, FGMI-LF-BR, and FGMI-LF-S 3 R because the discriminative power of these prelearned models are different. The performances can be improved by making use of other information (OCB) or measuring the similarities of images finely (BR and S 3 R), rather than simply using the training images with histogram similarities. The performances of these baseline models can be further boosted using the proposed method. Third, the performances can be improved by combining these models (FGMI-LF-Combined). The experimental results on the Flower-102 dataset show the effectiveness of the proposed method when combined with local-feature-based models.

The CUB-200-2011 Dataset
The CUB-200-2011 dataset has 200 different birds of 11,788 images. The images are divided into 5994 training images and 5794 testing images. The images are also labeled with bird locations along with class information. We only use the class information of the images.

The Cars-196 Dataset
There are 196 classes of 16,185 images in the Cars-196 dataset. Images are divided into 8144 training images and 8041 testing images, respectively. The image labels and bounding box annotations are also provided. We only use the class information of images, as on the CUB-200-2011 dataset.
Performances of the proposed method and comparison with other baseline models are given in Table 4. To be consistent with the experimental setup as on the CUB-200-2011 dataset, we also give the performances of FGMI-CNN-AlexNet, FGMI-CNN-VGG, FGMI-CNN-GoogleNet, and FGMI-CNN-BCNN along with FGMI-CNN-Combined. We can see from Table 4 that the proposed method is able to improve over many baseline models [3,11,60,[65][66][67][75][76][77]. Particularly, by using misclassification information, we can improve performance over AlexNet, VGG, GoogleNet, and BCNN, respectively. Besides, we are able to improve over [77], which makes use of the structural information of image regions to assist network construction. When analyzing the proposed method's performances on different datasets, we find that the Cars-196 dataset is relatively easier to classify than the CUB-200-2011 dataset. This is because cars are rigid objects while birds are nonrigid objects. Rigid objects are relatively easier to classify than nonrigid objects. However, by taking the misclassification information into consideration, we can consistently improve classification performance.

Influences of Parameters
The selected number of classes K influences the discriminative power of the new image representations. If we set K to 1, the proposed method will equal to only using the most easily confused class. All the classes will be used if we set K to C. To show the influence of K, we plot the performance changes with K on the Flower-102 dataset, the CUB-200-2011 dataset, and the Cars-196 dataset in Figure 3. We can see from Figure 3 that setting K/C to 0.1∼0.2 is able to obtain satisfactory performances.  α controls the influences of the regularization term in Equation (6). We plot its influences on the three datasets in Figure 4. We can see from Figure 4 that α should not be too big or too small. If α is too small, it will have very little influence. However, if we set α to a very large value, the optimization of Equation (6) will be degenerated. Setting α to 0.1∼10 seems to be a better choice, as shown in Figure 4.  The misclassification information also plays an important role for efficient classification. If we do not use the misclassification information, the proposed method will simply equal the combination of prelearned models with averaged predictions. We give the influences of the misclassification information in Table 5 (no CI). We can see that the misclassification information is very useful for classification.
The new image representation scheme is also necessary for accurate classification. This is because the classifier outputs of different subsets cannot be compared directly. One alternative way is to predict the image's class by voting. This can be achieved by using the predicted classes (instead of the values) over all the selected subsets corresponding to the pretrained models. We give the performances without using the new image representation scheme on the three datasets in Table 5 (no NIR). Since different subsets contain images of varied classes, the performances of this strategy are not as good as the proposed method. Table 5. Influences of misclassification information and new image representations on the Flower-102 dataset, CUB-200-2011 dataset, and Cars-196 dataset. no MI-without misclassification information (a simple combination of prelearned models with averaged predictions); no NIR-without new image representation (using the predicted classes over all the selected subsets corresponding to the pretrained models for voting).

Conclusions
In this paper, we proposed an efficient fine-grained image classification method by making use of the misclassification information of prelearned models, which has been generally ignored by previous methods. We used the learned classifiers to select misclassified images for each class. The selected images were then used to train misclassification classifiers. The selection and training process were conducted for each class. We combined the outputs of these learned classifiers for new image representations and trained classifiers for final predictions. The misclassification information contains discriminative features that are important for classification of similar classes with similar semantic and visual appearances. Specifically, for the fine-grained classification task, training the classifiers with misclassification information can better extract confused features, which is useful for discriminating similar classes. To evaluate the proposed method's effectiveness, we conducted fine-grained classification experiments on three fine-grained image datasets. Experiential results and analysis proved the effectiveness and usefulness of the proposed method.