CF-CNN: Coarse-to-Fine Convolutional Neural Network

: In this paper, we present a coarse-to-ﬁne convolutional neural network (CF-CNN) for learning multilabel classes. The basis of the proposed CF-CNN is a disjoint grouping method that ﬁrst creates a class group with hierarchical association, and then assigns a new label to a class belonging to each group so that each class acquires multiple labels. CF-CNN consists of one main network and two subnetworks. Each subnetwork performs coarse prediction using the group labels created by the disjoint grouping method. The main network includes a reﬁne convolution layer and performs ﬁne prediction to fuse the feature maps acquired from the subnetwork. The generated class set in the upper level has the same classiﬁcation boundary to that in the lower level. Since the classes belonging to the upper level label are classiﬁed with a higher priority, parameter optimization becomes easier. In experimental results, the proposed method is applied to various classiﬁcation tasks to show a higher classiﬁcation accuracy by up to 3% with a much smaller number of parameters without modiﬁcation of the baseline model.


Introduction
Since AlexNet won the 2012 ImageNet Large Scale Visual Recognition Challenge (ISLVRC) with a quantum jump in the sense of recognition performance [1], various types of deep convolutional neural network (CNN) models have been proposed for many applications including feature extraction, image enhancement, computer vision, medical imaging, and network security, to name a few [2][3][4][5][6][7][8]. Most deep neural networks adopt a CNN model because of its localization property using efficient computation and end-to-end learning ability that can detect various features from the input image. Recently, CNN-based deep learning research has tended to scale up the network to solve nonlinear problems [9][10][11][12][13][14]. In general, the CNN-based methods scale up the depth of layers [9][10][11][12] and the number of filters in each layer [14,15].
Simonyan et al. stacked several convolutional filters to increase the depth while decreasing the size of feature map by increasing the size of pooling or stride [9]. The deep CNNs exhibited superior performance, and have been widely adopted with the control of the feature map size. Residual network introduced by He et al. adopted the concept of shortcut connections between residual units for residual learning to solve the gradient vanishing problem [16,17] in the deeper network architectures [11,12]. Zagoruyko et al. experimented on the relationship between the depth and width of the residual network in various ways to improve the performance, and then proposed the wide residual network (WRN) with a wider channel and reduced depth [14]. They also used the dropout method to prevent overfitting caused by many parameters [18].
PyramidNet gradually increases the dimension of feature maps with zero-padded shortcut connections to preserve the deep network architecture [15]. However, it is difficult to optimize parameters if the number of parameters of the network indefinitely increases.
To solve these problems, hierarchical deep CNN algorithms have been proposed to classify many classes into several categories by grouping related classes and then classifying them using the CNN for each category. This method can improve the overall network performance by classifying a relatively small number of classes corresponding to each category. Wu et al., proposed CF-DRNet to classify five classes of diabetic retinopathy. CF-DRNet consists of a coarse network that determines the presence of diabetic retinopathy and a fine network of four severity grades of diabetic retinopathy. Finally, the grade of diabetic retinopathy is determined using the aggregation module [8]. Yan et al. proposed a hierarchical deep CNN (HD-CNN) model to classify categories and a sub-deep-CNN is applied to each category [19]. This method consists of (i) shared layers to extract low-level features that are shared across all subnetworks, (ii) a coarse category CNN to divide similar classes into a category, and (iii) an independent subnetwork for fine classification in each category. The final classification is performed based on weighted averaging by combining the coarse predictions obtained from the lower layer and the fine predictions obtained from each subnetwork. However, in this method, the number of required subnetworks is proportional to the number of categories and each subnetwork requires pretraining. Zhu and Bain proposed a branch convolutional neural network (B-CNN). B-CNN adds a subnetwork that performs coarse prediction while the main network performs both fine and coarse predictions from each subnetwork, and finally, obtains the classification result using the weight loss function [20]. Verma et al. proposed a three-stage hierarchical Yoga data set, and successfully validated the hierarchical data set by using a network similar to B-CNN except with a single fully connected layer [21]. Kim et al. proposed a hierarchical network model by grouping the weights of the network with high correlation with the class [22]. As a result, the proposed network structure is suitable for a distributed machine learning environment with a significantly reduced number of parameters while maintaining a similar performance. Figure 1 illustrates the network structure for classification with a hierarchical structure. In this paper, we present a hierarchical learning method, called coarse-to-fine convolutional neural network (CF-CNN). Figure 1d shows the concept of the proposed network. CF-CNN consists of a main network for fine classification and subnetworks for coarse prediction of classes. To predict a fine class, the last feature map of each subnetwork is used together with the last feature map of the main network through the refine convolution layers. Since a subnetwork of CF-CNN has a lower depth layer than the main network, the gradient vanishing problem can be alleviated in the learning process. The refine convolution layer fuses feature maps generated from subnetworks for coarse prediction. The feature map created from the subnetwork serves as a guide for the main network to perform finer classification. More accurate network parameter information is shown in Tables 1-7. In the experimental results section, we show that the proposed method can be applied to various CNN-based multilabel classification problems with an improved performance over existing methods. Our code will be made available at https://github.com/dkskzmffps/CF-CNN.     Table 4. Structure of the CF-Wide-ResNet-28layer model for CIFAR-10 and CIFAR-100 datasets.

CF-Wide-ResNet-28layer
Layer Name Output Size Block Structure Block Numbers k = 10 (k = 12)      Figure 2 shows the effect of the CF-CNN network structure on the CIFAR-100 dataset [23] compared with two standard preactivation ResNets with 326 layers and 1001 layers [12]. In CF-CNN, we have two subnetworks with 56 layers and refine the layers to preactivate ResNet with 326 layers and ResNet-1001, respectively, with an accuracy of 78.02% and 80.36%. The proposed CF-CNN has an accuracy of 80.77%. As a result, CF-CNN structure effectively improves the performance by adding a small number of layers instead of simply increasing the depth of the network.

Coarse-to-Fine Convolutional Neural Network
The hierarchically structured approach first divides multiple classes into several categories by grouping related classes, and then performs fine classification in each category using the CNN [19,22,24]. As a result, classification performance is improved at the cost of additional subnetworks and the corresponding learning method for each class group [19]. Although this approach may reduce the computational load, the classification accuracy is not preserved [22,24]. To solve that problem, the CF-CNN has the main network for fine classification and subnetworks for coarse classification. The proposed method uses the predicted class scores obtained from the baseline CNN model to group the classes, which belong to each label in the upper level with similar class scores to obtain new labels in the lower level. The created group label of each level is used as a classification label of each subnetwork, and both coarse and fine labels of each network are simultaneously trained. Figure 1d shows the architecture of the proposed CF-CNN. Given the group labels for the hierarchical structure, the labels in each hierarchical level are simultaneously trained by using the main and subnetworks. All feature maps of the last convolution layer in each subnetwork are used to predict fine classes via the refine layer. To obtain hierarchically structured group labels, we adopt disjoint grouping regularization proposed by Kim et al. [22]. Let x i ∈ R d represent the input data instance, y i ∈ {1, ...., C} the class label, and C the total number of classes. Given M training samples,

Loss Function for CF-CNN
, and corresponding class scores, , obtained by the pretrained deep CNN, the goal of the disjoint grouping method is to obtain hierarchical multilevel labels Q l = for coarse classification at subnetworks.
The class score vector, denoted as s i ∈ R C , is obtained by applying the softmax function to the deep CNN result, where C can be considered as the dimension of the class score. Let y l i ∈ 1, ..., G l and G l represent the label and number of groups at the l-th hierarchy level, respectively. The label of the first level, that is y 1 i , is equal to the original class label y i . To train the CF-CNN, we define the loss function as where L W, x, y l represents the cross-entropy loss of hierarchically structured group labels at hierarchy level l ∈ {1, ..., L} on the training data, L is the total number of hierarchy level, and W is the weight parameters of a network.

Disjoint Grouping Regularization
To obtain hierarchically structured group labels, we use the same disjoint grouping regularization, which was originally proposed by Kim et al. to divide classes belonging to the upper level group into lower level groups satisfying the disjoint property [22]. As shown in Figure 1a, Kim proposed a disjoint grouping regularization method to make the hierarchical network model by splitting the layers of the network. Kim's method creates a block diagonal weight matrix that belongs to a highly related class group by expressing the weights corresponding to each layer as a matrix. Since only the diagonal components assigned to the class group are learned during the learning process, the regularization process forces reducing the number of parameters to obtain a parallel model structure for distributed learning. On the other hand, we use this regularization with class scores from the pretrained model to generate hierarchical labels. Therefore, each class has a label having a hierarchical structure, and lower level classes have the same classification boundary with the upper level group.
Given the number of groups, denoted as K, let i represent the class belonging to the upper level label G, then the binary variable p g ki indicates whether class i is assigned to a group k, k = 1, ..., K, or not. The disjoint group assignment vector of dimension K, denoted as p g k , indicates whether the classes in the upper level label g are assigned to group k. Since our goal is to create a hierarchical label without duplication between classes, we assume that there is no overlap between groups, which results in ∑ K k=1 p g k = 1 K , where 1 K is the K vector of ones. Let s g be the class scores belonging to label g, the proposed disjoint grouping method minimizes the combination of three objectives functions as where λ O and λ B represent regularization parameters.

Disjoint Group Assignment
To apply the gradient descent optimization method, we change the binary variable p g ki to real variables in the range [0, 1] with constraint ∑ K k=1 p g k = 1 K . We use the softmax function to reparametrize p g ki with unconstrained variables z ki .
The objective function to create a class group satisfying the disjoint property is as follows.
where i ∈ {1, ..., C g } represents a class belonging to upper level group g, C g is the total number of classes in group g, and s g i,mean is the class score mean vector of the i class obtained from the pretrained baseline model. Kim's method aimed to learn the diagonal weight matrix using the feature assignment vector and the class assignment vector, whereas the proposed method aims to group the classes to satisfy the disjoint property using the class score.

Orthogonal Property
If we assume that there is no overlap between the groups, the group assignment vector should be orthogonal, i.e., p g k p g j = 0, ∀i = j. The group assignment vectors obtained by Equation (4) also exhibit orthogonal properties but add a regularization term to obtain better results.

Group Balance
The group assignment vector obtained by Equations (4) and (5) may assign most classes to one group. In an extreme case, all classes can be assigned to one group. To avoid that problem, we add a regularization term to control balance between groups. The corresponding regularization term is defined as (6) Figure 3 shows the effect of the group balance regularization. Each color bar represents the group k, and the width of the bar represents the class ratio belonging to each group k. With large λ B , the corresponding groups have similar ratio. On the other hand, if λ B is small, the group ratio may be flexible. However, a very small λ B makes almost all the classes belong to one group.  Figure 4 shows the result of creating a hierarchical label using the CIFAR-10 [23] dataset with 10 classes. To obtain the class score, we use the preactivation ResNet model [12] and parameters λ O and λ B are, respectively, set to 1 and 10 −5 . In each subnetwork, the group labels at each level are used to predict coarse labels, and the feature map of the last convolution layer of each subnetwork is combined with the feature map of the convolution layer of the main network. The combined feature map is fused through a refine convolution layer and used for fine prediction.

Experimental Results
In this section, we evaluate the performance of the proposed CF-CNN. For the experiment, we tested the proposed method on various classification models including ResNet [11], WideResnet [14], preactivation ResNet [12], and PyramidNet [15] as the baseline models. The classification performance was evaluated on the CIFAR-10, CIFAR-100 [23], and ImageNet datasets [25]. Since the CIFAR-10, CIFAR-100, and ImageNet datasets contain the same number of data for each class, the method for the imbalance data problem [6,26] was not considered. In addition, in this experiment, data augmentation methods such as color transformation, geometric transformation, rotation, and contrast transformation were not used in order to focus on checking the performance difference between the proposed model and the baseline model [27][28][29].
Both CIFAR-10 and CIFAR-100 have 50,000 training and 10,000 test images. CIFAR-10 contains 10 classes and CIFAR-100 has 100 classes. In the training process, basic data augmentation such as horizontal flipping and padding as much as 4 pixels around the image and random cropping of 32×32 image were applied. Each model was trained using stochastic gradient descent (SGD), to which Nesterov momentum was applied for 400 epochs. The learning rate starts from 0.1 and decays by a factor of 10 at 150, 250, and 300 epochs. The batch size was 128. When training PyramidNet and CF-PyramidNet, the initial rate was 0.25, which decayed by a factor of 10 for every 120 epochs. The batch size was 64.
The ImageNet dataset includes 1000 classes and consists of one million training images and 50,000 validation images. For the experiment, we used 200 epochs to train each model, starting with a learning rate of 0.05 and decaying by a factor of 10 at 60, 90, and 120 epochs. The batch size was 64. The size of the image used for training and testing is 224 × 224. The learning process of the proposed CF-CNN consists of four steps: (i) training the baseline model using 90% of the training dataset; (ii) computing the class score of the remaining 10% of the training dataset using the trained network; (iii) generating multilabels using the disjoint grouping method; (iv) training the CF-CNN on the training dataset. Table 8 shows results of classification using the baseline model and hierarchical structure labels obtained by various grouping methods. Manually divided hierarchical labels were generated using the method proposed by B-CNN and the number of classes for coarse 1 and coarse 2 were 8 and 20, respectively. In the case of the Random method, classes in the upper group were randomly selected and divided into 5 groups to form a hierarchical structure. The groups in each level had the same number of classes. For experimentation of the clustering method and the proposed method, we used the class score obtained from the pretrained baseline model. The clustering method used the k-means clustering method, and classes belonging to the upper group were divided into 5 groups using the k-means clustering method. For the disjoint grouping method, we set parameters λ O and λ B to 1 and 10 −5 , respectively. The second column, called 'Number of labels', indicates the number of labels at each level including the number of labels used for coarse 1, coarse 2, and fine classification, respectively. The label used for fine classification represents the original label. The classification accuracy was evaluated using the CIFAR-100 dataset. When a hierarchical structure is created using the k-means method and the disjoint group method, classes with similar characteristics form groups. When learning a subnetwork using such a group label, it shows better performance than the random grouping method or the manual grouping method because it is easier to find the optimal parameter in the process of fine prediction in the main network. The proposed method shows better performance than the k-means clustering method because it has stronger intergroup disjoint properties than the k-means method.
Tables 9-11 respectively summarize the classification accuracy for CIFAR-100, CIFAR-10, and ILSVRC 2012 datasets by applying the proposed method to various deep-learning models. B-CNN used the VGG-16 model as the baseline model [9]. HD-CNN adopted the NIN model [30], which doubled the number of filters in all convolutional layers in Table 9, and used VGG-16 model in Table 11. In Tables 9 and 11, SplitNet used WideResnet-16 (k = 8) and ResNet-18x2 models [22]. The rest of the models except for WideResNet used a bottleneck structure, detailed parameter information is shown in Tables 1-7. The parameter k of the WideResnet model represents a widening factor. In Pyramidnet, α and N represent the widening factor and the total number of blocks, respectively. In the proposed method, each subnetwork and main network have different parameter values for feature map fusion. D k of Pyramidnet represents the channel dimensions of the k-th block. D k is defined as In the proposed CF-CNN structure, the subnetwork for classifying each group label was created using the same layer structure used in each deep-learning model. Number of labels represents the number of group labels obtained using the disjoint grouping method, and represents the number of classes classified in coarse1, coarse2, and fine image classification, respectively. When compared with ResNet-326 in Table 9, CF-ResNet-164 performs better with a significantly smaller number of parameters. Likewise, compared with Pre-ResNet-1001, CF-Pre-ResNet-326 performs better with a much smaller number of parameters for CIFAR-10 and ILSVRC 2012 datasets.

Conclusions
In this paper, we proposed a multilevel label augmentation method using a disjoint grouping method. We also proposed coarse-to-fine convolutional neural network (CF-CNN) to learn the generated multilevel label with a smaller set of network parameters. Multilevel labels created by the disjoint grouping method have a hierarchical structure and have the same classification boundary between levels. The CF-CNN has a subnetwork to simultaneously learn multilevel labels. In the experimental results, the proposed method was applied to various classification models. As a result, the proposed method shows better performance than existing models with a much smaller number of parameters, without requiring structural changes of the building blocks constituting the network.