EnCaps: Clothing Image Classiﬁcation Based on Enhanced Capsule Network

: Clothing image classiﬁcation is more and more important in the development of online clothing shopping. The clothing category marking, clothing commodity retrieval, and similar clothing recommendations are the popular applications in current clothing shopping, which are based on the technology of accurate clothing image classiﬁcation. Wide varieties and various styles of clothing lead to great difﬁculty for the accurate clothing image classiﬁcation. The traditional neural network can not obtain the spatial structure information of clothing images, which leads to poor classiﬁcation accuracy. In order to reach the high accuracy, the enhanced capsule (EnCaps) network is proposed with the image feature and spatial structure feature. First, the spatial structure extraction model is proposed to obtain the clothing structure feature based on the EnCaps network. Second, the enhanced feature extraction model is proposed to extract more robust clothing features based on deeper network structure and attention mechanism. Third, parameter optimization is used to reduce the computation in the proposed network based on inception mechanism. Experimental results indicate that the proposed EnCaps network achieves high performance in terms of classiﬁcation accuracy and computational efﬁciency.


Introduction
With the development of electronic commerce, internet shopping for clothing has become a common lifestyle [1][2][3][4]. Before the clothing information is uploaded to the online shopping mall, the category, texture, style, fabric, and shape of clothing should be labeled. The purchaser searches for suitable clothing by keyword retrieval. The manual label method may be very costly on a human level, and the correct labeling of clothing is based on personal judgment. The mistake of personal judgment is inevitable in the thousands of clothing updates. Furthermore, it is difficult to distinguish the fine-grained classification of clothing by personal judgment. Thus, the high-efficiency method of the clothing classification [5,6] is urgent in the rapid development of clothing shopping.
Clothing classification attracts a lot of attention in academic circles. The classification methods for clothing are usually divided into two categories. First, the traditional feature extraction methods for clothing classification can be also divided into two types, one is based on global shape features and global texture features [7], such as Fourier descriptors, geometric invariant distance, local binary patterns (LBP), etc., and another one is based on local feature methods that include scale-invariant feature transformation (SIFT), sped up robust features (SURF), histogram of oriented gradient (HOG), etc. [8][9][10][11]. The classification accuracy of traditional methods rely on the selective feature severely. In specific cases, the methods may achieve high-level accuracy with stable and conspicuous features. Generally, the clothing image is various, similar, and complex. The traditional

Materials and Methods
There are three key issues should be considered in the EnCaps network. First, the input image size of the traditional capsule network [21] is only 28 × 28, and we should increase the input size by improving the network structure to process more high-quality images-second, how to extract robust feature with efficient network structure; and, third, the vector indicates the capsule unit, but the processing unit of traditional CNN is pixels, which results in a more complex calculated amount. The parameter optimization strategy should be used in the proposed network. In order to solve the above issues, the EnCaps network is proposed with three novel strategies, and the method architecture is shown in Figure 1.

Spatial Structure Extraction
Convolution is locally connected and parameter sharing, and as the layers of a convolutional network deepen, the network can learn more global contextual information and then use this information to make predictions. However, there is no available spatial information in the extracted features, which is one of the reasons for the prediction failure. First, the shape information of the extracted feature is important for object identification. The different types of clothing image have obvious structural features that can be used to classify. There are about seven common types of clothing profiles, such as 'A', 'H', 'X', 'T', 'Y', 'O', and 'V'. The feature of the clothing profile should enhance the classification accuracy. Thus, the spatial shape information is important for clothing classification. However, it is not enough to have profile information. The traditional convolutional neural network does not have the ability to analyze spatial information, which makes it difficult to distinguish between the two types of clothing with only slight differences. For example, skirt and dress, which are also in the shape of 'A', are extremely easy to be confused by the network due to no spatial information (localized spatial alignment information for clothing and body) being identified.
To complement the capability of spatial feature extraction, the capsule network [21] is introduced to process the features further. The capsule network extracts the structural features of objectives based on the capsule unit. The traditional capsule network is shown in Figure 2. The fundamental structure of the capsule network includes convolution layer, initial capsule layer, convolution capsule layer, and fully connected layer. Different from the traditional CNN, the feature vector v i of objective replaces the scalar feature. The "prediction vectors"û j|i from the capsules is obtained by Equation (1): where W ij is the weight matrix of a certain capsule layer. i denotes the vector index of input capsules layer, and j|i denotes the capsules index of PrimaryCapsules j corresponding to vector i. Then, the high level featureû j|i is processed by Equation (2) to realize the dynamic routing of features:

Reshape
where the parameter c ij indicates the routing probability from capsule i of L layer to capsule j of L + 1. The computational formula of routing probability is represented by Equation (3): where b ij is the prior probability from capsule i to capsule j, which is iteratively updated in the model training, and the initial value is 0. Then, the parameter s j is processed by the squashing function (Equation (4)) to obtain the L + 1 layer capsule: Since capsule network allows multiple classifications to co-exist, the traditional crossentropy loss can not be used directly; an alternative is the margin loss commonly used in SVM. The capsule mechanism is used at the end of EnCaps, and the use of margin loss is also a way to maintain the performance of the model. It can be expressed as Equation (5): where c is a certain classification, T c is the indicator for the classification ('1' indicates the presence of class c, '0' indicates the absence of class c), and m + is the upper bound that is predicting the existence of class c but not its true existence. m − is the lower bound that is predicting that class c does not exist but does exist, and λ is the scale factor. If class c is existing, ||v c || will not be less than 0.9, if class c does not exist, ||v c || will not be greater than 0.1. The input limitation of traditional capsule network is hardly used in clothing classification because the size of input image is only 28 × 28, and the input limitation of image size restrains the wide application of the capsule network. Thus, the image size of input network is improved in the EnCaps network for larger image size. The larger image is beneficial to obtain more feature information, which is useful for more accurate classification. In our proposed network, the size of 224 × 224 image is used, and the input size has increased by 64 times.

Enhanced Feature Extraction
The original capsule network only has two convolution layers, which can not extract the robust feature of objectives. In order to extract a more robust image feature, the enhanced feature extraction model is proposed with a deeper convolution network as shown in Figure 3.
25×25×384 1×1×256 Figure 3. Overview of the enhanced feature extraction model, where F t−1 is the input feature, F t is the middle feature, and F t+1 is the output feature of the enhanced feature extraction model.
In the proposed model, the deeper network structure and attention mechanism are used to extract robust features. The extracted 25 × 25 × 384 dimensional high-level feature map is extracted with a channel attention mechanism which ignores the irrelevant information and focuses on the key information in the image. The enhanced operation can be defined as: where H denotes the height of the feature map, W denotes the width of the feature map, W . is the convolution operation, σ and δ are different activation functions, and M . is the max-pooling operation. After a series of attentional enhancement operations, we reduce the size of the feature map from 25 × 25 × 384 to a one-dimensional vector of 1 × 1 × 256, and the detailed structure of stem module is shown in Table 1.

Parameter Optimization
In the early stage of feature extraction, the model is asked to use a more lightweight decoding network to extract more advanced features. In order to extract robust features with low computational cost, we use 1 × 3 and 3 × 1 convolution kernels to replace the 3 × 3 convolution kernel. It allows for a significant reduction in the size of the model without losing any information.
We refer to the stem module in the Inception network and improve it with three objectives: (1) to down-sample the images massively in the stem phase to reduce the aspect ratio, (2) to reduce the number of parameters in the feature map to make them more semantically informative, and (3) to reduce the size of the overall framework of stem to make the model more lightweight. We use an asymmetric convolutional approach to optimize our network, which decomposes the 3 × 3 convolutional kernel into 3 × 1 and 1 × 3, and it allows the number of parameters to drop by 33% relatively while maintaining the same accuracy rate, as shown in Figure 4 in detail. As shown in Figure 5, we set up three consecutive 3 × 3 down-sampling layers and one fused down-sampling layer to convolve the image size from 224 × 224 × 3 to 53 × 53 × 160. The extracted features are further fused by three sets of asymmetric convolutional parallel structures to deepen the channel length to 51 × 51 × 352, and finally the dimensionality of feature map is increased to 25 × 25 × 384 via a continuous down-sampling layer and one fused down-sampling layer, and the detailed structure of the stem module is shown in Table 2. Figure 5. Overview of the Inception mechanism, which belongs to the parameter optimization. C denotes 2D convolution, MP denotes Maxpool, BN denotes batch normalization, and LK ReLU denotes Leaky ReLU.

Dataset
The dataset used for experiments is one part of the DeepFashion dataset [28], and it consists of 5000 images in 10 categories: blouse, cardigan, dress, hoodie, jeans, romper, short, skirt, tank, and tee, and each category contains 500 images of the corresponding clothing. The resolution of image in the dataset is 224 × 224. 4500 images of the whole dataset is used for training and another 500 images is used for testing. In addition, in order to enhance the generalization ability and robustness of the model, data enhancement operations including folding, random rotation, and random cropping are performed on the training samples. Figure 6 shows operations corresponding to data augmentation, where (b) represents flipping the original image from top to bottom or left to right, (c) represents rotating the original image at arbitrary angles, the excess area is clipped, and the missing area is filled with white pixels, and (d) represents cropping the original image randomly, and the missing area is filled with white pixels.

Evaluation Criterion
Accuracy, precision, recall, and F 1 evaluate the performance of the classification model, which can be expressed as follows: where TP, FN, FP, and TN denote true positive, false negative, false positive, and true negative, respectively.

Experiment Platform Setting
The experiment is conducted on the Ubuntu 16.04 system with the Python language and the tensorflow framework. The hardware environment is equipped with Intel Gold 5118 CPU with 128 GB RAM and 32 GB Nvidia Tesla V100 GPU. By default, the Adam optimizer with β1 = 0.5 and β2 = 0.999 are used to train the model with 100 epochs, and the initial learning rate is set to 0.0003. The learning rate is automatically decayed by a factor of 0.1 when the validation loss is not significantly reduced.

The Comparison with Other Methods
The accuracy of convolutional neural network in image classification still depends on the number of samples, the data augmentation strategy, and so on. The problem of how to obtain the same performance with insufficient data annotation is also a concern. With less reliance on supervised learning and priori human annotation information, it is our goal to achieve better performance with smaller amounts of data. To demonstrate the superiority of EnCaps for small sample detection, we verify the accuracy of the 10 networks as the dataset is gradually incremented, as shown in Figure 7. The training set grows from 500 to 4500, the accuracy of EnCaps on the 500 validation set holds the leading value of 0.56 at first, and, as the dataset is incremented, the accuracy on the validation set is consistently higher than the rest of the networks, and maintains a very stable performance. During the training, we use ablation experiments to verify the functions of attention mechanism and advanced feature extraction mechanism, respectively. First, we perform a validation of the validity of the channel attention module in the model. We examine the validation set loss and the validation set accuracy of the model, respectively, during the gradual increase of the attention module. As shown in Figure 8, as the attention module increases from 0 to 4, the validation loss of the network decreases and reaches a minimum that is 4. The validation accuracy increases and reaches a peak at the number of 4. As the number increases from 4 to 9, the validation accuracy gets smaller first, some performance is lost, and the validation loss bounces back, which in turn reduces the overall efficiency of the model.  To further investigate the role of each module, we examine the accuracy and loss values on the validation set by removing the stem module, and the reduced module. There are three cases in experiments: (1) removal of the stem module, (2) removal of the reduced module, and (3) removal of the stem module and reduced module. Table 3, when we remove the stem module, the accuracy on the validation set decreases from 0.842 to 0.682, and the validation loss value increases from 27.58 to 47.42. It can be intuitively learned that the effect of the stem module on our network is not only to reduce the amount of data, but also the contribution to the performance of the model by extracting more advanced semantic features of the images. When we remove the reduced module, the accuracy on the validation set drops from 0.842 to 0.778, and the validation loss value increases from 27.58 to 43.10. It is clear that the reduced module plays a bridging role in EnCaps, it performs further enhancement to the advanced feature of the stem module, and it normalizes the enhanced feature so that it can be more logical to access the final capsule module. The attention module in the reduced module succeeds in highlighting the focus weights in the feature map, and it has a side reaction to the importance of the stem module. When we remove both the stem and reduced modules, namely it indirectly uses the capsule module, the accuracy on the validation set decreases from 0.842 to 0.628, and the validation loss value increases from 27.58 to 52.76, and it demonstrates that the importance of both modules in the overall module for classification performance. Without these two modules, the number of network parameters and operations increases by a factor of 0.3, and the efficiency is plummeted. We compare our method with several stare-of-the-art methods including VGG [29], GoogLeNet [30], ResNet [31], DenseNet [32], MobileNetV2 [33], and Capsule network [21]. The quantitative evaluation results are shown in Table 4. We use the validation set to validate each network separately, the number of parameters, and the computational effort of MobileNetV2 are minimal, but the corresponding accuracy is also reduced due to the lack of computational effort. The accuracy of EnCaps, 0.842, is the highest of all the networks. The accuracy of the second highest is ResNet, whose value is 0.820, but with a higher number of parameters and operations than EnCaps. The accuracy of original Capsule network is 0.628, which is much lower than all other networks, and the number of parameters and operations is about 0.3 times higher than EnCaps network. In summary, EnCaps is able to obtain very high detection results in the field of classification, especially in clothing image classification, and maintains a lightweight model volume that remedies the shortcomings of traditional capsule network in classification.

As shown in
We use visualization to compare the computation, number of parameters, and validation accuracy among 10 types of model, as shown in Figure 9, where the horizontal axis represents the MFLOPs (Million Floating-Point Operation Per Seconds) and the number of trainable parameters respectively, and the vertical axis represents the validation accuracy. By defining the graph, it can be seen that the graph is closer to the y-axis and further from the x-axis, which indicates the higher performance of the model. It can be seen that the EnCaps is at the highest point and relatively close to the y-axis, which means that our model is at the best performance level with a lightweight module and high accuracy.  From the previous conclusions, it is clear that the performance of ResNet-18 is outstanding among all 10 networks. Therefore, on the validation set, we examine the detection results of 500 pieces of clothing images with EnCaps and ResNet-18, respectively, and draw a confusion matrix that counts the number of detection results per image. As shown in Figure 10, the x-coordinate represents the predicted label, the y-coordinate represents the true label, and each square represents a count of the number of predicted results on the true label. It can be seen that seven EnCaps tests have a number of correct detections above 40, with a relatively low false detection rate, while only five of the ResNet-18 tests have a number of correct detections above 40, with a relatively high false detection rate. Some clothing that has different local features of the clothing leads to different types of clothing, and the ability of detected images at a fine granularity becomes a key basis for determining whether the model has high performance. EnCaps has a high recognition accuracy rate for the detection of different types of clothing, and it has a strong ability to screen fine-grained information to realize the highly robust. In order to further validate the ability of EnCaps network for classifying clothing, we examine the accuracy, precision, recall, specificity, sensitivity, and F 1 metrics of the model on the validation set, as shown in Table 5, the metrics of EnCaps are at a relatively excellent level, and it can be seen that our proposed EnCaps has a great advantage over other neural networks both in terms of detection effectiveness and network volume. According to the actual classification results of EnCaps, it demonstrates that EnCaps has high performance in the experiments. We randomly build 10 different types of clothing, test them individually, and obtain their classification results and the probability scores of the first two classification results. As shown in Figure 11, for some clothing images with only fine-grained distinctions, such as dress and romper, the EnCaps network extracts spatial features of the distribution of pixels between them to make a clear comparison and to be able to distinguish them. The proposed EnCaps network has the advanced feature extraction capability of traditional convolutional neural networks and the spatial structure perception capability of Capsule, so that our network is able to classify the indistinguishable images well.  Figure 11. The effectiveness of the EnCaps network in detecting garment classification.

Discussion and Future Directions
To realize the more accurate classification of clothing images, we propose the En-Caps network, which uses spatial structure extraction, enhanced feature extraction, and parameter optimization to obtain spatial structure information and robust image feature of clothing images. The EnCaps network not only achieves high accuracy of classification, but also low parameter computation. The experimental results demonstrate the superiority of EnCaps network, which is attributed to the deeper network and optimal network structure. The accuracy and computational complexity are the key metrics that we should consider in the network design. The traditional classification network does not consider the spatial structure feature, and the feature may improve the classification accuracy based on the previous works. Thus, the concept of capsule network is fused into the designed network. The original capsule network is not suitable for clothing classification, and the proposed network is designed anew according to the demand of clothing classification. The input of image size is enlarged by modifying the input network, and more robust feature extraction is obtained by the deeper and more efficient Encaps network. Comparison with traditional capsule network, the network structure is designed anew, and the classification accuracy and efficiency are remarkably improved.
The classical deep learning network focuses on the image feature but spatial location relationship. The improvement of network depth and structure is usually considered to extract robust objective features, such as LeNet [34], AlexNet [35], VGGNet [29], GoogleNet [30], ResNet [31], DenseNet [32], MobileNet [33], YOLO [36][37][38][39], and so on. The improvement may be valid for the classical objection classification, such as pedestrian, vehicle, animal, and other objectives with obvious image features. Clothing classification belongs to the fine-grained classification which has inconspicuous features, and the classification methods based on image feature without spatial location feature for difficulty achieving impressive results. Maybe the EnCaps network and other similar networks, which can extract spatial location relationships, are the best choice for the fine-grained classification task.
From the experimental results in the clothing classification, two phenomena are worth discussing. First, the EnCaps network achieves the best performance in terms of top-1 accuracy among VGGNet, GoogleNet, ResNet, DenseNet, MobileNet, and the traditional Capsule network. Generally, the more complex and deeper network may obtain better performance, but EnCaps, which has the lowest computational cost besides MobileNet, which is a lightweight network for mobile devices, obtains the best accuracy. The phenomenon may demonstrate that the spatial location information plays an important part in the procedure of classification. Second, the EnCaps network obtains the best accuracy among all compared methods with a gradually increasing training set from beginning to end. The larger training set is used, and the more accurate performance will be achieved in general. In our experiment, the recognition model is trained by 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, and 5000 samples, respectively. The EnCaps network obtains the best performance without exception, which may demonstrate that the spatial location information can boost the efficiency of training models with a small set of samples.
The development of artificial intelligent technology will change the nature of the clothing industry. In the future, more and more intelligent applications will be used for clothing. A system may identify your current clothing and recommend your favorite clothing. Clothing image classification is the fundamental technology in more complex applications, such as evaluation of clothing compatibility, clothing recommendation, and fashion trend prediction. Clothing image classification will be widely applied in the future clothing industry, and it may improve efficiency and convenience of clothing applications. The fine-grained classification method should be further studied for improving the accuracy.

Conclusions
A novel EnCaps network is proposed for clothing image classification. The proposed network adopts three strategies to obtain the spatial structure feature and robust image feature: (1) the spatial structure extraction model is proposed to obtain the spatial structure feature of clothing based on the improved capsule network, (2) enhanced feature extraction model is designed to obtain the robust image feature based on the deeper network structure and attention mechanism, and (3) the parameter optimization is used in the EnCaps network based on the inception mechanism. Experimental results indicate that the EnCaps network achieves the best comprehensive performance among classical deep learning networks, such as VGGNet, GoogleNet, ResNet, DenseNet, MobileNet, and the original capsule network. The accurate clothing classification network may be used in the clothing category marking, clothing commodity retrieval, and similar clothing recommendations. In the future work, the more efficient and robust network should be researched to obtain more accurate clothing classification.