Hierarchical Prototypes Polynomial Softmax Loss Function for Visual Classification

Xiao, Chengcheng; Liu, Xiaowen; Sun, Chi; Liu, Zhongyu; Ding, Enjie

doi:10.3390/app122010336

Open AccessArticle

Hierarchical Prototypes Polynomial Softmax Loss Function for Visual Classification

by

Chengcheng Xiao

^1,2,

Xiaowen Liu

^1,2,*,

Chi Sun

^1,2,

Zhongyu Liu

³ and

Enjie Ding

^1,2

¹

School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221008, China

²

IOT Perception Mine Research Center, China University of Mining and Technology, Xuzhou 221008, China

³

School of Information Engineering, Xuzhou University of Technology, Xuzhou 221000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(20), 10336; https://doi.org/10.3390/app122010336

Submission received: 20 September 2022 / Revised: 8 October 2022 / Accepted: 10 October 2022 / Published: 13 October 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

A well-designed loss function can effectively improve the characterization ability of network features without increasing the amount of calculation in the model inference stage, and has become the focus of attention in recent research. Given that the existing lightweight network adds a loss to the last layer, which severely attenuates the gradient during the backpropagation process, we propose a hierarchical polynomial kernel prototype loss function in this study. In this function, the addition of a polynomial kernel loss function to multiple stages of the deep neural network effectively enhances the efficiency of gradient return, and only adds multi-layer prototype loss functions in the training stage without increasing the calculation of the inference stage. In addition, the good non-linear expression ability of the polynomial kernel improves the characteristic expression performance of the network. Verification on multiple public datasets shows that the lightweight network trained with the proposed hierarchical polynomial kernel loss function has a higher accuracy than other loss functions.

Keywords:

deep learning; light-weight convolutional neural networks; loss function; visual classification

1. Introduction

Deep convolutional neural networks have enhanced image classification tasks [1,2,3,4]. However, due to the large number of deep neural network parameters, it is challenging to deploy on edge devices or mobile devices with limited resources. In recent years, studies have aimed at designing and studying lightweight neural networks with relatively low parameters and calculations [5,6,7]. These lightweight neural networks can be widely deployed on mobile terminals and embedded devices, thereby becoming current research hotspots.

Studies on lightweight neural networks focus on network design structures and improvement of the loss function [8,9]. Various advances in network design structures, such as Mobilenetv1 [5], Mobilenetv2 [10] and Shufflenet [6] have been reported. Based on evolutionary neural networks, the design of the network structure commonly used in automatic search methods is MobilenetV3 [11]. However, due to the large amount of computing resources, this method has not been evaluated by various studies. The loss function can improve the model performance without increasing the number of parameters and calculations in the inference stage of the model, therefore, we evaluated the loss function of lightweight networks.

Lightweight convolution neural networks and large-scale deep convolution neural network models are both end-to-end methods. Therefore, generally, lightweight networks directly use the loss function of large-scale convolutional neural networks. Currently, there are three main types of loss functions used for network model training. These are the cross-entropy loss functions based on sample labels and its variants, the contrast loss functions based on sample pairing, and the loss functions based on regularization. Regarding the cross-entropy loss function and its variants of sample labels, the input image is first mapped to a vector space through superposition of a multi-layer convolutional neural network to obtain a feature vector representation of the image. To determine the category of the input image, we usually measure the similarity between the feature vector of the image and class prototypes (also referred to as class weight vectors, class centers or class representations), and then calculate the class of the sample based on the class prototype distance of the sample. Through appropriate measurement methods such as inner product similarity, cosine similarity [12] and Euclidean distance [13] among others, the logits of the category can be calculated by Softmax to obtain the probability of the category the sample belongs to. In addition to Softmax, studies have also proposed the Bayesian radial basis network, Bayesian formula and Gaussian distribution. In the training process, the class probability of the sample is continuously optimized by cross-entropy loss to approximate its ground-truth.

The contrast loss function based on sample pairing cannot directly replace the Softmax loss function because it cannot obtain an efficient sample pair. Subsequently, DeepID2 [14], center loss [1] and margin loss [15] combined metric loss and traditional Softmax loss for supervise model training, and made some progress in face recognition. However, the performance of these methods on other visual classification tasks is limited. In addition, CPL [16] and LGM [9] algorithms add the regular term of the distance within the class to the Softmax function to regularize the feature distribution of the class prototype. These methods are still limited by additional unstable regularization losses.

Although Euclidean distance and inner product distance have good geometric interpretability, visual classification tasks require class prototype vectors and feature vectors to have good separability. In actual tasks, due to the sample complexity and diversity, especially for fine-grained image classification tasks, the class prototype feature vectors are poorly separable in a low-dimensional space. To overcome this challenge, it has been proven [17] that the vector in the low-dimensional space can be mapped to the high-dimensional space, so that the feature vector in the low-dimensional space is inseparable and converted into the high-dimensional space. It is equivalent to the high-dimensional space in the two-dimensional space. In addition, the top layer of the convolutional neural network often extracts visual semantic features, and the features extracted at the bottom layer often contain a large amount of visual bottom-level information [18]. The main reason for this phenomenon is that the last layer of the neural network obtains logits and artificially labeled labels for loss. Then, it optimizes the network parameters layer by layer through gradient descent back-propagation. Artificially labeled labels are actually a kind of semantics. For special symbols, based on the optimization goal of the loss, the features extracted near the top of the network during the training process tend to be closer to the semantic information presented by the label. To enable the low-level part of the lightweight network to effectively extract more semantic information, the loss function can be added to the bottom layer. This effectively strengthens the gradient back-propagation, and makes the underlying feature map focus on the semantic information to be extracted. For this reason, we propose a novel hierarchical polynomial loss function. As shown in Figure 1, the main feature of this method is that, combined with multi-stage characteristics of the lightweight network, in the training stage, a polynomial prototype loss function is added to multiple stages to enhance the semantic feature extraction performance at the bottom of the network. The reasoning stage of the network uses the backbone of the network, without increasing the amount of network parameters and calculations. Studies on multiple data have effectively improved the performance of lightweight networks.

The main contributions of this article are:

A hierarchical prototype loss function is proposed. By adding loss functions to different layers of the deep neural network, the performance of the semantic feature extraction at the bottom of the network is effectively improved;
The loss calculation method used is a polynomial function, which is a kernel method that can effectively improve linear separability of the low-dimensional space;
Through various experiments using multiple public datasets, it was proven that the proposed method is effective.

Next, we reviewed related studies in the second part. The third part elaborates on the method proposed in this article. The fourth part is the experimental demonstration part. The fifth part is the conclusion.

2. Related Works

We focused on the loss function of lightweight convolutional neural networks. Many lightweight networks have been proposed, with most of them being trained with the Softmax loss function. However, the loss function is generally universal, and as the basic task of other tasks, many loss functions have been proposed for visual classification tasks. Therefore, we review the loss functions of visual classification tasks. In general, the loss function of the deep neural network classification task can be divided into the loss function based on a sample category label, on a regular constraint and on a sample pair label.

2.1. Loss Function Based on Sample Labels

Most of the loss functions that are currently being used in classification tasks are based on the Softmax loss and its improved algorithms. This type of algorithm relies on a sample label. The prediction vector of the sample category is output in the last Fully Connected (FC) layer of the network, after which the distance is calculated from the sample label to obtain the sample loss. Then, the algorithm is optimized through gradient descent and back propagation. The standard cross-entropy loss only pays attention to separability between classes, but not the compactness within classes. Therefore, an L-constraint Softmax function proposed in [19] enhances the ability of the network to extract features within the class. In addition, the dot product operation of the last layer has been replaced with the product of amplitude and angle [20], which enhances the expressive ability of features within the class by adjusting the angle. Furthermore, parameters have been normalized on the basis of L-softmax [20]. A-Softmax [21] is proposed, which enhances the ability of the network to learn features between and within classes.

2.2. Loss Function Based on Regular Constrains

The loss function based on regular constraints ensures performance of the Softmax function while adding additional regular constraints to parameters of the network, thereby enhancing the expressive ability of the network and preventing premature overfitting of the model. A central loss function that is used in conjunction with the Softmax loss has been proposed [1]. This method can make the feature vectors extracted by the network aggregate to the center of the class in the feature space. Moreover, an improved central inconvenience loss function based on the central loss has also been proposed [22]. The category center is further constrained, and the center is constrained on a hypersphere, which further reduces regional differences in the feature expression caused by category imbalance.

2.3. Loss Function Based on Sample Pair Label

The Euclidean distance based on losses, including contrastive loss [23] and triplet loss [24] are commonly employed. Specifically, the distances among samples are regarded as optimization objectives for Euclidean losses. The purpose is to reduce distances between samples within the same classes. Besides, distances between samples among different classes are enlarged. This design is associated with some limitations, such as difficulties in mining efficient sample pairs or triplets. The networks’ training behavior and final performance are strongly influenced by different sampling methods [25]. As a consequence, the Euclidean distance based on losses are often applied when fine-tuning instead of when training from scratch.

Loss function studies have certain mathematical theory support and strong interpretability; therefore, they attracted scholarly attention. However, most studies are not aimed at light-weight image classification problems, so that performance gain of the related work is limited when it is migrated to the light-weight image classification task.

3. The Proposed Method

This section is the method part. First, we review the main differences between traditional convolutional neural networks and lightweight networks. Then, we review the theoretical basis of classic prototype learning and kernel methods, and finally we explain our proposed hierarchical polynomial class prototype loss function method.

3.1. Lightweight Neural Works

Traditional deep convolutional neural networks are usually stacked by multiple layers of operations such as convolution, pooling, activation function, and full connection. Many of the operations in this study are applicable to lightweight convolutional networks. The main difference is that, in order to reduce the number of parameters and calculations, the lightweight network replaces part of the operation of the convolutional layer with depth-wise separable convolutions and pointwise convolutions, as shown in Figure 2. Assuming that the size of the convolution kernel of a certain convolutional layer is

D_{k} \times D_{k}

, the size of the input feature map is

H \times W \times C_{i n}

, and the size of the output feature map is

H \times W \times C_{o u t}

, then, the parameter amount and calculation amount of the traditional convolution operation are

D_{k}^{2} C_{i n} C_{o u t}

and

H W C_{i n} C_{o u t} D_{k}^{2}

, respectively. Therefore, depth-wise convolutional of the lightweight neural network can be the parameter amount and calculation amount of the separated convolution, which are

D_{k}^{2} C_{i n}

and

H W C_{o u t} D_{k}^{2}

, respectively. It is shown that the lightweight network has significantly reduced amounts of parameters and calculations. However, the feature expression abilities of lightweight networks are not as good as those of traditional convolutional neural networks.

3.2. Prototype Learning

For classification tasks of deep convolutional neural networks, the Softmax loss is usually used as its loss function. Given a classification task has

C

classes, the feature vector of each input sample is

x_{i}

, and the class label of the sample is

y_{i} \in [1, C]

, then, we can easily calculate cross-entropy loss of this sample:

ℓ_{S o f t m a x} (x_{i}) = - \log P_{i, y_{i}} = - \log \frac{e^{f_{i}, y_{i}}}{\sum_{k = 1}^{C} e^{f_{i}, k}}

(1)

whereby

f_{i, j}

denotes the distance of feature vector

x_{i}

while class prototypes

W_{j}

.

P_{i, y_{i}}

denote the probability of

x_{i}

being predicted to be its ground-truth class

y_{i}

. Currently, Euclidean distance

f_{i, j} = - α {‖x_{i} - W_{j}‖}_{2}^{2}

and inner product

f_{i, j} = W_{j}^{T} x_{i}

are widely used to measure the similarity of feature vector and class prototypes. Essentially, the class prototype is the representative feature vector of the class to which the sample belongs, and usually a good class prototype vector is generally located in the feature center of the class.

Although Euclidean distance and inner product distance have good geometric interpretability, visual classification tasks require class prototype vectors and feature vectors to have good separability. However, in actual tasks, due to sample complexity and diversity, especially fine-grained image classification tasks, class prototype feature vectors are poorly separable in the low-dimensional space. To solve this problem, proponents of Support Vector Machines (SVM) have proved that the vector in the low-dimensional space is mapped to the high-dimensional space, so that the feature vector in the low-dimensional space is converted into a separable high-dimensional space.

f_{i, j} = 〈φ (x_{i}), φ (W_{j})〉

(2)

where

φ (•) : ℝ^{n} \to ℝ^{d} (d ≫ n)

is a non-linear mapping function. In a higher-dimensional space, this method allows the calculation of class prototypes distance

f_{i, j}

. The computational complexity of this method is much larger than the Euclidean distance or inner product distance in the low-dimensional space, and this shortcoming can be compensated by the kernel method.

f_{i, j} = 〈φ (x_{i}), φ (W_{j})〉 = \sum_{g} c_{j} {(x_{i}^{T} W_{j})}^{j} = k (x_{i}, W_{j})

(3)

where

k (\cdot, \cdot) : ℝ^{n} \times ℝ^{n} \to ℝ

is a kernel function. Its computational complexity is the same as the complexity of the inner product of the vector. The coefficient

c_{j}

is determined by

φ (•)

or

k (\cdot, \cdot)

.

K_{i, j} = K_{P K F} (x_{i}, W_{j}) = {(x^{T} w + c_{p})}^{d_{p}} = {\sum_{0}^{d_{p}} c_{p}^{d_{p} - j} (x^{T} w)}^{j}

(4)

where

K_{P K F} (\cdot, \cdot)

is a polynomial kernel function. The kernel function we consider in this paper is the polynomial kernel of degree

d_{p}

.

c_{p}

is a constant.

ℓ_{Softmax} {(x_{i})}_{PKF - Softmax} = - \log P_{i, y_{i}} = - \log \frac{e^{s \cdot {(x^{T} w + c_{p})}^{d_{p}}}}{\sum_{1}^{C} e^{s \cdot {(x^{T} w + c_{p})}^{d_{p}}}}

(5)

where the hyperparameter s is a scale parameter in order to enlarge the range of PKF-scores. In SVM, the polynomial kernel has been successfully used for dealing with natural language processing (NLP) tasks when

d_{p} = 2

[27]. Regarding the image recognition task, we use

d_{p} = 2

. Essentially, the inner product operation of the cross-entropy loss is a linear transformation, but is actually a special case of the polynomial kernel loss. The polynomial kernel loss function increases the non-linear measurement performance.

3.3. Hierarchical Prototypes Polynomial Softmax Loss Function

In the training phase of deep convolutional neural networks, gradient descent and back-propagation are usually used to optimize parameters of the network. However, as the number of network layers increase, the gradient will usually produce serious gradients in the back-propagation process. Therefore, during the training process, the loss can be added to the middle layer of the network to assist direction propagation of the gradient, so as to optimize network parameters. Therefore, the loss function proposed in this paper will be added to different layers of the network.

Let

F

be any lightweight network, which usually has L stages. The output feature map from any intermediate stage is represented by

F^{l} \in ℝ^{H_{l} \times W_{l} \times C_{l}}

, where

H_{l}

,

W_{l}

,

C_{l}

are the height, width and number of channels of the feature map at

l - t h

stage, and

l = \{1, 2, \dots, L\}

. Our objective is to impose classification losses on the feature-map extracted at different intermediate stages. The feature map of each stage performs the average pooling operation to obtain the feature vector

f^{l}

of the stage.

f^{l} = P o o l_{avg} (F^{l})

(6)

Then, polynomial similarity of the class prototype vector at this stage is calculated to obtain the logits of the class, and finally, the loss function at this stage is obtained, as shown in Figure 3.

ℓ^{l}_{Softmax} (f^{l}) = - \log \frac{e^{s \cdot {(f^{l} w + c_{p})}^{d_{p}}}}{\sum_{1}^{C} e^{s \cdot {(f^{l} w + c_{p})}^{d_{p}}}}

(7)

Then, the loss functions of all stages are weighted and summed to obtain the loss function of the network. To maintain the number of parameters and calculations of the lightweight network, we removed the class prototype vector and down-sampling operations at each stage in the inference phase.

3.4. Optimization of Polynomial Softmax Loss Function

Since the Softmax loss can be optimized through the most commonly used stochastic gradient descent (SGD) method, it is necessary to make the Polynomial Softmax Loss Function consistent with SGD requirements so that we can optimize the total loss through the optimizer. The partial derivative of the polynomial kernel can easily be obtained, as shown in Equation (8).

\frac{\partial}{\partial w} K_{P K F} (x, W) = x^{T} d_{p} {(x^{T} w + c_{p})}^{d_{p} - 1}

(8)

The polynomial order

d_{p}

is not trainable because of integer limitation, since the real exponent may produce complex numbers, which complicates the network.

3.5. The Pseudo Code of the Algorithm

Input: Training data

\{x_{i}\}

. Initialized parameters

Θ

in backbone of light network. Prototype Parameters

W^{1}, \dots, W^{L}

in Stage form 1 to L.

Output:

Θ

and weight

W^{1}, \dots, W^{L}

.

Initialization, the number of iterations $t \leftarrow 1$ ;
While not converge do;
$t \leftarrow t + 1$ ;
Compute the total loss by $ℓ_{l o s s} = ℓ^{1}_{Softmax} (f^{1}) +, \dots, + ℓ^{L}_{S o f t \max} (f^{L}) + ℓ^{c}_{S o f t \max} (f^{c})$ ;
Compute the standard backward propagation error $\frac{\partial ℓ^{t}_{l o s s}}{\partial x^{t}}$ ;
Update the parameters $W^{1}, \dots, W^{L}$ by $W_{t + 1}^{L} = W_{t + 1}^{L} - μ^{t} \frac{\partial L^{t}}{\partial W^{t}}$ ;
Update the parameters $Θ$ by $Θ_{t + 1} = Θ_{t} - μ^{t} \sum_{i}^{m} \frac{\partial L^{t}}{\partial x^{t}} \frac{\partial x^{t}}{\partial Θ^{t}}$ ;
End.

4. Experimental Results and Discussion

This section describes the experimental setup and the results.

4.1. Dataset

We used multiple image classification datasets to verify our proposed loss function. These datasets are divided into two categories. One is a large-scale image classification dataset, the main feature of which is the large difference in features between categories. The other is the fine-grained image classification data set. Compared to the large-scale image classification data set, this task is more difficult to understand. Its main feature is that the difference between the characteristics of the classes is small, and more subtle differences between the categories need to be captured. Large-scale conventional image classification tasks use the public dataset ImageNet-100 public data. The ImageNet-100 dataset randomly selects 100 classes from the well-known ImageNet dataset [28]. The extraction method is as previously described in [29]. The training set contains 500 images for each class, and the test set contains 50 images for each class, totaling about 55,000 images. In addition, the model was verified on fine-grained images, including three widely used datasets, Stanford dogs [30], Stanford cars [31] and CUB-200-2011 [32].

The Stanford dogs contain a total of 120 different dog categories, all of which belong to the canine family. Each category contains about 150 training samples and about 30 test samples. The Stanford cars dataset contains 196 different kinds of car category image datasets, a total of 16,185 images; 8144 training images and 8041 test images. The CUB-200-2011 dataset has 11,788 bird images, including 200 bird subcategories. The training data set has 5994 images while the test set has 5794 images. Each image provides image class label information. As shown in Figure 4, there are three different types of samples in the Stanford dogs, Stanford cars, CUB-200-2011, and ImageNet datasets. Intuitively, it can be seen that Stanford dogs, Stanford cars, and CUB-200-2011 belong to the fine-grained classification dataset. Visual characteristics of samples of different categories are very similar, and the difference is very small. This complicates the feature extraction of the network.

4.2. Experimental Setup

To verify the performance of the proposed loss function on image classification tasks, we used the well-known Pytorch deep learning architecture [33]. The experiment was performed on the UltraLAB graphics workstation that was equipped with 192 GB memory and 8 NVIDIA GTX-2080 graphics processors. Each graphics card had 8 GB memory. The workstation used the Windows server operating system.

To verify the performance of the proposed loss function in training lightweight networks, we selected various well-known lightweight convolutional neural networks. Due to the use of depth-with separable convolutions, lightweight neural networks greatly reduce the number of parameters and calculations. Their main characteristics are introduced below.

We adopted the cosine learning rate schedule for training of all models. The initial learning rate was 0.05, and was scheduled to reach zero within a single cosine period for 300 epochs. The models used SGD with a momentum of 0.9 and a weight decay of 10⁻⁴ for optimization. To prevent the model from overfitting during the training process, we used the same data enhancement method for all datasets. First, the enhancement method adjusts the size of the picture to 256 × 256, then performs random cropping, and a horizontal flipping operation. The batch size of ImageNet-100 and Stanford dogs was set to 96, while the batch size of Stanford cars and CUB-200-2011 was 16.

First, three lightweight nets were used for comparative experiments on four datasets, as shown in Table 1. Each lightweight network uses different network widths, ×1.0 represents the original model, ×0.75 shows that the number of channels of the convolutional layer of the model is 0.75 times that of the original model, and ×0.5 is 0.5 times that of the original model. Orig means that the network uses standard cross-entropy for testing, and ours means that the network was trained using the hierarchical polynomial loss function we proposed. Table 1 shows that the network trained by our proposed hierarchical polynomial loss function does not increase the amount of calculation and parameter in the inference stage of the lightweight network. The accuracy of the three different lightweight networks is in different data. The accuracy of the three lightweight networks has been greatly improved, especially on the Stanford dogs and the CUB-200 dataset. The result is a good verification that our proposed loss function can enable the network to learn better features.

In addition to the standard cross-entropy loss function, we compared the proposed loss function with other kinds of loss functions such as the central loss function and the contrast loss function, etc. on the ImageNet100 dataset. To fairly compare the effects of different loss functions on the network performance, we set the same hyperparameters to train the network. Table 2 shows that the network trained by our proposed hierarchical polynomial loss function has a great improvement compared to other loss functions. This is because the proposed loss function can effectively enhance the gradient return in the back propagation stage, and solve the problem of gradient disappearance caused by very small gradients.

4.3. Ablation Study

The part of the ablation experiment in which the loss function we proposed was added to different stages of the network is shown in Table 3. In the experiment, we adopted three kinds of lightweight models, and added the proposed loss function at stage 5, stage 4, stage 3, stage 2 and the FC layer, respectively. Compared to the original cross-entropy loss performance, the hierarchical polynomial loss function, when the proposed loss function was added to the three layers of stage 3, stage 2, and the FC at the same time, the trained network performance was better. Of course, it is not advisable to add loss functions to all layers. In the shallow part of the network, it is necessary to extract more low-level visual features, while in the higher layers of the network, more attention should be paid to semantic features.

According to Equation (6), the high power and constant term of the polynomial have different effects on the loss calculation. To explore its impact on the network performance, we used two datasets, Mobilenetv2 and Mobilenetv3-small on ImageNet100 and Stanford-dogs for experimentation. The alpha in Table 4 represents different high powers, while gamma represents different constant terms. It is shown that when the high power is set to 2 and the constant term is set to 1, the reasoning performance of the network is the best.

As mentioned earlier, the hierarchical loss function we propose can be used to calculate the corresponding loss at different stages of the network. However, the output of each stage of the network is not a logit; therefore, loss function cannot be directly calculated. It is thus necessary to reduce the dimension of the feature map at each stage to obtain its feature vector, and then find the distance from the class prototype vector to obtain its logit and label calculation loss. Maximum pooling and average pooling methods can be used to reduce the dimensionality of the feature map. We evaluated the impact of the two dimensionality reduction methods on the model, as shown in Table 5. It is shown that when the average pooling is used in each stage, the network’s inference performance is better than the maximum value pooling performance.

To have a perceptual understanding of the network performance, we used the Class Activation Mapping (CAM) method [34] to visualize a feature map output at each stage. The CAM method can effectively highlight the area that the network pays attention to. It can be seen from Figure 5 that compared to cross-entropy, our proposed method can focus more on the dominant feature area of the object to be recognized at different stages. The loss function we proposed can effectively enhance the features extracted at each stage of the network, thereby improving the inference performance of the network.

5. Conclusions

We propose a hierarchical polynomial prototype loss function for lightweight deep neural networks. We verified the model using a large number of experiments on different public datasets, and the loss was passed without increasing the amount of calculation in the model inference stage. The function optimization network can effectively improve the performance of lightweight neural networks. However, the proposed hierarchical prototype loss function requires further studies to evaluate the impact of setting different weights at different layers on the network performance. Therefore, some optimization algorithms will be tested in future studies to optimize the method.

Author Contributions

Conceptualization, C.X. and Z.L.; methodology, C.X. and Z.L.; software, C.X. and Z.L.; validation, C.X. and Z.L.; formal analysis, C.X. and Z.L.; investigation, C.X. and C.S.; resources, C.X.; data curation, C.X. and Z.L.; writing—original draft preparation, C.X.; writing—review and editing, C.X., Z.L. and C.S.; visualization, C.X. and Z.L.; supervision, X.L.; project administration, E.D.; funding acquisition, E.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China on research on the online identification of coal gangue based on terahertz detection technology (grant number NO. 52074273), and by the State Key Research Development Program of China (grant number NO. 2017YFC0804400, NO. 2017YFC0804401).

Data Availability Statement

Restrictions apply to the availability of these data. Data was obtained from [third party] and are available [from the authors/at URL] with the permission of [third party].

Conflicts of Interest

The authors declare no conflict of interest.

References

Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A Discriminative Feature Learning Approach for Deep Face Recognition. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 499–515. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
Masi, I.; Wu, Y.; Hassner, T.; Natarajan, P. Deep Face Recognition: A Survey. In Proceedings of the 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Parana, Brazil, 29 October–1 November 2018; pp. 471–478. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 122–138. [Google Scholar]
Huang, Y.; Yu, W.; Ding, E.; Garcia-Ortiz, A. EPKF: Energy Efficient Communication Schemes Based on Kalman Filter for IoT. IEEE Internet Things J. 2019, 6, 6201–6211. [Google Scholar] [CrossRef]
Tan, M.X.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Wan, W.; Zhong, Y.; Li, T.; Chen, J. Rethinking Feature Distribution for Loss Functions in Image Classification. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9117–9126. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Liu, Y.; Li, H.; Wang, X. Learning Deep Features via Congenerous Cosine Loss for Person Recognition. arXiv 2017, arXiv:1702.06890. [Google Scholar]
Lee, B.S.; Phattharaphon, R.; Yean, S.; Liu, J.G.; Shakya, M. Euclidean Distance based Loss Function for Eye-Gaze Estimation. In Proceedings of the 15th IEEE Sensors Applications Symposium (SAS), Kuala Lumpur, Malaysia, 9–11 March 2020. [Google Scholar]
Sun, Y.; Chen, Y.H.; Wang, X.G.; Tang, X.O. Deep Learning Face Representation by Joint Identification-Verification. In Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Gao, R.; Yang, F.; Yang, W.; Liao, Q. Margin Loss: Making Faces More Separable. IEEE Signal Process. Lett. 2018, 25, 308–312. [Google Scholar] [CrossRef]
Yang, H.M.; Zhang, X.Y.; Yin, F.; Liu, C.L. Robust Classification with Convolutional Prototype Learning. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3474–3482. [Google Scholar]
Noble, W.S. What is a support vector machine? Nat. Biotechnol. 2006, 24, 1565–1567. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Ranjan, R.; Castillo, C.; Chellappa, R. L2 constrained Softmax Loss for Discriminative Face Verification. arXiv 2019, arXiv:1703.09507. [Google Scholar]
Liu, W.Y.; Wen, Y.D.; Yu, Z.D.; Yang, M. Large-Margin Softmax Loss for Convolutional Neural Networks. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016. [Google Scholar]
Liu, W.Y.; Wen, Y.D.; Yu, Z.D.; Li, M.; Raj, B.; Song, L. SphereFace: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6738–6746. [Google Scholar]
Wu, Y.; Liu, H.; Li, J.; Fu, Y. Deep Face Recognition with Center Invariant Loss. In Proceedings of the Thematic Workshop’17: Proceedings of the on Thematic Workshops of ACM Multimedia 2017, Mountain View, CA, USA, 23–27 October 2017; pp. 408–414. [Google Scholar]
Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 531, pp. 539–546. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Wu, C.Y.; Manmatha, R.; Smola, A.J.; Krähenbühl, P. Sampling Matters in Deep Embedding Learning. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2859–2867. [Google Scholar]
Ding, E.; Cheng, Y.; Xiao, C.; Liu, Z.; Yu, W. Efficient Attention Mechanism for Dynamic Convolution in Lightweight Neural Network. Appl. Sci. 2021, 11, 3111. [Google Scholar] [CrossRef]
Goldberg, Y.; Elhadad, M. splitSVM: Fast, space-efficient, non-heuristic, polynomial kernel computation for NLP applications. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, Columbus, OH, USA, 16–17 June 2008; pp. 237–240. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Kai, L.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Han, K.; Wang, Y.; Zhang, Q.; Zhang, W.; Xu, C.; Zhang, T. Model Rubik’s Cube: Twisting Resolution, Depth and Width for TinyNets. arXiv 2020, arXiv:2010.14819. [Google Scholar]
Khosla, A.; Jayadevaprakash, N.; Yao, B.; Fei-Fei, L. Novel Dataset for Fine-Grained Image Categorization: Stanford Dogs. In Proceedings of the CVPR Workshop on Fine-Grained Visual Categorization (FGVC), Providence, RI, USA, 16–21 June 2012. [Google Scholar]
Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3D Object Representations for Fine-Grained Categorization. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), Sydney, Australia, 1–8 December 2013; pp. 554–561. [Google Scholar]
Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; Perona, P. The Caltech-UCSD Birds-200-2011 Dataset; Technical Report CNS-TR-2010-001; California Institute of Technology: Pasadena, CA, USA, 26 October 2011. [Google Scholar]
Ketkar, N. Introduction to PyTorch. In Deep Learning with Python: A Hands-on Introduction; Apress: Berkeley, CA, USA, 2017; pp. 195–208. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]

Figure 1. Schematic diagram of hierarchical class prototype learning model. Symbols on the figure (stars, triangles etc.) represent feature maps of output of different convolution layers in CNN.

Figure 2. (a) Standard convolution filters panel, (b) depth-wise convolution filters, (c) point-wise convolution filters. In the figure,

K \times K

represents the kernel size; C represents the number of input channels;

H \times W

represents the size of spatial information; and N represents the number of output channels; * is a convolution operation [26].

Figure 2. (a) Standard convolution filters panel, (b) depth-wise convolution filters, (c) point-wise convolution filters. In the figure,

K \times K

represents the kernel size; C represents the number of input channels;

H \times W

represents the size of spatial information; and N represents the number of output channels; * is a convolution operation [26].

Figure 3. Schematic presentation of the hierarchical polynomial loss function structure.

Figure 4. Sample instance images of the datasets.

Figure 5. Activation map of selected results on the CUB dataset with the Mobilenetv2 as the base model. From left to right: Column (a) are original images with CUB; Column (b) are Grad-CAM visualizations for the model trained with standard cross-entropy; Column (c) are Grad-CAM visualizations for the model trained with our proposed method.

Table 1. Comparisons of hierarchical polynomial loss function and cross-entropy loss function, Orig represents the standard cross-entropy used by the network while ours represents the loss proposed in this article. The result is the top 1 of the test set, the unit is %.

Model	ImageNet-100		Stanford Dogs		Stanford Cars		CUB-200-2011
Model	Orig	Ours	Orig	Ours	Orig	Ours	Orig	Ours
Mobilenetv2 (×0.5)	74.17	74.84	44.56	52.63	83.73	85.25	52.36	54.46
Mobilenetv2 (×0.75)	76.87	78.71	45.81	56.33	85.81	86.62	59.47	61.60
Mobilenetv2 (×1.0)	78.97	81.05	48.43	60.42	87.07	87.97	61.28	66.08
Mobilenetv3-small (×0.5)	65.72	65.98	33.42	40.30	72.92	75.04	42.85	47.54
Mobilenetv3-small (×0.75)	67.87	68.40	38.59	44.15	74.99	77.94	52.15	55.63
Mobilenetv3-small (×1.0)	70.00	72.15	46.51	47.45	79.10	80.69	52.39	56.65
Mobilenetv3-large (×0.5)	71.58	74.37	41.83	51.42	80.45	84.80	49.59	58.76
Mobilenetv3-large (×0.75)	75.81	77.30	45.27	55.53	80.61	85.23	52.63	60.12
Mobilenetv3-large (×1.0)	76.91	78.56	47.01	57.31	82.48	86.01	55.73	62.68

Table 2. Comparison with other loss functions. RBF denotes radial basis loss function. The result is the top 1 of the test set, the unit is %.

Model	ImageNet100
Model	Original Softmax Loss	RBF-Softmax	Qurs
Mobilenetv2 (×1.0)	78.97	79.51	81.05
Mobilenetv3-small (×1.0)	70.00	70.68	72.15
Mobilenetv3-large (×1.0)	76.91	77.31	78.56

Table 3. Add our proposed polynomial loss function to different stages of Mobilenetv2 and Mobilenetv3. The result is the top 1 of the test set, the unit is %.

Model	Add Prototypes Loss or Not					ImageNet100		Stanford Dogs
Model	FC	Stage 5	Stage 4	Stage 3	Stage 2	CE	PS	CE	PS
Mobilenetv2 (×1.0)	◯	◯	◯	◯	◯	78.97 (Orig)		48.43 (Orig)
	✓	◯	◯	◯	◯	78.07	78.68	53.99	53.05
	✓	✓	◯	◯	◯	79.70	78.66	54.54	56.75
	✓	✓	✓	◯	◯	79.85	81.05	57.61	60.42
	✓	✓	✓	✓	◯	79.23	80.54	58.86	59.73
	✓	✓	✓	✓	✓	79.30	78.91	59.42	58.76
Mobilenetv3-small (×1.0)	◯	◯	◯	◯	◯	70.00 (Orig)		42.75 (Orig)
	✓	◯	◯	◯	◯	70.03	71.37	45.16	43.97
	✓	✓	◯	◯	◯	70.21	70.37	45.03	45.31
	✓	✓	✓	◯	◯	70.64	72.15	47.05	44.92
	✓	✓	✓	✓	◯	69.92	70.21	47.75	47.53
	✓	✓	✓	✓	✓	68.95	70.43	46.71	46.51
Mobilenetv3-large (×1.0)	◯	◯	◯	◯	◯	76.91 (Orig)		47.01 (Orig)
	✓	◯	◯	◯	◯	76.45	76.89	50.61	52.51
	✓	✓	◯	◯	◯	76.22	76.71	51.13	51.90
	✓	✓	✓	◯	◯	78.14	78.56	56.76	56.40
	✓	✓	✓	✓	◯	77.42	77.73	56.40	57.31
	✓	✓	✓	✓	✓	76.93	77.83	55.68	57.14

Table 4. The effect of different alpha and gamma values in the proposed polynomial loss function on network performance. The result is the top 1 of the test set, the unit is %.

Model	Alpha	Gamma	ImageNet100	Stanford Dogs
Mobilenetv2 (×1.0)	2	0.5	79.39	59.10
		0.8	79.91	59.75
		1	81.05	60.42
	3	0.5	78.91	57.32
		0.8	79.61	59.51
		1	79.97	59.37

Table 5. Ablation analysis on the pooling module using hierarchical loss function. GAP denotes Global average pooling while GMP denotes Global maximum pooling. The result is the top 1 of the test set, the unit is %.

Model	GAP	GMP	ImageNet100	Stanford Dogs
Mobilenetv2 (×1.0)	✓	◯	81.05	60.42
Mobilenetv2 (×1.0)	◯	✓	80.29	58.94
Mobilenetv3-small (×1.0)	✓	◯	72.15	47.53
Mobilenetv3-small (×1.0)	◯	✓	71.52	46.60
Mobilenetv3-large (×1.0)	✓	◯	78.56	57.31
Mobilenetv3-large (×1.0)	◯	✓	77.48	57.02

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, C.; Liu, X.; Sun, C.; Liu, Z.; Ding, E. Hierarchical Prototypes Polynomial Softmax Loss Function for Visual Classification. Appl. Sci. 2022, 12, 10336. https://doi.org/10.3390/app122010336

AMA Style

Xiao C, Liu X, Sun C, Liu Z, Ding E. Hierarchical Prototypes Polynomial Softmax Loss Function for Visual Classification. Applied Sciences. 2022; 12(20):10336. https://doi.org/10.3390/app122010336

Chicago/Turabian Style

Xiao, Chengcheng, Xiaowen Liu, Chi Sun, Zhongyu Liu, and Enjie Ding. 2022. "Hierarchical Prototypes Polynomial Softmax Loss Function for Visual Classification" Applied Sciences 12, no. 20: 10336. https://doi.org/10.3390/app122010336

APA Style

Xiao, C., Liu, X., Sun, C., Liu, Z., & Ding, E. (2022). Hierarchical Prototypes Polynomial Softmax Loss Function for Visual Classification. Applied Sciences, 12(20), 10336. https://doi.org/10.3390/app122010336

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hierarchical Prototypes Polynomial Softmax Loss Function for Visual Classification

Abstract

1. Introduction

2. Related Works

2.1. Loss Function Based on Sample Labels

2.2. Loss Function Based on Regular Constrains

2.3. Loss Function Based on Sample Pair Label

3. The Proposed Method

3.1. Lightweight Neural Works

3.2. Prototype Learning

3.3. Hierarchical Prototypes Polynomial Softmax Loss Function

3.4. Optimization of Polynomial Softmax Loss Function

3.5. The Pseudo Code of the Algorithm

4. Experimental Results and Discussion

4.1. Dataset

4.2. Experimental Setup

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI