Fast Normalization for Bilinear Pooling via Eigenvalue Regularization

Xu, Sixiang; Dong, Huihui; Zhang, Chen; Wang, Chaoxue

doi:10.3390/app15084155

Open AccessArticle

Fast Normalization for Bilinear Pooling via Eigenvalue Regularization

by

Sixiang Xu

,

Huihui Dong

,

Chen Zhang

and

Chaoxue Wang

^*

School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(8), 4155; https://doi.org/10.3390/app15084155

Submission received: 2 February 2025 / Revised: 17 March 2025 / Accepted: 21 March 2025 / Published: 10 April 2025

(This article belongs to the Special Issue Application of Machine Learning to Image Classification and Image Segmentation)

Download

Browse Figures

Versions Notes

Abstract

Bilinear pooling, as an aggregation approach that outputs second-order statistics of deep learning features, has demonstrated effectiveness in a wide range of visual recognition tasks. Among major improvements on the bilinear pooling, matrix square root normalization—applied to the bilinear representation matrix—is regarded as a crucial step for further boosting performance. However, most existing works leverage Newton’s iteration to perform normalization, which becomes computationally inefficient when dealing with high-dimensional features. To address this limitation, through a comprehensive analysis, we reveal that both the distribution and magnitude of eigenvalues in the bilinear representation matrix play an important role in the network performance. Building upon this insight, we propose a novel approach, namely RegCov, which regularizes the eigenvalues when the normalization is absent. Specifically, RegCov incorporates two regularization terms that encourage the network to align the current eigenvalues with the target ones in terms of their distribution and magnitude. We implement RegCov across different network architectures and run extensive experiments on the ImageNet1K and fine-grained image classification benchmarks. The results demonstrate that RegCov maintains robust recognition to diverse datasets and network architectures while achieving superior inference speed compared to previous works.

Keywords:

bilinear pooling; matrix normalization; deep learning; image classification

1. Introduction

Over the past decade, deep networks have achieved remarkable success in the field of computer vision. Extracting local features on an image and then pooling them into a global representation has become one of the most promising types of network architectures [1,2,3]. As one of the most common pooling methods, bilinear pooling aggregates local features into a matrix where correlations between different features are preserved. With their higher-order information, the enriched representations have been used in various tasks [4,5,6,7], especially in visual recognition [8,9,10,11,12].

Among bilinear pooling approaches, normalizing bilinear representation is almost applied, and is considered as an indispensable step to further boost performance [13]. Existing normalization approaches can be roughly categorized into element-wise normalization and structure-wise normalization.

Element-wise normalization was first proposed to suppress the burstiness of bag-of-words features [14], and was then found to be effective in bilinear pooling as well. For each element in bilinear representation, the approach in [13] normalizes it by its square root and then divides the result by its Frobenius norm. This kind of normalization is simple yet effective and, thus, is widely used in bilinear pooling approaches [9,13,15,16,17].

Structure-wise normalization considers the fact that bilinear representation is a Symmetric Positive Definite matrix whose space forms a Riemannian manifold. Thus, it is clearly sub-optimal to send the representation to a linear classifier. The representation should be normalized by mapping the manifold into an Euclidean space. In particular, the O2P method [18] applies the matrix logarithm function as the normalization which can improve performance. In the following, DeepO2P [19] proposes back-propagation of the matrix logarithm function and achieves end-to-end learning of deep networks. However, computing matrix logarithm and its gradient are less efficient on the GPU. To accelerate the normalization, other works [15,20,21] suggest the square root of an SPD matrix which can also complete the linear mapping [22]. More importantly, the square root process can be approximated by Newton’s iteration [15] where only basic matrix–matrix multiplications are involved. Therefore, the normalization with Newton’s iteration is GPU-supported, and its efficiency has made itself one of the major normalization approaches to bilinear pooling. Although Newton’s iteration provides strong theoretical support and good performance [15,22], its sequential matrix multiplications still result in a high computational complexity of

O (D^{3})

, where D is feature dimension and is always large (

D = 2048

for ResNet-50 network [1]) for modern deep network architectures. Thus, more efficient structure-wise methods are further studied. For example, Run [23] only normalizes the largest eigenvalue and simplifies the computation into matrix–vector multiplications. PSMR [9] can reduce computational complexity into

O (n^{3})

. Here, N is the number of local features and can be smaller than the feature dimension in most cases. iSICE [8] and iSQRT [20] adopt a convolutional layer to reduce feature dimension before bilinear pooling. Although great efforts are made, the aforementioned studies make a trade-off between the efficiency and recognition accuracy of the network.

Recently, a pioneering work [10,24] analyzed the effect of the square root normalization and, for the first time, showed that the normalization reaches a good balance between feature decorrelation and information preservation. According to this finding, its method, called DropCov, achieves normalization by applying an adaptive channel dropout to the features before bilinear pooling. Therefore, compared with the square root normalization, this method only has a linear complexity of

O (D)

and is free of inference, largely improving the efficiency while keeping good recognition accuracy.

Although DropCov is impressive, it is highly dependent on the channel dropout rate. A small dropout rate can lead to nearly no normalization for bilinear representation and a high dropout rate can make the network fail to converge. To solve this problem, the work [10] proposes a network branch that predicts the dropout rate. Therefore, the rate can be adaptively set for each training sample when the training is proceeding. However, the branch is parametric and needs sufficient training data for accurate predictions. In Section 4.2, we observe its lower robustness on small-scale benchmarks (such as Caltech-UCSD Birds-200-2011 [25]). Furthermore, as a variant of the original dropout [26], DropCov samples a random subset of convolutional filters to be trained every iteration and makes the network converge slowly [27]. As a result, although DropCov has linear complexity

O (D)

during training, in certain cases, the training needs more epochs and spends more time than square root normalization.

The above disadvantages of previous works motivated us to design a more efficient and robust approach to normalize the bilinear representation matrix. Inspired by original square root normalization, whose square root mainly changes the distribution and magnitude of eigenvalues, we hypothesize that the distribution and magnitude of eigenvalues play an important role in the normalization. To prove this hypothesis, we quantified the distribution of eigenvalues (D.E.s) and magnitude of eigenvalues (M.E.s) as two scalars. More details about D.E.s and M.E.s can be found in Section 3.3. Then, we trained two networks, respectively, with and without square root normalization on ImageNet1K [28] dataset. In Figure 1, we show four quantities as a function of training epochs—(1) Training (Train.) accuracy, (2) Validation (Val.) accuracy, (3) M.E.s, and (4) D.E.s. As illustrated, with the normalization (blue curves), the network avoids severe overfitting and achieves higher accuracy, compared to the one without normalization (red curves). More interestingly, the normalization effectively suppresses the values of M.E.s and D.E.s, making the sum of eigenvalues smaller and the distribution less peaky. This phenomenon also conforms to the change in the eigenvalues when applying the normalization: the eigenvalues are square-rooted. Generally speaking, the normalization correlates with M.E.s and D.E.s. Suppressing M.E.s and D.E.s to some extent may avoid overfitting and result in better network performance. This leads to the main questions of this paper: is there another normalization that can suppress M.E.s and D.E.s rather than the original square root normalization? If so, can we replace the original square root normalization with it?

To answer these questions, in this paper, we work on regularizing M.E.s and D.E.s in the bilinear pooling matrix and studying its effect on the performance of deep networks. To this end, we propose an implicit normalization, namely RegCov, which encourages the network to align current M.E.s and D.E.s with the target ones. Different from DropCov [10], RegCov normalizes bilinear representation by adding two regularization terms in the objective function and no more parameters are introduced. This quantity can avoid performance degeneration when training data are not sufficient. Consequently, unlike DropCov, our method can perform well on the benchmarks in various scales, especially in small scales. Secondly, RegCov also holds free computational complexity for inference. During training, our method adopts a two-stage strategy where square root normalization is initially applied because it can achieve fast training convergence. We apply the proposed method to the recent deep networks and run extensive experiments on the datasets in different scales. All cases suggest that RegCov is a robust and efficient normalization for bilinear pooling.

2. Related Work

Bilinear pooling is first introduced by Tenenbaum et al. [29] to disentangle style from content. Pioneering works, such as BCNN [13] and DeepO2P [19], implement it as a layer into deep learning networks and find it effective at image classification tasks. Thanks to its effectiveness and simplicity, the research on bilinear pooling receives wide attraction and makes much progress in different aspects, such as compact bilinear representations [30,31,32,33,34] and multi-order pooling methods [17,35]. In this paper, we focus on another core improvement in bilinear pooling: normalization, which can enhance the effectiveness of bilinear representation. Different from other deep learning normalization techniques [36,37,38,39,40], most related works are dedicated to bilinear pooling and can be roughly categorized into element-wise and structure-wise approaches, which will be, respectively, discussed in the following subsections.

2.1. Element-Wise Normalization

Element-wise approaches normalize separately each element in bilinear representation. For example, normalizing features with element-wise signed square rooting and L2-normalization is widely adopted in many works [13,14]. The approach can effectively alleviate burstiness issues [14], where important but relatively rare-occurring features are overwhelmed by those that appear frequently. Ref. [35] proposes some surrogate functions of power normalization and realizes linear mapping of bilinear representation for end-to-end deep networks. Layer Normalization (LN) [41,42] standardizes the features in one sample with their mean and their variance, and subsequently applies linear transformation. Although this kind of normalization runs efficiently and improves network performance to some extent, most of the works are less effective than the structure-wise normalization below.

2.2. Structure-Wise Normalization

Bilinear pooling results in a Symmetric Positive Definite (SPD) matrix whose space forms a Riemannian manifold. Due to lying in such a non-Euclidean space, the result is sub-optimal to the next linear classifier, for example, a fully-connected layer in the deep network. The Log-Euclidean Riemannian Metric (LERM) [43,44] is proposed to measure the distance between two SPD matrices. Then, this method is implemented into deep networks to normalize the output of bilinear pooling and achieves success with linear classifiers [18,19,45]. More precisely, these works apply the logarithm to each eigenvalue in the SPD matrix. Grassmann Pooling (GP) [46] transforms the CNN feature matrix into a set of its singular vectors. Although these works improve the effectiveness, the normalization process involves Singular Value Decomposition (SVD), which is not efficient when using GPUs and consequently slows down the network. Hence, to achieve efficient matrix normalization, Lin and Maji [15] suggested normalizing bilinear representation into its square-rooted matrix with Newton’s iteration, where only matrix–matrix multiplications occur. Thus, this method is adapted to GPUs and speeds up the normalization. iSQRT-Cov [20] extends the idea of Newton’s iteration to back-propagation and makes the normalization more efficient when training a network. Since then, a series of works have applied Newton’s iteration to batch whitening [47,48,49] and have combined the fast square root normalization with compact bilinear pooling [9,17,30,50]. For example, PSMR [9] proposes a pseudo-square-rooted matrix, which can be obtained by normalizing the CNN feature matrix before bilinear pooling. One advantage of this matrix is its compatibility with most existing compact bilinear pooling methods. This method also improves the efficiency of square root normalization when the number of feature vectors is smaller than the feature dimension. Song et al. [51] proposed two methods, called Matrix Taylor Polynomial (MTP) and Matrix Padé Approximant (MPA), to calculate square-rooted matrix. This work shows that, beyond Newton’s iteration, there is still room to improve the speed of square root normalization.

Besides square root normalization, a few other types of normalization are proposed. RUN [23] is an efficient matrix normalization that normalizes the largest eigenvalue in the bilinear representation. DeepKSPD [52] generalizes square root normalization to a case where eigenvalues are rooted with a learnable parameter instead of a fixed value of

0.5

. ReDro [53] randomly groups feature channels and normalizes bilinear representation in each group. As the feature dimension is reduced, the normalization is more efficient. Lin et al. [54] implemented democratic aggregation and weight each local feature so that their contribution to bilinear representation is equalized. iSICE [8] is a recent normalization that mitigates the confounding effect in bilinear representation. Chen et al. [55] proposed Riemannian classifiers that enable accurate classification on the geometry of SPD manifolds. Bazzi et al. [56] proposed a Minimum Description Length (MDL) estimator to detect the number of superimposed signals, demonstrating how eigenvalue regularization can enhance robustness in high-dimensional spaces.

Different from the above works, our approach directly produces normalized bilinear representation, and any explicit normalization process is not involved for inference. This characteristic makes the bilinear pooling equipped with our approach run more efficiently. The most similar work to ours is DropCov [10], where for each training sample, a network branch estimates channel dropout rate and randomly zeroes feature channels with a probability. During inference, the branch is omitted and no normalization is performed. Compared with DropCov, this paper proposes two regularization terms and adds them to the loss function so no additional network structure is needed for our method. Furthermore, according to the results in Section 4.2, our method is more robust to insufficient training data than DropCov whose branch has difficulty in predicting the accurate dropout rate.

3. Our Approach

As shown in Figure 2, a deep network, such as Convolutional Neural Network (CNN) or Vision Transformer (ViT), extracts on the image local features

X \in R^{W \times H \times D}

, where W, H and D are, respectively, width, height and feature dimension. After being reshaped, local features

X

are flattened into a feature matrix

X \in R^{N \times D}

, and the number of feature vectors

N = W \times H

. Bilinear pooling aggregates these local features into bilinear representation

A \in R^{D \times D}

. Then, at the first training stage,

A

is processed by square root normalization, and the output matrix

A^{1 / 2}

is sent to a fully connected layer for final classification prediction. The network is trained with cross-entropy loss

L_{C E}

. At the second stage, the normalization is rejected and the resulting network is continued to be trained with

L_{C E}

and two newly proposed regularization terms:

L_{i}

and

L_{d}

. During inference time, we use the network trained after the second stage so no normalization is used.

3.1. Bilinear Pooling

At the first and second stages, a deep network extracts local features

X \in R^{D \times N}

, where N is the number of feature vectors, and D is their dimension. Bilinear pooling works as a global pooling approach that aggregates

X

into the bilinear representation

A \in R^{D \times D}

:

A = \frac{1}{N} X X^{T} + ϵ I,

(1)

where

ϵ

is a small value 1 ×

10^{- 7}

and adds to diagonal values in the matrix

A

for the sake of stability during training. Like the representation after average pooling and max pooling, the diagonal values in

A

also preserve representative features in each separate feature dimension. Furthermore, non-diagonal values show richer information, i.e., the correlation between each pair of dimensions. Hence, bilinear pooling is more powerful and validates its effectiveness in many computer tasks [4,5,8,9,10,11,12,19].

3.2. First Stage

Despite being powerful, the representation

A

lies in the Riemannian manifold. Obviously, it is not adapted to a linear classifier (fully connected layer), which works in the Euclidean space. Hence, at the first stage, we perform square root normalization on the

A

before the classifier. The solution normalizes

A

into its square-rooted one:

A^{1 / 2}

so that

A = A^{1 / 2} A^{1 / 2}

.

Originally, the normalization can be processed with the SVD [22,57]. Let the SVD of

A

be expressed as

A = Q Σ Q^{T}

(2)

where

Q^{D \times D}

is the matrix of eigenvectors, and the diagonal matrix

Σ \in R^{D \times D}

contains eigenvalues. The normalized matrix is obtained by square rooting diagonal elements in

Σ

:

A^{1 / 2} = Q Σ^{1 / 2} Q^{T} .

(3)

Nevertheless, computing square-rooted matrix

A^{1 / 2}

via SVD is less efficient on the GPU and slows the whole training process. Alternatively, at the first stage, we use Newton’s iteration, as proposed in [15,20], to calculate approximated

A^{1 / 2}

. Specifically, it is dedicated to solving the formula

F (Z) = Z^{2} - A = 0

. The approach is an iterative process where each iteration is as follows:

Y_{k + 1} = \frac{1}{2} Y_{k} (3 I - Z_{k} Y_{k}),

(4)

Z_{k + 1} = \frac{1}{2} (3 I - Z_{k} Y_{k}) Z_{k} .

(5)

As illustrated in Figure 3, the inputs are

Y_{0} = A

and

Z_{0} = I

. After K iterations,

Y_{K}

and

Z_{K}

converge to

A^{1 / 2}

and

A^{- 1 / 2}

. Like previous works, K is set to be 5 throughout the paper. Furthermore, Equations (4) and (5) indicate that the normalization process only consists of matrix multiplications which are well supported on the GPU. Therefore, we adopt it at the first stage to speed up the normalization.

Finally, the normalized representation

A^{1 / 2}

is flattened as a vector and sent to a linear classifier. The network is trained with the cross-entropy loss

L_{C E}

:

L_{C E} = - \frac{1}{S} \sum_{s}^{S} \sum_{c}^{C} q_{c}^{s} \log (p_{c}^{s}),

(6)

where

q_{c}^{s}

and

p_{c}^{s}

are, respectively, the target and predicted probability values at c-class, respectively, of the s-th training sample.

3.3. Second Stage

At the second stage, we remove the square root normalization because, although the efficiency has already been improved, Newton’s iteration still has an expensive computational complexity of

O (D^{3})

, where D is the number of feature dimensions.

However, as Equation (3) shows, square root normalization square-roots eigenvalues in the matrix

A

, and rejecting the normalization will change both the magnitude and distribution of eigenvalues. To quantify this change, we define the magnitude

M_{e}

as the sum of eigenvalues:

\begin{matrix} M_{e} = tr (A) \\ tr (A) = \sum_{d = 1}^{D} λ_{d} \end{matrix}

(7)

where

M_{e}

can be obtained via the trace of

A

. Subsequently,

A

is divided by

M_{e}

, and we have

\hat{A}

. The sum of eigenvalues

\sum_{d = 1}^{D} {\hat{λ}}_{d}

in the matrix

\hat{A}

is always 1. Then, the distribution of eigenvalues can be defined as the Frobenius norm of

\hat{A}

:

\begin{matrix} D_{e} = {∥ \hat{A} ∥}_{F} \\ {∥ \hat{A} ∥}_{F} = \sqrt{\sum_{d = 1}^{D} {\hat{λ}}_{d}^{2}} \end{matrix}

(8)

The value of

D_{e}

can enable us to decide the flatness of the distribution of eigenvalues. For example,

D_{e} = 1

represents the most peaky distribution where only one eigenvalue is equal to 1 and the others are zero. On the contrary,

D_{e} = 1 / s q r t (D)

represents the most flat distribution where all the eigenvalues are equal to

1 / D

. Within the interval

D_{e} \in [1 / s q r t (D), 1]

, a smaller value means a flatter distribution and vice versa.

For our approach RegCov, when the first stage terminates, we record the average distribution and magnitude values

\bar{D_{e}}

and

\bar{M_{e}}

for all the training samples. After the removal of square root normalization, we add two regularization terms in the loss function during training and the network is encouraged to output the bilinear representation whose

D_{e}

and

M_{e}

approach to the target

\bar{D_{e}}

and

\bar{M_{e}}

:

\begin{matrix} L & = L_{C E} + λ_{d} L_{d} + λ_{i} L_{m} \\ L_{d} & = \frac{1}{S} \sum_{s}^{S} {(D_{e}^{s} - \bar{D_{e}})}^{2} \\ L_{m} & = \frac{1}{S} \sum_{s}^{S} {(M_{e}^{s} - \bar{M_{e}})}^{2} \end{matrix}

(9)

Here,

L_{d}

and

L_{m}

push current

D_{e}

and

M_{e}

of the s-th training sample towards

\bar{D_{e}}

and

\bar{M_{e}}

.

λ_{d}

and

λ_{i}

are loss weights whose values will be further studied in Section 4.6.

3.4. Discussion

We insist on explicit square root normalization at the first stage. The reason is that our method RegCov needs the accurate targets

\bar{D_{e}}

and

\bar{M_{e}}

from Equation (9). Across different datasets and network architectures, the targets

\bar{D_{e}}

and

\bar{M_{e}}

can vary greatly. Inaccurate target values make bilinear representation dissimilar to the normalized one and eventually negatively influence classification performance.

In addition, the computational complexity of Newton’s iteration is

O (D^{3})

, which is much bigger than the

O (D)

of the state-of-the-art approach DropCov in the training phase. This limitation slows down the training speed for each epoch. However, the following Section 4.3 will show that Newton’s iteration needs less of a training epoch. Eventually, training time spent on our network is acceptable. Moreover, by replacing Newton’s iteration with a more efficient matrix square root, called MPA [51], the training speed of our approach can be further accelerated. In Section 4.7, the feasibility of MPA will be studied.

Another advantage of RegCov is its efficiency at the second stage. It simplifies the alignment of eigenvalues into the regularization of two scalar values:

D_{e}

and

M_{e}

. As detailed in Section 4.5, RegCov can result in approximated square-rooted eigenvalues. Additionally, RegCov avoids explicit calculation of each eigenvalue via computationally expensive SVD and makes the training faster.

RegCov first trains a teacher network whose bilinear pooling contains square root normalization. Then, its normalization is rejected and the resulting network is a student network that is finally trained with prior knowledge from the teacher. The above methodology makes RegCov similar to knowledge distillation (KD) [58], where the knowledge distilled from a teacher model improves a lightweight student model. Most of the KD methods use either the teacher’s output logits [58,59,60] or features from intermediate layers as distilled knowledge [61,62,63,64,65]. Differently, RegCov only needs two scalar values from bilinear representation, making itself simple yet effective.

4. Experiments

In this section, we first describe implementation details of RegCov in Section 4.1. In Section 4.2, we verify the generalization of the approach by implementing it into four common pre-trained deep network backbones and by fine-tuning the networks on three fine-grained visual recognition datasets. Additionally, we train our method from scratch on a large-scale object classification dataset and compare it with other normalization methods in Section 4.3. We also evaluate our method on a remote sensing dataset in Section 4.4. We further investigate the effectiveness of our approach by its alignment of eigenvalues in Section 4.5. Finally, we perform hyper-parameter analysis in Section 4.6, computational complexity analysis in Section 4.7, and visualization analysis in Section 4.8.

4.1. Implementation Details

Models. For the experiments with three fine-grained visual recognition datasets, following the common settings [9,13,31], we used the networks pre-trained on the ImageNet1K dataset, and extracted feature matrix

X

after

r e l u 5_3

in the VGG-16 and before global average pooling in other networks. Subsequently, bilinear pooling of Equitation (1) was followed, except for ResNet-50, where we applied the compact bilinear pooling of PSMR [9] because the feature dimension D here was too large. When training from scratch, we adopted the CNN backbones from [8,10,21], where the feature dimension of

X

was reduced to 256 with a

1 \times 1

convolutional layer.

Training and Evaluation. For fine-grained visual recognition tasks, we adopted the same image pre-processing techniques as those in [20,50] for a fair comparison. Input images were resized to 448 × 448, both for training and inference. Random horizontal flip was the only data augmentation method applied during training. The training is composed of two stages. At the first stage, we only trained the last FC layer for 100 epochs with Stochastic Gradient Descent (SGD) optimizer, whose learning rate was

1.0

, weight decay was

1 \times 10^{- 5}

, and the momentum was

0.9

. Then, we fine-tuned the whole network for another 100 epochs with the initial learning rate of

2.7 \times 10^{- 2}

and weight decay of

1 \times 10^{- 4}

. The batch size was set as 32 and the learning rate was divided by 10 when the training loss remained higher than

0.99

times the current minimum loss value for 10 epochs. At the second stage, we removed the normalization part and continued to fine-tune the network for another 15 epochs. The initial learning rate was

0.01

and was reduced by 10 at the 9th epoch. For inference, we averaged the classification scores for both the input image and its flipped version as the final prediction. This setting is quite common in the bilinear pooling field [8,13].

For training a CNN on large-scale dataset ImageNet1K, according to refs. [10,20,21], during training, a random crop was made on the original image and was then resized into 224 × 224. A random horizontal flip was also applied. At the first stage, the optimization strategy corresponded to SGD with an initial learning rate of 0.1. The network was trained within 80 epochs for ResNet-18 and 65 epochs for ResNet-50. The learning rate was decayed by 10 in the epochs 30, 60 and 75 for ResNet-18 and was decayed by 10 in the epochs 30, 45 and 60 for ResNet-50. At the second stage, the bilinear normalization was removed and the network was fine-tuned with the learning rate of 0.001, which was decayed by 10 after 9 epochs in 15 epochs in total. For inference, the input image was the center crop of 224 × 224 on the resized image whose shorter edge was matched to 256.

Throughout all the experiments, the loss weight parameters

λ_{i}

and

λ_{d}

in Equation (9) changed linearly for every training iteration. For the tasks on the ImageNet1K,

λ_{i}

increased from

1 \times 10^{- 6}

to

1 \times 10^{- 4}

and

λ_{d}

increased from

1 \times 10^{2}

to

1 \times 10^{3}

. For the other tasks,

λ_{i}

increased from

1 \times 10^{- 8}

to

1 \times 10^{- 5}

and

λ_{d}

was the same. The studies on loss weight parameters are provided in Section 4.6.

4.2. Experiments on Small-Scale Datasets

In this section, we evaluate the performance of the proposed approach on three fine-grained image classfication benchmarks. Caltech-UCSD Birds-200-2011 (Birds) [25] includes 5994 images for training dataset and 5794 images for test dataset from 200 bird species. FGVC-Aircraft Benchmark (Aircrafts) [66] consists of 100 aircraft categories, each of which has 67 training images and 33 test images. Stanford Cars (Cars) [67] provides 16,185 images and 196 classes of cars. The Cars dataset has a rough 50–50 split per class, resulting in 8144 training images and 8041 test images.

Table 1 presents classification accuracy results and inference speed for both our approach and counterparts. The accuracy results for counterparts are all reported from the studied papers, except for DropCov, whose results are absent in its paper. Our approach achieves good balance between running speed and classification performance. When the backbone is VGG-16, our approach significantly improves the performance of BCNN by

3.7 %

even though the two works share the same deep network architecture. Similarly, average accuracy increases by 6.9% for ResNet-50, where our network is identical to CBP. iSICE shows very competitive accuracy results, especially for VGG-16 and ResNet-50. On average, its accuracy for VGG-16 is superior to ours. However, iSICE adopts Newton’s iteration, which eventually greatly decelerates the whole network. Even worse, to avoid high computation, iSICE reduces the feature dimension D to 256. The reduction causes information loss from the original features and impacts negatively on the accuracy. This phenomenon is more obvious for networks ResNet-50, ResNet-101, ConvNext-T [68], Swin-T [69], and Swin-B [69], whose original D is large. Compared to iSICE, RegCov saves time from the normalization part while using the original features.

The most relevant counterpart to our approach is DropCov. Its network architecture during inference is identical to ours, so running speed is always the same. However, for classification performance, ours is superior to DropCov by a clear margin. This validates our claim that due to the lack of training samples, the network branch in DropCov is not trained sufficiently and has difficulty predicting appropriate dropout rates. On the contrary, at the first training stage of our approach, Newton’s iteration is adopted as the square root normalization approach. Unlike DropCov, Newton’s iteration is not data-driven, so it is naturally robust to small datasets. Thanks to the strong baseline, our networks are trained more effectively at the second stage where the normalization is omitted.

4.3. Experiments on ImageNet1K

We evaluate our method on large-scale image classification using the ImageNet1k dataset, which includes 1.28 M training images, 50 K validation images and 100 K testing images from 1000 classes. For a fair comparison with Dropcov, we apply Global Covariance Pooling (GCP) instead of bilinear pooling. GCP centers each feature vector

x_{i} \in X

to a zero mean:

x_{i} - \frac{1}{N} (\sum_{i = 1}^{N} x_{i})

before bilinear pooling. We implemented our approach into ResNet-18 and ResNet-50, and trained the networks from scratch. The counterparts include Plain GCP [13] (i.e., GCP without normalization), iSQRT-Cov [20], Layer Normalization (LN) [41] and DropCov [10].

In Table 2, Top-1 classification accuracy on the validation dataset, inference speed and training time are reported. Our method and plain GCP do not apply the normalization but ours clearly outperforms plain GCP in terms of accuracy with equal inference speed. Meanwhile, iSQRT-COV applies square root normalization with Newton’s iteration and achieves competitive accuracy. But its running speed is unsatisfactory due to the normalization which has a computational complexity of

O (D^{3})

. Layer Normalization is an element-wise normalization method, which normalizes each sample with its mean (

μ

) and variance (

γ

) and subsequently applies a linear transformation to each element. Its improvement in accuracy is marginal and consumes more inference time than ours. DropCov and RegCov achieve the best balance between inference speed and accuracy. However, for the training time, at the first stage of our method, we share the same network with iSQRT-COV. The square root normalization effectively normalizes the bilinear representation and leads to fast convergence [20]. As for DropCov, it repeatedly samples a random subset of features for each iteration and a proportion of data could not be seen until several epochs. This behavior makes the training hard and spends more epochs for convergence [27]. Therefore, although the square root normalization proceeds in our method, its training time is acceptable and is even smaller than DropCov for ResNet-50.

4.4. Experiments on UC Merced Dataset

We evaluated the proposed approach on the UC Merced dataset [71], a benchmark dataset for remote sensing scene classification. This dataset comprises 21 land-use categories, with each category containing 100 scene images of size 256 × 256 × 3 pixels.

We compared the proposed method with IB-CNN [72], which explores the application of bilinear pooling in remote sensing scene classification. For a fair comparison, we adopted the experimental settings described in [72] and used ResNet-34 as the backbone network. Additionally, 50% of the images were randomly selected as the training dataset, while the remaining images were reserved for testing. The classification performance was evaluated using two metrics: overall accuracy (OA) and the Kappa coefficient (

κ

). To ensure statistical reliability, each experiment was repeated five times, and the average results were reported as the final accuracy.

Table 3 summarizes the classification results. In addition to IB-CNN, the results of fine-tuned CNN (F-CNN) and bilinear CNN (B-CNN) are also included in the table. Compared to F-CNN, B-CNN achieves better performance, demonstrating that bilinear pooling improves classification accuracy. By leveraging the proposed joint pooling method, IB-CNN generates more compact bilinear representations while preventing an increase in the parameter size of the subsequent fully connected layer, which mitigates overfitting and thereby improves network performance. In contrast to IB-CNN, our RegCov method incorporates normalization, and its superior results highlight the robustness of our approach across different classification tasks. These findings further emphasize that the normalization plays a crucial role in significantly boosting performance.

4.5. Alignment of Eigenvalues

According to Equation (3), square root normalization produces the normalized bilinear representation matrix

A^{1 / 2}

by square-rooting the eigenvalues of the matrix

A

. Our approach, RegCov, adds regularization terms in the objective function Equation (9) and encourages the network without the normalization to output the normalized bilinear representation. To verify, Figure 4 plots the distribution of the largest 25 eigenvalues in the matrices

A

,

A^{1 / 2}

and the one made by our approach. For each dataset, we randomly selected 256 samples, calculated the eigenvalues of each sample, sorted them in descending order, and produced average eigenvalues across the samples. Compared to the one before square root normalization (red curves), the square root normalization considerably changes the distribution of the largest 25 eigenvalues (blue curves), making it more uniform. For our approach, RegCov succeeds in making the spectrum (green curves) close to the normalized one (blue curves). Especially for the large eigenvalues, the green and blue curves are almost overlapping. And good normalization on the principal eigenvalues is key to enhancing bilinear representation [23,46]. Finally, on the sample-wise level, our method can effectively reduce the distance to the eigenvalues after square root normalization, as shown in Table 4.

4.6. Hyper-Parameter Analysis

Impact of the Number of Epochs. In Table 5, we analyze the impact of the number of epochs at the second stage. Throughout the experiments, it takes the first

60 %

of epochs to linearly increase

λ_{i}

and

λ_{d}

. As observed, the accuracy stably augments when the number of training epochs increases and becomes saturated after 10 epochs. And we choose the setting of 15 epochs for all the experiments, considering the good balance between training time and network performance.

Impact of the weights $λ_{i}$ and $λ_{d}$ . As indicated in Equation (9), the weights decide regularization strength and should be carefully chosen. In Table 6, firstly, it is sub-optimal if

λ_{i}

and

λ_{d}

are set as zero, showing the necessity of the regularization terms. Secondly, fixed values during training bring marginal improvement. For example, high fixed values will make the regularization terms

L_{d}

and

L_{i}

dominant in Equation (9) and lead the network to minimize classification loss

L_{C E}

slowly. On the other hand, low fixed values cause weak regularization on bilinear representation and consequently influence the accuracy. For these reasons, we choose to linearly increase these values so that the network can focus on improving classification performance in the beginning and avoid subsequent overfitting caused by weak regularization.

4.7. Computational Complexity Analysis

Our approach can be considered as a variant of knowledge distillation. This is because RegCov employs

\bar{D_{e}}

and

\bar{I_{e}}

after the first training stage to guide a lighter network at the second stage. Hence, in this section, we use VGG-16 backbone on the Birds dataset and compare our approach with other knowledge distillation approaches, as shown in Table 7. The approach of logits [58] and Features [61] refer to, respectively, using prediction logits and normalized bilinear representation of the network at the first stage for knowledge distillation. It can be seen that the two approaches require more training time and more parameters. This is because they should run the teacher network every training iteration. In addition, they cannot transfer exact knowledge about the eigenvalues in the bilinear representation matrix. This limits its effect of the normalization and finally results in lower accuracy. SVD calculates explicitly the eigenvalues and consumes much more training time than ours due to its poor efficiency on the GPU.

Furthermore, at the first training stage, we selected Newton Iteration as explicit square root normalization. However, as discussed in Section 3.4, it would be important to implement a more efficient approach to compute the matrix square root. To this end, we utilized the official implementation of a state-of-the-art method, MPA [51] and conducted experiments on the Birds dataset [25] using the VGG-16 network. As shown in Table 8, MPA demonstrates faster computational speeds than the Newton Iteration method during the training phase. The results are consistent with the findings in ref. [51]. However, MPA provides lower classification accuracy in Table 8. We found that this is attributed to the reduced accuracy of the square-rooted matrix computed by MPA, particularly when the matrix dimension is large (please see Appendix A for details). Given that the matrix dimension in our approach RegCov is at least 256, and considering the inefficiency of SVD variants shown in Table 8, we retain Newton Iteration.

4.8. Visualization Analysis

In Figure 5, we show the heatmap of the feature matrix

X \in R^{W \times H \times D}

by calculating its

L 2

-norm across the channel dimension. Square root normalization suppresses dominant features and exploits more discriminative regions where corresponding features are enhanced. Our approach regularizes the features and can output similar feature heatmaps without explicit normalization.

5. Conclusions

In this work, we first analyze the effect of square root normalization from the perspective of the eigenvalues in the bilinear representation matrix. Particularly, we find that the normalization effectively suppresses the intensity of the eigenvalues and makes their distribution more uniform. According to these findings, we proposed a new bilinear pooling normalization, namely RegCov, to overcome the drawbacks of previous works, such as high computational cost, extra network layers or branches, and weak performance in the fine-tuning scenario. In particular, RegCoV adds two regularization terms in the objective function and regularizes the eigenvalues in the bilinear representation matrix. During inference time, the network can output normalized bilinear representation even though no normalization is performed. This plausible quality highly reduces inference time. Extensive experiments identify that our RegCov achieves significant efficiency while keeping competitive performance. Additionally, this work demonstrates the feasibility of efficiently normalizing the eigenvalues via a regularizer. However, RegCov still needs Newton’s iteration whose high computational complexity slows down the training process. Additionally, RegCov is a hyper-paremeter-sensitive approach and manually selecting loss weights

λ_{m}

and

λ_{d}

is complicated. These limitations encourage the exploration of other more efficient regularizers, which will be the aim of our future research.

Author Contributions

Methodology, S.X.; Writing—original draft preparation: S.X.; Supervision: C.W.; Writing—review and editing: C.Z., H.D. and C.W.; Funding acquisition: S.X. and C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Basic Research Program of Shaanxi (Program No. 2023-JC-QN-0776) and the National Natural Science Foundation of China (Program Nos. 32471597 and 62072363).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The author would also like to thank the anonymous reviewers for their helpful comments.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. The error comparison for matrices in different dimensions. MPA is less accurate than Newton Iteration when the matrix dimension is larger than 256. Best viewed in color.

When the matrix dimension is large, the accuracy of the square-rooted matrix computed by MPA [51] is reduced. To validate this, we measured the error between the covariance matrix

A

and

A^{1 / 2} {(A^{1 / 2})}^{T}

using the Mean Absolute Error (MAE), where

A^{1 / 2}

represents the square-rooted matrix of

A

. As illustrated in Figure A1, a comparison between Newton Iteration and MPA was conducted using 10,000 randomly generated covariance matrices across various matrix dimensions. The results reveal that MPA performs poorly when the matrix dimension exceeds 256.

References

He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Zhang, S.; Murray, N.; Wang, L.; Koniusz, P. Time-reversed diffusion tensor transformer: A new tenet of few-shot object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 310–328. [Google Scholar]
Zhang, S.; Wang, L.; Murray, N.; Koniusz, P. Kernelized few-shot object detection with efficient integral aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19207–19216. [Google Scholar]
Wang, L.; Koniusz, P. 3mformer: Multi-order multi-mode transformer for skeletal action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5620–5631. [Google Scholar]
Dong, Z.; Wang, Q.; Zhu, P. Multi-head second-order pooling for graph transformer networks. Pattern Recognit. Lett. 2023, 167, 53–59. [Google Scholar] [CrossRef]
Rahman, S.; Koniusz, P.; Wang, L.; Zhou, L.; Moghadam, P.; Sun, C. Learning partial correlation based deep visual representation for image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6231–6240. [Google Scholar]
Xu, S.; Muselet, D.; Trémeau, A.; Jiao, L. Improved Bilinear Pooling With Pseudo Square-Rooted Matrix. IEEE Signal Process. Lett. 2023, 30, 423–427. [Google Scholar] [CrossRef]
Wang, Q.; Gao, M.; Zhang, Z.; Xie, J.; Li, P.; Hu, Q. DropCov: A simple yet effective method for improving deep architectures. Adv. Neural Inf. Process. Syst. 2022, 35, 33576–33588. [Google Scholar]
Song, Y.; Sebe, N.; Wang, W. Batch-efficient eigendecomposition for small and medium matrices. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 583–599. [Google Scholar]
Song, Y.; Sebe, N.; Wang, W. On the Eigenvalues of Global Covariance Pooling for Fine-Grained Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3554–3566. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1309–1322. [Google Scholar] [CrossRef]
Perronnin, F.; Sánchez, J.; Mensink, T. Improving the fisher kernel for large-scale image classification. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Part IV 11. Springer: Berlin/Heidelberg, Germany, 2010; pp. 143–156. [Google Scholar]
Lin, T.Y.; Maji, S. Improved Bilinear Pooling with CNNs. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017; Kim, T.-K., Stefanos Zafeiriou, G.B., Mikolajczyk, K., Eds.; BMVA Press: Surrey, UK, 2017; pp. 117.1–117.12. [Google Scholar] [CrossRef]
Gao, Y.; Beijbom, O.; Zhang, N.; Darrell, T. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 317–326. [Google Scholar]
Gou, M.; Xiong, F.; Camps, O.; Sznaier, M. Monet: Moments embedding network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3175–3183. [Google Scholar]
Carreira, J.; Caseiro, R.; Batista, J.; Sminchisescu, C. Semantic segmentation with second-order pooling. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Part VII 12. Springer: Berlin/Heidelberg, Germany, 2012; pp. 430–443. [Google Scholar]
Ionescu, C.; Vantzos, O.; Sminchisescu, C. Matrix backpropagation for deep networks with structured layers. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2965–2973. [Google Scholar]
Li, P.; Xie, J.; Wang, Q.; Gao, Z. Towards Faster Training of Global Covariance Pooling Networks by Iterative Matrix Square Root Normalization. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Wang, Q.; Xie, J.; Zuo, W.; Zhang, L.; Li, P. Deep CNNs meet global covariance pooling: Better representation and generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 2582–2597. [Google Scholar]
Dryden, I.L.; Koloydenko, A.; Zhou, D. Non-Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging. Ann. Appl. Stat. 2009, 3, 1102–1123. [Google Scholar]
Yu, T.; Cai, Y.; Li, P. Toward faster and simpler matrix normalization via rank-1 update. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XIX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 203–219. [Google Scholar]
Wang, Q.; Zhang, Z.; Gao, M.; Xie, J.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. Towards A Deeper Understanding of Global Covariance Pooling in Deep Learning: An Optimization Perspective. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15802–15819. [Google Scholar]
Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-Ucsd Birds-200-2011 Dataset. Available online: https://www.vision.caltech.edu/datasets/cub_200_2011/ (accessed on 26 October 2021).
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Wang, S.; Manning, C. Fast dropout training. In Proceedings of the International Conference on Machine Learning, PMLR, Atlanta, GA, USA, 17–19 June 2013; pp. 118–126. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Tenenbaum, J.B.; Freeman, W.T. Separating style and content with bilinear models. Neural Comput. 2000, 12, 1247–1283. [Google Scholar]
Yu, T.; Li, X.; Li, P. Fast and compact bilinear pooling by shifted random Maclaurin. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3243–3251. [Google Scholar]
Yu, T.; Cai, Y.; Li, P. Efficient compact bilinear pooling via Kronecker product. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 3170–3178. [Google Scholar]
Du, Y.; Tang, J.; Rui, T.; Li, X.; Yang, C. GBP: Graph convolutional network embedded in bilinear pooling for fine-grained encoding. Comput. Electr. Eng. 2024, 116, 109158. [Google Scholar] [CrossRef]
Jing, P.; Cui, K.; Guan, W.; Nie, L.; Su, Y. Category-aware multimodal attention network for fashion compatibility modeling. IEEE Trans. Multimed. 2023, 25, 9120–9131. [Google Scholar] [CrossRef]
Pang, W.; Xie, W.; He, Q.; Li, Y.; Yang, J. Audiovisual dependency attention for violence detection in videos. IEEE Trans. Multimed. 2022, 25, 4922–4932. [Google Scholar] [CrossRef]
Koniusz, P.; Zhang, H.; Porikli, F. A deeper look at power normalizations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5774–5783. [Google Scholar]
Huang, L.; Qin, J.; Zhou, Y.; Zhu, F.; Liu, L.; Shao, L. Normalization techniques in training dnns: Methodology, analysis and application. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10173–10196. [Google Scholar] [PubMed]
Lee, S.; Bae, J.; Kim, H.Y. Decompose, adjust, compose: Effective normalization by playing with frequency for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11776–11785. [Google Scholar]
Lei, J.; Hu, X.; Wang, Y.; Liu, D. Pyramidflow: High-resolution defect contrastive localization using pyramid normalizing flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 14143–14152. [Google Scholar]
Wang, H.; Ma, S.; Dong, L.; Huang, S.; Zhang, D.; Wei, F. Deepnet: Scaling transformers to 1,000 layers. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6761–6774. [Google Scholar] [CrossRef]
Segu, M.; Tonioni, A.; Tombari, F. Batch normalization embeddings for deep domain generalization. Pattern Recognit. 2023, 135, 109115. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Ni, Y.; Guo, Y.; Jia, J.; Huang, L. On the nonlinearity of layer normalization. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 37957–37998. [Google Scholar]
Arsigny, V.; Fillard, P.; Pennec, X.; Ayache, N. Fast and simple calculus on tensors in the log-Euclidean framework. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Palm Springs, CA, USA, 26–29 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 115–122. [Google Scholar]
Chen, Z.; Song, Y.; Xu, T.; Huang, Z.; Wu, X.J.; Sebe, N. Adaptive Log-Euclidean metrics for SPD matrix learning. IEEE Trans. Image Process. 2024, 33, 5194–5205. [Google Scholar]
Huang, Z.; Van Gool, L. A riemannian network for spd matrix learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Wei, X.; Zhang, Y.; Gong, Y.; Zhang, J.; Zheng, N. Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 355–370. [Google Scholar]
Huang, L.; Yang, D.; Lang, B.; Deng, J. Decorrelated batch normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 791–800. [Google Scholar]
Huang, L.; Zhou, Y.; Zhu, F.; Liu, L.; Shao, L. Iterative normalization: Beyond standardization towards efficient whitening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 4874–4883. [Google Scholar]
Huang, L.; Ni, Y.; Weng, X.; Anwer, R.M.; Khan, S.; Yang, M.H.; Khan, F.S. Understanding whitening loss in self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9479–9492. [Google Scholar] [CrossRef]
Mukuta, Y.; Machida, T.; Harada, T. Compact approximation for polynomial of covariance feature. arXiv 2019, arXiv:1906.01851. [Google Scholar]
Song, Y.; Sebe, N.; Wang, W. Fast Differentiable Matrix Square Root and Inverse Square Root. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7367–7380. [Google Scholar] [CrossRef] [PubMed]
Engin, M.; Wang, L.; Zhou, L.; Liu, X. Deepkspd: Learning kernel-matrix-based spd representation for fine-grained image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 612–627. [Google Scholar]
Rahman, S.; Wang, L.; Sun, C.; Zhou, L. Redro: Efficiently learning large-sized spd visual representation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 1–17. [Google Scholar]
Lin, T.Y.; Maji, S.; Koniusz, P. Second-order democratic aggregation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 620–636. [Google Scholar]
Chen, Z.; Song, Y.; Liu, G.; Kompella, R.R.; Wu, X.J.; Sebe, N. Riemannian Multinomial Logistics Regression for SPD Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17086–17096. [Google Scholar]
Bazzi, A.; Slock, D.T.; Meilhac, L. Detection of the number of superimposed signals using modified MDL criterion: A random matrix approach. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4593–4597. [Google Scholar]
Li, P.; Xie, J.; Wang, Q.; Zuo, W. Is second-order information helpful for large-scale visual recognition? In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2070–2078. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Yang, Z.; Zeng, A.; Li, Z.; Zhang, T.; Yuan, C.; Li, Y. From knowledge distillation to self-knowledge distillation: A unified approach with normalized loss and customized soft labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 17185–17194. [Google Scholar]
Dong, J.; Koniusz, P.; Chen, J.; Wang, Z.J.; Ong, Y.S. Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28432–28442. [Google Scholar]
Chen, D.; Mei, J.P.; Zhang, H.; Wang, C.; Feng, Y.; Chen, C. Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11933–11942. [Google Scholar]
Miles, R.; Mikolajczyk, K. Understanding the role of the projector in knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4233–4241. [Google Scholar]
Yang, Z.; Li, Z.; Shao, M.; Shi, D.; Yuan, Z.; Yuan, C. Masked generative distillation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 53–69. [Google Scholar]
Yang, Z.; Li, Z.; Jiang, X.; Gong, Y.; Yuan, Z.; Zhao, D.; Yuan, C. Focal and global knowledge distillation for detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4643–4652. [Google Scholar]
Li, L. Self-regulated feature learning via teacher-free feature distillation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 347–363. [Google Scholar]
Maji, S.; Rahtu, E.; Kannala, J.; Blaschko, M.; Vedaldi, A. Fine-grained visual classification of aircraft. arXiv 2013, arXiv:1306.5151. [Google Scholar]
Krause, J.; Stark, M.; Deng, J.; Fei-Fei, L. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 554–561. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, Z.; Song, Y.; Wu, X.; Liu, G.; Sebe, N. Understanding Matrix Function Normalizations in Covariance Pooling through the Lens of Riemannian Geometry. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24 April 2025. [Google Scholar]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 3–5 November 2010; pp. 270–279. [Google Scholar]
Li, E.; Samat, A.; Du, P.; Liu, W.; Hu, J. Improved bilinear CNN model for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2020, 19, 1–5. [Google Scholar] [CrossRef]
Song, Y.; Sebe, N.; Wang, W. Why approximate matrix square root outperforms accurate SVD in global covariance pooling? In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 1115–1123. [Google Scholar]
Wang, W.; Dang, Z.; Hu, Y.; Fua, P.; Salzmann, M. Backpropagation-friendly eigendecomposition. Adv. Neural Inf. Process. Syst. 2019, 32, 3162–3170. [Google Scholar]

Figure 1. Curves of recognition accuracy, magnitude of eigenvalues (M.E.s) and distribution of eigenvalues (D.E.s) for ResNet-50 [1] networks with (w.) and without (w.o.) square root normalization on ImageNet1K [28] dataset. The normalization suppresses M.E.s and D.E.s in the bilinear representation and reduces overfitting for better Validation (Val.) accuracy. Inspired by these findings, instead of using the more computationally expansive square root normalization, we propose RegCov to normalize bilinear representation by regularizing M.E.s and D.E.s (best viewed in color).

Figure 2. Two successive training stages of RegCov. A network with normalization is trained with classification loss

L_{C E}

in the first step. The normalization is then removed and the network is under the supervision of

L_{C E}

and two additional two regularization terms, denoted as

L_{m}

and

L_{d}

.

Figure 2. Two successive training stages of RegCov. A network with normalization is trained with classification loss

L_{C E}

in the first step. The normalization is then removed and the network is under the supervision of

L_{C E}

and two additional two regularization terms, denoted as

L_{m}

and

L_{d}

.

Figure 3. Square root normalization with Newton’s iteration. The inputs are a bilinear representation matrix

A

and an identity matrix

I

. After K iterations, the outputs

Y_{n}

and

Z_{n}

are, respectively, the approximation of the square-rooted matrix

A^{1 / 2}

and of the inverse square-rooted matrix

A^{- 1 / 2}

. Each iteration corresponds to Equations (4) and (5).

Figure 3. Square root normalization with Newton’s iteration. The inputs are a bilinear representation matrix

A

and an identity matrix

I

. After K iterations, the outputs

Y_{n}

and

Z_{n}

are, respectively, the approximation of the square-rooted matrix

A^{1 / 2}

and of the inverse square-rooted matrix

A^{- 1 / 2}

. Each iteration corresponds to Equations (4) and (5).

Figure 4. The spectrum of bilinear representation before/after normalization and for our approach RegCov. Best viewed in color.

Figure 5. Visualization of some input images (first row) and

L 2

-norm of their extracted features before square root normalization (second row), after square root normalization (third row) and with our approach RegCov (fourth row). Light regions indicate the importance of features for the final classification decision. Best viewed in color.

Figure 5. Visualization of some input images (first row) and

L 2

-norm of their extracted features before square root normalization (second row), after square root normalization (third row) and with our approach RegCov (fourth row). Light regions indicate the importance of features for the final classification decision. Best viewed in color.

Table 1. Comparison of our method RegCov with the state-of-the-art bilinear pooling methods on the three fine-grained classification datasets: Aircrafts (A), Birds (B) and Cars (C). Average (Avg.) accuracy in % and inference speed (IS) for one mini-batch (32 images) on an RTX-3090 GPU are also listed.

Method	Backbone	A	B	C	Avg.	IS (ms)
BCNN [13]	VGG-16	83.9	84.0	90.6	86.2	167.3
IB-CNN [15]		88.5	85.8	92.0	88.8	177.5
MoNet [17]		86.7	86.0	90.5	87.7	1048.6
MPN-COV [57]		86.1	82.9	89.8	86.3	-
GP [46]		89.8	85.8	92.8	89.5	-
DeepCOV [21]		88.7	85.4	91.7	88.6	-
DeepKSPD [52]		90.0	84.8	91.6	88.8	-
DropCov [10]		90.3	81.6	89.5	87.1	167.3
PSRM [9]		90.0	86.3	92.3	89.5	221.4
iSICE [8]		92.2	86.5	94.0	90.9	187.2
RegCov (Ours)		90.6	86.8	92.3	89.9	167.3
CBP [16]	ResNet-50	81.6	81.6	88.6	83.9	96.8
DeepCOV [53]		83.9	86.0	85.0	85.0	-
DropCov [10]		88.5	78.7	90.0	87.1	96.8
PSRM [9]		92.3	87.0	94.1	91.1	99.8
iSICE [8]		92.7	85.9	93.5	90.7	110.3
GBP [32]		89.6	87.8	93.5	90.3	-
PoW-EMLR [70]		90.0	84.9	93.8	89.6	108.8
Scale PoW-EMLR [70]		89.9	84.8	93.3	89.3	109.9
PoW-TMLR [70]		88.4	77.8	73.9	80.0	108.1
RegCov (Ours)		91.8	86.4	94.1	90.8	96.8
DropCov [10]	ResNet-101	89.7	81.7	91.6	87.7	148.9
iSICE [8]		92.9	86.6	93.6	91.0	155.6
RegCov (Ours)		92.2	87.1	94.2	91.2	148.9
DropCov [10]	ConvNext-T	91.1	87.4	93.8	90.8	124.2
iSICE [8]		90.4	86.7	93.1	90.1	141.3
RegCov (Ours)		92.1	89.1	94.5	91.9	124.2
DropCov [10]	Swin-T	89.8	86.2	90.3	88.8	158.0
iSICE [8]		89.6	86.5	91.3	89.1	175.4
RegCov (Ours)		90.1	86.6	92.2	89.6	158.0
DropCov [10]	Swin-B	90.1	87.1	92.8	90.0	374.4
iSICE [8]		92.9	88.3	93.3	91.5	382.3
RegCov (Ours)		91.6	88.6	92.7	91.0	374.4

Table 2. Comparison of our method RegCov with several normalization methods on the ImageNet1K dataset. On the RTX-3090 GPUs, accuracy, training time (TT) and inference speed (IS) for one mini-batch (256 images) are listed. Please note that training time (TT) for RegCov has two values: training time spent for the first stage + training time spent for the first stage.

Method	Backbone	Accuracy	IS (ms)	TT (h)
Plain GCP	ResNet-18	70.0	149.9	31
LN		70.8	155.8	31
iSQRT-COV		75.2	159.3	29
DropCov		75.2	149.9	31
RegCov (Ours)		75.3	149.9	29 + 4
Plain GCP	ResNet-50	73.5	201.0	54
LN		75.2	202.8	54
iSQRT-COV		78.2	211.5	38
DropCov		78.2	201.0	54
RegCov(Ours)		78.2	201.0	38 + 8

Table 3. Comparison of our method RegCov with the state-of-the-art bilinear pooling methods bilinear on the UCM dataset [71].

Method	Backbone	Accuracy (%)	$κ$
F-CNN [1]	ResNet-34	$97.62 \pm 0.75$	0.976
B-CNN [13]		$98.14 \pm 0.48$	0.981
IB-CNN [72]		$98.67 \pm 0.39$	0.986
RegCov		$98.81 \pm 0.34$	0.988

Table 4.

L 2

-norm distance to the eigenvalues after square root normalization. Sample-wise

L 2

-norm distance is calculated and then averaged across samples.

Table 4.

L 2

-norm distance to the eigenvalues after square root normalization. Sample-wise

L 2

-norm distance is calculated and then averaged across samples.

State	ImageNet1K	Aircraft	Bird	Car
Before	927.5	0.8	36.7	6.2
RegCov	10.0	0.4	2.0	1.0

Table 5. Top-1 validation accuracy (%) of RegCov using ResNet-18 on Imagenet1K with respect to training epochs in the second step.

Epochs	3	5	10	15	20
Acc.	74.9	75.0	75.3	75.3	75.4

Table 6. Top-1 validation accuracy (%) of RegCov using ResNet-18 on Imagenet1K with respect to loss weights

λ_{i}

and

λ_{d}

.

Table 6. Top-1 validation accuracy (%) of RegCov using ResNet-18 on Imagenet1K with respect to loss weights

λ_{i}

and

λ_{d}

.

Acc.	0	$1 \times 10^{- 6}$	$1 \times 10^{- 5}$	$1 \times 10^{- 4}$	Linear
$λ_{d}$	0	$1 \times 10^{- 6}$	$1 \times 10^{- 5}$	$1 \times 10^{- 4}$	Linear
0	73.7	74.6	74.8	74.0	74.4
$1 \times 10^{2}$	73.7	74.6	73.9	74.2	75.2
$1 \times 10^{3}$	73.6	74.5	74.0	74.1	75.2
Linear	73.5	74.5	75.1	74.1	75.3

Table 7. Comparison of our method RegCov with knowledge distillation approaches in terms of training speed (T.S.) for one mini-batch (32 images) on an RTX-3090 GPU, model parameters (# params) and accuracy (Acc.).

Methods	T.S. (ms)	# Params (M)	Acc. (%)
SVD	1087.8	268.5	86.8
Logits [58]	733.4	537.1	86.4
Features [61]	731.9	327.4	86.4
RegCov	561.1	268.5	86.8

Table 8. Comparison of Newton Iteration with other square root normalization approaches in terms of training speed (T.S.) for one mini-batch (32 images) at the first stage on an RTX-3090 GPU and accuracy (Acc.).

Normalization	Acc (%)	T.S. (s)	I.S. (s)
SVD	86.5	1318.6	1094.2
SVD-Padé [73]	86.6	1458.8	1127.7
SVD-PI [74]	86.6	1449.8	1142.7
MPA	68.2	547.5	178.7
Newton Iteration	86.7	561.1	174.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, S.; Dong, H.; Zhang, C.; Wang, C. Fast Normalization for Bilinear Pooling via Eigenvalue Regularization. Appl. Sci. 2025, 15, 4155. https://doi.org/10.3390/app15084155

AMA Style

Xu S, Dong H, Zhang C, Wang C. Fast Normalization for Bilinear Pooling via Eigenvalue Regularization. Applied Sciences. 2025; 15(8):4155. https://doi.org/10.3390/app15084155

Chicago/Turabian Style

Xu, Sixiang, Huihui Dong, Chen Zhang, and Chaoxue Wang. 2025. "Fast Normalization for Bilinear Pooling via Eigenvalue Regularization" Applied Sciences 15, no. 8: 4155. https://doi.org/10.3390/app15084155

APA Style

Xu, S., Dong, H., Zhang, C., & Wang, C. (2025). Fast Normalization for Bilinear Pooling via Eigenvalue Regularization. Applied Sciences, 15(8), 4155. https://doi.org/10.3390/app15084155

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast Normalization for Bilinear Pooling via Eigenvalue Regularization

Abstract

1. Introduction

2. Related Work

2.1. Element-Wise Normalization

2.2. Structure-Wise Normalization

3. Our Approach

3.1. Bilinear Pooling

3.2. First Stage

3.3. Second Stage

3.4. Discussion

4. Experiments

4.1. Implementation Details

4.2. Experiments on Small-Scale Datasets

4.3. Experiments on ImageNet1K

4.4. Experiments on UC Merced Dataset

4.5. Alignment of Eigenvalues

4.6. Hyper-Parameter Analysis

4.7. Computational Complexity Analysis

4.8. Visualization Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI