Scale-Aware Network with Scale Equivariance

Ning, Mingqiang; Tang, Jinsong; Zhong, Heping; Wu, Haoran; Zhang, Peng; Zhang, Zhisheng

doi:10.3390/photonics9030142

Open AccessArticle

Scale-Aware Network with Scale Equivariance

by

Mingqiang Ning

,

Jinsong Tang

^*,

Heping Zhong

,

Haoran Wu

,

Peng Zhang

and

Zhisheng Zhang

Electronic College of Engineering, Naval University of Engineering, Wuhan 430033, China

^*

Author to whom correspondence should be addressed.

Photonics 2022, 9(3), 142; https://doi.org/10.3390/photonics9030142

Submission received: 5 January 2022 / Revised: 17 February 2022 / Accepted: 24 February 2022 / Published: 28 February 2022

Download

Browse Figures

Versions Notes

Abstract

:

The convolutional neural network (CNN) has achieved good performance in object classification due to its inherent translation equivariance, but its scale equivariance is poor. A Scale-Aware Network (SA Net) with scale equivariance is proposed to estimate the scale during classification. The SA Net only learns samples of one scale in the training stage; in the testing stage, the unknown-scale testing samples are up-sampled and down-sampled, and a group of image copies with different scales are generated to form the image pyramid. The up-sampling adopts interpolation, and the down-sampling adopts interpolation combined with wavelet transform to avoid spectrum aliasing. The generated test samples with different scales are sent to the Siamese network with weight sharing for inferencing. According to the position of the maximum value of the classification-score matrix, the testing samples can be classified and the scale can be estimated simultaneously. The results on the MNIST and FMNIST datasets show that the SA Net has better performance than the existing methods. When the scale is larger than 4, the SA Net has higher classification accuracy than other methods. In the scale-estimation experiment, the SA Net can achieve low relative RMSE on any scale. The SA Net has potential for effective use in remote sensing, optical image recognition and medical diagnosis in cytohistology.

Keywords:

convolutional neural network; scale equivariance; weight sharing; scale estimation; discrete-wavelet transform

1. Introduction

The convolutional neural network (CNN) has achieved good performance in object detection and classification due to its inherent translation equivariance [1,2].

Φ : U \to V

represents the mapping from the image field to the feature field. For an input image

x \in U

, the sliding window convolution and weight sharing in the CNN ensure that the feature can always be detected no matter where it is. While reducing the number of network parameters, the position of the maximum response on the feature map will also shift accordingly with the translation of the input feature, as shown in (1).

Φ (T x) = T^{'} Φ (x)

(1)

where

T

and

T^{'}

represent the translation transformation of the image filed and feature field, respectively, and Equation (1) shows that the CNN has translation equivariance [3,4,5,6].

In addition, the conventional CNN is resilient to slight deformation (translation, rotation, and isotropic scaling) due to the pooling operation. As long as the translation of pixels does not exceed the pooling kernel size, the output of max-pooling remains unchanged. However, the pooling kernel size is limited, so the translation invariance is not evident in the shallow layer of the network [7,8]. With the deepening of network layers, the translation invariance of the deep CNN increases with the increase in the receptive field. No matter where the input feature is, the CNN can detect the feature, predict the same label, and the data distribution is approximately unchanged. At this time, the network has global translation invariance. When

T^{'}

is an identity transformation, invariance can be regarded as a special case of equivariance [3]. At this time, Equation (1) can be rewritten as:

Φ (T x) = Φ (x)

(2)

Translation equivariance and invariance are the inherent properties of the CNN and are dependent on its architecture. However, in many practical applications, the scale equivariance of CNN is more critical. For example, in cytohistology, the size of a tissue or cell reflects whether the cell is normal or cancerous [9], as shown in Figure 1. Different cell sizes must be distinguishable among the outputs, and the network that meets this requirement is the scale-equivariant CNN. In addition, in the real world, the change in the distance between the object and the observer will directly result in the scale variation of the object’s projection on the retina, thereby resulting in the characteristics of ‘near large and far small’ in human vision, which shows the depth and volume of the object in relation to the size of the object.

Scale variation is as common as translation, whereby the existing CNN adopts both a sliding-window convolution and pooling in order to adapt to the scale change, but its scale equivariance is poor [10]. For example, after the input image is scaled, the learned convolution kernel will mismatch the scaled images, as shown in Figure 2, resulting in weak equivariance for scaling.

In fact, for single-layer convolution pooling, since the convolution operation is carried out by moving the sliding window on the image line by line, the pattern in the image can be detected before and after the translation. However, the image scaling and the translation of the sliding-window convolution belong to different spatial transformations, which will hinder the equivariance. In addition, for the max-pooling layer, the pixels on the image before and after scaling are generally not in the same pooling window, so the response of the image before and after scaling is often different, which also hinders the invariance of the network. In single-layer convolution pooling, the scaling operation will hinder the equivariance and invariance. With the deepening of the network, the receptive field gradually increases. Unlike the gradual strengthening of translation invariance and equivariance, the degradations of the CNN scale invariance and equivariance will be more obvious [11]. Therefore, the CNN is quite vulnerable to scaling operations, and scaling disturbance may cause misclassification. Due to the lack of equivariance, the CNN cannot estimate scale.

To solve these problems, we proposed a Scale-Aware Network (SA Net) with scale equivariance in order to estimate scale during classification. The specific methods are as follows: only one scale pattern is learned in the training stage. In the testing stage, the testing sample with an unseen scale is zoomed in and out on a group of images with different scales that form an image pyramid. The image-zooming-in channels are up-sampled by bilinear interpolation. The image-zooming-out channels are down-sampled. Unlike using Gaussian derivatives to limit bandwidth [12], the combination of dyadic discrete-wavelet transform (DWT) and bilinear interpolation is used in order to avoid spectrum aliasing.

Dyadic zooming out uses the discrete-wavelet transform, and non-dyadic zooming out uses the result of the next-level discrete-wavelet transform that is zoomed in by bilinear interpolation. Then, the image pyramid with different scales is sent to the multi-channel Siamese CNN with weight sharing for inferencing. A two-dimensional classification-score matrix is obtained. The classification and scale estimation can be simultaneously carried out by the position of the maximum value of the classification-score matrix.

Our main contributions can be summarized as follows.

1. In the testing stage, we scale the input image to form an image pyramid, in which the down-sampling of the image adopts dyadic DWT to avoid spectrum aliasing.

2. The weights between channels are shared, and multi-channel inferencing is simultaneously carried out. Learning only one scale can obtain the generalization performance of multi-scale images in the multi-scale target recognition, thus reducing the model’s parameters.

3. Different from the existing methods, the SA Net is sensitive to different scales of the same class during the testing stage. The SA Net establishes a scale-related inferencing branch without an additional scale-learning branch in order to enable the network to estimate unknown scales. Therefore, we can estimate the scale of the image while predicting the class.

The remainder of this paper is organized as follows. In Section 2, we introduce the related work. In Section 3, we will present our proposed SA Net, including the details of the training and testing stages and how the image-feature pyramid is informed. In Section 4, we present the experimental results and implementation details in order to validate the effectiveness of our proposed method. Finally, in Section 5, we conclude our work.

2. Related Work

To improve the scale equivariance of the CNN, there are mainly two families: multi-scale training and single-scale training. The SA Net belongs to the second family.

2.1. Multi-Scale Training

Multi-scale training can be divided into transform feature representations [11,13,14,15,16] and transform convolution kernels [10,17,18,19].

For transforming feature representations, it is equivalent to increasing the number of training samples in a disguised form, and the most representative way is the conventional-data-augmentation (CDA) strategy, which can be executed online or offline. The purpose of online CDA and offline CDA is the same, i.e., to improve the model’s generalization performance by learning as many samples as possible. Given abundant random-scaling training samples, a CNN might score the same sample at different scales. Besides, Kanazawa et al. [13] proposed a Locally Scale-Invariant CNN (LSI-ConvNet), which scales the input image into multiple scales in a specified way, then convolutes the images of different scales with the same convolution kernel, and finally normalizes the feature image through undo-scaling, adopting max-pooling over scales at each spatial location to obtain the locally scale invariant representations. However, some useful information such as scale variations was discarded, which will reduce the model’s performance in some specific tasks. For example, in cytohistology, the sizes of cells reflect whether their function is normal, and the above methods are not scale sensitive and cannot be used in these fields.

For transforming convolution kernels, Xu et al. [20] proposed a representative Scale-Invariant ConvNet (SI-ConvNet) to transform convolution kernels, which can be regarded as a set of filters. Although different filters have different sizes, they share the same parameters. Each filter corresponds to a channel and simultaneously detects patterns at different scales, and the last layer outputs of different channels are stacked into a feature vector. Then, the feature vector is input into the softmax layer for classification. The results show that the SI-ConvNet is not sensitive to scale changes and susceptible to object-scale variations. Ghosh et al. [21] proposed the Scale-Steerable ConvNet (SS-ConvNet), in which log-radial harmonics are introduced as the steerable basis, and the filters of each convolution layer are a linear combination of basis filters. These methods provide a new paradigm for enhancing a model’s scale invariance, but they also focus on scale invariance. In addition, they are too dependent on the training process, which will inevitably lead to the increase in the complexity of the network structure.

2.2. Single-Scale Training

Single-scale training uses single-scale samples instead of multi-scale samples in the training stage and thus reduces the number of parameters and the computational cost. Lindeberg et al. proposed FovMax and FovAvg networks [22], which adopted weight sharing between different channels together with global max or average pooling over the outputs from all channels in order to achieve scale invariance for digit classification on MNIST Large-Scale dataset. In addition, Lindeberg presents a theoretical analysis of the invariance and covariance properties of his proposed networks and, for the first time, explores their ability to generalize to scales that are not present in the training set. His proposed network reduces the number of parameters and computational costs, but using a Gaussian derivative to avoid spectrum aliasing is contrary to the simplicity of the network structure.

Although Lindeberg’s method lacks the estimation function for unseen scales as the result of the global pooling, as mentioned above, it provides some inspiration for scale equivariance. Based on single-scale training, the SA Net uses wavelet transform and combines all channels’ outputs to achieve scale equivariance, thus estimating previously unseen scales. The results on the MNIST and FMNIST show that it has good classification and scale-estimation performance.

3. Methods

In order to highlight the differences between the training stage and the testing stage, the SA Net is divided into two stages: training and testing stages, as shown in Figure 3.

3.1. Training Stage

The SA Net only needs to learn samples with one scale in the training stage and save the model, which reduces the scale diversity of the training dataset. According to Figure 3, the classification-score vector is obtained after the testing sample passes through the CNN. For the MNIST dataset, the length of

f

is 10, corresponding to 0–9, which is a total of 10 Arabic numbers. The predicted class:

\hat{c} = \arg \max (f)

(3)

In the training stage, the cross-entropy loss is calculated according to the prediction class and ground-truth, and then the parameters are updated by back-propagation.

3.2. Testing Stage

3.2.1. Discrete-Wavelet Transform

To avoid spectrum aliasing caused by direct down-sampling, we use dyadic DWT, and for a two-dimensional image

f (x, y)

. with an input size of

P \times Q

, we define the two-dimensional (2D) DWT pair as:

W_{φ} (a_{0}, p, q) = \frac{1}{\sqrt{P Q}} \sum_{x = 0}^{P - 1} \sum_{y = 0}^{Q - 1} f (x, y) φ_{a_{0}, p, q} (x, y)

(4)

W_{ψ}^{b} (a, p, q) = \frac{1}{\sqrt{P Q}} \sum_{x = 0}^{P - 1} \sum_{y = 0}^{Q - 1} f (x, y) ψ^{b}_{a, p, q} (x, y)

(5)

where

φ_{a_{0}, p, q} (x, y)

is the scale function and

ψ_{a, p, q}^{b} (x, y)

is the wavelet function, which is defined as follows:

φ_{a, p, q} (x, y) = 2^{a / 2} φ (2^{a} x - p, 2^{a} y - q)

(6)

ψ_{a, p, q}^{b} (x, y) = 2^{a / 2} ψ^{b} (2^{a} x - p, 2^{a} y - q)

(7)

where

a_{0}

is the original scale of the image,

W_{φ} (a_{0}, p, q)

is the low-frequency coefficient, which represents the approximate value of the image at the scale

a_{0}

, and

W_{ψ}^{b} (a, p, q)

is the high-frequency coefficient, including the horizontal, vertical, and diagonal details attached to the scale

a (a \geq a_{0})

,

b = {H, V, D}

. Equation (4) shows that the approximate representation of the input image can be obtained after using DWT, which is the same as the traditional pooling method. Equation (5) still retains the high-frequency component and can restore the original image with the low-frequency component. The SA Net only keeps the low-frequency information after the DWT first-level decomposition, as shown in Equation (4), thereby effectively avoiding spectrum aliasing in down-sampling.

Fast-wavelet transform (FWT) [23] can be expressed as follows:

W_{ψ} (a, q) = \sum_{p} h_{ψ} (p - 2 l) W_{φ} (a + 1, p)

(8)

W_{φ} (a, q) = \sum_{p} h_{φ} (p - 2 l) W_{φ} (a + 1, p)

(9)

The Equations (8) and (9) show a DWT coefficient’s relationship between adjacent scales. It is observed that the scale approximation and detail coefficients of scale

a

can be iteratively computed by convolving

W_{φ} (a + 1, p)

with the temporally inverted scaling and wavelet vectors

h_{ψ} (p - 2 l)

and

h_{φ} (p - 2 l)

. The 2D FWT is similar to 1D FWT, except that 2D FWTs have three sets of detail coefficients occurring in the horizontal, vertical, and diagonal directions [24,25].

3.2.2. Image Pyramid

The testing sample

x

with unseen scale is zoomed in and zoomed out on a group of images with different scales that form an image pyramid. The methods are as follows: scaling

x

according to the geometric scale set

S = {s_{1}, s_{2}, \dots, s_{L}}

to generate an image pyramid

{x_{s_{1}}, x_{s_{2}}, \dots, x_{s_{L}}}

composed of multi-scale samples

L

is the number of channels (i.e., the number of layers of the image pyramid). For

s_{i}

> 1, that is, when the testing sample passes through the image-zooming-in channels, the testing sample is up-sampled by bilinear interpolation, which is the same as [22]. For

s_{i}

< 1, that is, when the testing sample passes through the image-zooming-out channels, the testing sample is down-sampled, which is different from using a Gaussian derivative to limit the bandwidth [22]. Dyadic discrete wavelet transform (DWT) combined with interpolation is used to avoid spectrum aliasing caused by down-sampling. Only low-frequency components are retained for the results of dyadic DWT, as shown in Figure 4. Dyadic zooming-out uses DWT, and non-dyadic zooming-out uses the result of next-level DWT and zooms in it by bilinear interpolation.

After the testing sample passes through the image pyramid in the previous step,

L

images of different scales

{x_{s_{1}}, x_{s_{2}}, \dots, x_{s_{L}}}

are generated. These images are sent to the Siamese CNNs with weight sharing for inferencing. The Siamese CNNs have corresponding

L

channels, and the output of channelin

i

is as follows:

f_{i} = f {(x_{s_{i}})}^{T}

(10)

The output of each channel

f_{i}

is a flattened vector, and the output of channel

i

for class

j

is

f_{i j}

. The flattened vectors are stacked into a two-dimensional matrix:

F = [f_{1}, f_{2}, \dots, f_{i}, \dots, f_{L}] \in ℝ^{L \times C}

(11)

where

C

is the number of classes of the corresponding dataset.

Softmax is widely used in the CNN for multi-classification tasks. It maps multiple inputs into the

(0, 1)

. However, in the SA Net, the input is a two-dimensional array with various elements. When the input elements are too large or too small, they will exceed the range of the float, resulting in overflow and underflow. To ensure numerical stability and speed up the computation at the same time, we use the logsoftmax to obtain the classification-score matrix

F^{'}

, which is defined as follows:

\begin{array}{l} F^{'} = logsoftmax (F) = \log (\frac{e^{f_{i j}}}{\sum_{i}^{L} \sum_{j}^{C} e^{f_{i j}}}) = \log (\frac{\frac{e^{f_{i j}}}{e^{M}}}{\sum_{i}^{L} \sum_{j}^{C} \frac{e^{f_{i j}}}{e^{M}}}) \\ = \log (\frac{e^{(f_{i j} - M)}}{\sum_{i}^{L} \sum_{j}^{C} e^{(f_{i j} - M)}}) = = (f_{i j} - M) - \log (\sum_{i}^{L} \sum_{j}^{C} e^{(f_{i j} - M)}) \end{array}

(12)

where

M = \max_{i, j} (F)

, that is, the maximum value of all elements in

F

.

3.2.3. Classification and Scale Estimation

As shown in Figure 3, according to the maximum of the two-dimensional classification-score matrix

F^{'}

, we can obtain the predicted class

\hat{c}

and the estimated scale

\hat{s}

of the testing sample.

The predicted class

\hat{c}

and scale estimation

\hat{s}

are related to coordinate

(i, j)

, corresponding to the maximum of

F^{'}

, and are represented as follows:

(i, j) = \arg \max (F^{'})

(13)

\hat{c} = j, \hat{s} = 1 / i

(14)

For the data set

D = {x^{(n)}, y^{(n)}, s^{(n)}}_{n = 1}^{N}

containing

N

testing samples, the GT class of

x^{(n)}

is

y^{(n)}

and the GT scale of

x^{(n)}

is

s^{(n)}

. The estimated scale of the

x^{(n)}

is

{\hat{s}}^{(n)}

. The RMSE shown in Equation (15) is generally used to evaluate the estimation ability. However, to evaluate the performance of scale estimation under different scales, we use the relative value

δ

of RMSE shown in Equation (16):

RMSE = \sqrt{\frac{1}{N} \sum_{n}^{N} {(s^{(n)} - {\hat{s}}^{(n)})}^{2}}

(15)

δ = \frac{RMSE}{\hat{s}} \times 100 %

(16)

The detailed method is shown in the pseudo code of Algorithm 1.

Algorithm 1: Classification and scale estimation algorithm of SA Net

Input: Testing set

D = {x^{(n)}, y^{(n)}, s^{(n)}}_{n = 1}^{N}

For

n = 1 t o N

do
Select a sample

(x^{(n)}, y^{(n)}, s^{(n)})

in order;
For

i = 1 t o L

do

x_{s_{i}}^{(n)} \leftarrow T_{i} (x^{(n)})

;

f_{i} \leftarrow f {(x_{s_{i}}^{(n)})}^{T}

;
end
// 2-D classification-score matrix

F^{L \times C} \leftarrow logsoftmax {[f_{1}, f_{2}, \cdot \cdot \cdot, f_{L}]}^{T}

;

(i, j) \leftarrow \arg \max F^{L \times C}

;

{\hat{c}}^{(n)} \leftarrow j

;…………//label prediction

{\hat{s}}^{(n)} \leftarrow 1 / i

;…………//scale estimation
if

{\hat{c}}^{(n)} = y^{(n)}

then

c l s \leftarrow c l s + 1

;
end
end

A c c_{c l s} \leftarrow \frac{c l s}{N} \times 100 %

RMSE \leftarrow \sqrt{\frac{1}{N} \sum_{n}^{N} {(s^{(n)} - {\hat{s}}^{(n)})}^{2}}

δ \leftarrow \frac{RMSE}{\hat{s}} \times 100 %

Output:

A c c_{c l s}

,

δ

4. Experiments and Discussion

4.1. Scale Datasets

The traditional MNIST dataset [2] is only composed of grayscale images of handwritten characters with a resolution of 28 × 28 at the same scale. It cannot evaluate the scale-estimation ability and classification ability of the SA Net for unknown-scale samples. We use the MNIST Large-Scale dataset [22] proposed by Jansson.

The MNIST Large-Scale dataset is a variant of the original MNIST dataset. The image of the original MNIST dataset is firstly rescaled to 112 × 112 using bicubic interpolation, and the rescaled image is then smoothed and threshold-softened to eliminate the image-edge aliasing. The processed image is then resized within a certain scale such that the scaling ratio of width and height is the same. Then, the resulting image is embed into a 112 × 112 resolution image using zero-padding or cropping on the borders. After the above processing, the MNIST Large-Scale dataset with several scale changes is obtained, as shown in Figure 5. The MNIST Large-Scale dataset is different from the MNIST dataset containing only one scale, and its scale scaling ratio is

s_{D} \in [1 / 2, 8]

.

The MNIST Large-Scale dataset consists only of images of hand-written digits, and the existing performances on the MNIST Large-Scale dataset leave little room from improvement. To validate and evaluate our proposed method’s performance, we use the dataset of the variation of the Fashion MNIST–FMNIST Large-Scale dataset. The raw Fashion MNIST is a dataset that contains 70 K grayscale images (the pixel value is an integer between 0 and 255), associated with labels from 10 classes [26]. We processed the FMNIST Large-Scale dataset in the same way as the MNIST Large-Scale dataset, as shown in Figure 6.

4.2. Implementation Details and Network Parameters

The deep-learning framework was Pytorch 1.8.1 (CUDA 11.0), and the workstation was equipped with 32 GB memory, one Nvidia GeForce RTX 3090 graphics card(ASUS, Taiwan, China) and 24GB of video memory. The model was trained with 50,000 canonical images of the MNIST Large-Scale dataset for 20 epochs using the Adam [27] optimizer, and 10,000 images of different scales were used for testing. The training batch size was 128. The learning rate started at

3 \times 1 0^{- 3}

and decayed with a factor 0.1 every 2 epochs until it reduced to

5 \times 1 0^{- 5}

. In this experiment, we used the standard CNN as the benchmark. The CNN’s feature extractor included four Conv-BN [28]-ReLU [29]-Max-pooling blocks. The number of features of each block was 16, 16, 32 and 32, respectively. After passing through the feature extractor, a group of feature maps was sent to the two-layer fully connected (FC) layers for classification, followed by logsoftmax. The number of FC layers was 100 and 10, respectively, and the first FC layer adopted Dropout [30] with a probability of 15%.

4.3. Classification Experiment

CNN, FovMax and FovAvg were trained on a single-scale-training dataset of scale 2 from the MNIST and FMNIST Large-Scale datasets with the same training settings and strategy. Figure 7 shows the loss curve during the training procedure on the MNIST Large-Scale dataset.

We use FovMax, FovAvg and the SA Net with nine channels for digit classification and the standard CNN as the benchmark to evaluate the classification performance for different testing scales on different datasets, as shown in Table 1. In this paper, bold data represent the best results of different methods under the same conditions.

We can see that the CNN has good testing-classification performance only when the test scale is consistent with the training scale, and the test classification accuracy decreases with the increase of the difference between the testing scale and the training scale both on MNIST and FMNIST Large-Scale datasets. The scale of the training-set samples is 2. When the testing scale is 2, the classification accuracy on the MNIST Large-Scale dataset is 99.38%. However, when the testing scale is 8, the classification accuracy reduces to 12.42%, which is slightly better than the random selection of the MNIST Large-Scale dataset. The standard CNN does not learn scale-related features in the training procedure, and it has poor adaptability to scale changes. FovAvg and FovMax greatly improve the network’s resilience to scale variation. When the training and testing scales are inconsistent, FovAvg and FovMax still have good classification performance, but when the testing scale is large, the test-classification accuracy degrades seriously. For example, the training scale is 2.0. On the test scale of

2^{- 1}

, the classification accuracies of FovAvg and FovMax are 98.51% and 98.75%, respectively. On the test scale of 8.0, the classification accuracies of FovMax and FovAvg are 77.75% and 88.44%, respectively, and the classification performance declines. However, the classification accuracy of the SA Net is 90.15%, which is 12.5% and 7.71% higher than FovMax and FovAvg, respectively. The SA Net overcomes the existing problems of FovMax and FovAvg due to the down-sampling method of combination of dyadic DWT and interpolation.

Experiments on different datasets show that the SA Net has better robustness on the MNIST Large-Scale dataset, because handwritten digits are more regular than the images in the FMNIST Large-Scale dataset and are better for feature extraction.

Using different wavelet families, wavelet decomposition will yield different results. In order to select the optimal wavelet basis, we compare the classification experimental results of the SA Net using Haar (haar), Daubechies (db) and Biorthogonal (bior), and we can see that bior has the greatest benefit to the improvement of classification performance.

4.4. Scale-Estimation Experiment

4.4.1. Dataset

To explore the generalization ability of the SA Net for unknown scale images in a larger scale range, we generated digits with a scale range of

[1 / 4, 4]

and padded them with zeros to a 224 × 224 resolution, following the production mechanism of the MNIST Large-Scale dataset.

4.4.2. Receptive Field

The definition of the receptive field in the CNN is the area size mapped by the pixels on the feature map by each layer on the original image. The convolution layer and pooling layer will affect the size of the receptive field. The definition of the receptive field is as follows:

R F_{l + 1} = R F_{l} + (k - 1) \times S_{l}

(17)

S = \prod_{i = 1}^{l} s_{i}

(18)

where

R F_{l}

and

R F_{l + 1}

represent the receptive fields of layer

l

and layer

l + 1

, respectively,

k

represents the size of convolution kernel of layer

l + 1

,

s_{i}

represents stride of layer

i

and

S

represents the product of

s_{i}

of the previous

l

layer. The stride of the current layer does not affect the receptive field of the current layer. In order to ensure that all patterns in the image can be detected, the receptive field should match the size of the original image [31]. However, the actual receptive field is often smaller than the theoretical receptive field [32]. For an image of 224 × 224 resolution, the receptive field of the network described in Section 4.2 does not meet the requirements above, so we added two Conv-BN-ReLU-Max-pooling blocks in front of the FC layer of the SA Net to increase the network depth and to increase the receptive field. The parameters of the additional blocks are:

the number of feature channels are 32 and 64, respectively.
the size of the convolution kernels is three for two blocks.
the other parameters of the blocks are suitable for the convolution layer.

The receptive field of the final layer of the network can be obtained through Equations (17) and (18), and result is

R F_{f i n a l} = 226

, which meets the requirements of detecting all objects.

4.4.3. Experiment Results

According to Section 3.2.3, the SA Net uses the training set with a canonical scale; to estimate the scale of the testing sample with unknown scale, we use the dataset in Section 4.1.

When the number of channels

L

is 9, the common ratio is

\sqrt{2}

, and we obtain the performance for scale estimation against different scales

\hat{s}

of 10,000 testing samples, as shown in Table 2. It can be seen that the SA Net has good performance for scale estimation; for example, when the

\hat{s}

is 4,

δ

can reach 13.9% on the MNIST Large-Scale dataset. Besides, the SA Net also works on the FMNIST Large-Scale dataset, which are grayscale images and closer to real-world images compared to the MNIST [33].

Similar to the classification experiment, in order to select the optimal wavelet families, we comprehensively compare the scale estimation experimental results of the SA Net using Haar (haar), Daubechies (db) and Biorthogonal (bior). It can be seen that, like the conclusion in Section 4.3, bior is also the optimal wavelet family.

In order to verify the effectiveness of the SA Net for scale estimation, we visualize the multi-column’s output response and analyze the mis-estimated samples. For the testing sample with class ‘5′ and scale ‘

2^{- \frac{1}{2}}

’, the SA Net’s 3D response is show Figure 8.

In order to more intuitively evaluate the effect of scale estimation, we draw slices figures of a three-dimensional output response of the MNIST Large-Scale dataset according to the max response, as shown in Figure 9. In class dimension

\hat{c}

, the curve is not smooth, resulting in several sidelobes. Sidelobes will interfere with the classification task. The coordinate corresponding to sidelobes is the wrong class, indicating the correlation between adjacent classes is weak in the classification task. However, the amplitude of sidelobes is always smaller than the main lobe. The coordinate corresponding to the maximum response in the class dimension always corresponds to the GT class, verifying that the SA Net can always achieve high classification performance in the classification task in Section 4.3. In the scale dimension

\hat{s}

, the curve is relatively smooth. It has only one peak without sidelobe interference, which is conducive to scale estimation, so the testing sample’s true scale can always be obtained.

Compared with the single-channel network, the multi-channel network will certainly increase the inferencing time. However, the inferencing time of the SA Net for one frame of an image costs about 26 ms (using one GeForce RTX 3090 GPU). Therefore, the SA Net has the potential to achieve real-time scale estimation and classification.

4.4.4. Visualization of Scale Estimation

Rescaling the testing samples according to the estimated scales is equivalent to the inverse process of generating an image pyramid in Section 3.2.2, which can correct the testing sample to the canonical scale. Several restored examples in the MNIST Large-Scale dataset are chosen for visualization, as shown in Figure 10.

It can be seen that for testing images of any scale, the SA Net can correct the testing images to the normal scale with the scale-estimation value. In practical application, the images collected by the optical imaging sensor can be easily processed after the above correction process.

5. Conclusions

The existing methods can only be applied to the classification of scaled images and cannot estimate unknown-scale images. We propose a scale-aware network with scale equivariance, which can estimate the scale while classifying without adding additional scale-related learning tasks. According to experimental results on different datasets and wavelet families, we verify the practicability of our proposed method for scale estimation and find the optimal wavelet basis for our network. In addition, the classification experiments on the MNIST and FMNIST Large-Scale datasets show that the classification performance of the SA Net on unknown-scale samples is better than other existing methods, which is due to the down-sampling of the wavelet transform combined with interpolation to avoid spectrum aliasing.

The proposed method can be used in remote sensing, optical image recognition and medical diagnosis in cytohistology.

Author Contributions

Conceptualization, M.N. and J.T.; methodology and investigation, M.N., P.Z. and H.Z.; visualization, Z.Z.; formal analysis, M.N. and P.Z.; supervision, H.Z. and J.T.; funding acquisition, H.Z. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China with grant number 42176187 and 41906162.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The MNIST Large Scale that supports the findings of this study is available in the public domain: https://zenodo.org/record/3820247 (accessed on 23 February 2022). The FMNIST Large Scale that support the findings of this study is available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1106–1114. [Google Scholar]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Cohen, T.; Welling, M. Group Equivariant Convolutional Networks. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 2990–2999. [Google Scholar]
Marcos, D.; Volpi, M.; Komodakis, N.; Tuia, D. Rotation Equivariant Vector Field Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5058–5067. [Google Scholar]
Lang, M. The Mechanism of Scale-Invariance. arXiv 2021, arXiv:2103.00620v1. [Google Scholar]
Ma, T.; Gupta, A.; Sabuncu, M.R. Volumetric Landmark Detection with a Multi-Scale Shift Equivariant Neural Network. In Proceedings of the 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), Iowa City, IA, USA, 3–7 April 2020; pp. 981–985. [Google Scholar] [CrossRef]
Azulay, A.; Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? J. Mach. Learn. Res. 2019, 20, 1–25. [Google Scholar]
Sosnovik, I.; Moskalev, A.; Smeulders, A.W.M. Scale Equivariance Improves Siamese Tracking. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 2764–2773. [Google Scholar]
Wang, Q.; Zheng, Y.; Yang, G.; Jin, W.; Yin, Y. Multi-Scale Rotation-Invariant Convolutional Neural Networks for Lung Texture Classification. IEEE J. Biomed. Health Inform. 2018, 22, 184–195. [Google Scholar] [CrossRef] [PubMed]
Hanieh, N.; Leili, G.; Shohreh, K. Scale Equivariant CNNs with Scale Steerable Filters. In Proceedings of the International Conference on Machine Vision and Image Processing (MVIP), Tehran, Iran, 18–20 February 2020; pp. 1–5. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2017–2025. [Google Scholar]
Lindeberg, T. Scale-Covariant and Scale-Invariant Gaussian Derivative Networks. In Proceedings of the 8th International Conference on Scale Space and Variational Methods in Computer Vision (SSVM), Virtual Event, 16–20 May 2021; pp. 3–14. [Google Scholar] [CrossRef]
Kanazawa, A.; Sharma, A.; Jacobs, D. Locally Scale-Invariant Convolutional Neural Networks. arXiv 2014, arXiv:1412.5104. [Google Scholar]
Laptev, D.; Savinov, N.; Buhmann, J.M.; Pollefeys, M. TI-POOLING: Transformation-Invariant Pooling for Feature Learning in Convolutional Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 289–297. [Google Scholar] [CrossRef] [Green Version]
Esteves, C.; Allen-Blanchette, C.; Zhou, X.; Daniilidis, K. Polar Transformer Networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 3 May 2018. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Marcos, D.; Kellenberger, B.; Lobry, S.; Tuia, D. Scale equivariance in CNNs with vector fields. arXiv 2018, arXiv:1807.11783. [Google Scholar]
Cohen, T.S.; Welling, M. Steerable CNNs. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Sosnovik, I.; Szmaja, M.; Smeulders, A.W.M. Scale-Equivariant Steerable Networks. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Xu, Y.; Xiao, T.; Zhang, J.; Yang, K.; Zhang, Z. Scale-Invariant Convolutional Neural Networks. arXiv 2014, arXiv:1411.6369. [Google Scholar]
Ghosh, G.; Gupta, A.K. Scale Steerable Filters for Locally Scale-Invariant Convolutional Neural Networks. arXiv 2019, arXiv:1906.03861. [Google Scholar]
Jansson, Y.; Lindeberg, T. Exploring the ability of CNN s to generalise to previously unseen scales over wide scale ranges. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 1181–1188. [Google Scholar] [CrossRef]
Mallat, S.G. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674–693. [Google Scholar] [CrossRef] [Green Version]
Ben Chaabane, C.; Mellouli, D.; Hamdani, T.M.; Alimi, A.M.; Abraham, A. Wavelet Convolutional Neural Networks for Handwritten Digits Recognition. In Proceedings of the 2017 International Conference on Hybrid Intelligent Systems (HIS), Delhi, India, 14–16 December 2017; pp. 305–310. [Google Scholar] [CrossRef]
Burrus, C.S. Introduction to Wavelets and Wavelet Transforms: A Primer, 1st ed.; Prentice-Hall: Englewood Cliffs, NJ, USA, 1997; pp. 34–36. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Xu, L.; Choy, C.-S.; Li, Y.-W. Deep sparse rectifier neural networks for speech denoising. In Proceedings of the IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Xi’an, China, 13–16 September 2016; pp. 1–5. [Google Scholar]
Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, À.; Oliva, A.; Torralba, A. Object Detectors Emerge in Deep Scene CNNs. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
Yang, Z.; Bai, Y.-M.; Sun, L.-D.; Huang, K.-X.; Liu, J.; Ruan, D.; Li, J.-L. SP-ILC: Concurrent Single-Pixel Imaging, Object Location, and Classification by Deep Learning. Photonics 2021, 8, 400. [Google Scholar] [CrossRef]

Figure 1. Difference between equivariance and invariance. (a) equivariance and (b) invariance.

Figure 2. Influence of input deformation on CNN output.

Figure 3. Diagram of the network architecture of the SA Net.

Figure 4. First order decomposition of DWT.

Figure 5. Instances of MNIST Large-Scale dataset.

Figure 6. Images in the FMNIST Large Scale dataset.

Figure 7. The loss curve of training procedure.

Figure 8. Visualization of output response.

Figure 9. Slice figures of SA Net: (a) Class dimension and (b) Scale dimension.

Figure 10. Visualization results of scale correction. (a) unknown-scale testing samples. (b) samples corrected to canonical scale.

Table 1. Classification accuracy (%) comparison on different scales with different networks.

Datasets	Methods	Wavelet Basis	$2^{- 1}$	$2^{- \frac{1}{2}}$	$1$	$2^{\frac{1}{2}}$	$2$	$2^{3 / 2}$	$4$	$2^{5 / 2}$	8
MNIST Large Scale	CNN	—	20.86	38.67	68.35	96.60	99.38	94.98	45.83	21.84	12.42
	FovMax	—	98.51	99.26	99.31	99.32	99.32	99.29	99.30	97.40	77.75
	FovAvg	—	98.75	98.81	98.36	98.85	99.33	98.84	99.31	98.73	88.44
	SA Net	bior	98.81	99.26	99.30	99.38	>99.41	99.37	99.31	99.12	90.15
		haar	98.68	99.17	99.25	99.34	99.39	99.35	99.28	98.96	89.74
		db	98.74	99.20	99.27	99.38	99.41	99.36	99.29	99.12	90.13
FMNIST Large Scale	CNN	—	15.47	17.12	37.64	58.56	90.64	61.30	36.33	14.78	9.62
	SA Net	bior	82.31	82.75	83.46	82.60	90.64	88.86	86.12	68.32	41.20
		haar	82.07	82.36	83.18	82.08	90.64	88.73	85.92	68.03	41.04
		db	82.24	82.79	83.39	82.23	90.64	88.81	86.03	68.16	41.13

Table 2. Performance for scale estimation against different scales on different datasets.

	Datasets	Wavelet Basis	$2^{- 2}$	$2^{- \frac{3}{2}}$	$2^{- 1}$	$2^{- \frac{1}{2}}$	$1$	$2^{\frac{1}{2}}$	$2$	$2^{\frac{3}{2}}$	4
$δ$ /%	MNIST Large Scale	bior	7.1	23.2	22.7	28.3	26.5	21.6	15.7	17.2	13.9
		haar	9.3	25.9	27.1	30.2	28.3	24.6	17.4	19.5	14.2
		db	7.2	21.8	23.9	29.5	26.5	21.3	15.9	17.4	14.1
	FMNIST Large Scale	bior	13.65	24.51	36.60	42.13	51.19	39.83	58.90	59.91	53.68
		haar	14.10	23.16	36.81	42.31	51.32	40.32	60.80	62.30	55.73
		db	13.74	24.59	36.60	42.07	51.23	40.02	59.18	60.08	53.71

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ning, M.; Tang, J.; Zhong, H.; Wu, H.; Zhang, P.; Zhang, Z. Scale-Aware Network with Scale Equivariance. Photonics 2022, 9, 142. https://doi.org/10.3390/photonics9030142

AMA Style

Ning M, Tang J, Zhong H, Wu H, Zhang P, Zhang Z. Scale-Aware Network with Scale Equivariance. Photonics. 2022; 9(3):142. https://doi.org/10.3390/photonics9030142

Chicago/Turabian Style

Ning, Mingqiang, Jinsong Tang, Heping Zhong, Haoran Wu, Peng Zhang, and Zhisheng Zhang. 2022. "Scale-Aware Network with Scale Equivariance" Photonics 9, no. 3: 142. https://doi.org/10.3390/photonics9030142

APA Style

Ning, M., Tang, J., Zhong, H., Wu, H., Zhang, P., & Zhang, Z. (2022). Scale-Aware Network with Scale Equivariance. Photonics, 9(3), 142. https://doi.org/10.3390/photonics9030142

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scale-Aware Network with Scale Equivariance

Abstract

1. Introduction

2. Related Work

2.1. Multi-Scale Training

2.2. Single-Scale Training

3. Methods

3.1. Training Stage

3.2. Testing Stage

3.2.1. Discrete-Wavelet Transform

3.2.2. Image Pyramid

3.2.3. Classification and Scale Estimation

4. Experiments and Discussion

4.1. Scale Datasets

4.2. Implementation Details and Network Parameters

4.3. Classification Experiment

4.4. Scale-Estimation Experiment

4.4.1. Dataset

4.4.2. Receptive Field

4.4.3. Experiment Results

4.4.4. Visualization of Scale Estimation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI