#### 2.2.1. VIC Theory

The HVS, which is the most intuitive tool to perceive the world, has recently gained considerable attention in the field of image processing [

36]. Scientists believe that the essence of visual recognition is the perception of the invariant structure characteristics in the observed objects. Cayley et al. first introduced the theory of Algebraic Invariants and introduced it into the field of computing method, which initially formed the theory of visual invariance. At the Iceland conference in 1991, the theory of VIC was formally proposed [

37]. The main concept of the theory is that: (1) images are composed of edge and texture details; and (2) the invariant is the essential description of the geometric structure of the object. The invariant is the most important geometric structure in the visual object, as it plays a key role in the recognition of the object.

The main reason why the theory of visual invariance is widely employed is that it is similar to the visual essence of human beings. Human visual perception is based on invariant features, meaning that the perception and recognition of human eyes to external objects cannot change with the rotation, scale variation, translation and brightness changes in the object, as shown in

Figure 3 and

Figure 4. This is the most significant characteristic of the HVS [

38], which indicates that human eyes recognize and understand the object based on the characteristic information of the object itself, and this does not change with rotation or scaling. It is precisely because of human vision capturing the invariant of the same target that people can recognize objects.

Because of the influence of bionics to scientific progress, humans began to study visual cognition several decades ago. Visual object classification has been a long-term interest due to its important role in a variety of applications [

39]. Because rolling bearings under variable conditions of the same failure mode may display similar image characteristics, we choose an image translation method and employ the VIC of the HVS to extract the invariant features of the same failure mode under different conditions.

#### 2.2.2. SURF Theory

Recognition of the images that are rotating, scaling and translating refers to finding the stable points of the images. These points, such as corners, blobs, T-junctions and light spots in dark regions, do not disappear with rotating, scaling, translation and brightness changes. Scale-invariant feature transform (SIFT) is the corresponding computing method of VIC that can identify the invariant features and realize image matching. In 2006, Herbert Bay et al. improved SIFT and presented a novel scale and rotation invariant detector and descriptor called the speed up robust feature (SURF). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster [

40].

1. Theory of Scale Space

Scale space was first proposed by Iijima in 1962, and after the popularization by Witkin and Koenderink, scale space gradually gained attention and became widely used in the field of computer vision. The basic theory of scale space is that a scale parameter is introduced into the pattern information model and the scale space under multi-scale is obtained by continuously changing the scale parameter. The principal contours are extracted as eigenvectors to realize the detection of margins and corners and the extraction of different resolutions.

2. Integral Image Generation

Integral images, which can be used to fleetly conduct the function of box type convolution filtering is one of the main advantages of SURF. The entry of an integral image

${I}_{{\displaystyle \sum}}\left(x\right)$ at a location

$x={\left(x,y\right)}^{T}$ represents the sum of all pixels in the input image

$I$ within a rectangular region formed by the origin and

$x$ [

36].

Once the integral image has been computed, it takes three additions to calculate the sum of the intensities over any upright rectangular area, as shown in

Figure 5 [

41]. Hence, SURF uses block filtering instead of Gauss filtering, and it can greatly improve the efficiency of the algorithm.

3. Interest point localization

SURF utilizes the local maximum value of the approximate Hessian matrix of the determinant to locate the interest points. When the Hessian determinant is the local maximum, the detected point is the interest point. At the point

$x\left(x,y\right)$, which is in the original image, the Hessian matrix

$H\left(x,\sigma \right)$ with a scale of

$\sigma $ at

$x$ is defined as follows:

where

${L}_{xx}\left(x,\sigma \right)$ is the convolution of the Gaussian second-order derivative

$\frac{{\partial}^{2}}{\partial {x}^{2}}g\left(\sigma \right)$ with the image

$I$ at point

$x$, and similarly for

${L}_{xy}\left(x,\sigma \right)$ and

${L}_{yy}\left(x,\sigma \right)$.

Simple box filters using the integral image are used to approximate the second-order Gaussian partial derivation and have less computation burden, as shown in

Figure 6. Box filters can be quickly calculated by the integral image, and the calculation amount is not related to the template size, which improves the computational efficiency of SURF.

When we use

$\sigma =1.2$ of the second-order differential Gaussian function to filter and a template size of

$9\times 9$ as the smallest scale space to detect the points, the determinant of the Hessian matrix is

After simplification, the matrix becomes

where 0.9 is used to balance the Hessian determinant.

To realize the scale invariance of the interest points, SURF applies box filters of different scales on the original image to obtain the Hessian matrix response in terms of structures of the scale pyramid, as shown in

Figure 7.

As in the difference of Gaussian picture of SIFT, there are many layers referred to as octaves in the resolution pyramid, and several pictures of different scales remain in an octave. The size of pictures is unaltered, and the pictures in different octaves are obtained by changing the box filter size. In this way, SURF saves time during the down-sampling and improves the operation efficiency.

To determine the interest points, non-maximum suppression in a

$3\times 3\times 3$ neighborhood, which is shown in

Figure 8, is employed. Each pixel processed by the Hessian matrix is compared with 26 points of its three-dimensional neighborhood to obtain the maximum or minimum as the preliminary feature points.

The extreme point of discrete space is not a real extreme point.

Figure 9 shows the difference of the extreme point of a two-dimensional function in discrete and continuous space. SURF utilizes the linear interpolation method to obtain accurate interest points.

4. Interest Point Description

To guarantee the rotation invariance, the main directions of all of the interest points are required. We first calculate the Haar wavelet responses in the

$x$ and

$y$ directions within a circular neighborhood with a radius of

$6\sigma $ around the interest point, where

$\sigma $ is the scale at which the interest point was detected [

40]. Regarding the interest point as the center, the sum of all points of the Haar wavelet responses within a sliding orientation window of size 60° are the new vector. The longest such vector over all windows defines the orientation of the interest point, as shown in

Figure 10.

The description of the interest point needs to split up the region regularly into smaller

$4\times 4$ square sub-regions. For each sub-region, we compute the Haar wavelet responses at

$5\times 5$ regularly spaced sample points. Finally, Haar wavelet processing is performed in each area to calculate the Haar wavelet response in the

$x$ and

$y$ directions, i.e.,

${d}_{x}$ and

${d}_{y}$, as shown in

Figure 11. Then, the wavelet responses

${d}_{x}$ and

${d}_{y}$ are summed over each sub-region and form the first set of entries in the feature vector, thus the four-dimensional descriptor vector

$\mathrm{v}=\left({\displaystyle \sum {d}_{x},}{\displaystyle \sum \left|{d}_{x}\right|,{\displaystyle \sum {d}_{y},}{\displaystyle \sum \left|{d}_{y}\right|}}\right)$ is obtained. Concatenating this for all

$4\times 4$ sub-regions results in a descriptor vector with a length of 64.