Mathematical Formula Image Screening Based on Feature Correlation Enhancement

: There are mathematical formula images or other images in scientiﬁc and technical documents or on web pages, and mathematical formula images are classiﬁed as either containing only mathematical formulas or formulas interspersed with other elements, such as text and coordinate diagrams. To screen and collect images containing mathematical formulas for others to study or for further research, a model for screening images of mathematical formulas based on feature correlation enhancement is proposed. First, the Feature Correlation Enhancement (FCE) module was designed to improve the correlation degree of mathematical formula features and weaken other features. Then, the strip multi-scale pooling (SMP) module was designed to solve the problem of non-uniform image size, while enhancing the focus on horizontal formula features. Finally, the loss function was improved to balance the dataset. The accuracy of the experiment was 89.50%, which outperformed the existing model. Using the model to screen images enables the user to screen out images containing mathematical formulas. The screening of images containing mathematical formulas helps to speed up the creation of a database of mathematical formula images.


Introduction
Mathematical language is an international, universal language that is not restricted by regions or languages. The main form of mathematical language is mathematical formulas. Mathematical formulas are often the quintessence of technical documents. At present, there are a large number of images of mathematical formulas with research value in web pages or scientific and technological documents. However, they are also mixed with other images, and crawling the page image directly will result in obtaining all the images. If only images containing mathematical formulas need to be obtained, further screening is required.
The essence of mathematical formula image screening is to automatically classify a large number of images into two categories: images with mathematical formulas and images without mathematical formulas. Mathematical formula images can be further divided into two cases: mathematical formulas only and mathematical formulas interspersed between text or coordinate diagrams. The focus of mathematical formula image screening is how to categorize images of situations where formulas are interspersed with text, illustrations, and other elements as well as mathematical formula images.
Traditional image classification techniques [1] rely on the designer's prior knowledge and cognitive understanding of the classification task, resulting in worse experimental performance. In recent years, convolutional neural networks have performed prominently in image feature learning [2][3][4][5]. Convolutional neural networks extract features through autonomous learning, effectively circumventing the many drawbacks arising from complex feature extraction. LeCun et al. [6] proposed the LeNet-5 network to introduce convolutional neural networks into the field of image classification for the first time [7]. The LeNet-5 1.
For the influence of irrelevant features on the model, a feature correlation enhancement module (FCE) has been designed. FCE enhances the internal correlation of mathematical formula features through the interaction of soft attention and self-attention to reduce the influence of other features on the classification decisions.

2.
Aiming at the problem of inconsistent image size and the characteristic of horizontal writing in formulas, a strip multi-scale pooling (SMP) module was designed. SMP solves the size constraint by integrating spatial pyramid pooling (SPP) [20] into the network, and then extracting rectangular horizontal features using a strip pooling module (SPM) [21] to increase the attention of the horizontal structure. 3.
To solve the problem of unbalanced datasets, this paper introduces regularization into the binary cross-entropy loss function [22]. By cascading regularization, the improved loss function distributes the weights equally to different image features, which avoids the fitting phenomenon and speeds up the model convergence.
The remainder of the paper is structured as follows: the Section 2 illustrates a mathematical formulation image screening method based on feature correlation enhancement; the Section 3 analyzes and discusses the experimental results, and the Section 4 summarizes the paper.

Materials and Methods
When designing the network, ResNeSt-50 [13] was used as the basic framework of the network to ensure that detailed information could be extracted. Compared with ResNet-50 [23], the essence of the improvement of ResNeSt-50 is the introduction of the split-attention module, which captures the relationship across channels through a channelbased attention mechanism. ResNeSt has achieved excellent results in image classification, object detection, instance segmentation, and semantic segmentation tasks. The network structure of the mathematical formula image screening model (AttNeSt) based on feature correlation enhancement is shown in Figure 1. In the figure, "Self" represents the selfattention mechanism, and "Soft" denotes the soft attention mechanism. AttNeSt replaces the second layer of convolution in the split-attention module with an FCE structure to strengthen the interaction and correlation of feature information in mathematical formulas and reduce the contribution of useless information. The SMP module is introduced after extracting features. The main idea is to add a set of horizontal stripe pooling to strengthen horizontal features after multi-scale pooling and before feature fusion.

Feature Correlation Enhancement (FCE) Module
The convolution operation can process the features in the local receptive field but cannot correlate the global information to establish long-distance dependencies. The selfattention mechanism [24] can capture global information and obtain larger receptive-field and contextual information. SKNet [18] uses the convolution kernel attention to cause the network to adaptively adjust the size of the receptive field according to the multiple scales of the input information and obtain features with weight information. The feature correlation enhancement (FCE) module weights the feature weight information into the self-attention to increase the degree of association of the self-attention to the mathematical formula features, thus emphasizing the relevance and global dependency of such features and reducing the influence of other features on the classification. The feature correlation enhancement structure is shown in Figure 2, which is mainly divided into three parts, self-attention structure design, soft attention feature extraction, and feature correlation enhancement.

Self-Attention Structure Design
The self-attention mechanism aims to pay attention to some details according to the target object. The core is how to determine the parts that need attention based on the target, and to further analyze after finding the details. In this subsection, the process of the first stage of self-attention is described.
First, use the feature space mapping functions f (x) = W f x, g(x) = W g x and h(x) = W h x to transform G ∈ R C×N into three feature spaces f (G), g(G) and h(G), where N = H × W represents the number of pixels. Then, perform matrix multiplication and normalization on f (G) and f (G) to obtain G ij , as shown in Equation (1), where a ij = f (G) T g(G) and G ij represents the degree of association between the ith dimension element in f (G) and the jth dimension element in g(G):

Soft Attention Feature Extraction
First, the feature map X ∈ R H×W×C is converted with convolution kernel sizes of 3, 5, and 7 to obtain 3 types of feature information from differnt convolutional kernels, L, First, the feature map ∈ × × is converted with convolution kernel sizes of 3, 5, and 7 to obtain 3 types of feature information from differnt convolutional kernels, , ̑, and , respectively, and summed to obtain . Then, global average pooling (GAP) [25] is used to encode the convolutional layer to get , ∈ , and a compression calculation is performed in the × dimension of to obtain the th element in , as shown in Equation (2). Finally, the weight information of different spatial scales = ( , , … , ) is obtained: , andL, respectively, and summed to obtain L. Then, global average pooling (GAP) [25] is used to encode the convolutional layer L to get t, t ∈ R C , and a compression calculation is performed in the H × W dimension of L to obtain the cth element t c in t, as shown in Equation (2). Finally, the weight information of different spatial scales t = (t 1 , t 2 , . . . , t C ) is obtained: Use the softmax function for t to obtain a, b, and p. a, b, and p denote the soft attention channel weights of L,

of 22
∈ × × is converted with convolution kernel sizes of 3, 5, eature information from differnt convolutional kernels, , ̑, mmed to obtain . Then, global average pooling (GAP) [25] is tional layer to get , ∈ , and a compression calculation dimension of to obtain the th element in , as shown the weight information of different spatial scales = ion for to obtain , , and . , , and denote the soft f , ̑ and , respectively, as shown in Equation (3). de-The same is true for and : ∈ 1 × denote the th row of , and the same is true , ), denotes the shrinkage rate, denotes the minimum rent image sizes, GAP is used instead of the FC layer, with the umber of parameters and receiving features of different scales. re multiplied by the feature information of different convoluature vector , as shown in Equation (4).
fuses the infore fields and increases the weight of the mathematical formula into a feature map after ReLU activation and 3 × 3 convo-= · + ·̑+ · (4) nhancement n ( ) and ( ) is established in the first stage of self-attenive attention to the elements with high similarity through elecorrelation enhancement is the second stage of self-attention. soft attention feature map with weight information to obtion feature map with channel weights and convolution kernel ution of the convolution kernel and channel weights to the features is enhanced by calculating the degree of correlation following is the specific calculation procedure. fusion mechanism splices two or more feature maps by chansk of this paper, splicing on the channel dimension can better ts of soft attention features. The splicing on the channel ss the channel weights of the soft attention features, which in epresentations of the mathematical formula feature maps in nd strengthens the contribution of such features. Concat feaature map width and height. In this paper, we solve this probn feature maps and with the same and . number of channels of feature maps and are , , … , respectively, denote the number of channels the number of channels of feature by , and 1 denotes output channel is obtained after the Concat operation, t this point, the number of channels in the feature map beure map is denoted as : andL, respectively, as shown in Equation (3). a c denotes the cth element of a. The same is true for b c and p c : A, B, and P ∈ R C×d , and A c ∈ R1 × d denote the cth row of A, and the same is true for B c and P c . d = max(C/r, K), r denotes the shrinkage rate, K denotes the minimum value of d. Due to the different image sizes, GAP is used instead of the FC layer, with the advantage of reducing the number of parameters and receiving features of different scales.
The channel weights are multiplied by the feature information of different convolution kernels to obtain the feature vector V, as shown in Equation (4). V fuses the information of multiple receptive fields and increases the weight of the mathematical formula features. V is transformed into a feature map G s after ReLU activation and 3 × 3 convolution: First, the feature map ∈ × × is converted with convolution kernel sizes of 3, 5, d 7 to obtain 3 types of feature information from differnt convolutional kernels, , ̑, d , respectively, and summed to obtain . Then, global average pooling (GAP) [25] is d to encode the convolutional layer to get , ∈ , and a compression calculation erformed in the × dimension of to obtain the th element in , as shown Equation (2). Finally, the weight information of different spatial scales = , , … , ) is obtained: Use the softmax function for to obtain , , and . , , and denote the soft ention channel weights of , ̑ and , respectively, as shown in Equation (3). detes the th element of . The same is true for and : (3) , , and ∈ × , and ∈ 1 × denote the th row of , and the same is true and . = ( / , ), denotes the shrinkage rate, denotes the minimum lue of . Due to the different image sizes, GAP is used instead of the FC layer, with the vantage of reducing the number of parameters and receiving features of different scales.
The channel weights are multiplied by the feature information of different convolun kernels to obtain the feature vector , as shown in Equation (4).
fuses the infortion of multiple receptive fields and increases the weight of the mathematical formula tures.
is transformed into a feature map after ReLU activation and 3 × 3 convoion: .

Feature Correlation Enhancement
The correlation between ( ) and ( ) is established in the first stage of self-attenn, and the process is to give attention to the elements with high similarity through elental interactions. Feature correlation enhancement is the second stage of self-attention. st, is fused with the soft attention feature map with weight information to obn . is the self-attention feature map with channel weights and convolution kernel ights. Then, the contribution of the convolution kernel and channel weights to the thematical formulation features is enhanced by calculating the degree of correlation tween and ℎ( ). The following is the specific calculation procedure. The Concat [26] feature fusion mechanism splices two or more feature maps by chan-

Feature Correlation Enhancement
The correlation between f (G) and g(G) is established in the first stage of self-attention, and the process is to give attention to the elements with high similarity through elemental interactions. Feature correlation enhancement is the second stage of self-attention. First, G ij is fused with the soft attention feature map G s with weight information to obtain G Z . G Z is the self-attention feature map with channel weights and convolution kernel weights. Then, the contribution of the convolution kernel and channel weights to the mathematical formulation features is enhanced by calculating the degree of correlation between G Z and h(G). The following is the specific calculation procedure.
The Concat [26] feature fusion mechanism splices two or more feature maps by channel or dimension. For the task of this paper, splicing on the channel dimension can better express the channel weights of soft attention features. The splicing on the channel dimension can better express the channel weights of the soft attention features, which in turn enables more feature representations of the mathematical formula feature maps in the self-attention features and strengthens the contribution of such features. Concat feature fusion requires equal feature map width and height. In this paper, we solve this problem by upsampling to obtain feature maps G s and G ij with the same H and W.
Assuming that the number of channels of feature maps G s and G ij are G 1s , G 2s , G 3s, . . . G cs and G 1ij , G 2ij , G 3ij, . . . G cij , respectively, denote the number of channels of feature G s by C s , denote the number of channels of feature G ij by C ij , and 1c denotes the tensor of 1 × 1 × c. The output channel Z Concat is obtained after the Concat operation, as shown in Equation (5). At this point, the number of channels in the feature map becomes C s + C ij , and the feature map is denoted as G Z : The matrix multiplication operation is performed between G Z and h(G), and then 1 × 1 convolution is performed to obtain the correlation-enhanced self-attention feature map G FCE , as shown in Equation (6): Electronics 2022, 11, 799 6 of 20

Strip Multi-Scale Pooling (SMP) Module
Spatial pyramid pooling [20] proposes a multi-scale pooling structure to unify the feature dimensions of different inputs, which is widely used in the field of image classification. Strip pooling [21] proposes a strategy that considers a long but narrow kernel, allowing the network to efficiently model long-range dependencies, focusing on horizontal or vertical features. Considering that the formula part of the mathematical formula image is written horizontally, this paper is inspired by SPP and SPM to design the SMP module. SMP can focus on horizontal information under the premise of unified feature dimension, increase the feature expression ability of mathematical formula, and then improve the classification accuracy. The structure is shown in Figure 3. (To simplify the image, only two pooling scales are shown in the figure). In the process of unifying feature dimensions, the size of the pooling kernel varies according to the size of the input image. The pooling kernel (Filter) and step (Stride) are shown in Equations (7) and (8), where Filter is rounded up, and Stride is rounded down.
An example of a Level 3 SPP is shown in Table 1. The input size in Table 1 indicates the size of the feature map input to the SPP structure, corresponding to m × m in Figure 3, and the output size indicates the desired output size, corresponding to l × l in Figure 3. With the three-level SPP structure, two feature maps of different sizes obtain the same length output. If the input size changes, the pooling kernel and step size will also change to ensure that the output has the same length. Taking the input size 10 × 10 as an example, the feature map is subjected to three pooling operations with convolution kernels of 10, 4, and 2 and step sizes of 10, 3, and 2 to obtain the outputs Electronics 2022, 11, 799 7 of 20 of 1 × 1, 3 × 3, and 5 × 5, respectively. Similarly, the 15 × 15 feature map uses different convolution kernels and step sizes to obtain 1 × 1, 3 × 3, and 5 × 5 outputs.
After obtaining the feature map of multi-scale pooling, then strip pooling is performed. The realization process of horizontal strip pooling is shown in the dotted box in Figure 3. First, the l × l-scale feature map is transformed into 1 × l-scale features after horizontal strip pooling. The implementation is to calculate the average pixel value on the horizontal feature map corresponding to the pooling kernel, as shown in Equation (9), where x ∈ R H×W : Then, the convolution operation with a convolution kernel size of Filter is used to expand along the top and bottom, and the expanded feature map is the same size as the original feature map. After the 1 × 1 convolution operation and Sigmoid activation, the feature map R is obtained by multiplying with the corresponding pixels of the original feature map. The feature map R is fused into a feature vector V R . Images of all sizes are unified into a fixed dimension in V R and input into the fully connected layer.
The horizontal strip pooling considers the horizontal range rather than the whole feature map, reinforcing the information about the position of the formulas written horizontally in the feature map. Since the weights of the target features (mathematical formula features) have been increased during feature extraction, SMP pays more attention to the horizontal formula features and less attention to the text, which is also a horizontal feature.

Loss Function
The images in the dataset in this paper are randomly crawled from the network, and the images are in various forms. To balance the dataset and improve the model accuracy and generalization performance, this paper improves the binary cross-entropy loss function (BC). The binary cross-entropy loss function formula is shown in Equation (10), where N denotes the total number of samples, y denotes the label of sample i, and p i denotes the probability of sample i being predicted as category 1: Incorporating regularization in BC [27,28]. The L 1 and L 2 regularization [29,30] are shown in Equations (11) and (12), where β is the regulating factor between the loss function and the regularization term, n is the number of samples in the training set of the model, and δ is the weight parameter of the model: When only L 1 regularization is used, the same penalty is given to all weight parameters. When only L 2 regularization is used, a large penalty is given to parameters with larger weight parameters, and a small penalty is given to parameters with smaller weights. The improved binary cross-entropy loss function (IBC) is shown in Equation (13). (13) |δ| is the absolute value of the weight parameter, δ 1 is the 1NF of the weight parameter δ, δ 2 2 is the square of the 2NF of the weight parameter δ, t is the adjustment factor between the loss function and the L 2 regularization, p is the adjustment factor between L 1 regularization and L 2 regularization, which degenerates to L 2 regularization if p = 0 and to L 1 regularization if p = 1. L 1 regularization is to extract one of the features randomly and drop the other features. L 2 regularization is the mean selection when the image features present a Gaussian distribution. Therefore, in IBC, L 1 regularization is introduced for feature selection, and then L 2 regularization is introduced to deal with the image features of covariance, and the weights are equally divided among various image features through a cascade of regularization to retain useful features.

Experimental Details on Image Classification
This section describes the experimental dataset and other experimental details.

Dataset and Data Augmentation
The public dataset im2latex-100k [31] collects a large number of mathematical expressions rendered in the real world, but the images of this dataset contain only mathematical formulas (e.g., Figure 4a) and not other elements (e.g., text, coordinate diagrams, illustrations, etc., in Figure 4b-d). If using im2latex-100k as the dataset leads to overfitting of the model, the model will only be able to distinguish images of the type shown in Figure 4a. Since there is no image dataset containing images such as Figure 4b-d, a homemade dataset is used as the object of processing in this paper. Images are randomly crawled from web pages or scientific and technical documents and manually pre-classified into two categories, one for images containing elements of mathematical formulas (category 1) and one for images not containing elements of mathematical formulas (category 0). The experiment takes 6250 of these images, 3125 images per class, of which 2188 images are used for the training set and 937 images for the validation set. Due to the small sample of images in this paper [32][33][34], a data enhancement strategy [35,36] was used to avoid overfitting during the training process. Specifically, the training set was enhanced to 21,880 images (10,940 images each for class 0 and class 1) using a 40-degree random rotation and horizontal flip operation, and the original size of the images was retained.

Training Strategy
The experimental hardware environment is an AMD Ryzen 5 2600 Hexa-core processor, 24 GB of RAM, and an Nvidia GeForce GTX 1660 Ti GPU. The software environment is tensorflow-gpu1.14 and keras2.2.5 deep learning framework. The optimizer is Adam. The number of iterations is 200 epochs, the batch size is 64, the learning rate is 0.001, β1 = 0.9, β2 = 0.999, epsilon = 1 × 10 −8 . The loss function is IBC, where the parameters t = 0.71, p = 0.53, and β = 0.81. K = 32 in the shrinkage rate r.

Evaluation Indicators
To comprehensively evaluate the classification performance of the model, Precision, Recall, F1-score, the amount of image data (Support), and Accuracy were used as evaluation indicators. The training time (Time (step/s)) was used as a measure of time complexity.
The formula is as follows, where (True Positive) is the correct prediction for category 1, (False Positive) is the prediction of category 0 to category 1, (False Negative) is the prediction of category 1 to category 0, and (True Negative) is the correct prediction of category 0.
Precision: The calculation is shown in Equation (14).
F1-score: The calculation is shown in Equation (16). F1_1 denotes the harmonic mean of P1 and R1, and F1_0 denotes the harmonic mean of P0 and R0: Support: The number of images categorized into a certain category. Accuracy (ACC): The calculation is shown in Equation (17)

Experimental Results and Analysis
In this section, the experimental results of the image screening method (AttNeSt) are shown and compared with the experimental results of other methods.

Experimental Results of the AttNeSt Model
The AttNeSt model was trained following the experimental setup in Section 2.1.3. The experimental results are shown in Table 2. When the number of iterations is small, it is known from Support that the image imbalance is severe. As the number of iterations increased, all indicators improved. The experiment tried to continue the training after 200 iterations, the loss value tended to be constant, and the ACC no longer improved. To reduce the training time and overhead, the experiment set the maximum number of iterations to 200 epochs. The ACC variation is shown in Figure 5. To verify whether IBC has an optimization effect on the model, the model was trained using BC while keeping other parameters consistent. When the model was trained using BC, the ACC was 87.73%, a decrease of 1.77% compared to when using IBC. The variation of loss values is shown in Figure 6. IBC converges faster and has lower loss values than BC. It can be proved that IBC has an optimization effect on the model.

Effect of SMP Structure on Results
Setting multiple pooling scales to explore the effect of SMP structure on classification performance:  Table 3. As can be seen from Table 3, the SMP structure can help the model learn more horizontal information while unifying the feature dimensions. SMP_1, SMP_2, and SMP_3 improved in all evaluation indicators compared to Cropping and Warping. SMP_3 has the highest ACC of 89.50% and the best results, as the number of images in both categories tend to be balanced. It indicates that the added 9 × 9 pooling kernel is better adapted to large-size images. SMP_4 and SMP_5 add 11 × 11 and 13 × 13 pooling kernels, respectively, with much lower ACC and F1 values and extremely unbalanced Support data. The reason for analysis is that there is an upper limit to the image size, and by continuously increasing the pooling kernel scale, receptive fields that are too large will acquire useless information and greatly increase the computational effort, thus affecting the results.
To explore the effect of strip pooling on classification performance, the strip pooling was removed from SMP_3, and the SPP structure was retained. The experimental results are shown in Table 4. There is no structure of strip pooling in SPP; instead, the features are directly dimensionally unified after multi-scale pooling. Some horizontal information may be lost in this process, and the ACC dropped by 0.69% although the training time is shorter. SMP_3 adds horizontal strip pooling before unifying the feature dimensions, which enables the model to focus more on horizontal features, and improve the ACC.

Effect of FCE Structure on Results
To verify the effect of the FCE module on the experiment, the FCE module is modified. SA is used to ablate the soft-attention feature extraction stage in the FCE module and retain the self-attention mechanism. Soft is to ablate the self-attention mechanism and retain the soft-attention feature extraction in the FCE module. ResNeSt-50 is the original ResNeSt-50 network model, and the experimental results are shown in Table 5. As can be seen from the table, AttNeSt compared to ResNeSt-50 increased the FCE module, ACC improved by 7.18%, F1 value also improved greatly, and the number of images in each category (Support) tended to balance. The ACC of SA and Soft is 3.62% and 4.17% improved over ResNeSt-50, respectively, and all other metrics have been optimized. The ACC of AttNeSt is improved by 3.56% and 3.01% compared to SA and Soft, respectively. Experimental results show that adding soft-attention feature extraction or self-attention mechanism alone can improve the ACC of the model. The experimental results show that adding soft-attention feature extraction or a self-attention mechanism alone can improve the ACC of the model, and the FCE module integrates them so that the soft attention feature weights have a significant effect on the emphasis of useful features in self-attention, which can focus on other similar features and ignore the influence of useless features as much as possible, and the ACC is further improved.

AttNeSt Compared with Other Algorithms
To verify the superiority of the AttNeSt model and the added modules, this section compares the experimental results with other models.
The purpose of the comparison experiments is to verify the performance of AttNeSt in the image screening task of mathematical formulas. The dataset and data augmentation are consistent with those in Section 3.1.1. This section compares the results of this paper's algorithm with those of other algorithms for analysis. Other algorithms include AlexNet in the literature [37], Inception-v3 in the literature [38], ResNet-50 in the literature [23], DenseNet-201 in the literature [39], and DSK-Net in the literature [40]. All of the above network models add the SMP module before the fully connected layer so that the network model is not constrained by the image size. Softmax classification is added after the fully connected layer to output the probability that any image belongs to a certain class to suit our image classification task. The experimental results were determined with a learning rate of 0.001, an optimizer of Adam, a batch size of 64, and a loss function of IBC. The experimental results are shown in Table 6. The results in Table 6 show that the ACC of AttNeSt is 10.97%, 6.78%, 6.81%, 4.93%, and 5.96% higher than that of AlexNet, Inception-v3, ResNet-50, DenseNet-201, and DSK-Net, respectively. The results showed that AttNeSt had a significant effect on ACC improvement and support for balance, and there was no significant increase in training time, within a manageable range.
The loss variation of each algorithm is shown in Figure 7. It can be seen from the figure that the AttNeSt has good convergence. The above experimental results show that the classification effect of the algorithm in this paper is better than those of other algorithms.

Mathematical Formula Image Screening Using AttNeSt Model
To verify the validity of the AttNeSt model, this section uses the trained AttNeSt network model to screen the images containing mathematical formulas.

Prediction for a Single Image
Twelve typical, original sample images crawled by the network are shown in Figure 8. The results of the classification of these 12 images using the trained AttNeSt are shown in Table 7. In this paper, we set out to classify the predicted images into a class with higher probability. From the data in Table 7, it can be seen that (a-d), (e), and (h) were assigned to category 1 and the rest were assigned to category 0. The effect was as expected.

Prediction for a Single Image
Twelve typical, original sample images crawled by the network are shown in Figu 8.The results of the classification of these 12 images using the trained AttNeSt are show in Table 7.

Apply Other Models for Screening
Crawl all images on a web page or scientific document by keywords such as "math" or "formula" and tag each image with the category it belongs to. Take 2000 of these images, 1000 of which are of category 1, and 1000 of which are of category 0. Classification of images using the trained model in Section 3.3, the results are shown in Table 8, where Correct (1) indicates the number of images correctly assigned to category 1, Correct (0) indicates the number of images correctly assigned to category 0, Mistake (1) indicates the number of images belonging to category 0 but incorrectly assigned to category 1, and Mistake (0) indicates the number of images belonging to category 1 but incorrectly assigned to category 0. As can be seen from Table 8, AttNeSt correctly classified 921 images to category 1, correctly classified 967 images to category 0, classified 33 images from category 0 to category 1, and classified 79 images from category 1 to category 0. Compared with the results of other models, AttNeSt screening is the best and can correctly classify most of the images containing mathematical formulas into category 1.
Due to the differences in images, the actual accuracy varies when using trained models for image classification. The error of AlexNet is larger because the results of AlexNet are not so good when the model is trained. The ACC when screening images using other trained models compared to the ACC when the models were trained (Table 6), the errors were within 5%.

Results of the im2latex-100k Dataset Screening
After observing the im2latex-100k dataset, all the images in it now have transparent backgrounds; such images cannot be processed in the AttNeSt network model, so the images must be processed into images with white backgrounds and black text, as shown in Figure 9: The dataset contains a total of 103,537 images, of which 10,000 images are taken, and 10,000 other images are added in this paper. The model is loaded to predict these 20,000 images. The results were that 9991 of 1 category were predicted to be 1 category, 9 were predicted to be 0 category, 9994 of 0 category were predicted to be 0 category, and 6 were predicted to be 1 category, which was more satisfactory.

Prediction of Subcategories of Mathematical Formula Images
As can be found from the example dataset in Section 3.1.1, mathematical formula images can also be subdivided into several subcategories. This section divides the mathematical formula images into three subcategories, as shown in Figure 10. Figure 10. Examples of subcategories of images. The 1_X category represents an image that contains only mathematical formulas, the 1_Y category represents an image that contains both text and mathematical formulas, and the 1_Z category represents an image that contains both text, coordinate maps and formulas.
Take 100 images of each subcategory and add another 300 images. Image classification is performed using a trained AttNeSt network model. The aim is to verify whether the model has good classification performance for images with formulas interspersed between other elements. The results are shown in Table 9. Correct and Mistake in Table 9 have the same meaning as in Section 3.4.2. The results in Table 9 show that all images in subcategory 1_X are correctly classified, two images in subcategory 1_Y are misclassified to category 0, and 12 images in subcategory 1_Z are misclassified to category 0. In addition, 10 of the 300 other images were misclassified to category 1. Overall, the model has superior performance for classifying images with only mathematical formulas, and good performance for images with formulas interspersed between text or coordinate diagrams.

Conclusions
To screen images containing mathematical formulas in web pages or scientific documents, we designed a network model, AttNeSt, that can screen images containing mathematical formulas from among many kinds of images. First, the feature correlation enhancement (FCE) module was designed with the aim of improving the contribution of mathematical formula features in the self-attention feature maps. Then, the strip multi-scale pooling (SMP) module was designed to cause the input images to retain their original sizes and to focus on horizontal formula features. Finally, regularization is incorporated into the binary cross-entropy loss function to balance the dataset. The experimental results show that the ACC of AttNeSt is 7.18% better than ResNeSt. The superior performance of AttNeSt compared with other methods. Good results were obtained using the trained AttNeSt network model to screen the blended images, as shown in the results of Sections 3.4.2 and 3.4.4 For images where mathematical formulas are interspersed with text or illustrations, etc., the model is able to screen most of these images correctly.
Although the model in this paper can accurately screen images containing mathematical formulas in most cases, there are errors. For example, mathematical formulas are also present in Figure 8g, but they were classified into category 0. The reason for the classification error is that the mathematical formula part of the figure is too small in proportion to the rest of the image. In addition, embedded formulas, edge formulas not easily recognized, and formulas interspersed with text where features are difficult to distinguish may also cause classification errors. The follow-up work will include continued theorizing about new approaches to improve the experiment and achieve better screening results.
The trained AttNeSt network model can screen out images containing mathematical formulas from a large number of images, which helps to facilitate the creation of a database of mathematical formula images. In subsequent work, images within a large number of relevant documents or web pages will be crawled, and images containing mathematical formulas will be screened using the AttNeSt model. This will increase the number of mathematical formula images, which will be helpful for the training of mathematical formula retrieval, formula extraction, and identification, and thus making it more beneficial for readers to be able to retrieve the mathematical formulas they want.

Data Availability Statement:
The data presented in this study are openly available in the im2latex-100k dataset at https://zenodo.org/record/56198#.YfYK7epBy3A (accessed on 10 October 2020).