An End-to-End Formula Recognition Method Integrated Attention Mechanism

: Formula recognition is widely used in document intelligent processing, which can significantly shorten the time for mathematical formula input, but the accuracy of traditional methods could be higher. In order to solve the complexity of formula input, an end-to-end encoder-decoder framework with an attention mechanism is proposed that converts formulas in pictures into LaTeX sequences. The Vision Transformer (VIT) is employed as the encoder to convert the original input picture into a set of semantic vectors. Due to the two-dimensional nature of mathematical formula, in order to accurately capture the formula characters’ relative position and spatial characteristics, positional embedding is introduced to ensure the uniqueness of the character position. The decoder adopts the attention-based Transformer, in which the input vector is translated into the target LaTeX character. The model adopts joint codec training and Cross-Entropy as a loss function, which is evaluated on the im2latex-100k dataset and CROHME 2014. The experiment shows that BLEU reaches 92.11, MED is 0.90, and Exact Match(EM) is 0.62 on the im2latex-100k dataset. This paper’s contribution is to introduce machine translation to formula recognition and realize the end-to-end transformation from the trajectory point sequence of formula to latex sequence, providing a new idea of formula recognition based on deep learning.


Introduction
In scientific and technical documents, scientific formulas are essential to express the relationship between variables concisely.However, formula input is complicated and prone to error, this is because the formula is essentially a tree structure, which needs to recognize not only the characters of the formula but also the relationship between the characters.The traditional method can not deal with variable substitution and lacks good generalization, which prompts researchers to investigate vision-based automatic formula recognition.The traditional research, mainly with a two-step process (character segmentation and relationship description), is focused on the recognition of superscript and subscript relations, special symbols, and fractions.For example, the INFTY system [1] aims to convert printed scientific formulas into LaTeX characters [2].However, some disadvantages have not been solved, such as the inability to generate complicated structures and error accumulation.
With the development of deep learning, the original method is gradually replaced by the end-to-end framework [3][4][5][6][7].In recent years, optical character recognition (OCR) [8][9][10][11].based on deep learning, has developed rapidly.However, employing the existing OCR technology for scientific formula recognition is challenging because the two-dimensional structure of scientific formulas is very complicated.Identifying characters accurately, especially the logical relationship among characters, is necessary.In other words, the formula recognition task can be regarded as transforming an image into mark-up.In Ref. [12], the authors constructed WYGIWYS to transform an image into mark-up language.Karpathy et al. [13] proposed an image description framework that can learn semantic coding of the image, then the encoded semantic vector is inputted to the decoder to generate each character.
This paper proposes an attention-based end-to-end encoder-decoder framework to realize the formula transformation from image to mark-up.For an input image, the feature map set is firstly extracted by the feature extractor and is encoded to include the context semantic vector C [14][15][16].The standard model combines the two steps into an end-toend structure.Since the formula recognizes the output sequence mark, this paper uses BLEU, Maximum Edit Distance (MED) and Exact Match (EM) as evaluation indicators.
The training and validation of the model were carried out on the im2latex-100k dataset.
The goals of this paper are as follows: firstly, the idea of machine translation is introduced into formula recognition to explore a new idea of formula ocr.Secondly, using YOLO model to detect multi-line formula to separate single-line formula from multi-line formula to improve the model's accuracy.The specific contributions are as follows: 1.The formula's trajectory points are regarded as a particular language translated into a LaTeX sequence.2. The preprocessing method of the multi-line formula is proposed.It uses YOLOV4 to detect the type of multi-line formula, segment the multi-line formula, and identify them, respectively, and then combine the results to increase the recognition accuracy.
Following the introduction, Section 2 introduces the work in the field of formula recognition; Section 3 introduces the models used in this paper; Section 4 introduces the experimental environment, experimental methods, and evaluation indicators; Section 6 summarizes the methods proposed in this paper; Section 5 discusses the good results of this method and puts forward the author's personal opinions; back matter states the usability of data in this article.

Related Work
Recovering mathematical formulas from images has always been considered a challenging task.The first step of formula recognition is accurately identifying the picture characters.According to the character's type, location, and size, the published structure is analyzed, and the formula is finally converted into a LaTeX sequence.Traditional methods usually include three independent stages: character segmentation, recognition, and structure generation.With the development of artificial intelligence, people are aware of the great application potential of deep learning in various fields, and researchers have begun to apply deep learning to formula recognition.

Traditional Methods
Okamoto et al. [17] proposed a three-stage formula recognition, which uses horizontal and vertical projection to locate a single character and then uses regular expressions to recognize special symbols and numbers.In the structural analysis stage, logical structure recognizes superscript, subscript, and radical expressions.This method has a good effect on simple structural formulas, but it has a poor effect on complex structures, such as matrix and nested structures.Berman et al. [18] proposed a method based on the principle of the image-connected region.Álvaro et al. [19] compared four types of formula symbol recognizers and found that the classification errors mainly include overline, fractions, and minus signs.In the work of Zanibbi et al. [20], a method based on a baseline structure tree was proposed to establish an operator tree to describe the structure of scientific formula.Lee et al. [21] proposed a method to identify formula areas from images, so that formula and ordinary text areas can be processed separately.Twaakyondo et al. [22] proposed a method that divides the formula into several sub-formulas, then merges several subformulas into an overall tree structure, and then returns.Suzuki et al. [23] proposed to locate the formula character and use the minimum cost spanning tree algorithm to obtain the formula structure.This work makes commercial formula recognition a reality and is also the core principle of the formula recognition software INFTY reader.

Neural Methods for Formula Recognition
With the development of artificial intelligence, researchers began to apply deep learning to formula recognition.Gao et al. [24] proposed a deep neural network based on PDF character information combined with visual feature training to recognize familiar characters and formula areas in documents and process them separately.In Ref. [25], the encoder-decoder framework of seq2seq is implemented to realize formula recognition.
The end-to-end formula recognition method based on deep learning combines the encoder-decoder into one step.Deng et al. [26] proposed an encoder-decoder framework with coarse attention to realize end-to-end image to mark-up generation.The author uses a convolutional neural network to extract formula image feature information in the article.The author applies a coarse-to-fine scaling attention mechanism for each extracted feature vector in the decoder.Zhang et al. [27] proposed a gated recurrent unit (GRU) based on the encoder-decoder framework to realize handwritten formula recognition.Based on GRU, the author adds an attention mechanism so that the output of the encoder is no longer a fixed-length context vector but is dynamically calculated with different decoding times.In Ref. [28], the author proposed that the TAP model uses stroke information as input and uses GRU with attention mechanism as a decoder to generate LaTeX character sequence.The literature [29] replaced CNN with DenseNet [30] and enhanced the attention using the joint attention mechanism.In Ref. [31], Zhang et al. first enlarged the input image to twice the original size and then applied double attention to improve the model performance.
Peng et al. [32] proposed a large-scale pre-training model named MathBERT based on BERT to improve accuracy while paying attention to itself and its context, which is jointly trained with mathematical formulas and their corresponding contexts.In addition, in order to further capture the semantic-level structural features of formulas, a new pre-training task is designed to predict the masked formula substructures extracted from the Operator Tree (OPT), which is the semantic structural representation of formulas.
Wu et al. [33] proposed a graph-to-graph (G2G) codec framework and tested it on the handwritten mathematical formula dataset.In this paper, the formula's embedding and latex tags are used as input and output, respectively, with Graph Neural Networks (GNN) to explore the structural information and a novel sub-graph attention mechanism to match the primitives in the input and output graphs.The xperimental results show the model reaches new SOTA on the CROHM dataset.
Wang et al. [34] proposed a deep neural network model named MI2LS to convert pictures into LaTeX markers.The model consists of an encoder and a decoder.In the coding stage, the convolutional neural network is used to extract formula features to generate a feature map, and then input the coding network composed of LSTM to generate semantic vector C in the decoding phase, bi-directional LSTM is used to decode the semantic vector C in turn to generate LaTeX tags.

Methods
The model inference process can be regarded as the mapping from the picture to the character sequence, Figure 1 shows the recognition process.The formula is expressed as f(χ ) -> Υ, and the model's input is a picture in H × W × C format, which is indicated by χ.The picture of input training data does not need to be displayed and acquired.Only the LaTeX sequence needs to be obtained, and then the formula picture can be dynamically rendered using the LaTeX library, such as KaTeX.The output result of the model is the LaTeX character sequence Υ = {y 1 , ..., y T }, where y i denotes the ith character decoded, and T denotes the total length of the output LaTeX character sequence.The proposed model consists of an encoder and a decoder.The encoder adopts the Vision Transformer (VIT) model, which encodes the formula's point trajectory sequence into an abstract semantic vector.VIT divides the original input picture into a series of image blocks of 16 × 16 [35] and adds positional embedding because the relative position between different blocks significantly impacts the results.The decoder sequentially decodes the semantic vector and outputs the LaTeX sequence.The initial input of the decoder is a special mark <START>, where each decoding time step t receives the context vector and the decoded output at time t − 1 as total inputs to obtain the decoded character until the decoder outputs the end character <EOS>.The overall architecture of the model proposed in this paper is shown in Figure 2, the Encoder architecture is shown in Figure 3.

Encoder
The input of the standard Transformer is a sequence of token embedding.However, the image uses two-dimensional structure data.To use Transformer to process the image, the two-dimensional structure data χ ∈ R H×W×C must first be flattened as χ p ∈ R N×P 2 •C 2D sequence of blocks, where p denotes the dimension of each image block C denotes the number of channels of the image, and (h, w) denotes the wide and height of the image, respectively, N = H • W/P 2 .The hidden vector of fixed length used by the Transformer is D. Keeping n constant makes the P 2 • C picture blocks map to a D-dimensional sequence with Mapping Layer, i.e., where X class is the category vector addition.ω is image block embedding with the size ω ∈ R (P 2 •C)×D , N denotes the number of image embedding blocks, X i P denotes the category vector.In the model proposed in this paper, the encoder part scales the image to 224 × 224 × C, where C = 3, then divides it into 9 image blocks of 16 × 16 blocks.It attaches a category mark to each image block, expands each into a one-dimensional linear sequence, and adds a position vector to the Transformer encoder, the calculation process of position vector is shown in the Figure 4.After adding the position vector, the initial time input is where, ω pos is a position vector and satisfies ω ∈ R (N+1)×D ,and where i ∈ 0, 1, .., N/2, C 0 is the output at the initial time, y i is the decoder input at time t, and Υ is the LaTeX character sequence of the final output.When the input vector is obtained, the input data will be overlapped by L identical layers.Each layer includes two sub-layers: Multi-Head Self-Attention (MSA) and Feed Forward Network (FFN).After the data enters the two layers, the data normalization is carried out in the Layer Normalization(LN), i.e., where µ, σ denotes the mean and variance, respectively.a and b are learnable parameters.ε is a random decimal number, and H is the length of the vector.In order to improve the data feature extraction ability, each layer adopts the residual structure.The forward calculation formula is where Q, K, and V denote the query matrix, key matrix, and value matrix, respectively, that can be learned.The calculation of the encoder layer is The output of the encoder layer includes not only the encoded vector but also the key matrix K and the value matrix V.The picture block vector encoded in the prediction stage is used as the query matrix Q to output K and V to obtain the prediction result.
Introducing an attention mechanism into the model can indeed improve the model's performance.Based on classical attention, several varieties of attention have been put forward one after another, among which the more popular one is the multi-head attention mechanism (MLA).
The classical self-attention mechanism can be regarded as a particular case of multihead attention, that is, there is only one detection head.
Multi-head attention is used to extract features from multiple self-attention heads in parallel and combine all attention heads' output as the final result.The multi-head attention mechanism does not introduce new parameters but divides the original Q, K, and V into several sub-parts.After splitting, each part is mapped to different subspaces of the high-dimensional space.The weights are calculated so that the model assigns different attention scores to different regions, the multi-head attention's architecture is shown in the Figure 5.The calculation process of multi-head attention is as follows: where, W O , W Q , W K , W V is the linear transformation parameter matrix that can be learned Where matmul denotes matrix multiplication and scale means multiplying by the scaling factor.

Decoder
The decoder's function in this paper is to translate the encoded vector into a LaTeX character sequence.It is composed of several layers with the same structure, the architecture of Decoder is shown in the Figure 6.Each layer consists of a Mask multi-head self-attention layer, fully connected layer, and a feedforward layer.The layers are connected by residual structure, using the softmax as activation function.In training, the input of the decoder is ground truth.After masking the Multi-Head Self Attention layer, the decoder will randomly mask 15% of the total characters so the model can learn the internal structure of LaTeX.In the prediction stage, the global semantic information encoded as the decoder's input and the previous step's prediction output jointly generate the output LaTeX character.
Let the word vector set of ground truth be V = {v 1 , ..., v N }, the mask multi-head attention layer will only calculate attention for V = {v 1 , ..., v N−1 } when the predicted results is y i .The codec attention layer input includes two parts: output of Masked multihead self-attention layer and K, V of encoder output.
It can be seen that the input of the decoder includes not only the encoder output key vector and value vector but also the ground truth word vector as the query vector.The query, the key, and the value vectors are scaled, and then dot product attention is calculated to obtain the output vector.
The feedforward neural network layer includes two sub-layers whose input is the output of the attention layer.The output of the feedforward layer is subjected to linear transformation and softmax function as the final output of the decoder.
The calculation process is: where, W 1 , b1, W 2 , b 2 are all learnable parameters.

Loss Function
In order to verify the reliability of the prediction results of the model, this paper designs two granularity loss functions: sequence level and character level.Since the sequence level loss function cannot cover the initial random strategy, the proposed model adopts the character level loss function.
Character level loss is based on maximum likelihood estimation(MLE).A dataset contains the mapping of training pictures to LaTeX character sequences: {χ i , Υ i } N i=1 where χ i denotes a picture; Υ i = {y 1 , ..., y T } denotes a LaTeX sequence, N indicates the size of the dataset, T denotes the length of a LaTeX sequence.
The purpose of model training is to find the appropriate parameter θ to maximize the predicted accurate characters, that is where which is equivalent to minimizing cross entropy loss function [36]:

Experiments 4.1. Preprocessed Data
Experimental tests were performed on the im2latex-100k dataset, which was built based on various scientific and technical documents using regular expressions.A total of 103,356 actual scientific formulas were extracted from more than 60,000 documents.The dataset was divided into 3 parts: the training set (83,883 equations), the test set (9319 equations), and the validation set (10,354 equations).Each formula includes LaTeX code and rendered PNG format pictures, label pairs of LaTeX code, and image mapping.The length of each LaTeX sequence ranged range from 38 to 997.There is no clear boundary between the extracted LaTeX characters, so it is necessary to insert spaces between each character.Some structural errors cannot be rendered, and it is necessary to filter out these formulas.
The data needs to be preprocessed uniformly to improve prediction accuracy and reduce model parameters.For example, the a_ {b} and (a)_ {{b}} rendering results are completely consistent.Reducing a layer of {} can decrease the calculation amount of the model.Another situation is that ψ and the rendering result are the same.However, if the former model is adopted, the output will be predicted as '\', 'p', 's', 'i', which will undoubtedly increase the calculation amount of the model.In this paper, the normalization algorithm removes redundant characters and replaces multi-character symbols with single characters.The original LaTeX can be converted into a symbol tree using the LaTeX parsing library.The classic LaTeX parsing library is KaTeX, which is written in JavaScript.After generating the symbol tree, the character level LaTeX sequence by traversing the tree will be obtained.
The algorithm execution flow is as follows.The input is an unprocessed LaTeX sequence, such as Before Normalization.The algorithm outputs a Post-Process LaTeX sequence such as After Normalization in Figure 7.In the algorithm, latexstring denotes Post-Process results, tree denotes the formula tree of analysis by KaTeX library.tree[i].typedenotes the node type in the formula, including two types such as structuralCharacter and ordinaryCharacter.For structuralCharacter, this node is processed recursively; for ordinaryCharacter, this node is added to the variable latexstring as the final result.The algorithm process is shown in Algorithm 1. for i : 0 → Length(tree) do 5:

Case(ordinaryCharacter)
9: end for 11: end function In this paper, YOLOv4 is used as the target detection model.YOLOv4 first divides the original image into N×N grids and introduces the anchors to locate the target higher.
The backbone used in YOLOv4 is csparknet.In order to enhance the feature extraction ability, Spatial Pyramid Pooling (SPP) and path aggregation network (PANet) are added at the neck.The SSP contains convolution kernels with multiple scales of 1 × 1, 5 × 5, 9 × 9, 13 × 13 to expand the receptive field of the network.PANet can fuse the features of different layers to expand input data information and obtain more accurate prediction results.Suppose the input picture size is H×W, the downsampling is s, and the number of output categories is class_ Num.Outputs dimension of the vector D = [H / s, w / s, 3 (5 + class_num)], where 5 means the four coordinate values and one confidence of the regression box.The YOLOV4 model recognition result is shown in the Figure 8.
There are several multi-line formulas in the im2latex-100k dataset, such as matrix with square bracket, matrix with round bracket, round bracket formula, angle bracket formula, curly bracket formula, piecewise function (right bracket), piecewise function with right bracket, and multi-line formula.We can see an example of these multi-line formula from Table 1.The first step of multi-line formula recognition is to identify the formula type of which the main distinguishing feature is the symbols on both sides of the formula.The YOLO model detects whether the picture contains multi-line characters, and multi-line character types separate the formulas of each line and combine them into a complete LaTeX sequence.Special characters (<start> and <end>) shall be added at the beginning and end of each LaTeX sequence to ensure that the encoder can distinguish the start character and the end character.
After adding the YOLO model, the formula recognition first detects whether the picture contains multi-line identifiers.If there are specific types of multi-line formulas, the multi-line formulas are divided into multiple single lines for identification; If there is only one row, we directly call the VIT model to predict the result.
The difference between the segmented recognition and the real value is that each member of the real value is wrapped by "{}", while the segmented recognition space is divided by \quato space.At the same time, multi-line formulas often have fixed format characters such as \begin {array}and \end {array}.When multi-line formulas have been recognized, multiple lines must be combined into a complete LaTeX sequence.The multiline formula segmentation and recognition is shown in the Figure 9.By analyzing the LaTeX string of the multi-line formula, it is found that it contains fixed structure parts.For example, there are usually \begin {bMatrix} and \end {bMatrix} in the matrix.The result string is spliced with the fixed part through the algorithm to obtain the complete recognition result.

Formula Type
Formula Picture LaTeX Expression Matrix (square brackets)

Settings
In this experiment, the batch-size, initial learning rate, and epoch are set to 45, 0.0001, and 1000, respectively.The hardware environment includes: CPU model i7 12700h, memory capacity 60 GB, graphics card Tesla V100, graphics card memory 60 GB.The storage space includes memory and hard disk, which are 60 GB and 100 GB, respectively.The time to iterate an epoch on Tesla V100 is about 2 minutes.

Measurements
The formula recognition task can be regarded as a particular machine translation task.The input language here is the trajectory point sequence of the formula, and the target language is the LaTeX sequence.Therefore, the evaluation standard can use the same indicators of machine translation.
BLEU (Bilingual Evaluation Under) [37] is a text evaluation algorithm often used to evaluate the correspondence between machine translation and professional human translation.BLEU's guiding design idea is the degree of similarity between machine translation and human translation.The higher the degree of similarity, the easier the score calculated by BLEU can be used as an evaluation index for machine translation quality evaluation.BLEU calculation formula is: where ∑ i ∑ k min(h k (c i ), max j∈m h k (s ij )) denotes the minimum number of occurrences of n-gram in the prediction sequence and the standard answer.
The matching degree of n-gram may change with the shortening of sentence length, thus leading to a problem: the model may only accurately predict some characters in the LaTeX sequence, but not all, so for its matching degree to avoid the bias of this score, BLEU introduced a length Brevity Penalty in the final score results.
where l c denotes the length of the predicted LaTeX sequence, l s denotes the effective length of the ground truth, and, when there are multiple ground truths, the length closest to the predicted sequence is selected.When the length of the predicted sequence is greater than the length of the ground truth, the penalty coefficient is 1, which means that there is no penalty.The penalty factor will be calculated only when the length of the machine translation is less than the ground truth.
Since the accuracy of each n-gram statistic decreases exponentially with the increase of the order, to balance the effect of each order statistic, the geometric average form is used to obtain the average value.It is then weighted and multiplied by the length penalty factor.The final evaluation formula is: where, the upper limit of W N = 1 N .N is 4, that is, only the accuracy of 4-g is counted at most.
Another important indicator to measure the effect of the formula recognition model is the maximum edit distance [38] (MED), proposed by Russian scientist Vladimir Levenshtein in 1965.MED is usually used to calculate the similarity of two character string sequences.Two character strings ψ 1 and ψ 2 their MED < ψ 1 , ψ 2 > are defined as the minimum number of single-character editing, which transformed ψ 1 into ψ 2 .Only three types of singlecharacter editing are defined in this article, including insertion, deletion, and substitution.The formula is expressed as: Here, lev a,b (i, j) denotes the distance between the first i characters in a and the first j characters in b.When the length of a or b is 0, the number of edits to convert an empty string into a non-empty string is the length of the non-empty string.The process of the algorithm is as follows.Where M and N denote the length of input character sequence STR_A and STR_B, respectively; MATRIX_ED is the array storing the final return results.The algorithm process is shown in Algorithm 2. end while 11: end while In addition to BLEU and MED, Exact Match (EM) was used to calculate the degree of match between the predicted LaTeX sequence and the ground truth.The specific method is used to traverse two sequences and compare whether the characters at each position are equal based on the shorter one of the two sequences.For example, the calculation result of a_{b} and a_b} is 0.5.In order to compare with other models, the normalized MED calculation formula is used in this paper.
The comparison between this model and other models is shown in Table 2.The results show that BLEU is about 2% higher than similar models; MED achieves the same effect as similar models, but EM performance could be better.
In this paper, the model's generalization is also verified on the dataset of handwritten mathematical formulas.CROHME 2014 is a dataset of 10,846 handwritten formulas, each from a real scene.This model and other models with a handwritten mathematical formula recognition effect are shown in Table 3.We can see that this model still has a good effect in handwritten mathematical formula recognition.Table 4 shows the total efficiency of the model when there is no distinction between single-line and multi-line formulas on img2latex 100 k.The experimental results show that on the handwritten mathematical formula dataset CROHME 2014, the single line formula for BLEU, MED, and EM reaches 54.29, 57.80, and 60.20, respectively; the multi-line formula for BLEU, MED, and EM reaches 55.39, 58.20, and 60.22.On the im2latex-100k dataset, the single line formula for BLEU, MED, and EM reaches 90.02, 90.34, and 70.24, respectively; the multi-line formula BLEU, MED, and EM reaches 71.45, 73.55, and 65.27.
In addition, the im2latex-100k dataset reached 0.92 in BLEU, and the Exact Match is 0.62 when there is no distinction between single-line and multi-line formulas.
In order to verify the effect of model parameters on the results, we tested the effect of the model under different parameters.The experimental results show that the model effect's main parameters are batch size and learning rate.From the following figure one can see the influence of different values of these two parameters on the convergence effect of the model.It was evident that the model achieves the best effect when the batch size is 45 and the learning rate is 0.0001.The parameter change curve during model training is shown in Figure 10.

Discussion and Implications
Formula recognition is an exciting research direction.The difficulty is that the formula style is changeable, and the number of formula is infinite.A new formula can be obtained by replacing a variable of the formula.These formula characteristics determine that it can not be realized by pure programming and can not be recognized automatically.
However, it is observed that human understanding of the formula has a good generalization.Learning a particular type of formula such as a b can naturally understand a+c b−d and more expressions of the same form.We speculate that this ability of human beings lies in understanding the structure of the formula, even though there is no structural change for humans to replace its structure with some part of the formula.
In this paper, we abandon the artificial design features of the old method and adopt the idea of deep learning to compress the trajectory points of the formula into high-dimensional semantic vectors to promote the model to learn the tree structure nature of the formula.As a result, the model has good recognition results when processing the same structure but different forms.We believe that the main reason for the good effect of the model is: first, the end-to-end structure alleviates error accumulation.Secondly, introducing an attention mechanism makes the model's current output aligned with the current concern area.
Our future work will introduce more information to the model to improve its accuracy, such as the above information and spatial location information.We have noticed that some people have been involved in this work, so we intend to improve our model further to increase the accuracy.

Conclusions
This paper proposes an end-to-end printed formula recognition method based on the attention mechanism.In order to solve the low accuracy of formula OCR, the model adopts end-to-end training to alleviate the error accumulation.This paper's main innovation includes: first, the idea of machine translation is introduced into formula recognition, especially, using Transformer as the encoder-decoder framework to improve the generalization and accuracy of the model.Secondly, this paper proposes to identify the type of multi-line formula by target detection for the first time, and then divide the multi-line formula into several single-line formulas.Compared with other models, the model in this paper achieved better performance.The experimental results show that the model has great generalization and is superior to the traditional method when dealing with complex structural formulas.

Figure 1 .
Figure 1.An example of LaTeX sequence generation from an image.

Figure 2 .
Figure 2. Overall architecture of this model.

Figure 3 .
Figure 3. Architecture of the encoder.

Figure 5 .
Figure 5. Scaled dot product attention and multi-head attention.

Table 2 .
Comparison between the proposed model and other similar models on img2latex 100k.

Table 3 .
The effect of our model on the handwritten formula dataset.It can be seen from the table that our model is better than other similar models.

Table 4 .
The total efficiency of the model when there is no distinction between single-line and multi-line formulas on img2latex 100k.