Dual Model Medical Invoices Recognition

Hospitals need to invest a lot of manpower to manually input the contents of medical invoices (nearly 300,000,000 medical invoices a year) into the medical system. In order to help the hospital save money and stabilize work efficiency, this paper designed a system to complete the complicated work using a Gaussian blur and smoothing–convolutional neural network combined with a recurrent neural network (GBS-CR) method. Gaussian blur and smoothing (GBS) is a novel preprocessing method that can fix the breakpoint font in medical invoices. The combination of convolutional neural network (CNN) and recurrent neural network (RNN) was used to raise the recognition rate of the breakpoint font in medical invoices. RNN was designed to be the semantic revision module. In the aspect of image preprocessing, Gaussian blur and smoothing were used to fix the breakpoint font. In the period of making the self-built dataset, a certain proportion of the breakpoint font (the font of breakpoint is 3, the original font is 7) was added, in this paper, so as to optimize the Alexnet–Adam–CNN (AA-CNN) model, which is more suitable for the recognition of the breakpoint font than the traditional CNN model. In terms of the identification methods, we not only adopted the optimized AA-CNN for identification, but also combined RNN to carry out the semantic revisions of the identified results of CNN, meanwhile further improving the recognition rate of the medical invoices. The experimental results show that compared with the state-of-art invoice recognition method, the method presented in this paper has an average increase of 10 to 15 percentage points in recognition rate.


Detailed Introduction
A large number of paper medical invoices are produced in hospitals every day. If we only use manpower to identify and classify these medical invoices, it is not only a waste of manpower, but also cannot guarantee work efficiency after long working hours [1]. Compared to a similar field, such as for face recognition, one of the best methods is based on Gabor jet feature extraction, filters, and Borda count classification [2][3][4][5], which really point out that the faster the image features are processed, the better the recognition they will present. In order to improve and stabilize the recognition efficiency 1.
In terms of preprocessing, we abandoned the traditional methods, such as the wavelet-transform, because of the worse feedback in the breakpoint font. We designed a brand-new preprocessing method called Gaussian blur and smoothing (GBS), which can effectively repair the breakpoint font and improve recognition accuracy; 2.
Compared with a state-of-art text recognition system, such as optical character recognition (OCR), we used the dual model network of CNN and RNN as the recognition module in order to improve the recognition accuracy.
As a matter of fact, this system helps hospitals accomplish a large amount of complicated work.

The Background of Deep Learning
Today, deep learning is already becoming a much-needed technology in many areas, especially in recognition and further processing. As for the deep learning technology we currently put in use, it started from the BP model, well-known as the multi-layer perceptron. It is the primary model of the deep learning methods, but it is still limited by the lack of calculation power of computers. With the development of computing power, more complex methods with better accuracy have appeared. In 2006, Professor Geoffrey Hinton and his students published a paper on science, and pointed out that in many areas, the gathering features of the datasets were more useful for describing themselves. In order to overcome the difficulty of how to train a network model of deep learning, he presented a new idea called layer-wise pre-training, and these two points are still what we are using nowadays in works on deep learning. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules. These modules transform each representation at one level (starting with the raw input), into a representation at a higher and more abstract level. With the composition of enough transformations, very complex functions are formed [24]. In 2012, Professor Geoffrey Hinton and his students performed a further study using CNN to recognize images using only pixels. This is an advanced idea in image processing as well as in all areas of image recognition, solving the problem of too many features of images that need to be collected and replaced with pixels. In particular, CNN has achieved the top performance on many image-based classification tasks, as its structure is very suitable for representing the image data. For example, image classification used Chinese handwriting recognition [25], and it has been used in the field of face recognition as well. For another framework, we used RNN. The recurrent neural network is a powerful model for sequential data, which has been used widely in speech recognition for combining the multiple levels of representation that have proven so effective in deep networks, with flexible use in a long-range context [26][27][28]. The RNN also fits in the sentences semantic area [29]. A study by Cho, Kyunghyun shows that the proposed model (based on RNN) learns a semantically and syntactically meaningful representation of linguistic phrases [30]. It has also been proven to play an important role in the recognition of fast continuous Mandarin speech [31]. These areas have many familiar parts with word or text recognition; in particular, the semantic part is still a very challenging part of invoice recognition, which is why we fitted the RNN into our work smoothly.

The Status Quo of Invoices Recognition
So far, the deep learning algorithm has been applied to replace the original matching algorithm. As for the text recognition, it has always been a long-standing topic in computer vision [32]. Recognition methods in text are put forward, but are rarely based on deep learning. Another blank area is in the area of medical invoices recognition. Traditional invoice recognition only focusses on general invoices and Sensors 2019, 19, 4370 4 of 27 bank invoices, for example, the bank bill automation recognition based on HMMs method invented by Gui-Xin Wang, the recognition algorithms of bank invoice images using a BP network invented by Meng-di Han, and the spiral recognition methodology for the recognition of Chinese bank bills [33].
For traditional recognition, it only can recognize the continuous characters where every word is printed clearly and smoothly. A lot of methods are based on OCR, but it is already out of date in the invoice recognition area. In recent years, deep learning methods have really outperformed traditional methods in the OCR research field, which proposes the idea that CNNs on large-scale annotated datasets can be efficiently transferred to other visual recognition tasks with a limited amount of training data [34], which gave us the idea of combining the deep learning methods with our dual model system.

The Innovation of GBS-CR
As we all know, for printed text, the main difficulty still comes from severely merged or degraded characters. Even now, the incorrect recognition of merged characters is still one of the main problems [35], both in printed letters and handwriting [36], but medical invoices are different. They not only have the problem above, but also medical invoices are mostly printed using the stylus printer. This means that there will many breakpoints reserved in every character, but at the moment, traditional recognition cannot recognize words with breakpoints very well. To perfectly solve the problem, we found that Gaussian blur and smoothing is an excellent method, by expanding the character itself while still being able to recognize it. In the field of face recognition, the Gaussian filter has been used as a tool to help deal with preprocessing and after blurring. We found a mature method named the scale-invariant feature transform (SIFT), proposed by David Lowe, which provided a solution to extract the features and re-match with the database. SIFT is a powerful set of local, invariant features or key-point descriptors for detecting local structures in different image views. The noises are blurred, but the details on edges remain unaffected in this scale space. Compared with traditional SIFT-based matching methods [37][38][39], the features extracted by AAg-SIFT [40][41][42][43] are more stable and precise [44][45][46][47]. This method has normally been seen in image recognition [48,49], and inspired us with how to preprocess the breakpoint character. In our work, we figured out a way to preprocess the image by using a Gaussian filter to process the words and to reach a point where the breakpoint was fixed smoothly and the necessary features were left over clearly. Thus, we could still extract the invariant features. We also applied the deep learning model to upgrade the speed in order to solve the problem of the low recognition rate of the traditional method.
This paper proposes a new image preprocessing method named GBS and a medical invoice recognition algorithm which combines CNN and RNN together, in order to recognize medical invoices, which greatly improves the performance of medical invoice recognition, after having studied the existing achievements in signal processing and deep learning in our laboratory [50][51][52]. In the algorithm combining the Alexnet with Adam optimization algorithm, using the Adam optimization algorithm can make the network convergence fast and reduce the loss of the advantages of network training, greatly improving the recognition rate.

Dual Model Medical Invoice Recognition Methods
This paper introduces the optimization and acceleration of the GBS-CR model from the following aspects, the framework of which is shown in Figure 1. As shown in Figure 1, the GBS-CR system proposed in this paper was divided into the following two modules: 1.
Recognition (online). In the first module, we only needed two models trained by Alexnet-Adam-CNN and RNN. Then, we could upload these models to the cloud and save it.
In the second module, as there is a large amount of professional terminology in medical invoices, this paper combined AA-CNN and RNN in sequence. After AA-CNN outputs recognized the results, RNN was used to carry out the semantic revisions obtained by the recognized results of CNN, and modified some professional terminology in the medical invoices.

AA-CNN Model Training for Recognition
After obtaining the font images of the medical invoices, we carried out the image preprocessing in order to obtain the training set required by the AA-CNN network, which integrates the Alexnet and Adam optimization algorithm, and we started network training with the primary character dataset. In the end, we could save the trained model to the local and identify the images. The specific process is shown in Figure 2.  As shown in Figure 1, the GBS-CR system proposed in this paper was divided into the following two modules:
In the first module, we only needed two models trained by Alexnet-Adam-CNN and RNN. Then, we could upload these models to the cloud and save it.
In the second module, as there is a large amount of professional terminology in medical invoices, this paper combined AA-CNN and RNN in sequence. After AA-CNN outputs recognized the results, RNN was used to carry out the semantic revisions obtained by the recognized results of CNN, and modified some professional terminology in the medical invoices.

AA-CNN Model Training for Recognition
After obtaining the font images of the medical invoices, we carried out the image preprocessing in order to obtain the training set required by the AA-CNN network, which integrates the Alexnet and Adam optimization algorithm, and we started network training with the primary character dataset. In the end, we could save the trained model to the local and identify the images. The specific process is shown in Figure 2.

AA-CNN Model Training for Recognition
After obtaining the font images of the medical invoices, we carried out the image preprocessing in order to obtain the training set required by the AA-CNN network, which integrates the Alexnet and Adam optimization algorithm, and we started network training with the primary character dataset. In the end, we could save the trained model to the local and identify the images. The specific process is shown in Figure 2.  First, we got the single Chinese character images from the TTF font file, then we put all of these images together to form a primary character dataset, which included 525,700 images from 3755 commonly used Chinese characters.
Second, because of the differences between the common invoices and medical ones, we added the appropriate percentage of simulated breakpoint fonts to build a new dataset, which included 225,300 images from 3755 commonly used Chinese characters.
In view of the different fonts between the normal notes and medical invoices, we designed a method to imitate the breakpoint font. The comparison of these two kinds of invoices is shown in Figure 3. First, we got the single Chinese character images from the TTF font file, then we put all of these images together to form a primary character dataset, which included 525,700 images from 3755 commonly used Chinese characters.
Second, because of the differences between the common invoices and medical ones, we added the appropriate percentage of simulated breakpoint fonts to build a new dataset, which included 225,300 images from 3755 commonly used Chinese characters.
In view of the different fonts between the normal notes and medical invoices, we designed a method to imitate the breakpoint font. The comparison of these two kinds of invoices is shown in Figure 3.  We can clearly see from Figure 4 that the font of the general invoice is continuous, while the medical one is broken.  Therefore, we simulated the breakpoint effect on the original font according to the principle of the printing breakpoint font, and used them to form a new dataset (self-built dataset), which included 751,000 images-525,700 images from the primary dataset, and 225,300 images from the new dataset (this kind of structure will be explained in Section 4.3.1). It is shown in Figure 5. We can clearly see from Figure 4 that the font of the general invoice is continuous, while the medical one is broken. First, we got the single Chinese character images from the TTF font file, then we put all of these images together to form a primary character dataset, which included 525,700 images from 3755 commonly used Chinese characters.
Second, because of the differences between the common invoices and medical ones, we added the appropriate percentage of simulated breakpoint fonts to build a new dataset, which included 225,300 images from 3755 commonly used Chinese characters.
In view of the different fonts between the normal notes and medical invoices, we designed a method to imitate the breakpoint font. The comparison of these two kinds of invoices is shown in Figure 3.  We can clearly see from Figure 4 that the font of the general invoice is continuous, while the medical one is broken.  Therefore, we simulated the breakpoint effect on the original font according to the principle of the printing breakpoint font, and used them to form a new dataset (self-built dataset), which included 751,000 images-525,700 images from the primary dataset, and 225,300 images from the new dataset (this kind of structure will be explained in Section 4.3.1). It is shown in Figure 5. Therefore, we simulated the breakpoint effect on the original font according to the principle of the printing breakpoint font, and used them to form a new dataset (self-built dataset), which included 751,000 images-525,700 images from the primary dataset, and 225,300 images from the new dataset (this kind of structure will be explained in Section 4.3.1). It is shown in Figure 5. Therefore, we simulated the breakpoint effect on the original font according to the principle of the printing breakpoint font, and used them to form a new dataset (self-built dataset), which included 751,000 images-525,700 images from the primary dataset, and 225,300 images from the new dataset (this kind of structure will be explained in Section 4.3.1). It is shown in Figure 5. To illustrate our figure clearly, we added English translations to help understand the figure. At the left side is an example of the original dataset (https://github.com/AstarLight/CPS-OCR-Engine/blob/ master/ocr/gen_printed_char.py). The Chinese character on the image means "content" in English. On the right side is an example of the Chinese characters from our self-built dataset-the meanings of the Chinese characters are "great", "real", "time", "boy", "modest", and "normal". Both datasets are expanded by rotating at different angels.
Finally, as we already know that so many CNN models have been used in many other areas, in order to find an appropriate one, we used some known CNNs, such as Googlenet, Caffenet, and Alexnet, to select a better network that is suitable for our goal.
From the data shown in Table 1, we can see that Alexnet has a higher accuracy when distinguishing the medical invoices; therefore, we chose Alexnet to continue further with our experiment. During the Alexnet training, any change of the parameters in a layer would result in changes in the following parameters, leading to the network needing to constantly adapt to the new data distribution. It required more parameters to adjust the vector, and it was harder to train a network because of the existence of the nonlinear problems in the activation function of the operation. From what we discussed above, the input data of each layer of the neural network were normalized to the standard normal distribution, which can solve the problems above, reduce the training time of the network significantly, and accelerate the network convergence effectively.
For further improving accuracy and reducing loss, in the second stage, image preprocessing methods such as binarization were added to obtain a small sample dataset, and the training was conducted again.
We used AA-CNN to train the character dataset and the Relu activation function to reduce the computing costs. The Relu equation is shown in Equation (1): Next, we will explain how the AA-CNN network worked. As shown in Figure 6, a Chinese character image with a size of 300 × 300 pixels was input into the input layer, and then, after five convolutions (the kernel size of the first convolutional layer was 11 × 11 pixels, while the others' kernel size was 3 × 3 pixels) and two max-poolings connected to two fully-connected layers, and it finally output 3755 categories-the validation split value was 375,500 and the batch size was 128. methods such as binarization were added to obtain a small sample dataset, and the training was conducted again.
We used AA-CNN to train the character dataset and the Relu activation function to reduce the computing costs. The Relu equation is shown in Equation (1): (1) Next, we will explain how the AA-CNN network worked. As shown in Figure 6, a Chinese character image with a size of 300 × 300 pixels was input into the input layer, and then, after five convolutions (the kernel size of the first convolutional layer was 11 × 11 pixels, while the others' kernel size was 3 × 3 pixels) and two max-poolings connected to two fully-connected layers, and it finally output 3755 categories-the validation split value was 375,500 and the batch size was 128. In this article we chose Alexnet in order to make this network more suitable for training the self-built dataset. We made a series of modifications to it, such as the kernel size of the convolutional layer. We also used the Adam optimization algorithm to further reduce the loss value of Alexnet.
The reasons for using AA-CNN for training in this paper were as follows: 1.
Local connection and weight sharing: to reduce a large number of parameters, the speed of seal recognition, and classification; 2.
Downsampling: to improve the robustness of the model, that is, to improve the accuracy of the medical invoice recognition and model stability; 3.
Local response normalization: this normalization was applied after using the nonlinear activation function of ReLu; 4.
Overlapping pooling: to reduce the overfitting of images caused by the operation of the non-overlapping adjacent units.
At the same time, three different optimization algorithms were selected based on the Alexnet network, which were as follows: 1.
In Figure 7, we can see the Adam optimization algorithm (Adam curve: train "accuracy/train" loss, SGD curve: train accuracy/train loss, AdaGrad curve: train accuracy"'/train loss"')-the loss rate was reduced and the recognition accuracy was improved. In this article we chose Alexnet in order to make this network more suitable for training the self-built dataset. We made a series of modifications to it, such as the kernel size of the convolutional layer. We also used the Adam optimization algorithm to further reduce the loss value of Alexnet.
The reasons for using AA-CNN for training in this paper were as follows: 1.
Local connection and weight sharing: to reduce a large number of parameters, the speed of seal recognition, and classification; 2.
Downsampling: to improve the robustness of the model, that is, to improve the accuracy of the medical invoice recognition and model stability; 3.
Local response normalization: this normalization was applied after using the nonlinear activation function of ReLu; 4.
Overlapping pooling: to reduce the overfitting of images caused by the operation of the non-overlapping adjacent units. At the same time, three different optimization algorithms were selected based on the Alexnet network, which were as follows: 1.
AdaGrad optimization algorithm. In Figure 7, we can see the Adam optimization algorithm (Adam curve: train "accuracy/train" loss, SGD curve: train accuracy/train loss, AdaGrad curve: train accuracy'''/train loss''')-the loss rate was reduced and the recognition accuracy was improved.
In the procedure of the back propagation, the chain rule was applied on the loss function. The In the procedure of the back propagation, the chain rule was applied on the loss function. The normalized layer was determined by the type of back propagation gradient, including a summation calculation with a small data batch. The calculation equation is shown below:

RNN Model Training for Semantic Revisions
This paper introduced the optimization and acceleration of the BPTT-RNN model from the following aspects. The framework is shown in Figure 8. First, we downloaded the medical terminology from the CNKI (China National Knowledge Infrastructure) and PubMed, making the linguistic dataset include more than 7000 medical terms.
Then, we needed to vectorize the linguistic dataset, preparing for network training. Finally, in order to obtain the linguistic model, the dataset was applied on the network training.
The RNN was applied to do the semantic revisions using the results of CNN. Then, we gave a brief introduction of BPTT-RNN.
The circulation layer structure is shown in Figure 9. As is shown in Figure 9, the output of the hidden layer was ot using xt as input data. The point is, the value of s depended not just on xt, but on st-1. We can use the following equations to express the calculation procedure of cyclic neural network: Equation (5) is the output layer calculation function. The output layer was a full connection layer. V is the weight matrix of the output layer, and g is the activation function. Equation (6) is the hidden layer (circulation layer) calculation function. U is the weight matrix for the input x, W is the value for the last time, st-1 is the input weight matrix for this time, and f is the activation function (ReLU).
The BPTT algorithm is a training algorithm for a cyclic layer. Its basic principle is the same as the BP algorithm, and it also contains the same three steps: 1.
Forward calculation of each neuron output value; First, we downloaded the medical terminology from the CNKI (China National Knowledge Infrastructure) and PubMed, making the linguistic dataset include more than 7000 medical terms.
Then, we needed to vectorize the linguistic dataset, preparing for network training. Finally, in order to obtain the linguistic model, the dataset was applied on the network training. The RNN was applied to do the semantic revisions using the results of CNN. Then, we gave a brief introduction of BPTT-RNN.
The circulation layer structure is shown in Figure 9. First, we downloaded the medical terminology from the CNKI (China National Knowledge Infrastructure) and PubMed, making the linguistic dataset include more than 7000 medical terms.
Then, we needed to vectorize the linguistic dataset, preparing for network training. Finally, in order to obtain the linguistic model, the dataset was applied on the network training.
The RNN was applied to do the semantic revisions using the results of CNN. Then, we gave a brief introduction of BPTT-RNN.
The circulation layer structure is shown in Figure 9. As is shown in Figure 9, the output of the hidden layer was ot using xt as input data. The point is, the value of s depended not just on xt, but on st-1. We can use the following equations to express the calculation procedure of cyclic neural network: Equation (5) is the output layer calculation function. The output layer was a full connection layer. V is the weight matrix of the output layer, and g is the activation function. Equation (6) is the hidden layer (circulation layer) calculation function. U is the weight matrix for the input x, W is the value for the last time, st-1 is the input weight matrix for this time, and f is the activation function (ReLU).
The BPTT algorithm is a training algorithm for a cyclic layer. Its basic principle is the same as the BP algorithm, and it also contains the same three steps: 1.
Forward calculation of each neuron output value; 2.
The error term value of each neuron is calculated reversely. It is the partial derivative of the error function E to the weighted input of the neuron j; Figure 9. Structure of the circulation layer.
As is shown in Figure 9, the output of the hidden layer was o t using x t as input data. The point is, the value of s depended not just on x t , but on s t−1 . We can use the following equations to express the calculation procedure of cyclic neural network: Equation (5) is the output layer calculation function. The output layer was a full connection layer. V is the weight matrix of the output layer, and g is the activation function. Equation (6) is the hidden layer (circulation layer) calculation function. U is the weight matrix for the input x, W is the value for the last time, s t−1 is the input weight matrix for this time, and f is the activation function (ReLU).
The BPTT algorithm is a training algorithm for a cyclic layer. Its basic principle is the same as the BP algorithm, and it also contains the same three steps:

1.
Forward calculation of each neuron output value; 2.
The error term δ j value of each neuron is calculated reversely. It is the partial derivative of the error function E to the weighted input net j of the neuron j; 3.
Calculate the gradient of each weight.
Finally, we used the stochastic gradient descent algorithm to update the weight, using the previous Equation (6) to carry out forward calculation for the circulation layer: Then, we expanded Equation (6) to get Equation (7): ).
The left side of Equation (7) is an output vector, s t , with a size of n.
The right side of Equation (7) has an input vector, x, with a size of m; the dimension of matrix U is n*m; and the dimension of matrix W is n*n. s t n represents the value of the n th element of the vector s at time t, u nm means the weight from the mth neuron in the input layer to the nth neuron in the circulation layer, and w nn means the weight from the nth neuron at time t−1 of the circulation layer to the nth neuron at time t of the circulation layer.

Recognition
After the CNN and RNN model training, the recognition module could carry out the offline operation. The specific procedure is shown in Figure 10.

Recognition
After the CNN and RNN model training, the recognition module could carry out the offline operation. The specific procedure is shown in Figure 10. We got the image of the original medical invoice obtained by a scanner, then extracted the parts we needed to identify through the color threshold.
Finally, extracted text preprocessing (breakpoint processing) was performed. Breakpoint processing is a process with two steps. The first one is Gaussian blur (in this paper, a Gaussian blur with a 0.3 pixel radius was adopted for the medical invoices). The Gaussian blur equations are shown as follows: In the equations above, σ means the radius of the pixels, the radius of the Gaussian distribution (σ) is 0.5, μ is the central point, which is generally zero, and (x,y) are the relative coordinates of the peripheral pixels to the center ones. In Equation (8), f(x) is the density function of the normal distribution in one dimension, also known as the Gaussian distribution. In Equation (9), G(x,y) is the two-dimensional Gaussian equation derived from Equation (8).
After the Gaussian blur processing, the image was further enhanced by a smoothing operation. In order to get a clearer outline (high frequency and intermediate frequency information of the image) of the breakpoint font, we used the mean value filtering to process the image after the Gaussian blur. The equations are shown in Equation (10) and (11): A kernel with the size of m*n (m, n is odd) was applied on the image as the mean filtering, and We got the image of the original medical invoice obtained by a scanner, then extracted the parts we needed to identify through the color threshold.
Finally, extracted text preprocessing (breakpoint processing) was performed. Breakpoint processing is a process with two steps. The first one is Gaussian blur (in this paper, a Gaussian blur with a 0.3 pixel radius was adopted for the medical invoices). The Gaussian blur equations are shown as follows: G(x, y) = 1 2πσ 2 e −(x 2 +y 2 )/2σ 2 .
In the equations above, σ means the radius of the pixels, the radius of the Gaussian distribution (σ) is 0.5, µ is the central point, which is generally zero, and (x,y) are the relative coordinates of the peripheral pixels to the center ones. In Equation (8), f(x) is the density function of the normal distribution in one dimension, also known as the Gaussian distribution. In Equation (9), G(x,y) is the two-dimensional Gaussian equation derived from Equation (8).
After the Gaussian blur processing, the image was further enhanced by a smoothing operation. In order to get a clearer outline (high frequency and intermediate frequency information of the image) of the breakpoint font, we used the mean value filtering to process the image after the Gaussian blur. The equations are shown in Equations (10) and (11): A kernel with the size of m*n (m, n is odd) was applied on the image as the mean filtering, and the value of the intermediate pixel was replaced by the pixel average of the region covered by the kernel, g(x, y) means the center point at (x, y) with the size of the m*n filter window, and f (s,t) means the activation function. It is calculated as Equation (10).
For the images of size M*N, the weighted mean filter whose window was the size of M*N is calculated as Equation (11).

Experimental Settings
The experiments conducted in this paper were as follows: the operating system was Ubuntu 16.04; the GPU was NVIDIA GEFORCE GTX 1060; the memory size was 16 GB; and the software platform was python tensorflow.
The workflow of the whole system is shown in Figure 11.

Comparisons of the Experimental Results
In Section 4.2, the experimental results are compared, as shown in Figure 12.

Comparison of Preprocessing
In this paper, in order to achieve a better recognition effect, we used two different preprocessing methods to process the medical invoices, namely:
GBS preprocessing: Gaussian blur and smoothing processing.

Wavelet-Transform Preprocessing
In traditional image preprocessing, we used wavelet-transform based on different wavelet basis functions to enhance the original image, such as haar, db2, coif, and bior. The results are shown in Figure 13.
Based on the images above, we found that the difference was not obvious between the four kinds of wavelet-transform, so we selected two of them (haar and db2) to continue the following experiment.
The detailed cumulative histogram images by wavelet-transform based on the haar and db2 wavelet basis functions are shown in the Figures 14 and 15. We observed that the image after the haar wavelet-transform had a more stable peak value from 0 to 25, but wavelet-transform based on the db2 basis function only had a peak value from 0 to 20. So, the wavelet-transform based on the haar basis function was more applicable to this experiment. However, the breakpoint font was still not addressed very well. Sensors 2019, 18, x FOR PEER REVIEW 11 of 26 Figure 11. The workflow of the whole system.

Comparisons of the Experimental Results
In Section 4.2, the experimental results are compared, as shown in Figure 12.

Comparison of Preprocessing
In this paper, in order to achieve a better recognition effect, we used two different preprocessing methods to process the medical invoices, namely: 1.
Original preprocessing: wavelet transform based on different wavelet basis functions; Figure 11. The workflow of the whole system.

Comparisons of the Experimental Results
In Section 4.2, the experimental results are compared, as shown in Figure 12.

Comparison of Preprocessing
In this paper, in order to achieve a better recognition effect, we used two different preprocessing methods to process the medical invoices, namely: 1.
Original preprocessing: wavelet transform based on different wavelet basis functions; 2. GBS preprocessing: Gaussian blur and smoothing processing.

Wavelet-Transform Preprocessing
In traditional image preprocessing, we used wavelet-transform based on different wavelet basis functions to enhance the original image, such as haar, db2, coif, and bior. The results are shown in Figure 13. Based on the images above, we found that the difference was not obvious between the four kinds of wavelet-transform, so we selected two of them (haar and db2) to continue the following experiment.
The detailed cumulative histogram images by wavelet-transform based on the haar and db2 wavelet basis functions are shown in the Figure 14-15. We observed that the image after the haar wavelet-transform had a more stable peak value from 0 to 25, but wavelet-transform based on the db2 basis function only had a peak value from 0 to 20. So, the wavelet-transform based on the haar basis function was more applicable to this experiment. However, the breakpoint font was still not addressed very well.   Based on the images above, we found that the difference was not obvious between the four kinds of wavelet-transform, so we selected two of them (haar and db2) to continue the following experiment.
The detailed cumulative histogram images by wavelet-transform based on the haar and db2 wavelet basis functions are shown in the Figure 14-15. We observed that the image after the haar wavelet-transform had a more stable peak value from 0 to 25, but wavelet-transform based on the db2 basis function only had a peak value from 0 to 20. So, the wavelet-transform based on the haar basis function was more applicable to this experiment. However, the breakpoint font was still not addressed very well.  In this paper, a breakpoint processing method was proposed, as shown in Figure 16, based on Gaussian blur and smoothing (GBS).

Breakpoint Processing
In this paper, a breakpoint processing method was proposed, as shown in Figure 16, based on Gaussian blur and smoothing (GBS).

Breakpoint Processing
In this paper, a breakpoint processing method was proposed, as shown in Figure 16, based on Gaussian blur and smoothing (GBS). We found that all of the breakpoint fonts were fixed after GBS preprocessing. In order to further confirm the validity of the breakpoint processing, we carried out the following experiments. We selected a medical invoice randomly, and identified it after the wavelet-transform and GBS preprocessing with our program. The whole process is shown in Figures 17-19.
We found that all of the breakpoint fonts were fixed after GBS preprocessing. In order to further confirm the validity of the breakpoint processing, we carried out the following experiments. We selected a medical invoice randomly, and identified it after the wavelet-transform and GBS preprocessing with our program. The whole process is shown in Figures 17-19.  We found that all of the breakpoint fonts were fixed after GBS preprocessing. In order to further confirm the validity of the breakpoint processing, we carried out the following experiments. We selected a medical invoice randomly, and identified it after the wavelet-transform and GBS preprocessing with our program. The whole process is shown in Figures 17-19.  In total, there were 157 Chinese characters in this medical invoice. We found that in Figure 19a, 138 Chinese characters were right, while in Figure 19b, only 115 Chinese characters were right. The recognition accuracy of these two methods is shown in Table 2. Table 2. The recognition accuracy of wavelet-transform preprocessing and Gaussian blur and smoothing (GBS) preprocessing.

Preprocessing Mode Recognition Accuracy (%)
Wavelet-transform preprocessing 73. 25 GBS preprocessing 87.91 From the data in Table 2, it is not hard for us to see that the GBS preprocessing obtained a higher recognition accuracy in this experiment.  In total, there were 157 Chinese characters in this medical invoice. We found that in Figure 19a, 138 Chinese characters were right, while in Figure 19b, only 115 Chinese characters were right. The recognition accuracy of these two methods is shown in Table 2. Table 2. The recognition accuracy of wavelet-transform preprocessing and Gaussian blur and smoothing (GBS) preprocessing.

Preprocessing mode
Recognition Accuracy (%) Wavelet-transform preprocessing 73.25 GBS preprocessing 87.91 From the data in Table 2, it is not hard for us to see that the GBS preprocessing obtained a higher recognition accuracy in this experiment.

Comparison of Normal CNN and AA-CNN
In this section, we will compare the normal CNN with AA-CNN in the same training procedures (training set: 751,000; test set: 37,550; base learning rate: 0.01; iteration: 9000). The training curve is shown in Figure 20.

Comparison of Normal CNN and AA-CNN
In this section, we will compare the normal CNN with AA-CNN in the same training procedures (training set: 751,000; test set: 37,550; base learning rate: 0.01; iteration: 9000). The training curve is shown in Figure 20. From Figure 20, we observed that the AA-CNN achieved a higher accuracy and lower loss compared with the normal CNN.

Comparison of Semantic Revisions (RNN)
In this experiment, CNN and RNN were combined to form a parallel computing system. After the CNN recognition result was obtained, RNN was applied to conduct a semantic revision based on the CNN recognition result, which could greatly improve the recognition accuracy of our system for medical invoices.
In order to further confirm the validity of the semantic revisions, we selected a medical invoice randomly, and carried out the following experiments.
From Figure 21a, we can see that some medical terms were wrongly recognized, such as "serum", "hepatitis B", and "nonesterified fatty acid", which were wrongly recognized in Figure 21a, while in Figure 21b, because of incorporating the semantic revisions module, all of the medical terms were recognized properly.

Comparison of Other Machine Learning Techniques
In this experiment, CNN and RNN were combined to form a parallel computing system. In order to further confirm the validity of our method, we selected the same medical invoice used in Section 4.2.3, and carried out the following experiments with HMMs and other DCNN [53,54].  There were a total of 209 Chinese characters in the selected medical invoice and 175 of them were recognized correctly in Figure 21a, while 191 of them were recognized correctly in Figure 21b. Table 3 shows the accuracy comparison of the recognition results with or without semantic revisions. As shown in Table 3, as we added a semantic revision module, it examined the recognition results of CNN by medical terminology, so that some characters that were difficult to recognize in CNN could be revised in this module. Thus, the recognition accuracy was higher.

Comparison of Other Machine Learning Techniques
In this experiment, CNN and RNN were combined to form a parallel computing system. In order to further confirm the validity of our method, we selected the same medical invoice used in Section 4.2.3, and carried out the following experiments with HMMs and other DCNN [53,54].
From Table 4, we can see that, compared with other machine learning techniques, the GBS-CR method proposed in this paper still maintained a good performance in recognition. This is because of two main reasons. Firstly, we cited a new preprocessing method (GBS), which could fix the special font in the medical invoices, while the other methods could not. Secondly, after the CNN output its recognition results, we used them as the input of RNN (the semantic revision part) and the RNN gave out the final revised results. In conclusion, there was an average increase of 7 to 12 percentage points in the recognition rate due to our new method.

Analysis of the Dataset
In this section, we analyzed the dataset in the same training procedures (network: AA-CNN; base learning rate: 0.01; iteration: 8000). We adopted two kinds of datasets, one was the primary font dataset including 3755 commonly used Chinese characters, all of which were continuous, and the other was the self-built dataset mixed with both a continuous and a breakpoint font. Throughout the experiment process, we found that the training accuracy of the model would vary with the different proportions of the breakpoint font, as shown in Figure 22.
It can be seen from Figure 22 that the precision of the model training reached the highest point when the proportion was 7:3 (the font of breakpoint was 3, the original font was 7), while the recognition accuracy was 93.49%.
So, in the other experiments carried out in this paper, we chose this proportion of dataset (self-built dataset).
In this section, we analyzed the dataset in the same training procedures (network: AA-CNN; base learning rate: 0.01; iteration: 8000). We adopted two kinds of datasets, one was the primary font dataset including 3755 commonly used Chinese characters, all of which were continuous, and the other was the self-built dataset mixed with both a continuous and a breakpoint font. Throughout the experiment process, we found that the training accuracy of the model would vary with the different proportions of the breakpoint font, as shown in Figure 22. It can be seen from Figure 22 that the precision of the model training reached the highest point when the proportion was 7:3 (the font of breakpoint was 3, the original font was 7), while the recognition accuracy was 93.49%.
So, in the other experiments carried out in this paper, we chose this proportion of dataset (self-built dataset).

Analysis of Generalization Ability
In daily life, invoices are not usually kept in a perfect environment. There are a lot of invoices that will be conveniently put in a place, such as in a pocket, and then soaked by water, or rubbed by hands. Therefore, in this section, we focus on the generalization ability of the proposed method, and have made a comparison of the recognition accuracy.

Hand-Rubbed Medical Invoices
In this part, we selected a medical invoice randomly, and carried out the following experiments. As shown in Figure 23b, it was a hand-rubbed medical invoice, and we did the contrast experiments with the original one in the Figure 23a. We first used the GBS-CR method proposed in this paper to identify the two invoices and to count their accuracy. Then, OCR was used to identify the same hand-rubbed invoice, and its accuracy was compared with the GBS-CR method.
As shown in Figure 24, we used the red box to interpret part of the recognition results. From the comparison of the red box in Figure 24a,b, we observed that some key information was omitted from the hand-rubbed invoice, and some of the results were wrong. According to the statistics, 163 of the 177 characters were correctly identified in the normal medical invoice, while 151 of the 177 characters were correct in the hand-rubbed one. The recognition accuracy is shown in Table 5.

Analysis of Generalization Ability
In daily life, invoices are not usually kept in a perfect environment. There are a lot of invoices that will be conveniently put in a place, such as in a pocket, and then soaked by water, or rubbed by hands. Therefore, in this section, we focus on the generalization ability of the proposed method, and have made a comparison of the recognition accuracy.

Hand-Rubbed Medical Invoices
In this part, we selected a medical invoice randomly, and carried out the following experiments. As shown in Figure 23b, it was a hand-rubbed medical invoice, and we did the contrast experiments with the original one in the Figure 23a. We first used the GBS-CR method proposed in this paper to identify the two invoices and to count their accuracy. Then, OCR was used to identify the same hand-rubbed invoice, and its accuracy was compared with the GBS-CR method.   As shown in Figure 24, we used the red box to interpret part of the recognition results. From the comparison of the red box in Figure 24a,b, we observed that some key information was omitted from the hand-rubbed invoice, and some of the results were wrong. According to the statistics, 163 of the 177 characters were correctly identified in the normal medical invoice, while 151 of the 177 characters were correct in the hand-rubbed one. The recognition accuracy is shown in Table 5.   It can be seen from Table 5 that although the invoice had been hand-rubbed, our system (GBS-CR) could still maintain a good performance (only 8 of the 177 characters were omitted or recognized wrongly).  It can be seen from Table 5 that although the invoice had been hand-rubbed, our system (GBS-CR) could still maintain a good performance (only 8 of the 177 characters were omitted or recognized wrongly).
Then, we did the same experiment with another text recognition method called optical character recognition (OCR), and compared the recognition accuracy with our method. According to the statistics, 151 of the 177 characters were correctly identified in the GBS-CR method, while 119 of the 177 characters were correct in OCR (a part of the recognition results is shown in Figure 25).
From Figure 25, we observed that many characters were omitted and even recognized wrongly with the OCR method, and the recognition accuracy is shown in Table 6.
Then, we did the same experiment with another text recognition method called optical character recognition (OCR), and compared the recognition accuracy with our method. According to the statistics, 151 of the 177 characters were correctly identified in the GBS-CR method, while 119 of the 177 characters were correct in OCR (a part of the recognition results is shown in Figure 25). From Figure 25, we observed that many characters were omitted and even recognized wrongly with the OCR method, and the recognition accuracy is shown in Table 6. In this part, we also randomly selected a medical invoice and carried out the following experiments. As shown in Figure 26b, it was a waterlogged medical invoice, and we did contrasting experiments with the original one in Figure 26a. We first used the GBS-CR method proposed in this paper to identify the two invoices and to count their accuracy. Then, OCR was used to identify the same waterlogged invoice, and its accuracy was compared with the GBS-CR method.

Waterlogged Medical Invoices
In this part, we also randomly selected a medical invoice and carried out the following experiments. As shown in Figure 26b, it was a waterlogged medical invoice, and we did contrasting experiments with the original one in Figure 26a. We first used the GBS-CR method proposed in this paper to identify the two invoices and to count their accuracy. Then, OCR was used to identify the same waterlogged invoice, and its accuracy was compared with the GBS-CR method.  We first used the GBS-CR method proposed in this paper to identify the two invoices and to count their accuracy. The results are shown in Figure 27a,b. We first used the GBS-CR method proposed in this paper to identify the two invoices and to count their accuracy. The results are shown in Figure 27a,b.
(a) Normal medical invoice.
(b) Waterlogged medical invoice. We first used the GBS-CR method proposed in this paper to identify the two invoices and to count their accuracy. The results are shown in Figure 27a  As shown in Figure 27, we also used the red box to interpret part of the recognition results. From the comparison of the red box in Figure 27a,b, we found that just a little information was As shown in Figure 27, we also used the red box to interpret part of the recognition results. From the comparison of the red box in Figure 27a,b, we found that just a little information was omitted from the waterlogged invoice; however, some results were recognized wrongly as well. According to the statistics, 116 of the 127 characters were correctly identified in the normal medical invoice, while 106 of the 127 characters were correct in the waterlogged one. The recognition accuracy is shown in Table 7. It can be seen from Table 7 that although the invoice was waterlogged, our system (GBS-CR) could still maintain a good performance (only 11 of the 127 characters were omitted or recognized wrongly).
Then, we did the same experiment with the OCR method, and compared the recognition accuracy with ours. According to the statistics, 83 of the 127 characters were correct with OCR (a part of the recognition results is shown in Figure 28).
It can be seen from Table 7 that although the invoice was waterlogged, our system (GBS-CR) could still maintain a good performance (only 11 of the 127 characters were omitted or recognized wrongly).
Then, we did the same experiment with the OCR method, and compared the recognition accuracy with ours. According to the statistics, 83 of the 127 characters were correct with OCR (a part of the recognition results is shown in Figure 28). From Figure 28, we observed that many characters were omitted and even recognized wrongly with the OCR method, and the recognition accuracy is shown in Table 8.  Table 8, we can see that, compared with the OCR method, the GBS-CR method proposed in this paper maintained a good performance in the generalization ability, and had an average increase of 10 to 15 percentage points in the recognition rate.

Summary of Experiment Analysis
From the analysis above, we have summarized the following points: 1.
By careful design, we adjusted and modified the existing CNN network, and combined it with an Adam optimization algorithm to derive a new CNN network, called AA-CNN; 2.
We proposed a novel preprocessing method, consisting of a Gaussian blur and smoothing operation (GBS), in order to fix the breakpoint font. This could improve the recognition accuracy to a certain extent; 3.
We designed a medical semantic revision module, which was presented in this paper. It had an average increase of seven to eight percentage points in the recognition rate; From Figure 28, we observed that many characters were omitted and even recognized wrongly with the OCR method, and the recognition accuracy is shown in Table 8. From Table 8, we can see that, compared with the OCR method, the GBS-CR method proposed in this paper maintained a good performance in the generalization ability, and had an average increase of 10 to 15 percentage points in the recognition rate.

Summary of Experiment Analysis
From the analysis above, we have summarized the following points:

1.
By careful design, we adjusted and modified the existing CNN network, and combined it with an Adam optimization algorithm to derive a new CNN network, called AA-CNN; 2.
We proposed a novel preprocessing method, consisting of a Gaussian blur and smoothing operation (GBS), in order to fix the breakpoint font. This could improve the recognition accuracy to a certain extent; 3.
We designed a medical semantic revision module, which was presented in this paper. It had an average increase of seven to eight percentage points in the recognition rate; 4.
We built a new dataset, with an optimal performance comparable to the state-of-the-art method, which used a self-built dataset combining the features of continuous Chinese characters and breakpoint font in medical invoices; 5.
In this paper, the generalization ability of our system also had a good performance.

Conclusions and Future Work
This paper proposed a dual model medical invoice recognition method based on deep learning, which can extract and accurately recognize many breakpoint fonts in real medical invoices, and can solve the problem of the complexity and high error rate of manual classification output in practical applications. In addition, for breakpoint font, a bidirectional approximation method was used to solve the problem of the difficult recognition of the breakpoint font in an impact printer. On the one hand, in the preprocessing section, GBS (Gaussian blur and smoothing) was used to preprocess the collected images of the medical invoices, and the font of the breakpoint can be well recovered. Furthermore, in the training of the AA-CNN model, a self-built dataset with a 7:3 proportion (the font of breakpoint is 3, the original font is 7) of the breakpoint font was used to further improve the recognition accuracy of the system for the typeface of the breakpoint printed by the impact printer. In the process of training the AA-CNN network, a fine-tuned CNN combined with an Adam optimization algorithm was selected for training. Compared with the traditional network, it greatly accelerated the network convergence speed, reduced the loss value, and improved the recognition accuracy. It solved the problem of the low recognition rate of the text recognition methods (OCR) in identifying the breakpoint font in medical invoices. The algorithm designed in this paper achieved a higher recognition accuracy by GBS preprocessing, and the combination of AA-CNN and the semantic revision module based on RNN. Both the theoretical tests and the results of the contrast experiments show that the method proposed in this paper has more advantages in processing the breakpoint font in medical invoices. In the future, we will also focus on improving the preprocessing methods such as restricted Boltzmann machines (RBMs) [55][56][57][58], and applying our approach to more areas like 3D object recognition and fMRI [59][60][61][62].