Large-Scale Printed Chinese Character Recognition for ID Cards Using Deep Learning and Few Samples Transfer Learning

: In the ﬁeld of computer vision, large-scale image classiﬁcation tasks are both important and highly challenging. With the ongoing advances in deep learning and optical character recognition (OCR) technologies, neural networks designed to perform large-scale classiﬁcation play an essential role in facilitating OCR systems. In this study, we developed an automatic OCR system designed to identify up to 13,070 large-scale printed Chinese characters by using deep learning neural networks and ﬁne-tuning techniques. The proposed framework comprises four components, including training dataset synthesis and background simulation, image preprocessing and data augmentation, the process of training the model, and transfer learning. The training data synthesis procedure is composed of a character font generation step and a background simulation process. Three background models are proposed to simulate the factors of the background noise and anti-counterfeiting patterns on ID cards. To expand the diversity of the synthesized training dataset, rotation and zooming data augmentation are applied. A massive dataset comprising more than 19.6 million images was thus created to accommodate the variations in the input images and improve the learning capacity of the CNN model. Subsequently, we modiﬁed the GoogLeNet neural architecture by replacing the FC layer with a global average pooling layer to avoid overﬁtting caused by a massive amount of training data. Consequently, the number of model parameters was reduced. Finally, we employed the transfer learning technique to further reﬁne the CNN model using a small number of real data samples. Experimental results show that the overall recognition performance of the proposed approach is signiﬁcantly better than that of prior methods and thus demonstrate the effectiveness of proposed framework, which exhibited a recognition accuracy as high as 99.39% on the constructed real ID card dataset.


Introduction
Image classification has always been one of the prominent topics in deep learning, and Chinese character recognition is one application of it. Traditionally, optical character recognition (OCR) has been used for text recognition and it has achieved good results. Large-scale image classification is an important and challenging task in the field of computer vision, which plays an essential role in facilitating OCR methods. For example, the number of classes for a Chinese OCR system could be as high as 13,070. When performing large-scale classification, the amount of data in each category is considered the most important factor. By contrast, if a classifier is divided into an excessive number of characters, the accuracy decreases with an increase in the numbers of characters.
ID card information verification is widely performed for multiple purposes on various occasions, such as for opening bank accounts or making deposits, hotel check-in, clinic registration, identity verification at facility entrances, and pick-up services for purchased items. The development of a system designed to automatically identify personal data on ID cards is expected to provide considerable convenience for both customers and service providers. It also saves human resources and reduces the possibility of errors. It not only saves time but also reduces physical contact, especially when infectious diseases are prevailing; it is especially important and safe. Hendra Dito Dwi et al. [1] developed an OCR system to identify new ID cards issued in Indonesia. Similarly, Angga Maulana et al. [2] used MSER to detect pre-processed Indonesian ID card images and found the area where the text was located. Wira et al. [3] applied a series of image processing techniques, such as image binarization, Sobel edge detection, and morphology, to mark the text area on citizen ID cards. Then, Google Tesseract was used as a primary framework for character recognition. Their approach correctly identified citizen ID cards at a rate of over 90%. Niloofar et al. [4] designed a network model called efficient and accurate scene text detector (EAST), which can accurately extract the text area. Compared with the MERS-based algorithm, it is more adaptable to natural noise and is faster.
On arriving at a hotel, conventionally, one must check-in via the reception staff. The traditional method involves manual data entry on a workstation computer through the manual confirmation of identity documents. Although a bar code is included in contemporary ID cards that records personal information, only government agencies or specific institutions can access and use it owing to privacy issues; ordinary hotels cannot use this feature. Human error is typical under such settings. For example, during the peak tourist season, the influx of a large number of customers may easily cause the reception staff to panic or otherwise perform imperfectly, leading to data entry errors. Moreover, customers' privacy may be easily violated and their personal information may be exploited by the malicious actions of staff at the reception desk itself. In this study, we aim to establish a self-service check-in system, as shown in Figure 1, utilizing the proposed large-scale printed Chinese OCR with deep learning and transfer learning. When a user presents their ID card to the camera device, the system can automatically recognize their personal information and complete the follow-up check-in procedures, which not only solves the manual error problem but also avoids the possibility of malicious actions. Although handwritten text recognition methods have been established, further re- 1 search on text recognition on ID cards remains necessary to account for the variations in 2 backgrounds and lighting. In real situations, there will be different factors to be considered 3 such as lighting conditions, the material of the paper used, color, and text font will also have 4 different interference. Various background patterns and anti-counterfeiting mechanisms 5 are also being added to the ID cards. The influence of background noise tends to cause 6 problems in character segmentation and recognition. Furthermore, after using the ID cards 7 for a long period, they can have all kinds of scratches and damages, and the lamination 8 film of the cover can turn yellow with aging. The interweaving of these factors increases 9 the difficulty of identification. Another challenge in Chinese character recognition is a large 10 number of Chinese characters. Unlike digits or English alphabets, Chinese characters are 11 more than tens of thousands of characters, and even the commonly used Chinese characters 12 have more than 4000 categories. Moreover, most of the research focuses on handwritten 13 text, and few have classified or identified printed text. Handwritten text is deliberately 14 collected, which is mostly generated in a laboratory or a single environment. The text image 15 was clear, and the quality of the image was better. There is no need to perform excessive 16 3 of 18 preprocessing. Table 1  greatly applied to OCR. Xu et al. [6] proposed an end-to-end subtitle recognition system.

52
After inputting a video with subtitles, the subtitle area is marked, and sliding windows are 4 of 18 used to cut the characters one by one at regular intervals and recognize them. Although an 54 end-to-end system is adopted, the sliding window can easily cut out too many repetitive 55 characters and cause failure. Zhong et al. [7] proposed an HCCR-GoogLeNet, which 56 reduces the parameters of the original GoogLeNet [5], and adds the traditional feature 57 extraction method HoG, and obtain gradient feature maps to enhance the performance

108
When performing large-scale classification, the most important factor is the amount 109 of data in each category. To train a high-precision neural network, a large amount of data 110 is necessary to assist, and it is known as data hungry. If the data are insufficient, the 111 model will not be able to converge, or the amount of data in each category will be different, AlexNet [27], GoogLeNet [5], and ResNet [28]. The experimental results show that the 168 proposed method effectively enhances the accuracy rate of Chinese character recognition.  The steps executed are presented in the following subsections.

182
The deep learning mechanism relies on a large amount of training data. However, 183 owing to the privacy issues associated with collecting personal ID card images, the avail- shown in Figure 5b. After obtaining the stitching background, it was combined with 225 the character image, and Gaussian blurring was performed. Figure 6 illustrates the 226 process.  normalization techniques such as min-max, averaging, and histogram equalization, we 232 adopted the min-max normalization method (Equation (1)) and achieved the best recog-233 nition accuracy. Figure 7 shows an example of min-max normalization. Compared to the 234 original image (Figure (7a)), the normalized image appears to exhibit a better contrast. This   Figure 9) to reduce the number of model parameters and improve its classification 261 performance. Figure 9 shows the modified network architecture GoogLeNet-GAP. Chinese character recognition [12]. In the study of Melnyk et al., the handwritten character 264 data were binary images and the background was clear [12]; hence, the MelnykNet model 265 only requires a few parameters and can be trained to perform well. However, as our input 266 images are grayscale images with a noisy background, it was necessary to strengthen the 267 feature extraction capabilities of the CNN model to achieve better recognition performance.

268
According to Matthew et al. [30], he feature maps extracted by each convolutional layer 269 differ; the higher the layer, the finer the feature maps, and vice versa. In the original 270 architecture, the contour of the characters were extracted by the previous convolutional 271 layer, so we added a residual block [28] (as plotted in red dash line in Figure 10) to the later 272 layer to continue to extract the detailed features of the character. The modified MelnykNet 273 model MelnykNet-Res is illustrated in Figure 10. two models using a real testing dataset. Table 2 presents the performance of these two 279 models, given the different numbers of training samples augmented for each character.

280
As may be observed from       Table 5 shows the number of parameters 344 used in the FC and GAP for various classification tasks.     One of the main advantages of transfer learning is that a pre-trained model can 377 quickly adapt to new data through fine-tuning training using small amounts of real data.

378
To evaluate the performance of transfer learning, we fine-tuned various classification  fine-tuning. After the data are generated and balanced, the problems of insufficient data 389 and learning deviation are solved simultaneously. Figure 11 shows that the character 390 recognition performance was improved dramatically when the models were fine-tuned 391 using more real data.  To solve the recognition problem caused by character segmentation shifting, we 409 utilized the projection method. First, the grayscale character image was converted to a 410 binary image with a specific threshold to separate the character from the background.

411
Then, binary pixels were projected in the horizontal and vertical directions separately to 412 locate the area of the character in the image. However, the ID card images may include a 413 noisy background, and the projection may be affected by noise disturbances. Therefore, we 414 performed morphology processing using a 1 × 7 vertical bar-type erosion operator on the 415 horizontal projection image and disconnected the small adjacent blocks. Next, we search to check if there was a continuous black area from the boundary toward the center, as 417 indicated by the red lines in Figure 13. Meanwhile, the distance between the two red lines 418 was required to be greater than 60% of the original width or height to enclose a complete 419 character. As shown in Figure 13, a better segmentation result was obtained. Next, the 420 OCR system was re-evaluated using the rectified dataset.
421 Table 9 presents the new evaluation results of the proposed models with the 13,070 422 character classification task. It may be observed that the recognition accuracy was sig-