Chinese Character Boxes: Single Shot Detector Network for Chinese Character Detection

: This paper proposes a deep learning-based Chinese character detection network which is important for character recognition and translation. Detecting the correct character area is an important part of recognition and translation. Previous studies have focused on methods using projection through image pre-processing and recognition methods based on segmentation and methods using hand-crafted features such as analyzing and using features. Unfortunately, the results are vulnerable to noise. Recently, recognition or translation systems based on deep learning were dealt with as a single step from detection to translation but they failed to consider the inaccurate localization problem that arises in detectors. This paper proposes a Chinese character boxes (CCB) network that deals with a method to detect the character area more accurately using the single-shot multibox detector (SSD) as the baseline and called CCB-SSD. The proposed CCB-SSD network has a single prediction layer structure in which unnecessary layers are removed from the feature-pyramid structure. The augmentation method for training is introduced and the problem caused by the use of default boxes is solved by using the proposed non-maximum suppression (NMS). The experimental results revealed a 96.1% detection rate and 0.89 performance against the false positives per character (FPPC) which is the proposed false positive index for the character data-set and caoshu data-set used in this paper. This method showed better performance than the conventional SSD with 69.4% and 6.57 FPPC.


Introduction
Many studies have been conducted to solve optical character recognition (OCR) problems but challenging problems still remain due to various external factors such as the type and boldness of the handwriting, the size of the document, the degree of damage to the document, and the background of the document. In the case of handwritten documents, various noise can occur due to the small size of the text, heavy noise, dense text, and preservation environment. In particular, the preservation environment can cause damage or severe noise. These problems are very difficult to detect and recognize using conventional methodologies. The document covered in this paper is an old document written in caoshu. Caoshu is one of the typefaces of Kanji which corresponds to a curve-oriented cursive and is in the form shown in Figure 1b. Caoshu differs in shape according to the author and has a different form from that of Figure 1a which is the standard Chinese character type today.
Old documents are of high historical value written in a handwritten form and still remain untranslated because of the difficulty and time consuming to translate. It will be a meaningful to discover the history that has not yet been revealed by accelerating the translation of these documents through deep learning which has shown good performance in various fields.  This paper proposes a Chinese character boxes (CCB) network for character area detection based on a single-shot multibox detector (SSD) [1] called CCB-SSD to detect each character of an old document. In addition, this paper introduces the detection results based on the number of default boxes and aspect ratios used in SSD and augmentation methods for the cropped data-sets and document type data-sets. Also, a novel non-maximum suppression (NMS) algorithm was proposed to solve the overlap problem caused by the use of several default boxes. A new performance evaluation index, false positives per character (FPPC) was introduced and compared with false positives per image (FPPI) which evaluates the per-image detection performance.
The contributions of paper are as follows.
• CCB-SSD network was proposed for Chinese character detection based on a single-scale structure that can be trained by end-to-end. This can make the network deep enough without additional steps such as deconvolution-based up-sampling and feature map concatenation to overcome the limitations of conventional methods. Therefore, semantic information can be fully exploited and even smaller characters can be detected as a result.

•
Using the proposed single-scale structure and NMS, the problem of the existence of unnecessary layers and the relationship between the classifiers was solved and the problem of overlap caused by the use of several default boxes was resolved.

•
The FPPC, a new evaluation index that can be a more objective indicator than the FPPI was proposed for the detection of Chinese characters in old documents. In addition, an augmentation method of data-sets with document type and cropped data type was introduced.
The remainder of this paper is organized as follows. Section 2 briefly introduces previous studies related to the detection and recognition of Chinese characters. Section 3 outlines the proposed detector networks called CCB-SSD and training methods. Section 4 introduces various experimental results and finally Section 5 reports the conclusions.

Related Works
OCR problems can be divided into deep learning-based methodologies and traditional methodologies before deep learning. The direction of previous research is based on traditional machine learning. Techniques include after image pre-processing the method using projection [2], the recognition method based on segmentation [3][4][5][6], method of analyzing the features of the character to use various features [7][8][9][10][11][12][13][14][15], method to combine words after detecting each letter [16], and recognition and detection method based on the tree structure [17,18]. These conventional methods have a problem in that they are vulnerable to various noises. In particular, they are affected greatly by noise such as spots or blurring.
In recent years, deep learning has been active in many areas of computer vision and deep learning-based methodologies have been explored in the field of character detection, recognition and translation. Tsai et al. [19] used deep convolutional neural networks (CNN) to classify different types of letters. If the network is deep enough for the classification problem, the size of the fully connected layer will not affect the performance. Maidana et al. [20] used four networks, AlexNet [21], VGG [22], LeNet [23], and ZFNet [24], and a single network that was deep enough to assume that the fused network did not have significant performance improvements. Jaderberg et al. [25] first used a network based on R-CNN [26] for detection and a network based on CNN for recognition. Edge Boxes [27] and aggregate channel feature (ACF) detectors [28] were used for the bounding box proposal and the union of the proposed boxes was used as a candidate bounding box in each method. The candidate boxes were then classified by a random forest classifier and finally CNN was used for bounding box regression. The considerable time needed to generate the proposal region by merging the Edge Boxes and the weak ACF detectors and perform all the classification processes using CNN after the regression process is a significant problem. Liao et al. [29] proposed an end-to-end training neural network model for scene text detection. The model uses a fully convolutional structure of 28 layers and is inspired by SSD and uses a single-stage based network. Each layer of the feature pyramid has the same 12 default boxes to predict the same number of outputs, a 72-d vector. The prediction results from each layer are processed through concatenation and NMS, and the processing speed is fast. On the other hand, there is a problem in that it is not well detected when the interval between the characters is large or overexposed. Busta et al. [30] proposed a network that unified text localization and recognition into a single end-to-end network. The detector network was based on YOLO9000 [31] and information about rotation θ was added to the bounding box (x, y, w, h) when the bounding box was predicted. Nevertheless, there is a problem in that it cannot detect a single character or a short snippet of a number and a character.
Networks that existing deep learning-based texts detect by word or sentence. However, an old document written in a caoshu cannot be distinguished from words or sentences because it is created with direction and order from top to bottom and from right to left unlike a typical scene composed of words such as a street sign. Conventional work does not solve this character unit detection problem well.

Proposed Method
This section introduces CCB-SSD network which is a network for Chinese character detection and compares the performance according to restrict of random parameters used for data synthesis, and introduces augmentation method of cropped form and document type data. To solve the problem of overlap caused by the use several default boxes and the proposed NMS was used and a new performance evaluation index FPPC was introduced. Figure 2 presents the structure of conventional SSD. The single-stage detector encapsulates all computations into a single network without a region proposal or feature resampling stage used in a 2-stage detector such as Faster R-CNN [32]. In particular, SSD uses a multi-scale feature map structure to detect objects of various sizes and uses default boxes of various aspect ratios in each feature map to detect objects having various aspect ratios.

Baseline Method: SSD
Therefore, it is trained to detect small objects in the feature maps of conv4_3 and conv7 which are relatively large and detect large objects in the feature maps of conv9_2 and conv11_2 because the receptive field is large. This paper uses the scale from a minimum of 0.2 to maximum 0.9 and it is necessary to detect objects corresponding to 20% of the input image resolution in conv4_3 which is the largest feature map. On the other hand, conv11_2 is the smallest feature map in which trained to detect an object having a size corresponding to 90% of the input image resolution. By separating the role of each feature map and default boxes, training was made easier. Three problems typically arise when the SSD structure is applied to the problem of detecting the Chinese character region in the old document treated in this paper. The first problem is that the convolution layer which is predicted in the feature maps of conv7, conv9_2, and conv11_2 in the network structure is an unnecessary layer. The higher layer which is far from the base network is used to detect the objects that occupy most of the image such as cars at close distances but the size of the characters written to the document is not very large. Second, there is a problem of not taking into account the relationship between the results detected from several feature maps. This means that detection can occur multiple times on the same object which can lead to duplicate detection [33]. Finally, using multi-scale feature maps, the convolution layers that detect in the feature map near the base network are lacking in semantic information because the feature map is not deep enough.

Proposed CCB-SSD Network and NMS
Proposed Network This paper solved the three problems mentioned above by using the proposed single-stage based network and NMS. Figure 3 presents the flow of the whole detection process from the network structure and input to obtain the detection result. Resnet-34 [34] to a 1 8 scale was used as the base network. The feature map from Resnet-34 passes through 7 3 × 3 convolution layers, downsamples to half the size through one 3 × 3 convolution layer with a stride by 2, and then through 8 3 × 3 convolution layers. Therefore, the input scale is downscaled to 1 16 of the total and the offset and the confidence score for the bounding box are calculated using the 3 × 3 convolution from the last feature map. Finally, the proposed NMS is used to obtain the final detection results. This is a fully convolutional structure except for one pooling layer of Resnet-34. The orange and blue arrows mean the same feature map. Training Strategy Training strategies generally follow the content of conventional SSD. Each ground truth box is matched with a default box with best jaccard overlap and default boxes with any ground truth and jaccard overlap greater than 0.5 are then matched. According to conventional SSD, this matching strategy can simplify learning and obtain high scores from the network. The loss function for training is given in Equation (1).
x is an indicator matrix that indicates whether or not to match. A value of 1 is given if the i-th default box matches the j-th ground truth and 0 otherwise. p is a specific category; N is the number of default boxes matched; l is the predicted box; g is the ground truth box; c is the confidence for category p and d is the default box. The mean squared error (MSE) loss instead of smooth L1 loss was used to obtain the location loss in conventional SSD. Therefore, the MSE function of L loc in Equation (3) is expressed as (l m i −ĝ m j ) 2 and the softmax of Equation (2) is expressed asĉ . Information about cx, cy, w, and h of the default bounding box and ground truth is encoded and the detector learns information about the difference. Therefore, the coordinates predicted by the detector are offset by the difference. The network initialization method uses He et al. [35] called Kaiming He initialization.

Proposed NMS
Base network Proposed Non-Maximum Suppression As mentioned above, a conventional SSD uses a multi-scale feature map structure and does not take the relationship between multiple classifiers into account, thereby duplicating the same object. The CCB-SSD network proposed in this paper was a single-scale based detection network but the same problem arises. Each part or stroke of a character also serves as a character. Owing to the nature of these characters, overlapping detection is likely to occur by detecting the regions of actual characters and strokes or other words within them as shown in the right image of Figure 4a. This problem can be solved by using a recognizer, but this paper solves the aforementioned problem in NMS because it is a network study for more accurate detection which is a previous step of recognition. In the case of conventional NMS, confidence is sorted in descending order for a particular class and then repeatedly reduces the overlapping bounding boxes based on the highest confidence. The intersection over union (IoU) is used as a criterion for determining the degree of overlap. To solve the problem shown in Figure 4a, this paper considered the following two areas for the box with the highest confidence and the remaining boxes when calculating the IoU. First, it is important to check if the intersection of the remaining boxes area is greater than 0.6 and to check if intersection of the highest confidence box area is greater than 0.6. Only the box with the highest confidence is left and the remaining boxes are removed. By considering the characteristics of characters unlike ordinary objects, it is safe to say that the characters do not overlap other characters. Proposed NMS allows to efficiently process both areas without modifying the network or adding additional steps. Figure 4b shows the results of using the proposed NMS. The detection overlapping with the same letter is removed unlike conventional NMS.

Experimental Results
This section introduces the FPPC and data-set augmentation method, and experiment results according to the parameters of the network.

Proposed Criterion: False Positives per Character
This section introduces the new evaluation index, false positives per character. The conventional method of calculating the degree of false detection generally uses a FPPI. On the other hand, the data-set and the augmented data-set used in this paper will contain fewer than 10 characters but more often 300 to 400 characters. This is not a problem limited to augmentation but it is also true for actual data-sets. The number of characters written to an image is more accurate than calculating the FPPI because the variation with respect to the number of characters written per image is large. This paper proposes the false positives per character which calculates false positives based on the characters per image as shown in Equation (4). In Equation (4), FA is defined as false alarm (true positives + f alse positives − detected GT) and GT means ground truth. This paper uses 50 characters per image are used, so Equation (4) is multiplied by 50.

Kyungpook National University (KNU) Data-Set
The data-set used to train the detector network was the KNU data-set from kyungpook national university. The character data-set contains approximately 200,000 cropped characters and the caoshu data-set consists of 1000 scanned or photographed documents. In addition, there are 100 background images for which there are no letters for augmentation. Table 1 provides an example. The first cropped character data-sets are composed of approximately 200 million data-sets and the second caoshu data-sets are in the form of documents scanned or photographed as cursive texts. The last background data-sets are document-type backgrounds with no text.

Configuration of data-Set Cropped Character Data-Set Caoshu Data-Set Background Data-Set
# of data-set 200,000 1000 100 example

Augmentation Method
When a character data-set is augmented, each character is synthesized with the background image. First, the background image is rotated randomly in units of 90 degrees and the characters to be synthesized are fetched. The number of characters in a background image and the size of each character are set to random and the character is resized to a random size with the original ratio maintained. The character is then written while moving randomly along the line. In this case, the boldness of each character is set to a random value and written to the corresponding position. Finally, the method moves along the line and if the position exceeds the image width, it goes to the next line. Therefore, the random parameter for augmentation can be summarized as [background image rotation, number of characters, size of characters, boldness of characters]. Figure 5 shows an example of synthesized images and about 30,000 images were synthesized for training.
When controlling the boldness of the random parameters, if a wide range of random values are used to cover both the boldly written characters and the very faintly written characters, the ground truth is created in empty space as shown in Figure 6. The detector is then trained to detect the empty space based on this information. That is, the accuracy of the detector is adversely affected. When the data-set including this problem and the data-set which restricts the random parameter so that the problem does not occur are trained, the FPPC is reduced significantly from 0.713 to 0.00238 in the same detector. Therefore, these problems may occur when using random parameters.
In the case of a data-set, it is given as a single document type and cannot be used in the above synthesis method, so random cropping is used. A given 1000 documents are cropped randomly and repeatedly to generate approximately 30,000 images similar to the augmented character data-sets. As a result, the image shown in Figure 7 can be obtained.

Detection Performance Evaluation
Detection performance according to the default box size and ratios The size of the default box specifies the range from the smallest to the largest of the data-sets. The minimum and maximum sizes were set to 10 and 650 respectively. To determine the aspect ratio of the default box, k-mean clustering which was the method used in [31] was applied and the mean IoU according to the number of clusters was compared as shown in Figure 8. Considering the characteristics of the characters written in the document, it is expected that the aspect ratio will be similar. The aspect ratio distribution of clustering using 8 clusters is [1.02, 1.0, 1.03, 1.07, 1.08, 1.6, 1.68, and 1.97] and the actual aspect ratio distribution of the data-set is narrow. On the other hand, a wide range of aspect ratio showed better performance than those with a tight aspect ratio given as a result of clustering and broader aspect ratio. This is because there is no uniformity in that the shape and size of each document vary according to the author. Most aspect ratios of data are within a narrow range but better performance is achieved with fewer default boxes because it does not consider variations in the aspect ratio depending on the author rather than using a wider range of aspect ratios. In addition, a wide range of aspect ratios show good results when actually receiving a new type of data-set as input. Performance comparison based on the number of layers and whether or not the base network is pre-trained As a base network, Resnet-34 which is most frequently used in detector network is used up to a 1 8 scale. Because pre-trained models are the trained weights for extracting more general features pre-trained sets are not used to obtain more optimized features to detect Chinese characters. Tables 2 and 3 present their performance. As a result, when using a pre-trained network, there is no direct relationship but the performance improves with increasing number of layers. On the other hand, when using a network that is not pre-trained, the performance is reduced when using more than 16 layers. Although a comparison of the two experimental results did not yield any significant results, the pre-trained network has relatively fewer FPPC and the pre-trained base network is more stable particularly when the number of layers is small.  Performance comparison of random parameter problems Table 4 lists the experimental results for examining the effects of the pixel value threshold on the performance of the augmentation of the data-set mentioned in Section 4.3. When unrestricted random parameters are used, the ground truth is formed in empty space, so the network has a response that needs to be detected. On the other hand, using the restricted random parameter does not cause the criterion of a ground truth in empty space so the same detector is used but the FPPC number is much smaller. Therefore, these problems must be considered in the data augmentation process using random parameters. In Tables 2 and 3, a fine-tuning method was used to compare the performance of the base network in terms of the pre-trained performance. In fine-tuning, the learning rate α is reduced to 1 10 or 1 3 of a specific epoch after training. As a result, there is a problem in that the character data-set that has been trained cannot be detected correctly. Therefore, a method of training together by merging the character data-set and caoshu data-set was used. Table 5 shows the performance of the detector by training the augmentation character data-sets and caoshu data-sets using the restricted random parameter. At this time, the character data-set and caoshu data-set each had approximately 30,000 images and a total of 60,000 images were used for training. The detector performance was evaluated using optimized conditions such as data-set and network structure, NMS, aspect ratio, and size of the default boxes that solved the above-mentioned problems. To confirm the performance based on the 16 layers that showed the best performance in the previous experiment, only 10, 16, and 22 layer deviations were compared. As a result, it showed the best performance with a 96.1% detection rate and 0.89 FPPC when using 16 layers and showed better performance than the conventional SSD with 69.4% and 6.57 FPPC. The detection rate is defined by Equation (5). Figure 9 presents the detection results of conventional SSD and the detection results using the proposed CCB-SSD network. In the case of conventional SSD, the tuning part was set in the same condition as the proposed CCB-SSD network and the entire network structure such as the number of filters and layers was used as the conventional SSD architecture.

Conclusions
This paper proposed a novel Chinese character boxes-single shot multibox detector (CCB-SSD) network for Chinese character detection. The CCB-SSD network has a single-scale structure which enables the network to be deep enough without any additional processes such as deconvolution-based up-sampling or feature map concatenation for better small character detection. Higher layers in the original SSD were removed because Chinese characters are small and dense. In addition, a new non-maximum suppression (NMS) method was proposed to remove duplicate detections for the same character because the baseline method (SSD) does not consider the relationships between classifiers in the multi-scale feature maps. Finally, a new evaluation index, false positives per character (FPPC) was proposed instead of false positives per image (FPPI) to remove the effect of character numbers per image. Extensive experiments have created an optimal network structure and have shown good performance as a result of testing the network for old documents written in caoshu at an entirely different time. Most characters can be detected well in old documents but there is still a problem of detecting two characters in one character or detecting two characters in one because there are no clear criteria for breaking between characters.