An End-to-End Classiﬁer Based on CNN for In-Air Handwritten-Chinese-Character Recognition

: A convolutional neural network (CNN) has been successfully applied to in-air handwritten-Chinese-character recognition (IAHCCR). However, the existing models based on CNN for IAHCCR need to convert the coordinate sequence of a character into images. This conversion process increases training and classifying time


Introduction
In the field of image processing and pattern recognition, the recognition of online handwritten characters is important work and many impressive achievements have been achieved [1][2][3][4][5]. The data processed by online handwriting recognition is a series of coordinate sequences, which are usually used in human-computer interaction devices, such as handwriting tablets, smart phones and pads. The input range of the traditional humancomputer interaction method is limited by the size of the touch device, and damage to the local area will make the entire device unusable. In this context, a new way of humancomputer interaction, vision-based and in-handwriting, has attracted more and more researchers' interest [6][7][8], e.g., we can use in-air handwriting to switch TV channels remotely or adjust the temperature of an air conditioner. Compared with the traditional online handwriting a using touch pad or wearable device, visual-based in-air handwriting experiences fewer space constraints and allows the writers to freely write in the air; the generated character has no pen-lift information and is finished in one stroke. The jitter of the stroke of a character is quite serious and the strokes overlap each other. These characteristics of in-air handwriting characters bring more difficulties for IAHCCR. Some examples of in-air handwriting Chinese characters and traditional handwritten Chinese characters can be seen in Figure 1.
As in-air handwriting is a new development of traditional online handwriting, the methods used for OLHCCR are available for IAHCCR. Traditional OLHCCR methods do not directly recognize the original coordinate sequence of an online handwritten Chinese character, but extract features from a preprocessed coordinate sequence according to specific classification rules and specific domain knowledge [9,10]. Before deep learning was introduced into OLHCCR, statistical features, classification algorithms based on statistical features and preprocessing algorithms had always been the research hotspots of OLHCCR, and show excellent performance in OLHCCR [11,12]. Related methods have been introduced into IAHCCR, e.g., linear or nonlinear normalization [9,11], eight-directional features [10], modified quadratic discriminant functions (MQDF) [13], etc.
In recent years, deep learning has achieved great success in the fields of computer vision and pattern recognition [14][15][16], and also been successfully applied to OLHCCR [17][18][19][20]. Compared with the above-mentioned traditional classification models, the models based on deep learning show an overwhelming advantage in OLHCCR. However, to our knowledge, all the existing models based on CNN for OLHCCR do not directly recognize the coordinate sequence of an online handwritten Chinese character, but convert the coordinate sequence into images or vectors [2,3,[21][22][23]. As shown in Figure 2, this conversion process not only causes the loss of training and recognition time, but also only utilizes the spatial information of characters and loses the temporal information of the coordinate sequence, so it is difficult to obtain a higher recognition rate.  Since the CNN based model uses a large-scale convolution kernel, it often requires a large number of training patterns and leads to more memory consumption. Unlike the CNN models, the models based on RNN can directly process the coordinate sequences of online handwritten Chinese characters and outperform most CNN structures [4,24]. Although RNN-based models are suitable for processing sequence data, they are less suitable for processing long sequence data than CNN and ignore the global structures of online handwritten Chinese characters. In-air handwritten Chinese characters usually contain hundreds of points, so RNN-based models will consume a lot of computation time.
Based on the above analysis, we propose an end-to-end classifier based on CNN for IAHCCR in this paper which has the advantages of both CNN and RNN. First, we directly use the preprocessed coordinate sequences of online handwritten Chinese characters as the input of the network. Then, the coordinate sequences are converted into one-dimensional feature maps through the first layer of convolution, and then the range of the receptive field is expanded by stacking the number of convolution layers to obtain the contextual connection. Finally, global average pooling is applied to the output of the convolutional layer to obtain a fixed-size feature vector, which is sent to the fully connected layer to extract features and use softmax for classification. The end-to-end CNN can directly recognize the coordinate sequences of online handwritten Chinese characters. Compared with the existing CNN models, the end-to-end CNN does not need to convert the original data into images, nor does it need to design features combined with specific domain knowledge. Due to selecting the convolution kernel with a smaller scale, the end-to-end CNN needs fewer parameters and occupies less memory. Compared with the RNN-based models, the end-to-end CNN can learn the global information of online handwritten Chinese characters and adapt to different lengths of coordinate sequences.
The rest of the paper is organized as follows. Section 2 briefly introduces the related works. Section 3 introduces the proposed method at length. The experimental results are reported in Section 4. We conclude this paper in Section 5.

Related Works
Research on IAHCCR has been underway for several years, and a complete recognition method usually consists of three stages: preprocessing, feature extraction and classification. The work related to different stages will be described in detail below.
In the preprocessing stage, as mentioned in Section 1, IAHCCR can use the method of OLHCCR. Character normalization can reduce within-class variation and improve recognition accuracy. Traditional methods require the normalization of characters to a uniform size. Linear normalization, which causes character-shape changes, has been superseded by other methods [11]. For example, nonlinear normalization [9], pseudo-2D normalization and line-density projection interpolation [11]. End-to-end methods do not require normalization to a fixed-size vector but, rather, normalize the distribution of coordinate sequences [24].
Feature extraction and classification are two separate stages in the traditional IAHCCR method. In the stage of manual feature extraction, compared with the character image drawn using the original coordinates, the method of decomposing local strokes into different directions to form multiple feature maps achieves higher recognition accuracy, such as eight-directional feature maps [10]. Especially for IAHCCR, the whole Chinese character is more like a curve function defined on a two-dimensional plane that can be expanded by Taylor. Qu et al. [25] proposed higher-order directional features. Representing Chinese characters as directional features has been the standard approach for a long time. Classifiers are also very important for IAHCCR. To further improve the recognition rate and recognition speed, the learning vector quantization technique-based [26] multi-level classification technique is reported for IAHCCR in [27]. Qu et al. introduced locality-sensitive sparse representation-based classifiers (LSRC) [28] into IAHCCR, and achieved a higher recognition rate than MQDF [25]. In order to further improve the recognition accuracy of LSRC, a loss function is designed to minimize the reconstruction error of each training pattern and make the reconstruction of each training pattern as close as possible to the optimized prototype of its class is suggested, which significantly improves recognition accuracy [29].
As deep-learning techniques have made great achievements in other fields, the deep neural network was introduced into IAHCCR. In the deep-learning-based IAHCCR algorithm, feature extraction and classification are integrated. For IAHCCR, Qu et al. [23] proposed a nine-layer convolutional-neural-network model combined with data-augmentation technology which significantly improved the recognition rate. Like other image recognition tasks [30], the performance of this method requires a large amount of data to ensure the performance, so data-augmentation technology is needed to expand the data, although it also requires more memory consumption and a higher number of parameters. Ren et al. [20] proposed an end-to-end recognizer based on recurrent neural networks. This method directly recognizes character sequences without converting characters into feature vectors. In order to further improve the recognition accuracy, Ref. [4] proposed an RNN system with two new computing architectures added. Table 1 summarizes related works. In this paper, we combine the advantages of CNN and RNN to propose an end-to-end CNN model for directly recognizing sequences.

Proposed Method
Like other end-to-end recognition methods, the end-to-end CNN method consists of two parts, preprocessing and model architecture.

Preprocessing
As the writing styles of writers vary widely, the structure, position, shape, samplingpoint density and stroke order of the finished in-air Chinese characters are different. These varied intra-class structures and the confusion between similar characters always result in a reduction in recognition accuracy [24]. In this paper, the primary purpose of the preprocessing is to eliminate redundant points and standardize the distribution of coordinate points, so as to improve the recognition accuracy for IAHCCR. The preprocessing steps in this paper are summarized as follows: (1) Remove redundant points in the coordinate sequence of in-air handwritten Chinese characters. (2) Normalize the coordinates to a unified coordinate system.

Remove Redundant Points
Any given in-air handwritten character P can be represented by its coordinate sequence as where x t and y t are the XY coordinates of the tth point of P; and t = 1, . . . , T, T is the number of the coordinate points of P. For the coordinate sequence of the character, except for the first point, if the Euclidean distance between the tth point (x t , y t ) and its adjacent point (x t−1 , y t−1 ) is less than the given threshold L, i.e., then the point (x t , y t ) is deleted as a redundant point. In the following experiments, L varies corresponding to different handwritten character P, and can be computed by 0.015 × max{h, w}, where h and w are the space height and width of P, respectively.

Normalize Coordinates
Since in-air handwriting has fewer space constraints and an unstable writing position, the positions of the written Chinese characters vary: some are higher, some are lower, some are left and some are right. The end-to-end CNN directly takes a coordinate sequence as input, so the variation in coordinate-points distribution will greatly reduce the recognition rate. In order to decrease the variations in the spatial sizes and positions of characters, we employed coordinates normalization, following the method presented in [24]. Let I t = (x t − x t−1 ) 2 + (y t − y t−1 ) 2 be the distance between two consecutive points (p t = (x t , y t ) and p t−1 = (x t−1 , y t−1 )). In order to normalize the coordinates to a standard interval, it is first necessary to estimate the mean of the coordinates projected on the coordinate axis. We can calculate the mean µ x and µ y of XY coordinates, respectively, by The standard deviation δx on the x axis can be estimated as The normalized trajectory needs to keep the original character shape and stroke writing direction, the characters are only scaled by the standard deviation δx. For the tth point of P, we can obtain the normalized point (x t ,ȳ t ) bȳ Some examples of processing through the above steps are shown in Figure 3. From Figure 3, we can see that the processed coordinates are evenly distributed on both sides of (0, 0) and the number of points has been reduced from 723 to 355.

Designing End-to-End CNN Architecture
For the proposed architecture, a brief overview is first given. As shown in Figure 4, the preprocessed sequence of coordinates is used as input to the recognizer. A larger receptive field can be obtained by stacking convolutional layers (Conv1 and Conv2) with small convolution kernels and reducing the sequence length by downsampling, so that more discriminative features can be learned. The fully connected layer (FC) requires a fixed-size feature vector input, but the length of the coordinate sequence is variable. Therefore, it is necessary to average over the coordinate sequence to obtain a fixed-size feature vector.
For the output of the convolutional layer, the sequence mean m i of each feature map Z i = [z i1 , . . . , z it , . . . , z iT ] is estimated by Since the number of output feature maps is fixed, we can combine these means into a fixed-size feature vector f = [m 1 , . . . , m c , . . . , m C ], where C is the number of feature maps. Finally the softmax function is used to estimate the probability distribution of all classes. Next, we will introduce the specific configuration of end-to-end CNN in detail.  The configuration of the end-to-end classifier can be seen in Figure 5a. In Figure 5a, "Conv k:2 × 3, s:1, 64" denotes that the kernel size a convolutional layer is 2 × 3, the stride is 1 and the number of feature maps is 64, respectively. PReLU stands for the parametric rectified linear unit [31]. "Max-pool k:1 × 2 s:2" denotes that the size of max pooling is 1 × 2, and the stride is 2. "Dropout 0.2" denotes that the dropout rate is 0.2 [32]. "FC 160" denotes a fully connected layer, and the number of channels is 160. "Block N" represents a residual block, which is illustrated in Figure 5b, N = 64, 128, 256.
In more detail, the preprocessed in-air handwritten character P = [p 1 , . . . , p t , . . . , p T ] ∈ R 2×T are directly used as the input of the classifier, T varies according to each different character. In order to identify the coordinate sequence, the network needs to stack multiple convolutional layers to expand the receptive field, so that the model can extract as much spatial and temporal information of the sequence as possible. P is first processed using a convolutional layer in the time dimension to obtain a series of feature maps of 1 × T 1 , T 1 changes with T, the kernel size is 2 × 3 and the stride is 1. After that, dropout is employed to avoid overfitting, and the max pooling is used to reduce the sequence feature length, which can further increase the receptive field range and reduce the impact of zero padding on sequence recognition. Since the constructed network is very deep, the residual link [33] can efficiently train the network. As shown in Figure 5b, we construct a residual block which contains a total of three convolutional layers. In Figure 5b, "⊕" denotes the elementwise sum, three convolutional layers are denoted by "Conv1", "Conv2", and "Conv3", respectively. Conv1 and Conv2 are used to directly extract sequence features with kernel size 1 × 3 and stride 1. Conv3 is designed to make the feature size of the input and output the same during the connection operation, the size of the convolution kernel is 1 × 1, and the stride is 1. The residual block is repeated until the number of feature maps is increased from 64 to 256. Then, global average pooling (GAP) [34] is employed to obtain a fixed-size feature vector f . Finally, f is input into the fully connected layer (FC) and softmax is used to classify it.

Datasets
We conducted experiments on the IAHCC-UCAS2016 [23] which is an in-air handwritten Chinese character dataset and the GB1 dataset in the SCUT-COUCH2009 [35] database, which is a traditional handwritten Chinese character dataset. The samples in IAHCC-UCAS2016 consist of projections on a 2D plane of a sequence of 3D coordinates recorded by a sensor worn on the fingertip. The samples in the GB1 dataset are the trajectory coordinates written directly on the tablet. Both datasets are publicly available. The GB1 dataset involves 3755 character classes of the first level set of GB2312-80 and each class has 188 patterns. The IAHCC-UCAS2016 covers 3811 Chinese character classes, and each class contains 115 samples. For each class, 80% were randomly selected as training sets and the remaining 20% as test sets.

Model Training Strategy
Our network structure was implemented based on PyTorch, initialized with the default parameters of the framework. The optimizer used Adam; the initial learning rate was set to 0.001. The learning rate was decreased when the accuracy on the training set no longer increased or increased slowly, with a decay rate of 0.1. All experiments were conducted on RTX-2080ti.

Comparison Experiments
In order to determine the appropriate mini-batch size under the condition that the learning rate is set to 0.001, the mini-batch size was set to 64, 128, 256 and 512, respectively. Similar to other studies in this field, and since the amount of data for each class in the dataset was equal, the accuracy criterion was used to evaluate the proposed method. The accuracy was calculated by In Equation (8), TP is the number of samples that are correctly assigned to the goal class, TN is the number of samples that are correctly not assigned to the goal class, FN is the number of samples that are wrongly assigned to the goal class, and FP is the number of samples that are wrongly assigned to the other classes. We performed five-fold crossvalidation on the dataset and reported the average recognition rate for each mini-batch size in Figure 6. The model converged at 20 epochs and performed best when the mini-batch size is set to 128. For convolution operation on P, we want the size of the sequence before and after convolution operation to be same, so we need to zero-pad the sequence. We designed three padding methods, padding1, which is evenly filled at both ends of the sequence; padding2, which is filled at the head of the sequence; and padding3, padded at the end of the sequence. The change in the recognition rate during the training process is shown in Figure 7. The best recognition rates of the three methods are shown in Table 2. From Figure 7 and Table 2, we can see that padding1 is the best padding method. Similar to using CNN to recognize images, when CNN is used to recognize coordinate sequences, zero padding on the edge can effectively retain the information of the edge position. Compared with method 2 and method 3, method 1 has a higher recognition accuracy by 0.07% and 0.28%. Filling methods 2 and 3 both lose the feature of the edge position to some extent. For the fill method 3, in particular, the header position information of the character sequence is lost, which is quite important to recognize the character.  To verify the effectiveness of the end-to-end CNN, we compared the proposed method with traditional CNN architectures on both datasets. We used the nine-layer CNN presented in [23] as a benchmark (including Cov8d using eight-direction feature maps, CovCd1 using a combination of higher-order direction feature maps and curved feature maps, and CovCd2 using a combination of eight-direction feature maps, high-order eight-direction feature maps and curved feature maps). In addition, we also compared our method with end-to-end RNN (RNN1) [20] and RNN combined with new computing architectures (RNN2) [4] on the IAHCC-UCAS2016 dataset. In Table 3-6, the column "DA" indicates whether the data-augmentation technique was adopted during training, and the column "Ensemble" indicates whether the recognition decision was made by the ensemble of multiple trained models. As shown in Table 3 and 4, the traditional CNN that recognizes directional features [23] does not directly recognize coordinate sequences, but recognizes the extracted directional feature images. This indirect learning method will affect the recognition accuracy, and it is difficult to obtain a high recognition rate when the amount of data is insufficient, so it is necessary to use data-augmentation techniques to expand the data set during the training process. Our method does not use data augmentation techniques and achieves a recognition accuracy of 96.07% on the IAHCC-UCAS2016 dataset and 98.02% on the GB1 dataset. Although RNN can directly identify the coordinate sequence and extract the time-series features between the coordinates, it is difficult to consider the global spatial features. RNN [4,20] often use multiple models to jointly participate in recognition decisions to improve recognition accuracy. However, this strategy will exponentially increase the storage cost and the improvement in recognition accuracy is not ideal. As shown in Table 3, the recognition accuracy of RNN2-Ensemble is 0.8% higher than that of RNN2, but the storage cost is about 3.67 times that of RNN2. Compared with RNN, end-to-end CNN can extract more discriminate spatiotemporal features more comprehensively, requires only a single model and the storage cost is only 6.48 MB.  We also compared our proposed method with traditional methods on both datasets. These methods include nearest prototype classifier (NPC) [36], nearest prototype classifier trained by MCE (NPC-MCE) [28], multistage classifiers (Multi1) [29], discriminative multistage classifiers (Multi2) [26], modified quadratic discriminant functions (MQDF) [25], locality-sensitive sparse representation-based classifiers (LSRC) [27] and locality-sensitive sparse representation toward optimized prototype classifier (LSROPC) [29]. Table 5 and 6 summarize the recognition performance of the various methods on both datasets. From Table 5 and 6, compared with traditional machine-learning methods, deep-learning technology has huge advantages, and the above methods all identify the features extracted by artificial means, which inevitably loses the timing information of the trajectory sequence to a certain extent. Therefore, using CNN to directly identify coordinate sequences can achieve an overwhelming performance improvement.

Conclusions
This paper proposes an end-to-end classifier based on CNN for IAHCCR. Our method achieves 96.08% recognition accuracy on the IAHCC-UCAS2016 dataset with a storage cost of 6.48MB. Compared with the directional feature-extraction strategy that indirectly identifies temporal features, the direct identification of trajectory sequences can directly extract more discriminate temporal features to obtain better results, and no complex featureextraction process is required. Unlike RNN, the macroscopic structure of trajectory sequences can also be considered. The experimental results show that the proposed method is very suitable for OLHCCR. In future work, we plan to explore more robust and efficient CNN architectures for IAHCCR.