An LSTM Based Generative Adversarial Architecture for Robotic Calligraphy Learning System

: Robotic calligraphy is a very challenging task for the robotic manipulators, which can sustain industrial manufacturing. The active mechanism of writing robots require a large sized training set including sequence information of the writing trajectory. However, manual labelling work on those training data may cause the time wasting for researchers. This paper proposes a machine calligraphy learning system using a Long Short-Term Memory (LSTM) network and a generative adversarial network (GAN), which enables the robots to learn and generate the sequences of Chinese character stroke (i.e., writing trajectory). In order to reduce the size of the training set, a generative adversarial architecture combining an LSTM network and a discrimination network is established for a robotic manipulator to learn the Chinese calligraphy regarding its strokes. In particular, this learning system converts Chinese character stroke image into the trajectory sequences in the absence of the stroke trajectory writing sequence information. Due to its powerful learning ability in handling motion sequences, the LSTM network is used to explore the trajectory point writing sequences. Each generation process of the generative adversarial architecture contains a number of loops of LSTM. In each loop, the robot continues to write by following a new trajectory point, which is generated by LSTM according to the previously written strokes. The written stroke in an image format is taken as input to the next loop of the LSTM network until the complete stroke is ﬁnally written. Then, the ﬁnal output of the LSTM network is evaluated by the discriminative network. In addition, a policy gradient algorithm based on reinforcement learning is employed to aid the robot to ﬁnd the best policy. The experimental results show that the proposed learning system can effectively produce a variety of high-quality Chinese stroke writing.


Introduction
Robot is becoming an important role in improving efficiency of recycling and sustaining industrial manufacturing [1]. Furthermore, robotic manipulators show their advantages in garbage sorting, waste removal, component disassembling, and so on [2][3][4]. The writing of Chinese characters is a very task for robotic manipulators since a single Chinese character is formed by orderly organizing a set of stokes in a certain structure [5]. This structural complexity of Chinese character makes robotic calligraphy often to be used as a test bed for control method evaluation.
Calligraphic robots are built to learn the ways of calligraphers' writing and then perform its own calligraphy. These kind of robots can be used to help people learn the fundamental skill of Chinese calligraphy. Furthermore, they can engage in the repair of calligraphic collections in order to protect the cultural heritages [6,7]. Traditional calligraphic robots only mimic the writing of calligraphers regarding the shape of Chinese characters [8]. This leads to the lack of aesthetic preferences in robotic calligraphy. Therefore, these traditional calligraphic robots is not capable of developing new writing styles [9]. The situation is worsened due to the limited training data set. A new framework of robotic calligraphy which allows writing robots to learn aesthetic preferences with the small size of human calligrapher samples is very meaningful.
Many learning-based approaches to robotic calligraphy have attempted to build automatic calligraphic robots. However, these methods cannot generate the correct writing sequences for Chinese strokes. There have been two classes of solutions in literature. One is to manually pre-define the robot's end joint angles for each writing action to write Chinese characters or letters [10,11]. However, such methods may require a lot of work from human engineers. The other is to use the learning from demonstration (LfD) approach [12] and imitation learning method [5]. This type of methods do not need an understanding of the control or programming model of robots. However, it requires a lot of labour costs and possesses the poor generalization ability.
Furthermore, many scientists have tried to use generative adversarial nets (GANs) combining with other main machine learning techniques to find writing sequence information. For example, Chao et al. [13] used a GAN-based method to produce stroke trajectories. Although this method can realize the writing of various strokes, the writing sequence of strokes was generated in accordance with the rules predefined by humans. We noticed that the Long Short-Term Memory (LSTM) networks [14] is effective for solving time series problems. Two groups of researchers: Gregor et al. [15] and Im et al. [16] attempted to achieve the sequential painting by using the LSTM network. In the field of robotics, Rahmatizadeh et al. [12] tried to use GAN to transform an input image into a low-dimensional space and use LSTM to predict their robot's each joint value. However, all of these methods must require the massive training data to obtain action sequence information.
To beat the above challenges, we introduce an LSTM network into a GAN-based robotic calligraphy system [13], so as to implement an LSTM-based generative adversarial architecture. In this work, the generator network inside a GAN is replaced by an LSTM. Thus, within a single generation process, the LSTM network contains multiple loops, each of which generates a new trajectory point. A calligraphic robot then uses the point to write a segment of a stroke. The written stroke in an image format is taken as input to the next loop of the LSTM network, until the whole stroke is finally written. Additionally, a reinforcement learning algorithm is adopted by using the output of a discriminator network as a reward for training the LSTM network. The main contribution of this work is that in the absence of the robot motion trajectory dataset, the generative adversarial architecture can convert the pixel stroke image to the vector trajectory of the controllable robot, so that the robot can write high-quality Chinese character strokes and finally the pixel image information can be used to control the robot. The rest of this article is organized as follows. Section 2 details the calligraphy robot's learning system. Section 3 specifies the experimental setup and discusses the experimental results. Section 4 concludes the paper and gives perspectives of future work. Figure 1a shows the training procedures of the proposed architecture for the robotic calligraphy system. The architecture consists of an LSTM-based stroke generation module and a convolutional neural network (CNN)-based discriminator module. The generation module produces the probability distribution of the stroke points of the strokes in sequence. The discriminator determines whether an input image is real (training data) or fake (written by the robot). Then, the generative adversarial training scenario is used to train the entire architecture. However, since a robot system participates in the training process, the error back-propagation method of the traditional GAN cannot be applied for the architecture. To solve this problem, with reference to our previous work [13], the policy gradient method of reinforcement learning is employed to train the system.  Policy gradient are normally used to solve reinforcement learning problems. The methods based on policy gradient target at modelling and optimizing the policy directly while the goal of reinforcement learning is to encourage the agent to obtain optimal rewards. The policy is often modelled with a parameterized function with respect to θ and its mathematical expression is π θ (a|s) where a are actions while s are observations.

Framework Architecture
In the stroke generation module, the input of the LSTM is a blank image. Then, the robot obtains information of the stroke position from the output of the LSTM by using Gaussian sampling. Afterwards, the robot uses the inverse kinematics calculation to convert the stroke position information into the manipulator's joint values. The robot uses this mechanical arm joint value to continue writing the stroke by linking the last point of the previous loop to the new point of the current loop. The robot captures an image of the current stroke with a camera and transmits the image to the next loop of the LSTM network. This process is repeated until all the points are generated. Figure 1b illustrates the usage of the trained robotic calligraphy systems. Only the LSTM-based generation module is used in the system operation. A user first inputs a stroke type and the stroke style into the generation module. The module then generates all the robot joint values of a stroke and the robot writes out the whole stroke in turn. A detailed description of the implementation of the three modules in the framework is given below.

Stroke Generation Module
The stroke generation module is implemented using a LSTM network for a robot. An example of using five-epochs LSTM networks in dealing with the stroke generation is shown in Figure 2. The LSTM network generates a probability distribution at each loop. The robot obtains a three-dimensional coordinate value, M i , by the sampling on this distribution. The robot subsequently uses inverse kinematics to convert the stroke position to its robot joint value. The robot needs to connect the previous trajectory point to the current trajectory point on the drawing board until all the trajectory values of the vector M = [M 0 , M 1 , ..., M k−1 ] are obtained. The number of loops, k, of the LSTM network is preset according to the complexity of the strokes. For example, for simple strokes, the LSTM network only undergoes two loops. In other words, the LSTM outputs two coordinate values, and then, the robot connects the two coordinate values in a sequence. For complex strokes, the number of epochs of the LSTM network is set to a larger value. A complex stroke requires the LSTM network to undergo five loops. In this case, the robot obtains a trajectory vector M = [M 0 , M 1 , ..., M 4 ], which means that the robot needs to write five times in succession to complete a stroke.  The LSTM network used in this work is a 60-dimensional single hidden layer cyclic neural network for all strokes. In each loop of the LSTM, the input of the LSTM is labeled as p i−1 . In the first loop of the LSTM, the input is a 28 × 28 pixel blank vector, p 0 . The image sample is averaged as the global feature of the network and set to an initial value, h 0 , of the LSTM hidden layer. C 0 is a random vector of 28 × 28 dimensions also as the global feature of the network. The input of the ith loop of the LSTM network is a set of 28 × 28 pixel vectors p i−1 , h i−1 and c i−1 while the outputs are h i and c i , which are formulated as follows:

LSTM
where h i is used to predict the mean, µ i , of the Gaussian distribution of the three-dimensional coordinates through a fully connected layer. µ i is defined as follows: where f (·) represents the two-layer full connection layer of the neural network. The sigmoid function is used to map variables between 0 and 1. The variance of Gaussian distribution is fixed on the identity matrix, E, with a diagonal of 1. The sampling on Gaussian distribution, N(x|µ, E), is used by the robotic arm to generate 3-dimensional coordinates M i = (x i , y i , z i ) that need to be written. Represented by: where t is the maximum value of [28, 28,4]. N is the Gaussian distribution. Figure 3 shows the experimental system and the figuration of the robot. The robotic system used in this experiment includes a three-degree-of-freedom robot arm, a camera, and a writing board. The tip of the soft pen is mounted on the arm and operated in the working range of the arm. l denotes a mechanical linking rod, (x, y, z) denotes the coordinate axis of the robot, and J denotes the steering gear of the robotic arm. The robot converts the three-dimensional coordinate point t i into three joint values θ i = (θ 1 , θ 2 , θ 3 ) of the robot by inverse kinematics. The camera is used to capture the completed characters written on the board and the captured images are sent back to the neural network afterwards. The specific calculation method is as follows: where d 2 i = x 2 i + y 2 i and T(·) represents the transformation process of inverse kinematics. The robot continues to write the stroke by following the previous trajectory point generated in the last cycle, i.e., connecting the coordinate point M i−1 of the last loop to the coordinate point M i of this cycle. If it is the first loop of LSTM, only the coordinate point is generated. In addition, the camera next to the robot captures, binarizes, and trims the written result to an image with 28 × 28 pixels. This process is expressed as W(·). The image is used as the input p i of the LSTM for the next loop. The writing result of the robot system is expressed as: Finally, the output of the generation module is as follows:

Stroke Discrimination Module
The stroke discrimination module is built on a CNN network. The input of the stroke discrimination module is divided into two categories. The first type is the image X f ake , the writing result of a robot taken by a camera, which is binarized and trimmed. The second type is the real stroke image X real . The size of the input image layer is set to 28 × 28 while the size of the network's output is 1. The output predicts the probability of the data distribution of X from the real image, X real , or the image generated by the robot, X f ake . The hidden layer of the CNN network consists of two convolutional layers and two fully connected layers. The network's structure is shown in Figure 4. The image is up-sampled at the convolutional layer to 320 dimensions and passed through the fully connected layer to produce the one-dimensional output.

Training Algorithm
The objective function of this architecture is expressed as: where D(·) represents the output of a CNN network, G(·) represents the output of a LSTM network and E[·] represents the expected value of the LSTM network. The target of the CNN network is expressed as the following loss function: D(x) represents the score of the CNN network for the real stroke sample, and D(G(p 0 , h 0 , c 0 )) represents the score of the CNN network for the stroke sample generated by the LSTM network, ranging from 0 to 1.
In order to make the LSTM network obtain higher rewards, the LSTM network must guarantee the quality of each trajectory point. Therefore, the goal of the LSTM network is to increase the occurrence probability of the trajectory with a high score in the CNN network. The loss function of the LSTM network is as follows: where log prob (LSTM(p i , h i , c i )) represents the probability of the output trajectory points of the LSTM for the ith loop; ∏ k i=0 log prob (LSTM(p i , h i , c i )) represents the occurrence probability of the stroke calculated by multiplying the likelihood probabilities of all the trajectory points in the stroke. D(G(p 0 , h 0 , c 0 )) represents the output from the CNN network, of which values are ranged from 0 to 1.
The gradient of the objective function J(θ) and the LSTM network parameter, θ, are derived by: Since the expectation E[·] can be approximated by sampling, the parameters of the LSTM network are updated in the following ways: where α is the learning rate. The training procedures are presented in pseudo code listed at Algorithm 1.

Algorithm 1 Training Procedure Pseudocode
Require: Real stroke images database X real , mean of real stroke images h 0 , random number c 0 , blank vector p 0 . 1: Initialize LSTM and CNN network with random weights; 2: repeat 3: for g-step do 4: Input p 0 , h 0 , c 0 into LSTM; 5: for i in 0 : k do 6: Use Equation (4) to sample a trajectory point M i ; 7: Robot writes the trajectory, which is captured as a input image for the next cycle; 8: end for 9: Update LSTM parameters via Equation (15); 10: end for 11: for d-step do 12: Combine the new stroke images X f ake with real stroke images X real ; 13: Train CNN by Equation (12); 14: end for 15: until GAN Converges

Training Data
The architecture proposed above was applied to the task of robotic writing on Chinese character strokes, which is also used for system verification and evaluation. The images of the stroke training data were extracted from the Chinese character images, normalized and classified. Then, the training processes of the CNN network and LSTM network and the robot writing action were carried out, and the learning performance of the policy gradient was obtained.
First, we adopt the method proposed in [17] to automatically extract the strokes of a character. Next, the stroke images are converted into binary forms. An else CNN network is used to classify the binary-valued strokes into 31 categories, which will be stored in the database. In addition, we also calculated the mean value of all the types of stroke S = [S 1 , S 2 , ..., S m ] as the h 0 in the LSTM to facilitate the network to learn features of the stroke style. m is the number of images in this category. h 0 is given as: In this experiment, we selected six different Chinese character strokes to train and test our proposed architecture. Each class of strokes have 500 sample images. Figure 5 shows the training samples used in the experiment of the six types of stroke, each row shows one type of stroke with various variants. From Figure 5, we can see that the strokes in the same type are not exactly identical to each other. The types of strokes from top to bottom are: "horizontal stroke", "short left-falling stroke", "right-falling stroke", "vertical, turn-right and hook stroke" and "horizontal hook stroke".  In the early stage of training, the writing results of strokes were chaotic; the shapes were not close to the target stroke. In the medium stage of training, the written results got closed to the target stroke. However, a large detailed difference between the results and target image still existed. In contract, the final stage of writing shows a high level of quality, and the shapes produced at this stage are very similar to the target strokes.

Training Process and Writing Results
In Figure 6, we also noticed that: during the training process, some stroke's writing sequence was in accordance with human writing habits even without the stroke's intermediate sequence process. This proves that it is possible to write strokes that conform to human writing habits even when LSTM does not have the intermediate sequence process. However, not all the strokes were written by following human's writing sequence; more future efforts will focus on this problem. Figure 7 illustrates the evaluation results of the LSTM network's output for the "vertical, turn-right and hook stroke". The starting value of the LSTM network is about 0.7, that is, since the LSTM network has not been fully trained, the CNN network can have a high possibility to determine whether a stroke is from the robot or not. As the LSTM network continued to be strengthened, the loss function of the CNN network gradually decreases and becomes stable. Since the robot participated in the generation process of the LSTM network, some unexpected errors still existed in the written results; therefore, the errors prevented the CNN network from achieving the standard loss. However, in this experiment, even the loss cannot achieve the lowest value, the written results can still be accepted by human users. Figure 8 shows the final writing results of the six strokes, all of which are trimmed but not binarized. The variety of results was obtained based on the random vector of the c 0 in the LSTM. The results show that: the trajectories of the same stroke are different, and the part of results are written in accordance with human writing habits. Meanwhile, we found that the trajectories of some strokes are very close to those written by humans. For example, the first stroke in (e) and the last stroke in (f) owned beautiful appearances.

Conclusions
This paper proposes a system of robotic calligraphy based on the LSTM and generative adversarial network. The system enabled the robot to learn and generate trajectory writing motions of Chinese character strokes independently. Without the writing sequence information, the method can realize the conversion from input stroke images to the robot motion sequences, thereby enabling the robot to write high-quality Chinese character strokes. Meanwhile, the robot system can write some strokes that conform to the human writing sequences without any human pre-defined rules. Experimental results based on six strokes demonstrated the effectiveness of the proposed method with several results attaining the human writing level.
Although our approach is effective, it can be improved regarding the following two aspects. First of all, our approach can only write Chinese character strokes at present. However, how to write a complete Chinese character is still a problem. It is worth further consideration. Secondly, the proposed work cannot guarantee all the strokes can have a correct writing sequence; thus, more future efforts are required.