Improved Long Short-Term Memory Network with Multi-Attention for Human Action Flow Evaluation in Workshop

Featured Application: Our method can be used for worker action recognition and action ﬂow evaluation in workshops, which can improve production standardization. Abstract: As an indispensable part of workshops, the normalization of workers’ manufacturing processes is an important factor that a ﬀ ects product quality. How to e ﬀ ectively supervise the manufacturing process of workers has always been a di ﬃ cult problem in intelligent manufacturing. This paper proposes a method for action detection and process evaluation of workers based on a deep learning model. In this method, the human skeleton and workpiece features are separately obtained by the monitoring frame and then input into an action detection network in chronological order. The model uses two inputs to predict frame-by-frame classiﬁcation results, which are then merged into a continuous action ﬂow, and ﬁnally, input into the action ﬂow evaluation network. The network e ﬀ ectively improves the ability to evaluate action ﬂow through the attention mechanism of key actions in the process. The experimental results show that our method can e ﬀ ectively recognize operation actions in workshops, and can evaluate the manufacturing process with 99% accuracy using the experimental veriﬁcation dataset.


Introduction
In manufacturing, the key to quality control lies in the manufacturing process itself; thus, monitoring and control of the manufacturing process is the focus of manufacturing quality control [1]. The product quality is the result of multiple working steps which are the basic units of the manufacturing process [2,3]. The mapping of workers, that is, of the process flow of workers' processing actions of the workpiece, consists of multiple separate processing actions [4]. Identifying workers' processing actions and processing action flows can contribute to effectively judging whether the processing status of the workpiece and the processing flow are disordered, making it possible to control the quality of the final product [5]. Monitoring and controlling the quality of processes has become the most important task of quality control. Manual observation is the main method of traditional monitoring, which is time-consuming and labor-intensive. It is increasingly difficult to meet production needs using this method. The emergence and continuous development of intelligent monitoring based on deep learning provides an effective way to monitor processing actions, and is also an important enabling technology for the future transformation and upgrading of manufacturing and management to digital and intelligent modes [6,7].
As the key technology of intelligent monitoring, action recognition has received significant attention in recent years [8][9][10]. Its research routes can be divided into two categories: image-based and human skeleton-based recognition methods.

•
Image-based recognition methods Wang et al., analyzed the changes of pixels in video sequences and used the dense trajectories (DT) [11] and improved dense trajectories (iDT) methods [12] to identify actions. Wang et al. proposed a method using CNN (Convolutional Neural Networks) to classify trajectories [13]. Song et al.,used 3D convolution to process temporal and spatial features at the same time [14]. Tran used two-stream CNN, with one stream extracting spatial information from single-frame optical flow and another extracting temporal information from multiframe optical flow [15]. Image-based methods can extract more features from the whole frame, but due to the excessive attention to the background, light and other information of the image, much information is redundant and the method is less efficient and accurate.

•
Human skeleton-based recognition method Ke et al., used the distance between human joints to generate grayscale images and then employed CNN for classification [16]. Gaglio used 3D human posture data and three different machine learning techniques to recognize human activity [17]. Wei et al. n considered high order features like relative motion to different joints and proposed a novel high order joint relative motion feature and a novel human skeleton tree RNN network [18]; Yan et al. took the skeleton as a graph and used a graph neural network for classification [19].The accuracy of the methods based on skeleton joint information is generally greater than methods based on image recognition, but these methods only consider the movement of the human body, which is too simple to further improve the accuracy of human action recognition.
Current research is still overly focused on the recognition of a single action rather than of continuous action processes. However, in a complex workshop production scenario, an analysis of a single action by a worker cannot meet current quality control needs. In many production processes, the continuous flow of workers' actions also needs to follow strict yet flexible process standards [20][21][22]. Therefore, it is equally important to analyze and judge the normalization of the action flow. On the other hand, the task of action recognition in a workshop is different from the same task in a natural scenario, because workers' actions mostly involve the use of tools to operate the workpiece or the equipment. Information such as that relating to the workpiece has a vital influence on action recognition. Taking account of this situation, this paper proposes a workpiece attention-based, long short-term memory (WA-LSTM) network for action detect and a key action attention-based, long short-term memory (KAA-LSTM) network for action flow evaluation. The overall framework of the method is shown in Figure 1. The experiment results show that our methods can recognize workers' actions online with high accuracy and can effectively judge whether workers' action flows follow the requirements.

Methods
There are two requirements for the recognition of workers' manufacturing actions. Firstly, it is necessary to detect the individual manufacturing action of a worker in the monitoring video. We detect the action classification results frame-by-frame using the action detection network, and segment continuous actions by filtering and combining. Secondly, it is also necessary to evaluate the process level of the identified action flow to determine whether the worker's action flow meets the specifications. The recognized actions are sequentially input in chronological order and the action flow is evaluated in combination with all actions.

Manufacture Action Recognition Based on WA-LSTM
For the recognition of a single action, methods such as [10][11][12]17,18] need to divide the video segment in advance to obtain action fragments, which cannot meet the online recognition requirements of surveillance video. Temporal neural networks such as recurrent neural networks (RNNs) [23] and long short-term memory networks (LSTMs) [24] can effectively retain temporal information and make frame-level action classifications to meet the online needs of action recognition [25,26]. However, traditional human skeleton-based recognition methods ignore the presence of the workpiece and other factors which are critical in the recognition of manufacturing actions in workshop. Thus, we propose a WA-LSTM method for action detection and recognition. The overall framework of WA-LSTM is shown in Figure 2.

Encoding of Worker Skeleton Sequence
A worker's action is composed of a series of static skeletons in temporal sequence. In this paper, the static skeletons of workers in each frame are defined as the action features of workers. The skeleton features obtained by different methods or formats are different. For example, the Kinect depth sensor captures 3D coordinates of 25 joints of the human body [27], while the OpenPose framework captures the 2D coordinates of 18 joints [28]. We use OpenPose for the experiment. The skeleton feature is defined as where (x i , y i ) represents the cartesian coordinates of the ith joint. Thus, the skeleton feature sequence in the manufacturing process can be defined as where m represents the mth frame in the manufacturing process.

Feature Extraction of Workpiece and Fusion Method
Workers' actions are different from actions in a natural scenario, i.e., they are more closely related to the workshop environment, such as the workpiece and tools. These environmental factors often determine the actions of workers. Therefore, it is better to recognize workers' actions using these factors.
In this paper, we use a pretrained convolutional neural network to extract workpiece features. As shown in Figure 3, we use a simple fully convolutional network (FCN) [29], a common framework in semantic segmentation, to train a model with the capacity to segment frequently-used workpieces in one image. Our dataset consists of 500 frames extracted from the video data of the dataset which will be introduced in Section 3. We labeled the frames for every workpiece or tool at a pixel level for the image segmentation task. When extracting features from images, only the downsampled parts with their parameters frozen are used. After that, a new, fully connected (FC) neural network is added to acquire high-level semantic features, which can be defined as where n is the dimension of the features vector that can be adjusted dynamically by the number of output layer cells. The additional neural network is jointly trained with the LSTM.

Overall Workflow of WA-LSTM
For each frame in the video stream, the human skeleton features P i , representing the human skeleton features at frame i, are extracted through the OpenPose framework.
After smooth processing with a Kalman filter, an algorithm for optimal time sequence estimation [30,31], the skeleton features are input into the LSTM network which maps the worker skeleton feature information to another high dimension for a high-level feature. Meanwhile, the LSTM can effectively preserve the temporal information for short-term temporal association learning. The output of LSTM is denoted as representing the classification information of skeleton features at moment i, where n is the number of categories of action. Meanwhile, the workpiece feature is extracted from the pretrained FCN with an additional FC layer marked as W i to represent the workpiece semantic.
The output C i of the LSTM network is fused with the workpiece semantic W i by weighted summation. Softmax is used as the final activation function to generate the probability y i j of each action category j at time i. The formula is expressed as The generated frame-level classification needs to be further filtered and combined to obtain the temporal action.

Manufacturing Process Evaluation Based on KAA-LSTM
The manufacturing process is composed of several manufacturing actions, a typical sequence model with obvious context information between actions. There is a different degree of importance between every action in the action sequence. Referring to the research on semantic focus [32,33] in the natural language process domain, we propose a KAA-LSTM which uses an attention mechanism to express different degrees of importance. The overall framework is shown in Figure 4.

Encoding of Action Sequences
We use LSTM for process evaluation; this requires the input to have a certain dimension. Thus, we use the one-hot encoding method [34] to solve this problem. For the single action A i , which means the ith action classification category, we use the vector to represent the action. The length of the vector is determined by the number of action categories. Its elements are all set to zero with only the corresponding position i being set to one. The action sequence can be represented by a one-hot encoding sequence as follows: where t is the tth action in chronological order. The LSTM takes every one-hot vector as input in sequence. After the last input, the hidden layer states c i of each input are weight-summed and the sigmoid function is used to generate the probability of whether the process is in conformity with the specification. The output of the function is between 0 and 1. The higher the output, the more standardized it is.

Key Action Attentional Mechanisms
In text classification, different words and sentences contain different amounts of information and the same words have different importance in different semantic contexts. The normative discrimination of the manufacturing action process has similar characteristics. The importance of each kind of manufacturing action in a particular process may be different, and the importance of the same kind of production action occurring many times may be different.
In view of the above characteristics, this paper proposes the key action attention mechanism [35] to extract the action information that is most critical to normative judgment in the process of production action, and to assign the attention weight according to the importance of obtaining the most identifiable feature vector. The formula is expressed as follows: where u i is an implicit representation of the state information obtained through a simple FC layer, α i is the importance of each action in the process calculated by the softmax function, and W s , b s and u s are training parameters obtained by jointly training. Finally, u i is weight-summed and put through the sigmoid function. The key action attention mechanism quantifies the importance of each action in the action process, which can assist the further identification of information for normative discrimination.

Case Description and Dataset
At present, there is no dataset for manufacturing actions. This paper takes the precleaning procedure of the combustion chamber of a rocket engine as a test case. The propellant of rocket engines is a typical firework product with strict process specification requirements for every manufacturing action. The standard process is as follows: However, in the actual manufacturing process, there will be reworking and repetition. The present study is based on an actual workers' precleaning processes gathered from interviews and simulated in a laboratory environment. We simulated 520 processes, including 300 standard processes and 220 nonstandard ones. Each process consists of several types of manufacturing actions, as shown in Figure 5. Our dataset was recorded using a camera with a resolution of 640 × 480 pixels and a frame rate of 15 FPS. Each video sample ranged in time from 10 to 18 s and included between four and ten actions. Standard samples were executed in accordance with the above standard procedure while nonstandard samples were not, for example those in which cleaning was executed before cooling.
The main difference between our dataset and other common datasets is that each action recorded in our dataset is an operation performed on a workpiece. Datasets like MSR [36] are simply records of the body's own behavior. Furthermore, each sample of our dataset consists of several continuous actions. Samples in datasets like UCF101 [37], on the other hand, are snippets of single actions. Finally, our dataset provides a label of evaluation for every sample to produce a "good or not" classification of sequential actions. To the best of our knowledge, no other such dataset exists which meets our requirements.

Experiment and Result for WA-LSTM
The most notable feature of WA-LSTM is that it integrates the features of the workpiece and the skeleton of the worker, and analyzes their temporal information through the LSTM network, so that each feature can be used to the maximum advantage. In order to verify its superiority, the following models are used for comparative experiments: • DNN: The worker skeleton features and the workpiece features are directly input into the FC network to obtain the classification results without considering the temporal information. • LSTM: Only the skeleton features of the worker and the temporal information are considered, but not the workpiece information.
The cross-entropy is used as the loss function during training. For predicted result y i with the real labelŷ i , loss is defined as: where n indicates the number of categories. The iterative process of loss and accuracy during the training process is shown in Figure 6.

Experiment and Result for KAA-LSTM
The manufacturing action process recognized by WA-LSTM is input into the KAA-LSTM model to discriminate standardization. To prove the performance of KAA-LSTM, we also set up two comparison experiments using the following models: • a simple DNN model; • an LSTM model without attentional mechanisms.
We set the true label of the standard process toŷ = 1, and that of the nonstandard process toŷ = 0. After activation by the sigmoid function, the model outputs a single value, y, as a normative evaluation score. Taking cross-entropy as the loss function, the loss value of a single sample is expressed as: During the prediction process, an output result greater than 0.5 is set as the standard sample and the accuracy calculated based on this. The experimental results are shown in Table 1 [38].

Discussion on the Results of WA-LSTM
The convergence rate of the LSTM model is faster than that of the DNN model, and the convergence value is smaller, which indicates that temporal information has an important influence on action recognition and can better promote convergence. From the iterative process of accuracy, it can be seen that the LSTM model can achieve 98% accuracy in the training set and more than 95% accuracy in the validation set, while the DNN model can only achieve roughly 87% accuracy in both sets. This is because the DNN model does not take temporal information into account, making it difficult to distinguish between ambiguously transitional actions. In addition, after adding the workpiece feature on the basis of LSTM, the converged value is further reduced and the accuracy is further improved, indicating that the workpiece feature as attention information can effectively improve the performance of the original model.
Stacking the truth label and the predicted label of each frame along the time line, we get the result shown in Figure 7. We can see that due to the lack of temporal information, the results of the DNN model are unstable, and there are many small segments, which will greatly affect the segment of action. As it takes account of the context information of the action, the LSTM model has better continuity in its results, and its performance is further improved after the addition of the workpiece attention mechanism. Its predicted results can better reflect the real values but have a small delay compared with the truth label. At the same time, it is easy to make recognition mistakes in the process of action cohesion when tools are changed. We consider the reason for this to be because all kinds of action have the same probability when tools are changed; the breakpoint of the action division when marking the truth value also causes interference in the experimental results. In conclusion, the proposed WA-LSTM model can significantly improve accuracy and make the prediction of results smoother.

Discussion on the Results of KAA-LSTM
It can be seen from Table 1 that DNN and LSTM have lower accuracy, while KAA-LSTM can achieve almost 100% accuracy in the verification set. This is due to the information mining of important actions by the attention mechanism, which enhances the discrimination ability of the model. Furthermore, the recall and precision levels of KAA-LSTM are much higher than those of the other two models. This indicates that our model is capable of a high level of discernment.

Prospect
Our research is an attempt at standard analysis of worker action flow, which can be used for the normalization of work flow such as the assembly process in the workshop. Future research could be developed from the following points. First, the workpiece feature could be obtained in different ways, for example, object detection or knowledge graphs. Second, our dataset is a simulation and simplification of real manufacturing processes, which include more situations. A new dataset could be collected and analyzed in a real environment.

Conclusions
This paper discusses the application of monitoring video in the normative recognition of worker action flow. A sequence model based on workpiece attention is proposed to detect workers' manufacturing actions in the video. The experiment proves that the fusion of workpiece attention and human skeleton detection can effectively improve the accuracy and stability of action recognition. Another sequence model based on key action attention is proposed to evaluate the process of manufacturing action. Experiments show that this method can achieve 99.36% accuracy on the validation set, and can accurately identify some incorrect manufacturing processes. The method developed in this paper can be applied to monitor key stations in workshop processes and undertake the automatic supervision of workers' processing actions.
Author Contributions: Y.Y. conceived the idea; T.L. performed the static analyses; Y.Y. and J.W. designed the methodology; J.B. was responsible for project administration; X.L. provided resources; J.W. prepared the manuscript and checked the writing. All authors have read and agreed to the published version of the manuscript.