An Automated Recognition of Work Activity in Industrial Manufacturing Using Convolutional Neural Networks

: The automated assessment and analysis of employee activity in a manufacturing enterprise, operating in accordance with the concept of Industry 4.0, is essential for a quick and precise diagnosis of work quality, especially in the process of training a new employee. In the case of industrial solutions, many approaches involving the recognition and detection of work activity are based on Convolutional Neural Networks (CNNs). Despite the wide use of CNNs, it is difficult to find solutions supporting the automated checking of work activities performed by trained employees. We propose a novel framework for the automatic generation of workplace instructions and real ‐ time recognition of worker activities. The proposed method integrates CNN, CNN Support Vector Machine (SVM), CNN Region ‐ Based CNN (Yolov3 Tiny) for recognizing and checking the completed work tasks. First, video recordings of the work process are analyzed and reference video frames corresponding to work activity stages are determined. Next, work ‐ related features and objects are determined using CNN with SVM (achieving 94% accuracy) and Yolov3 Tiny network based on the characteristics of the reference frames. Additionally, matching matrix between the reference frames and the test frames using mean absolute error (MAE) as a measure of errors between paired observations was built. Finally, the practical usefulness of the proposed approach by applying the method for supporting the automatic training of new employees and checking the correctness of their work done on solid fuel boiler equipment in a manufacturing company was demonstrated. The developed information system can be integrated with other Industry 4.0 technologies introduced within an enterprise.


Introduction
The Fourth Industrial Revolution, often known as Industry 4.0, is based on the Industrial Internet of Things (IIoT) and other technology enablers such as Artificial Intelligence (AI), digitization and automation [1]. Its goal is to establish direct communication between industrial machinery, people, and processes. IIoT can more critically bring considerable gains in productivity, product quality, and safety through proactive detection of problems by tapping and analyzing such data. While most IIoT research is currently focused on predictive maintenance of industrial machines (unplanned production stoppages result in significant increases in costs and lost plant productivity), monitoring, assessing, and improving worker productivity and performance is a future challenge of the Industry 4.0 system [2]. In fact, human workers are the most dynamic factor in any advanced intelligent manufacturing system; so, any development in this area must account for the concept of human-centered intelligent manufacturing (HCIM) [3]. To develop such human-centered systems, the main task is to understand human behavior that leads to achieving the optimal work performance. This field of research is especially significant for large manufacturing plants, where most production tasks are carried out by several workers in a production line that follows complex work routines. The factory workers' assembly labor is still at the heart of the industrial manufacturing system, and improving assembly work is one of the most critical tasks for increasing productivity [4]. The awareness of worker activities in spatially-extant and distributed manufacturing facilities is a pre-requisite for efficient workflow organization [5]. For example, a worker in a conveyor line-based production system conducts predetermined operation procedures repeatedly, each of which consists of a sequence of operations. Automatic and accurate worker activity detection is important for work performance management [6], evaluating work efficiency [7], assessing workloads and reducing the risk of injuries [8], and preventing safety accidents [9]. In general, it contributes to the sustainability of work practices [10]. However, observing and analyzing the worker activity workflow together with machines and tools in the middle of the industrial production process in the real-world manufacturing environment is difficult.
The growing need for IT innovation in manufacturing enterprises, according to Industry 4.0, has brought forward a great challenge to adopt advanced computer vision methods and Convolutional Neural Networks (CNNs) to recognize work activity and automatically generate work instructions in the workplace [11]. The use of deep CNNs to solve the problem of recognizing working practices, is promising [12][13][14] due to the integration of three elements: (1) object detection, (2) human pose, and (3) the recognition of work activity [15,16]. The applications developed in this manner-and based on such manufacturing technologies-are capable of learning-by-doing and, more importantly, are capable of self-improvement [17].
Nowadays, managers are looking for innovative solutions that will be helpful when deciding on the employment of new employees, by training them individually without involving other employees, to increase innovative working behaviour at the work place [18], and to keep the employees interested in boring day-to-day activities [19]. Current studies of manual manufacturing work often use monitoring of security cameras, checklists, and (often imprecise) work logs. The automated assessment and analysis of employee activity is essential for a quick and precise diagnosis of work quality, especially in the process of training a new employee. Overall productivity assessment, progress review, labor training programs, and safety and health management all require effective and timely analysis and tracking of personnel operations. Currently, the employee training process is costly, time-consuming, and requires the involvement of experienced employees who are required to supervise and control this process. The implementation of a dedicated information system, based on deep learning (DL) techniques, can smoothen the course of the training in checking the employee progress and correcting any mistakes.
Human action recognition has been used as a method of automatically analyzing and comprehending worker activities to provide real-time help and facilitate worker-machine collaboration [20]. The integration of ambient sensing technologies such as wearable sensors or surveillance cameras, and artificial intelligence-based analysis using deep learning models leads to the rise of the Worker 4.0 [21] that implements the main principles and behavior of workers in an Industry 4.0 scenario.
The digitization of the industrial workplace through the ubiquity of sensors, combined with digital information systems and intelligent monitoring, generates huge amounts of data every day, which capture the factual manual workflows [22]. The creation of workflow models to control, analyze, and optimize such industrial workflows, if done manually, is time consuming and costly [23]. To make the most of available monitoring data, we need to create a workflow recognition framework that can automatically extract features from uncut videos to recognize human and machine behavior. In this work, we have developed an information system supporting the automatic training of new employees and checking their work. We have built a training set for workers' service procedure when checking a solid fuel boiler in a manufacturing company, by collecting data from videoed work activity sequences, using the DL approach. The camera is the most popular type of technology when it comes to recognizing activity and can be used for video or image-based recognition, using trained models to capture various activities [24]. According to Ijjina and Chalavadi [25] and Chen et al. [26], using a camera with deep learning techniques would be a good solution for detecting activity.
In our approach, the stages of the procedure and their duration time were extracted and determined. Next, the features obtained from the video are given as input to CNN to learn the discriminating features. We use the objects of the workplace to assist the detection of each stage of the procedure. To identify objects in an image, CNN is used with a support vector machine (SVM) plus CNN with Softmax and R-CNN respectively, in a series of experiments. The main contributions of this paper are as follows.
(1) For the workflow recognition problem, we propose the integrated CNN, SVM, CNN, Softmax, and R-CNN approach, thus improving the accuracy of recognition.
(2) We design a framework for the automatic generation of work instructions, based on key objects in each stage of the procedure, which is the first attempt to combine realtime action recognition tasks, with specific practical application scenarios.
(3) We demonstrate the practical usefulness of our proposed approach by applying the method to practical systems, supporting the automatic training of new employees, and checking their work on a solid fuel boiler, in a manufacturing company.

Research Literature
The domain of worker activity recognition and work process discovery in industrial environments has multiple dimensions such as the type of recognition (supervised, unsupervised recognition, and semi-supervised), the type of sensor used for recognition (motion, vision, sound and radio signal -based) of worker movement, and sensor location (wearable, ambient and attached to objects) [27]. In the computer vision literature, the task of recognizing an image is defined as recognizing that an object belongs to certain classes [28]. The object detection algorithms 'recognize' based on the training sequence data containing the objects, together with information about their classification [29,30]. In the deep learning (DL) approach, algorithms are designed to model complex levels of data abstraction using multiple layers of non-linear transformations [31] with artificial, neural networks assuming the role both of generator and classifier of diagnostic features. Computer image recognition systems have found an application in industry with the solutions implemented using both machine learning methods and DL techniques [32]. Image recognition systems use the classification and reduction techniques of features based on Kernel Principal Component Analysis (KPCA), Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Hidden Markov Model (HMM) [33], Decision Trees (DT), K-Nearest Neighbours (KNN), Random Forest (RF) and Discriminant Analysis (DA) [34].
The limitations related to the machine learning approach, consisting of processing natural data in its raw form, resulted in the introduction of solutions based on the DL techniques. Thanks to this approach, the data analyzed was transformed into an appropriate, internal representation (feature vector) from which the learning subsystem can detect or classify patterns [35]. The DL techniques based on Artificial Neural Networks (ANNs) are used in human activity recognition systems [36], systems with Myo bands [33,37], or with sensors such as Microsoft Kinect [38,39] and Intel RealSense [40], accelerometer, gyroscope, or sensors of mobile devices (such as smartphone) [41] or wearable devices [42] as well as in systems based on video analysis [11]. To recognize a worker in action, smart manufacturing systems may use the ultrasonic sensors, Inertial Measurement Unit (IMU) [43], and the surface electromyography (sEMG) signals obtained from a Myo armband with Discrete Fourier Transformation (DFT) and Convolutional Neural Network (CNN) [37].
Activity recognition in the manufacturing area with ultrasonic and IMU sensors has been used to recognize worker activity in bicycle maintenance scenarios, in car manufacturing and to capture arm movements for classification into five activities using signals from a smartwatch for industrial assembly lines, so that factory working time can be estimated [37]. One interesting approach, when analyzing human activity, is based on the Wi-Fi signal analysis method [44]. The principle of wireless detection is that human activities affect the propagation of the RF signal; different activities can cause different patterns of signal change. These techniques are used to process signals collected from ambient and wearable sensors (motion, proximity, microphone, video sensors, accelerometers, Body Sensor Networks, gyroscopes). The technique has been proposed to detect human activities [45], detect falls [46], recognize person's gait [47], predict body pose [48,49], classify gestures [50] and extract information on movement during interactive games and exercises [38]. Such systems may find applications in intelligent home healthcare systems and assisted living environments [51] and can monitor patients by diagnosing their health and controlling their drug intake. They can be used for the automatic surveillance of public places and to detect criminal activity [52], as well as to observe the worker's behaviour at workplace for the signs of fatigue or stress [53]. The activity signature patterns observed in the signals received from wearable motion sensors of ubiquitous smartphones are analyzed to recognize activities performed by different construction workers [54]. Previous studies have either employed remote-monitoring sensors such as RGB+D cameras [23] and RFIDs; or worker-attached wearable sensors such as accelerometers, gyroscopes and magnetometers [55].
Thanks to advancements in computer vision algorithms, practitioners and researchers may now use camera sensors to give semi-real-time information on worker actions at a cheap cost [56]. Object detection, object tracking, and hence activity recognition were all part of the development of camera-based systems [57]. In the case of industrial solutions, the DL approach is used in autonomous mobile robots, vision systems for cars, speech recognition, smartphones, cameras, and digital cameras, and robot control. CNN networks currently dominate almost all tasks related to recognition and detection. Despite the wide use of the CNN network, it is difficult to find solutions supporting the automated control and evaluation of activities performed by an employee in enterprises. A frequent element of activity control systems, in the context of the performance and evaluation of work, are CNN models based on data from various types of sensors. Al-Amin et al. [20] presented a method in which workers wear two devices on their hands to acquire IMU data while constructing things. Two CNN action recognition models, a left-hand model and a right-hand model, were developed independently and fused together to produce a better activity recognition result. AR models are refined through transfer learning, which allows them to adapt to new personnel. The approach was validated in the assembly of Bukito 3D printers. Angah and Chen [58] used Mask R-CNN for multiple worker tracking on the construction job site. Multi-Object Tracking Accuracy (MOTA) that considers the mismatch problem, was used to evaluate the tracking performance. Hu et al. [59] used structured two-stream convolutional neural networks (CNNs) to recognize the behavior of workers and machines. CNNs extracted the spatial-temporal activity features and included the attention mechanism to detect important behavior.
The literature also includes models for recognizing employees' dangerous activities, using the convolutional neural network and the LSTM network [60]. Zhao and Obonyo [61] proposed using Deep Neural network models through integrating CNN with Long Short-Term Memory (LSTM) for recognizing construction workers' postures from motion data, captured by wearable Inertial Measurement Units (IMUs) sensors. Similarly, Yang et al. [62] used on-body inertial sensors and deep learning to analyze workloads of worker lower body during a physical load carrying task. The LSTM-based technique is also used to analyze emotions based on signal changes from EEG [63]. Gong et al. [64] proposed a deep learning model for recognizing the activities of workers on the offshore drilling platform, using the characteristics of the human body's key points which remain unaffected by complex background noise, to assist the detection of the human target. HMM and Naive Bayes classifier were used to recognizing employee activity in production processes using the Kinect sensor. Zeng et al. [65] proposed isolating discriminant features to recognize the activity, based on CNN and sensors of mobile devices. Jaouedi et al. [66] presented an approach, based on an analysis of video content in which functions are based on all visual characteristics of each frame of a video sequence, using the Gated Recursive Neural Network (RNN) unit (GRU) model. Pohlt et al. [67] suggested the Inflated 3D Convolutional Network (I3D ConvNet) as image encoder and then the Graph Convolutional Networks (GCN) as keypoint extractor for worker activity recognition. Tao et al. [68] performed assembly operation recognition in HCIM environment using image frames obtained from a visual camera. The recognition is performed in real time using a deep learning model trained by adopting a transfer learning approach. Son et al. [69] adopted very deep residual networks (ResNet-152) for feature map extraction and Faster regions with CNN features (R-CNN) for labeling to detect construction workers with different poses and against varying backgrounds in industrial image sequences. Sun et al. [70] employed the generative adversarial networks (GANs) for estimation of human body joints and performed worker efficiency analysis as a temporal action localization problem. First the teacher performed exemplar work activities recorded in a reference video. Then the video of activities performed by the worker is matched (using dynamic time warping) against a reference video using invariant spatio-temporal features extracted from the worker body posture sequences and perform cross-video matching. A detailed analysis of the in the form of a comparison study was made in our previous works: [11,71]. In this work, the main contribution is to build Algorithm 5; Algorithm 6 and to apply the proposed framework in the form of an integrated system to support the automatic training of new employees.
Summarizing, the DL approach is used in many fields, as the authors have presented in [11]. However, in the case of work quality and performance control systems on the market, there is a research niche of appropriate solutions using the DL approach in the process of training and verifying employee skills in a manufacturing company that would employ the benefits of smart learning and artificial intelligence [72]. There are attempts to implement solutions checking the activity of people in the workplace (such the correct sitting posture of office workers [40] and postures of construction workers [73]), but no studies were done regarding effective solutions for the automatic evaluation of the work of a newly hired employee.

The Service Used as a Case Study
The solid fuel boiler service procedure analyzed, consisting of five service activities as follows: 1. stopping the furnace operating, 2. checking the solid fuel tank, 3. checking the gear motor and auger, 4. assembling the auger and the gear motor, and 5. tightening the mounting screws of the gear motor and mounting the cleanout.
Each service step consists of stages, based on the relevant reference frames of the video material referenced. The total number of steps in the service procedure was 38. The image with the general framework was presented in Figure 1. The set of graphic files were created, which contained class objects related to the process of servicing the solid fuel boiler. The training set contained 3440 graphic files to include a total selection of objects from 11 classes, namely, namely, a person-925, a hand-1065, a solid fuel boiler-355, a controller-407, a shovel-32, a bucket-758, a plastic storage box-370 an auger-317, a solid fuel tank-716, a gear motor-1253, and a spanner-421. In total, the training set contained 6619 marked objects for 11 classes. Eight video sequences were tested showing the correctly performed service procedure for checking a solid fuel boiler. This set was created automatically from entries in a configuration file, created with Data Transformation Language. The configuration settings assumed that 80% of the dataset would be allocated to training data and 20% to validation data. Ultimately, 2787 (81.02%) images were included in the training set and 653 (18.98%) in the validation set.

New Approach to the Automatic Generation of Work Instructions
The new approach to the automatic generation of work instructions has been designed according to a flowchart ( Figure 2). The system's preparatory stages for supporting the automatic training of new employees and checking their work for clearance are marked in red, with a dashed line. Correct working practices, in the system, are marked in blue with a dashed line. In the socalled preparatory stages (Figure 2), the video material referenced, showing a service procedure being correctly performed, should be recorded. After determining the reference frames corresponding to the individual stages of the service procedure, the next step was to prepare the CNN to detect objects on the video sequence. For this purpose, artificial neural network training was carried out with the use of a dataset containing classes of objects appropriate for the service process.
The proposed CNN architecture uses SVM and Softmax to extract video frame features and information regarding the object, from the images, that is, the establishment of the referenced features and objects and the video frames tested. To train the CNN, an original dataset was used created by the authors. The dataset contains images of training materials used in the process of training employees to operate a solid fuel boiler, with selected object classes. The implementation of tasks in the system, relating to the analysis and interpretation of the image, requires the correct identification of the elements in the image. This is connected, not only with choosing the right network architecture but also with proper training thereof. Publicly available datasets made it possible to train the network to detect a person-class object, but they did not allow other objects to be detected, such as the gear motor, the controller, or the auger. Therefore, the training set had to be created from scratch, and the elements that the system was to identify had to be determined manually. This process turned out to be very time-consuming, as the effective operation of the CNN network as a detector of objects requires training on a large dataset.
In the new method proposed, 6 algorithms have been created and implemented, the use of which allows video sequences to be processed, to check the service activities performed.
The Algorithm 1-4 were strictly described in the form of pseudo-code in the previous publication of the authors [11]. The task of the Algorithm No. 1 is to analyze the set of reference frames and divide this set into component activities of the service procedure along with the determination of the stages of each activity. The START/STOP rule has been defined, according to which the set of reference frames is divided into subsets (each subset is a separate activity). The rule is a video frame where the technician and the tool board will be identified. The effect of the algorithm is a set of service activities containing the steps corresponding to this activity. The task performed by Algorithm 1 is described in [11]. The task of the algorithm No. 2 is to extract the features of the previously determined reference frames and test frames using CNN and SVM. The acquired features make it possible to identify the activity currently performed by the technical employee, and the feature extraction algorithm is an element of the two-stage process of determining the stage of service activities [11]. The task of the Algorithm No. 3 is to determine the data set containing the labels of the classes of objects located in the previously determined reference frames and test frames [11]. Technologies used in Algorithm 3 are based on CNN, R-CNN and YOLOv3 networks. The set of labels generated for each frame, corresponding to the names of classes of identified objects, will be used in the two-stage process of identifying the currently performed activity. The task of Algorithm 4 , described in [11], is to identify the step of the activity by comparing the labels of the object classes and the characteristics of the frames of the test material, and the characteristics and labels of the object classes for the reference frames. The features of the test video material were extracted using the CNN network.
The pseudo-code of Algorithm 5, designed to generate a set of graphic instructions under the stage of the service operation, is presented as follows: The pseudo-code for Algorithm no. 6, which analyses the times of the stage of activities, based on which it will be possible to identify irregularities in the service works performed, is as follows:

Algorithm 6: initialization;
activity = get(activity) stage = get(stage) time = find(act_ time, next(activity, stage)) if time == 0 then time = find(act_ time, next(activity, stage)) a = time_now while (a < time) begin send → activity, stage end; send → next(activity, stage) end where: activity-a variable that stores information about the current activity, stage-a variable that stores information about the current stage of the activity, time-a variable that stores information about the time allocated to a given stage of activity, act_time-a one-dimensional array that stores times specified for each stage, next-a function that determines the next stage of the service procedure, a-a variable that stores current time.
The process of training the artificial neural network, testing the activity detection system, and generating the scenario of conduct was carried out on a unit with the following configuration: processor AMD Ryzen 5 3600, motherboard MSI B450 TOMAHAWK MAX, RAM: PC CRUCIAL SPORT DDR4, 16GB, 3200MH, graphics card: Gigabyte GeForce RTX 2060 SUPER Gaming OC 8GB GDDR6, SSD: 2xPatriot Burst 240GB-Raid 0.
Network training was performed on the Ubuntu 19.04 operating system with CUDA 10.2 and cuDNN v7.6.4 software and the Docker Engine installed. The operation of the employee activity detection system was tested on the Windows 10 operating system, with the CUDA 10.2 and cuDNN v7.6.4 software as well as the PyTorch and Opencv frameworks installed. The implementation of tasks related to the information system, used for image analysis, assumed the use of the Yolov3 network architecture for the purpose. The Yolov3 network has been used before for monitoring workers activity in construction site video sequences [74].
The research used a convolutional neural network based on the Yolov3 and Yolov3 Tiny architecture. The choice of architectures based on Yolov3 was due to the fact that they ensured high efficiency in recognizing objects contained in the tested video material using a PC class machine. The YOLOv3 architecture is based on the Darknet53 model and has 53 convolutional layers. A more efficient version of YOLOv3 is the YOLOv3-Tiny version which is based on the Yolov3 architecture, but has a less complex architecture and consists of 24 layers. The ANN, based on this model made it possible to detect objects of all classes contained in the dataset, but the speed of operation of the IT system was limited by the technical capabilities of the machine on which the application was launched. The speed of operation of the detection system for the Yolov3 model, after implementing the proprietary algorithms was about 5 frames per second. In connection with the above, the network architecture based on the Yolov3 Tiny model was adopted, which allowed the processing speed of about 26-27 frames per second to be obtained. The process of training the network to use the created dataset for the Yolov3 Tiny model took about 5 h, while for the Yolov3 model, it took about 12 h.
The efficiency of establishing the appropriate frames using CNN + SVM was about 94% (Resnet18 94%, Alexnet 87.51%) and with YOLOv3 73.15% (Cifar10Net: 61.66%, Alexnet: 48.33%). Table 1 presents a comparison with state-of-the-art alternatives in order to highlight the new contribution of our work.

Research Results-An Integrated System to Support the Automatic Training of New Employees
Based on the proposed methodology (Figure 2), an information system to support the automatic training of new employees and verify their work was built. In the first stage, the system is prepared to work with test material while in the second stage, the test material is analyzed. The test material is a video recording of service activity correctly done.

Preparing the System to Work
The implementation of the IT system will facilitate the following: the extraction of reference frames (including START/STOP frames); the division of reference frames into sets of frames, corresponding to individual activities and their stages; the extraction of reference frame features and their transformation, where the three-dimensional tensor is transformed into a two-dimensional form, the extraction of objects in reference frames, the preparation of sets of graphic instructions, the storing of information about objects, reference frame characteristics and file names for graphic instructions, the designation of activities and steps to be controlled by the system.
From the reference material obtained, reference frames corresponding to the individual stages of the activity were separated manually. For each reference frame, features were determined, via the extraction of features from an artificial neural network, in the form of a three-dimensional tensor, which was processed into a two-dimensional form. Objects in reference frames were also identified. Information about the characteristics and objects of the reference frames was saved in the system. Obtaining information about objects for each frame can be done automatically or manually with the decision being made by the system's user, based on the value of the program variable. The automatic detection of objects is performed, using the built-in code and the Python functions implemented additionally. Designated features and object class labels were used to identify the activity trained to work in a given employee's position, by comparing the characteristics and labels of the reference cage with the test cages. An example of an array of objects for activity 1 solely, appears thus: ref_objects = ( ["Gear motor","Hand","Bucket"], [""],) A START/STOP frame was designated based on which activities and stages of the service procedure were distinguished. The appearance of this type of frame in the film means the end of service. The video frame features were extracted using an ANN, processed from a three-dimensional tensor to a two-dimensional form, and saved to a file. Next, graphic instructions were prepared corresponding to each stage of the service procedure. For testing purposes, the instructions contained only information about the activity and the stage of the activity performed.
The production version of the system should contain graphic instructions, suggesting the actions to be performed. The user of the system, while configuring its operation, has the option to select the activities to be controlled. The work activities and steps are determined through the appropriate configuration of the elements of the activities_to_proc array. An additional element of the system was the time control functionality for the completion of the service phase. In the module wherein the time of the service procedure is not controlled, the system will require each step to be implemented. In the time control module of the service activity, in the event of exceeding the time allotted for the implementation of a stage, the system will display information that the stage has not been completed and the system will go on to control the next stage. The stage is verified based on information about the time allocated to each stage, stored in a previously defined table.

The Correct Operation of the System
Implementation of the IT system will facilitate the following: START/STOP frame detection; analysis of the frames of the test video, based on the extracted features of the frame and information about the objects on the frame of the test material; detection of user-defined activities and steps of the service procedure carried out on the test video material; generation of a set of graphical instructions corresponding to the conduct scenario, created based on the activities designated and the stages of activities; the displaying of graphic instructions; the displaying of information about the activity and step of the activity, the results of comparing features -for the reference frame and test frames features; the results of the comparison of objects, the common part of the set of labels for the object classes of the reference frame and the set of labels for the object classes of the test frames.
In the first stage of the operation of the system, in detection mode, the START/STOP frame features are loaded, based on which, the beginning and end of the service action will be determined, and the test frames will be compared to the feature values for this frame. The actual stage of the detection of activity begins with the analysis of the video sequence. Test material in the form of a video, or material recorded by a camera, is captured and processed in a loop; analysis, thereof, concerns the characteristics of each frame of the data processed. It was assumed that the start of the service procedure -and each activity that is part thereof-begins and ends with the appearance of the START/STOP frame; therefore, the system first compares the features of the test material frame with the features of the START/STOP frame. As a result, the system will search for this type of cage until it finds it. The feature analysis process for test frames begins with changing the tensor size (3D to 2D). The system then compares the features of the test frame, currently being analyzed, with the features of the START/STOP frame. If the features are similar, the system will identify the analyzed frame as a START/STOP frame, and the variable that counts the activities and stages of activities is incremented. Setting activity to 1 and stage to 1 means that the system has started searching for stage 1 of activity 1. In time control mode (mode = 1), it will measure the time from the commencement of the START/STOP frame and if the time for a stage is exceeded, it will go on to the next stage, taking with it information that the stage has not been completed. In the case of the control mode of each stage, from the moment the system identifies the next START/STOP frame, the value of the activity variable will be incremented, and the stage variable will be set to 1. This will enable the search for the next service steps. The system will analyze the image frames until it finds a frame corresponding to the last index of the table, storing information about the reference frames.
When the process of detecting activities and steps has been initiated, that is, when the system has identified the test frame as the first frame, START/STOP, the system acquires the features of the next reference frame. This will be the frame responsible for stage 1 of activity 1. For this purpose, the features of the reference frame that were previously stored there, are read from the file responsible for the stage of activity being searched for. After loading the features of the cage searched, they are compared with the features of the cage of the test material. If the features are similar, labels for the classes of objects in the test frame are set. The identified objects (labels) are compared with the objects assigned to the reference frame. For matching, we use mean absolute error (MAE) as an error metric. If the characteristics and labels of objects on both compared cages match, the stage of the service procedure operation is identified. A window is displayed with information about the stage of service activity found.

The Results of the Experiments
Analysis of the system efficiency consisted in assessing the detection of activities, the stages of service activities, and the scenario of conduct, along with graphic instructions, generated at a given workstation (see Figure 3).
Once the system was operating properly, the recorded test video sequences, showing the service procedure being performed by the trained employee, were analyzed. Based on the stage of service activities identified, the system generates a scenario of conduct at a given workstation and displays a set of workstation instructions for four consecutive stages of the service procedure ( Figure 1). If the features and classes of objects of the reference frame and test frame currently being analyzed match (see an example of video frame matching results presented in Figure 4), the system will start to detect the next reference frame.  Matching matrix between the reference frames and the test frames using mean absolute error (MAE) as a measure of errors between paired observations. The matrix is used to detect the most similar test frames to the reference frames.
As a result of the analysis of the test video sequences by the system, the following results were obtained (for each of the tested sequences), which confirm its effectiveness: five service activities have been detected, a total of 38 steps have been detected, 11 classes of objects appearing in image frames were detected. So, the proposed system allowed for full control of the service procedure implementation.
Moreover, the proposed system is a solution to the problem of high turnover among employees with specialist knowledge. A frequent element of activity control systems in the context of work performance and evaluation are CNN network implementations based on data from various types of sensors [39,75,76]. So, despite the wide use of CNN, it is difficult to find solutions that allow automation of the training process for new employees without the involvement of an experienced employee in the process of training new employees.
The proposed approach can be extended to cover a knowledge management area in the context of the automation of the process of specialist knowledge transfer to a new employee. The implementation of an integrated system supports the automatic training of new employees in a company where training courses are conducted for employees taking up employment at a given position. During the activities, the new employee is recorded and the captured video frames are analyzed by CNN for the similarity of features and the presence of objects. The information system based on the proposed model includes two modes of monitoring the implementation of the service phase. In the first mode, the system will require the implementation of each stage of activities and will await the performance of the appropriate stage of service activities. In the second mode, in the event of exceeding the time allotted for the implementation of a stage, the system will display information that the stage has not been completed and the system proceeds to control the next stage.

Conclusions
Despite the increasing automation levels in emerging Industry 4.0 manufacturing, acquiring and transferring the explicit knowledge of highly skilled manufacturing workers remains a strategic challenge. In this paper we addressed this problem by employing the deep learning techniques for capturing human worker activities in industrial setting.
The research results enabled to build of the integrated system to supporting the automatic training of new employees and verifying their work using based on the deep learning approach based on the data collected on a properly conducted service procedure. The originality of the proposed new framework for the automatic generation of workplace instructions and real-time recognition of worker activities is demonstrated by integrating CNN, R-CNN, YOLOv3 (Yolo Tiny) in the new approach and therefore it is possible to generate the right scenario during the service procedure of solid fuel boiler. The original results of the developed solution are based on a two-stage model of identification of the currently performed activities in the service procedure. The two-stage process of identifying the service activity stage is carried out by comparing the designated reference and test characteristics of video frames and information about the classes of objects located on the analyzed frames. The limitations of our work are lack of simultaneous analysis of data obtained from a larger number of cameras, and lack of the analysis of activities carried out for a different service procedure. In our further works, it is planned to build algorithms, the use of which will increase the effectiveness of verifying the correctness of the activities performed by a new employee thanks to the development of the mechanism for controlling the location of objects and optimization the applied techniques and algorithms of image analysis in order to accelerate the extraction and comparison of image features in real time.
Verification of the performed service activities by an employee is a responsible task, therefore in our further work, it is planned to expand the proposed system with the possibility of simultaneous analysis of data obtained from a larger number of cameras in the workplace. This will allow for more precise control of the service work performed.
Currently, technology maturity, data and knowledge acquiring and sharing, new methods in information systems design, perception, and human-robot interaction are the important challenges in the traditional manufacturing companies. Moreover, in the context of introducing changes in the manufacturing company operating in accordance with the assumptions of Industry 4.0 mobile platforms can accomplish tasks in workspaces. Therefore, the proposed information system should be integrated with other Industry 4.0 technologies introduced within an enterprise, but this requires further work in this area. Data Availability Statement: Data is available from the corresponding author upon reasonable request.

Conflicts of Interest:
There are no conflicts of interest.