Detection of Anomalous Behavior of Manufacturing Workers Using Deep Learning-Based Recognition of Human–Object Interaction

: The increasing demand for industrial products has expanded production quantities, leading to negative effects on product quality, worker productivity, and safety during working hours. Therefore, monitoring the conditions in manufacturing environments, particularly human workers, is crucial. Accordingly, this study presents a model that detects workers’ anomalous behavior in manufacturing environments. The objective is to determine worker movements, postures, and interactions with surrounding objects based on human–object interactions using a Mask R-CNN, MediaPipe Holistic, a long short-term memory (LSTM), and worker behavior description algorithm. The process begins by recognizing the objects within video frames using a Mask R-CNN. Afterward, worker poses are recognized and classiﬁed based on object positions using a deep learning-based approach. Next, we identiﬁed the patterns or characteristics that signiﬁed normal or anomalous behavior. In this case, anomalous behavior consists of anomalies correlated with human pose recognition (emergencies: worker falls, slips, or becomes ill) and human pose recognition with object positions (tool breakage and machine failure). The ﬁndings suggest that the model successfully distinguished anomalous behavior and attained the highest pose recognition accuracy (approximately 96%) for standing, touching, and holding, and the lowest accuracy (approximately 88%) for sitting. In addition, the model achieved an object detection accuracy of approximately 97%.


Introduction
In Industry 4.0, following the growing public demand for industrial products that significantly contribute to economic perspectives [1,2], manufacturers are striving to produce highly customizable products while maintaining production accuracy and speed [2,3].In line with this, government regulations have been established with the purpose of regulating the responsibility of company owners to ensure worker safety during working hours [4].Likewise, monitoring manufacturing environments, particularly human workers, is crucial [5][6][7].Implementing these measures enhances quality control, worker productivity, and cost efficiency in addition to workplace safety.Hence, it is necessary to focus on aspects such as human error, machine malfunction, or other possibilities that significantly impact product quality, efficiency, and workplace safety.
As a matter of fact, human error significantly contributes to manufacturing production disturbances and safety problems [1][2][3][4][5][6][7][8][9][10][11].In addition, humans are considered the most active objects; thus, detecting worker behavior is vital.Anomaly detection, which is a key task of modern automated monitoring systems, highly impacts diverse fields such as health care, sports analysis, industries, and security.Therefore, this study aims to develop a model that detects anomalous worker behavior based on human-object interactions.The outputs are in the form of descriptive texts that indicate detection results.This supports management teams in mitigating production failure and improving worker performance, as well as workplace safety.Accordingly, an algorithm has been developed for anomaly detection and text generation.
To provide a clear structure and logical flow, this paper is organized into different sections.Section 2 presents a description of related research, and Section 3 contains an explanation of the concept and detection of anomalous behavior.Next, the proposed method is discussed in Section 4, and the experimental results are provided in Section 5. Finally, Section 5 also presents the conclusion and future study directions.

Related Works
Computer vision related to the monitoring and detection of anomalous behavior has been studied under different conditions and environments for different purposes using different methods.Some of the studies are addressed below.
The authors of [7] present a system that analyzes human motion from the position and movement of objects within three-dimensional spaces in in-depth images.The system uses a depth sensor and a computer vision algorithm to detect human poses and movements and subsequently executes a machine learning model to recognize specific behaviors.In [12], computer vision monitors worker productivity and safety for the purpose of minimizing manual monitoring in industries using an object recognition algorithm (YOLOv3) trained using images.Likewise, ref. [13] presents a vision-based fault classification approach by utilizing cameras to capture real-time images from robot movements and components to monitor industrial robots.The aim is to ensure worker productivity and product quality.In contrast, ref. [14] focuses on designing and implementing a system that uses a combination of shape-and motion-based features to recognize human actions based on silhouettes.The system consists of preprocessing, silhouette extraction, feature extraction, and classification modules.A support vector machine classifier is implemented to classify human behavior.Furthermore, in [15,16], image analysis is performed on images where objects and body poses remain stable.
Subsequently, ref. [17] proposes a multi-view learning approach to detect anomalous human behavior in complex and dynamic environments using multiple data sources, such as sensor data, video feeds, and contextual information.In addition, the authors of [18] conduct video monitoring using intelligent behavior identification techniques to detect suspicious behavior and alert security personnel.The system utilizes a camera to capture footage that is subsequently processed to identify suspicious worker behaviors based on human body movement analysis.As a result, the proposed system reduces the number of false alarms and enables security personnel to promptly respond to potential threats.Apart from this, ref. [19] focuses on safety monitoring at construction sites to prevent accidents and injuries using an anomaly detection algorithm developed in a random finite set framework.Later, ref. [20] presents an unsupervised anomaly-based approach for pedestrian age classification in surveillance camera footage, and the proposed framework incorporates an adversarial model with skip connections.The results indicate the potential of this approach to classify pedestrian age and detect anomalous age patterns.In addition, in [21], an online and adaptive method is utilized to detect abnormal events in video surveillance using a spatiotemporal ConvNet.The method combines spatial and temporal information and incorporates an adaptive learning approach to dynamically update the model, thus improving the detection accuracy using normal events as training samples.A different study [22] highlights the importance of anomaly detection in the elderly daily behavior domain.The study presents diverse methods and approaches as valuable monitoring tools to detect possible anomalous situations that potentially indicate warning signs for chronic illnesses or initial physical and cognitive decline.The study is specifically designed for caregivers, healthcare professionals, and family members to enhance their overall safety and quality of daily life.Ultimately, the integration of advanced technologies and machine learning in eldercare conceivably strengthens the well-being and independence of the elderly population, while facilitating and supporting caregivers in delivering professional duties.Moreover, ref. [23] presents a performance analysis of convolutional neural networks (CNN) using a long short-term memory (LSTM) network to effectively capture and analyze spatiotemporal information from surveillance video data for surveillance systems.The image features are extracted from the image frame sequences using CNN, whereas LSTM uses the gate mechanism to maintain vital information.Afterward, the results are compared with existing detection models, including a mixture of probabilistic principal analysis, motion deep net, social force, and dictionary-based models for performance evaluation.The authors of [24] propose a framework that utilizes deep learning techniques to detect abnormal sea surface temperature (SST) events automatically.The framework consists of two main components: a deep convolutional autoencoder and a classifier.The deep convolutional autoencoder is trained on a large dataset of normal SST patterns.The purpose is to learn the underlying features and patterns to detect abnormal events in sea surface temperature, thus monitoring and managing marine ecosystems and predicting extreme weather events.Subsequently, ref. [25] aims to improve the accuracy and efficiency of SST prediction, as well as reduce computational costs by combining two deep learning models: LSTM networks and CNNs.The hybrid approach consists of two main stages: feature extraction and prediction.During the feature extraction stage, CNN extracts spatial features from satellite SST images.The CNN learns to capture patterns and spatial dependencies in the data; therefore, it allows for the encoding of relevant information.The extracted features are subsequently fed into the LSTM networks in the prediction stage.In this case, LSTM networks are capable of capturing temporal dependencies and long-term patterns suitable for time series prediction tasks.In [26], a valuable overview of deep learning approaches is provided for anomalous event detection in video surveillance that serves as a comprehensive resource for researchers and practitioners inclined toward understanding the advancements and current state-of-the-art techniques in this field.
Equivalently notable, existing studies have not specifically addressed the labeling of anomalous worker behavior using descriptive text.This study, therefore, presents a model that detects anomalous behavior performed by manufacturing workers based on humanobject interaction in videos.In practice, the model combines object detection and human pose estimation using Mask R-CNN (R-CNN stands for Region-Based Convolutional Neural Network), MediaPipe Holistic, long short-term memory (LSTM), and worker behavior description algorithm.Mask R-CNNs and LSTMs are part of the most outstanding practices within deep learning.Mask R-CNN is capable of detecting objects, whereas Me-diaPipe works as a pose tracker.In addition, LSTM is utilized to train the model to recognize possible differences in worker behavior within manufacturing environments.Additionally, as an advancement of previous research [15,16], this study focuses on using videos to generate descriptive texts and distinguish normal and anomalous worker behavior.

Problem Statement
Anomalous behavior is defined as unusual or abnormal activities performed by a human.For example, sleeping in a bedroom is normal, whereas sleeping in a bathroom is unusual; lying on a floor motionless over an extended period is unusual.Likewise, anomalous worker behavior refers to actions or conduct exhibited by workers that deviate from the expected, normal, standard, or desired behavior at a manufacturing site.This behavior compromises a workplace's productivity, safety, and overall functions.
This study addresses anomalous behavior in manufacturing environments in particular.The anomalies are divided into anomalies related to human pose recognition (emergency: worker falls, slippage, or illness) and anomaly related to the recognition of human poses with object positions (tool breakage and machine failure).In this case, an anomaly detection algorithm is developed to detect particular conditions resulting from recognizing unexpected human poses based on body movements.Moreover, a framework is proposed to detect human poses and interconnected object positions.Normal data consist of a video containing expected worker behavior, and test data contain normal and abnormal behavior.Hence, the proposed method learns behavior patterns from normal data to estimate normal/abnormal responses.

Body Motion-Based Anomaly Detection
Body motion-based anomaly detection requires thorough human body motion information.The idea is to collect features from the expected body movements provided by the normal data.The model is trained to classify normal or anomalous features and has been tested to verify performance.Correspondingly, the use of MediaPipe Holistic and LSTM to analyze body motion features is proposed.The method is pre-trained to classify purposes in the frames.Frame t, f(t) is the set of all possible feature vectors associated with t.Additionally, the collected b vectors from the normal data are subsequently embedded into a common multidimensional Cartesian space.This study addresses anomalous behavior in manufacturing environments in particular.The anomalies are divided into anomalies related to human pose recognition (emergency: worker falls, slippage, or illness) and anomaly related to the recognition of human poses with object positions (tool breakage and machine failure).In this case, an anomaly detection algorithm is developed to detect particular conditions resulting from recognizing unexpected human poses based on body movements.Moreover, a framework is proposed to detect human poses and interconnected object positions.Normal data consist of a video containing expected worker behavior, and test data contain normal and abnormal behavior.Hence, the proposed method learns behavior patterns from normal data to estimate normal/abnormal responses.

Body Motion-Based Anomaly Detection
Body motion-based anomaly detection requires thorough human body motion information.The idea is to collect features from the expected body movements provided by the normal data.The model is trained to classify normal or anomalous features and has been tested to verify performance.Correspondingly, the use of MediaPipe Holistic and LSTM to analyze body motion features is proposed.The method is pre-trained to classify purposes in the frames.Frame t, f(t) is the set of all possible feature vectors associated with t.Additionally, the collected b vectors from the normal data are subsequently embedded into a common multidimensional Cartesian space.First, the input feature (video frames) is extracted.In this case, sequential frames are extracted from the video from the top of the convolutional layer to identify the transformation of each movement in each frame.The entire dataset is divided into training and testing data in 75% and 25% ratios.The model classifies objects using an input image size of 224 × 224 × 3 (RGB).

Proposed Method
The next step is object detection, where each object in the sequential video frames is detected using Mask R-CNN methodology by producing a fixed size from each feature map.Afterward, a completely connected layer with a similar input size is created to enable First, the input feature (video frames) is extracted.In this case, sequential frames are extracted from the video from the top of the convolutional layer to identify the transformation of each movement in each frame.The entire dataset is divided into training and testing data in 75% and 25% ratios.The model classifies objects using an input image size of 224 × 224 × 3 (RGB).
The next step is object detection, where each object in the sequential video frames is detected using Mask R-CNN methodology by producing a fixed size from each feature map.Afterward, a completely connected layer with a similar input size is created to enable convolutional and deconvolution neural networks to detect object classes (labels).Subsequently, it creates bounding boxes (bbox) and segmentation masks.The underlying reason for utilizing R-CNN Mask is its capability to detect each small object and overlapping objects in a more specific and efficient way compared to other models such as Fast R-CNN, Faster R-CNN, Single-Shot MultiBox Detector (YOLO), and Single-Shot MultiBox Detector (SSD).Additionally, Mask R-CNN is suitable for the requirements and focus of the study.
In the next stage, poses are identified and classified to recognize worker body movement.Within the process, the model acquires pose landmarks from the MediaPipe Holistic framework, and subsequently examines the overall body parts to determine the precise locations of human body key points.The purpose is to understand the articulation of an individual's joints and body parts.As a matter of fact, this stage is intended to generate a label for each part of the human body prior to extracting and classifying the workers' poses, such as standing, walking, touching, holding, bending, sitting, and squatting.
Following the forementioned pose classification stage, the model accordingly describes behavior classification (determines possible anomalous behavior) and generates descriptive text based on the identified worker pose and interaction with surrounding objects.In this study, the behavior is identified as normal or anomalous subsequent to adjusting to the real conditions within manufacturing environment.To illustrate this, the sitting pose, in general, is considered a neutral pose; however, in case the sitting pose is performed on an object, a combination of the pose and the object is considered as a behavior and is classified as normal or abnormal based on the location or domain.For a more comprehensive illustration, a student is sitting on a chair in a classroom, for example.This behavior is classified as normal, as sitting in the classroom is normal.However, when the behavior is performed in a manufacturing environment, the behavior is classified as anomalous, as the behavior is forbidden according to the working procedure.For a more detailed example, a worker is sitting in front of a machine located in a manufacturing environment.This is classified as anomalous behavior as it is considered to deviate from the working procedure.In addition, Figure 2 displays anomalous conditions derived from video frames.Figure 2a-c presents normal conditions: a worker is precisely standing in front of a machine (a), a worker is touching a control box in front of a machine (b), and a worker is holding a tool in front of a machine (c).By contrast, Figure 2d-f presents anomalous behavior: a worker is sitting in front of a machine (d), a worker is sitting/slipping in front of a machine (e), and a worker is standing behind a machine (unseen; f).
convolutional and deconvolution neural networks to detect object classes (labels).Subsequently, it creates bounding boxes (bbox) and segmentation masks.The underlying reason for utilizing R-CNN Mask is its capability to detect each small object and overlapping objects in a more specific and efficient way compared to other models such as Fast R-CNN, Faster R-CNN, Single-Shot MultiBox Detector (YOLO), and Single-Shot MultiBox Detector (SSD).Additionally, Mask R-CNN is suitable for the requirements and focus of the study.
In the next stage, poses are identified and classified to recognize worker body movement.Within the process, the model acquires pose landmarks from the MediaPipe Holistic framework, and subsequently examines the overall body parts to determine the precise locations of human body key points.The purpose is to understand the articulation of an individual's joints and body parts.As a matter of fact, this stage is intended to generate a label for each part of the human body prior to extracting and classifying the workers' poses, such as standing, walking, touching, holding, bending, sitting, and squatting.
Following the forementioned pose classification stage, the model accordingly describes behavior classification (determines possible anomalous behavior) and generates descriptive text based on the identified worker pose and interaction with surrounding objects.In this study, the behavior is identified as normal or anomalous subsequent to adjusting to the real conditions within manufacturing environment.To illustrate this, the sitting pose, in general, is considered a neutral pose; however, in case the sitting pose is performed on an object, a combination of the pose and the object is considered as a behavior and is classified as normal or abnormal based on the location or domain.For a more comprehensive illustration, a student is sitting on a chair in a classroom, for example.This behavior is classified as normal, as sitting in the classroom is normal.However, when the behavior is performed in a manufacturing environment, the behavior is classified as anomalous, as the behavior is forbidden according to the working procedure.For a more detailed example, a worker is sitting in front of a machine located in a manufacturing environment.This is classified as anomalous behavior as it is considered to deviate from the working procedure.In addition, Figure 2 displays anomalous conditions derived from video frames.Figure 2a-c presents normal conditions: a worker is precisely standing in front of a machine (a), a worker is touching a control box in front of a machine (b), and a worker is holding a tool in front of a machine (c).By contrast, Figure 2d-f presents anomalous behavior: a worker is sitting in front of a machine (d), a worker is sitting/slipping in front of a machine (e), and a worker is standing behind a machine (unseen; f).(e) (f)

Implementation Result
In practice, a customized dataset was trained by including objects created through a process that was augmented by rotation and occlusion operations.The objects comprised a worker, a machine, a control box, a toolbox, a tool, and a product, which were utilized to perform object detection.Subsequently, the collected data were divided into training data of 75% and testing data of 25%.In this particular context, the model was intended to enhance the previous model [15,16] by adding a dataset and layers.For a more detailed explanation, Figure 3a depicts the model specifically detecting each object in the frames: a worker, a machine, a toolbox, and a tool; Figure 3b depicts a worker, a machine, a control box, and two toolboxes displayed using bbox and segmentation masks.In addition, the results signify high accuracy and fast recognition based on normal conditions.Moreover, pose classification was performed using LSTM.In the process, a network was built that consisted of three LSTM layers and three dense layers, which therefore, produced an output layer that consisted of seven neurons representing dynamic poses or human behavior: standing, walking, touching, holding, bending, sitting, and squatting.Due to direct connections between the current and previous cells, the earlier-generated

Implementation Result
In practice, a customized dataset was trained by including objects created through a process that was augmented by rotation and occlusion operations.The objects comprised a worker, a machine, a control box, a toolbox, a tool, and a product, which were utilized to perform object detection.Subsequently, the collected data were divided into training data of 75% and testing data of 25%.In this particular context, the model was intended to enhance the previous model [15,16] by adding a dataset and layers.For a more detailed explanation, Figure 3a depicts the model specifically detecting each object in the frames: a worker, a machine, a toolbox, and a tool; Figure 3b depicts a worker, a machine, a control box, and two toolboxes displayed using bbox and segmentation masks.In addition, the results signify high accuracy and fast recognition based on normal conditions.

Implementation Result
In practice, a customized dataset was trained by including objects created through a process that was augmented by rotation and occlusion operations.The objects comprised a worker, a machine, a control box, a toolbox, a tool, and a product, which were utilized to perform object detection.Subsequently, the collected data were divided into training data of 75% and testing data of 25%.In this particular context, the model was intended to enhance the previous model [15,16] by adding a dataset and layers.For a more detailed explanation, Figure 3a depicts the model specifically detecting each object in the frames: a worker, a machine, a toolbox, and a tool; Figure 3b depicts a worker, a machine, a control box, and two toolboxes displayed using bbox and segmentation masks.In addition, the results signify high accuracy and fast recognition based on normal conditions.Moreover, pose classification was performed using LSTM.In the process, a network was built that consisted of three LSTM layers and three dense layers, which therefore, produced an output layer that consisted of seven neurons representing dynamic poses or human behavior: standing, walking, touching, holding, bending, sitting, and squatting.Due to direct connections between the current and previous cells, the earlier-generated Moreover, pose classification was performed using LSTM.In the process, a network was built that consisted of three LSTM layers and three dense layers, which therefore, produced an output layer that consisted of seven neurons representing dynamic poses or human behavior: standing, walking, touching, holding, bending, sitting, and squatting.Due to direct connections between the current and previous cells, the earlier-generated information was directly applied to predict target poses; thus, crossing multiple units was unnecessary.The long-range information persisted up to the stage of predicting the last pose.In addition, enhanced information was employed due to direct use of the previous and last hidden states to predict the current pose.In this model, the previous LSTM cells, including the last cells, were directly connected to the current cells, where the attention mechanism integrated information in different hidden states.For a more precise description, the training and testing processes are presented in Figure 4.
Appl.Sci.2023, 13, x FOR PEER REVIEW 7 of 14 information was directly applied to predict target poses; thus, crossing multiple units was unnecessary.The long-range information persisted up to the stage of predicting the last pose.In addition, enhanced information was employed due to direct use of the previous and last hidden states to predict the current pose.In this model, the previous LSTM cells, including the last cells, were directly connected to the current cells, where the attention mechanism integrated information in different hidden states.For a more precise description, the training and testing processes are presented in Figure 4. Table 1 illustrates the hyperparameters for training the model.These parameters are obtained from continuous training of the model that produced the expected results.The model was tested on a video sequence to ensure each worker pose was accurately detected, as illustrated in Figure 5.  Table 1 illustrates the hyperparameters for training the model.These parameters are obtained from continuous training of the model that produced the expected results.The model was tested on a video sequence to ensure each worker pose was accurately detected, as illustrated in Figure 5.
Turning to another aspect, a worker behavior description algorithm was created to present the worker behavior derived from video images in the form of descriptive texts.In this practice, the logic was based on the worker's position relative to the coordinates of each object and the pose label within each frame.Each behavior was classified as normal or anomalous based on the given object position and pose label.Afterward, descriptive texts were created according to the attained results from human pose estimations and detection on the objects nearest to the worker.In this case, simple description texts were composed of a subject ("("a") + ("worker") + to be + verb + adverb ("in front of") + particle + object").
Appl.Sci.2023, 13, x FOR PEER REVIEW 7 of 14 information was directly applied to predict target poses; thus, crossing multiple units was unnecessary.The long-range information persisted up to the stage of predicting the last pose.In addition, enhanced information was employed due to direct use of the previous and last hidden states to predict the current pose.In this model, the previous LSTM cells, including the last cells, were directly connected to the current cells, where the attention mechanism integrated information in different hidden states.For a more precise description, the training and testing processes are presented in Figure 4. Table 1 illustrates the hyperparameters for training the model.These parameters are obtained from continuous training of the model that produced the expected results.Turning to another aspect, a worker behavior description algorithm was created to present the worker behavior derived from video images in the form of descriptive texts.In this practice, the logic was based on the worker's position relative to the coordinates of each object and the pose label within each frame.Each behavior was classified as normal or anomalous based on the given object position and pose label.Afterward, descriptive texts were created according to the attained results from human pose estimations and detection on the objects nearest to the worker.In this case, simple description texts were composed of a subject ("("a") + (ʺworkerʺ) + to be + verb + adverb ("in front of") + particle + object").
Additionally, adverbs were used to express the workerʹs position relative to other objects based on the cameraʹs viewing angle.This process was conducted by encoding the object into a feature vector among the objects in the frame.Thereafter, the language model adopted vectors to generate descriptive texts, as illustrated in Figure 6-("*" is indicated as the beginning of the sentence structure).The proposed algorithm that describes worker behavior is presented in Algorithm 1, while the detection and labeling results are in Figure 7 (normal and anomalous conditions).Additionally, adverbs were used to express the worker's position relative to other objects based on the camera's viewing angle.This process was conducted by encoding the object into a feature vector among the objects in the frame.Thereafter, the language model adopted vectors to generate descriptive texts, as illustrated in Figure 6-("*" is indicated as the beginning of the sentence structure).The proposed algorithm that describes worker behavior is presented in Algorithm 1, while the detection and labeling results are in Figure 7 (normal and anomalous conditions).

Performance Evaluation
In evaluating the performance, assessing accuracy was conducted on the proposed model's object detection and human pose identification.Table 2 presents two experimental results for the object detection accuracy, precision, and recall (with 1000 and 12,000 datasets).When implementing 1000 datasets in real time, the model produced an approximate average accuracy of 96%.On the other hand, when implementing 12,000 data, the average accuracy was approximately 97%.The results suggest that employing both datasets provides insignificantly different results.Additionally, Figure 8 depicts a confusion matrix of six objects: worker, machine, control box, toolbox, tool, and product.F(t), human_object_interaction, list_behavior 7.
Descriptive_text1 ← "Normal Behavior" Table 3 details the model's performance on human poses in terms of accuracy, precision, and recall from 21 to 31 videos.In this case, the video contains a total of 39,005 frames and 37 anomalous events: sitting, bending in front of the machine, and when the worker is not seen in the frame.For training and testing, when implementing 21 videos, the model achieved satisfying accuracy rates of 94% and 93%, respectively.Moreover, when implementing 31 videos, the model achieved a similarly satisfying accuracy of approximately 95% and 94%, respectively; in addition, no overfitting was found.Furthermore, the model successfully performed the highest accuracy for standing, touching, and holding poses (approximately 95% for 21 videos and 96% for 31 videos).Out of the overall poses, when implementing 21 and 31 videos, the sitting pose performed the lowest accuracy of, approximately 88% and 90%, respectively.For a clearer description, the confusion matrix of the seven poses is as follows: standing, walking, touching, holding, bending, sitting, and squatting (illustrated in Figure 9).

Performance Evaluation
In evaluating the performance, assessing accuracy was conducted on the proposed modelʹs object detection and human pose identification.Table 2 presents two experimental results for the object detection accuracy, precision, and recall (with 1000 and 12,000 datasets).When implementing 1000 datasets in real time, the model produced an approximate average accuracy of 96%.On the other hand, when implementing 12,000 data, the average accuracy was approximately 97%.The results suggest that employing both datasets provides insignificantly different results.Additionally, Figure 8 depicts a confusion matrix of six objects: worker, machine, control box, toolbox, tool, and product.

Conclusions
To resolve problems in manufacturing industries, this study provides a model that detects anomalous behavior in manufacturing environments.The focus is to classify anomalous worker behavior to subsequently label the behavior by utilizing descriptive texts.To provide the expected results, relevant scenarios have been conducted that encompass recognizing the behavior from object detection and human pose based on a threshold.Afterward, classifying the anomalies related to human pose recognition (emergency: worker falls, slippage, or illness) and anomalies related to the recognition of human poses with object positions (tool breakage and machine failure).Next, the worker behavior within the manufacturing environment was classified as normal or anomalous behavior using a worker behavior description algorithm.In this case, the proposed model successfully detected anomalous behavior within the tested videos consisting of anomalous

Conclusions
To resolve problems in manufacturing industries, this study provides a model that detects anomalous behavior in manufacturing environments.The focus is to classify anomalous worker behavior to subsequently label the behavior by utilizing descriptive texts.To provide the expected results, relevant scenarios have been conducted that encompass recognizing the behavior from object detection and human pose based on a threshold.Afterward, classifying the anomalies related to human pose recognition (emergency: worker falls, slippage, or illness) and anomalies related to the recognition of human poses with object positions (tool breakage and machine failure).Next, the worker behavior within the manufacturing environment was classified as normal or anomalous behavior using a worker behavior description algorithm.In this case, the proposed model successfully detected anomalous behavior within the tested videos consisting of anomalous worker behavior and unexpected worker interaction with surrounding objects, while effectively preserving the target with an acceptable object detection accuracy of approximately 97% and the highest pose recognition accuracy of approximately 96% (for standing, touching, and holding); the lowest accuracy was approximately 88% (for the sitting pose).The results did not exhibit significant differences, and the gap was no more than one percent.In fact, some poses were incorrectly detected.For a more detailed example, in one case, the model could not properly align the positions of the overall body landmarks, thus leading to landmark misalignment.In a different case, the hand landmarks were not identified as covered by the body landmarks.
Therefore, a betterment of the current work is required in future studies by expanding requirement scenarios on abnormal real-time localization in terms of target tracking from multiple-target anomalous behavior to detect anomalies.In this particular context, the studies are expected to enhance the scope from individual workers to interactions between multiple workers and objects.The goal is to provide a more comprehensive understanding of the manufacturing environment by adding larger and more diverse datasets to improve the performance.In addition, implementing and deploying the model in real-time monitoring systems are required to create practical and useful models in real-world manufacturing environments.These practices involve optimizing the model's computational efficiency, ensuring low latency, and developing user-friendly interfaces for workers and supervisors to interpret and respond to anomalous alerts effectively.Moreover, advanced Mask R-CNN versions are feasibly pre-trained to enhance object detection.

Figure 1
Figure 1 illustrates the architecture of the proposed method to provide an overview of the developed model.The model consists of object detection, pose identification, pose classification, and behavior classification description stages.The purpose is to identify worker anomalous behavior in a manufacturing environment.The output is descriptive text generated based on the identified worker pose and interaction with surrounding objects.The stages are detailed as follows.

Figure 1
Figure 1 illustrates the architecture of the proposed method to provide an overview of the developed model.The model consists of object detection, pose identification, pose classification, and behavior classification description stages.The purpose is to identify worker anomalous behavior in a manufacturing environment.The output is descriptive text generated based on the identified worker pose and interaction with surrounding objects.The stages are detailed as follows.

Figure 2 .
Figure 2. Examples of normal and anomalous worker behavior: (a) a working is standing, (b) a worker is touching the control box, (c) a worker is holding a tool, (d) a worker is sitting in front of a machine, (e) a worker is sitting/slipping, and (f) a worker is standing behind a machine (unseen).

Figure 3 .
Figure 3. Examples of object detection: (a) bbox and segmentation result of a worker, machine, toolbox, and tool; (b) bbox and segmentation result of a worker, machine, control box, and two toolboxes.

Figure 2 .
Figure 2. Examples of normal and anomalous worker behavior: (a) a working is standing, (b) a worker is touching the control box, (c) a worker is holding a tool, (d) a worker is sitting in front of a machine, (e) a worker is sitting/slipping, and (f) a worker is standing behind a machine (unseen).

Figure 2 .
Figure 2. Examples of normal and anomalous worker behavior: (a) a working is standing, (b) a worker is touching the control box, (c) a worker is holding a tool, (d) a worker is sitting in front of a machine, (e) a worker is sitting/slipping, and (f) a worker is standing behind a machine (unseen).

Figure 3 .
Figure 3. Examples of object detection: (a) bbox and segmentation result of a worker, machine, toolbox, and tool; (b) bbox and segmentation result of a worker, machine, control box, and two toolboxes.

Figure 3 .
Figure 3. Examples of object detection: (a) bbox and segmentation result of a worker, machine, toolbox, and tool; (b) bbox and segmentation result of a worker, machine, control box, and two toolboxes.

Figure 6 .Algorithm 1 :
Figure 6.Behavior classification and recognition toward the relation between workers and objects in creating descriptive texts.

Figure 6 .
Figure 6.Behavior classification and recognition toward the relation between workers and objects in creating descriptive texts.

Figure 7 .
Figure 7. Examples of worker behavior and description (normal/anomalous) results within frames.Normal behavior: (a) the result in the form of a video frame, (b) a worker is standing in front of a machine, (c) a worker is walking, (d) a worker is touching the control box, and (e) a worker is holding a tool.Anomalous behavior: (f) a worker is bending in front of a machine and (g) a worker is sitting in front of a machine.

Figure 7 .
Figure 7. Examples of worker behavior and description (normal/anomalous) results within frames.Normal behavior: (a) the result in the form of a video frame, (b) a worker is standing in front of a machine, (c) a worker is walking, (d) a worker is touching the control box, and (e) a worker is holding a tool.Anomalous behavior: (f) a worker is bending in front of a machine and (g) a worker is sitting in front of a machine.

Figure 10 .
Figure 10.Examples of output snapshots with incorrect classification: (a) incorrect alignment of positions of overall body landmarks and (b) failure to identify hand landmarks as being covered by body landmarks.

Table 1 .
Hyperparameters of the study.

Table 1 .
Hyperparameters of the study.

Table 1 .
Hyperparameters of the study.

Table 2 .
Evaluation of object detection performance.