Status Recognition Using Pre-Trained YOLOv5 for Sustainable Human-Robot Collaboration (HRC) System in Mold Assembly

Molds are still assembled manually because of frequent demand changes and the requirement for comprehensive knowledge related to their high flexibility and adaptability in operation. We propose the application of human-robot collaboration (HRC) systems to improve manual mold assembly. In the existing HRC systems, humans control the execution of robot tasks, and this causes delays in the operation. Therefore, we propose a status recognition system to enable the early execution of robot tasks without human control during the HRC mold assembly operation. First, we decompose the mold assembly operation into task and sub-tasks, and define the actions representing the status of sub-tasks. Second, we develop status recognition based on parts, tools, and actions using a pre-trained YOLOv5 model, a one-stage object detection model. We compared four YOLOv5 models with and without a freezing backbone. The YOLOv5l model without a freezing backbone gave the optimal performance with a mean average precision (mAP) value of 84.8% and an inference time of 0.271 s. Given the success of the status recognition, we simulated the mold assembly operations in the HRC environment and reduced the assembly time by 7.84%. This study improves the sustainability of the mold assembly from the point of view of human safety, with reductions in human workload and assembly time.


Introduction
The use of robots in manufacturing began in Industry 3.0 as industrial robots were introduced for automated mass production. However, there are challenges in expanding industrial robot systems' application in mass personalization. Industrial robot systems can work fast with a low error rate. Still, industrial robots are less flexible and require highcost reconfiguration to cope with the frequent demand changes in mass personalization production. In contrast, manual systems can adapt to changes with lower investment costs, but human workers tend to become fatigued and have a higher error rate [1]. Therefore, the application of human-robot collaboration (HRC) systems in manufacturing has gained attention in Industry 4.0. HRC systems combine the cognitive ability of humans with the consistency and strength of robots to increase the flexibility and adaptability of an automated system [2]. Besides this, the key enabling technologies in Industry 4.0, such as artificial intelligence and augmented reality, are integrated into the HRC systems to support interaction and collaboration between humans and robots [3]. Although we are still at the stage of realizing Industry 4.0, the term Industry 5.0, which focuses on bringing humans back into the production line and collaboration between humans and machines, has been introduced [4]. Hence, HRC systems are foreseen to be an active research area in Industry 5.0 as well.
This paper focuses on the application of HRC systems in mold assembly. Most molds are still assembled manually, while full automation systems are implemented in various assembly systems, such as for automotive parts [5][6][7][8] and electronic parts [9,10]. A full automation assembly cell is insufficiently flexible to cope with the frequent changes in low-volume mold assembly production and the wide variety of mold components that vary in weight and geometry. However, musculoskeletal disorders (MSD) caused by heavy part handling and repetitive motion during mold assembly have increased the need for robots in mold assembly [11]. Therefore, we propose the application of the HRC mold assembly cell to overcome the problems in mold assembly. The implementation of collaborative robots can reduce repetitive strain injuries and relieve heavy part lifting for human workers. At the same time, the use of collaborative robots can increase the work consistency and productivity of the assembly cell by integrating computer vision into HRC assembly cells.
This study aims to develop a vision-based status recognition system for a mold assembly operation to achieve a sustainable HRC mold assembly operation by reducing assembly time and improving human working conditions. An assembly operation comprises tasks performed to join or assemble various parts to create a functional model. Each task consists of a series of sub-tasks executed to assemble a specific part at a defined location using a defined tool. The main contributions of this paper are presented as follows. First, we classify the mold assembly tasks into sub-tasks and actions. In this study, the status of a task is represented by the actions involved. For example, the sequence of actions of the screwtightening sub-task is as follows: picking up the screwdriver, locating the screwdriver on a screw, tightening the screw by rotating the screwdriver, then returning the screwdriver. Second, we develop a vision-based status recognition system that includes part, tool, and action recognition. We apply the transfer learning technique using a pre-trained YOLOv5 model (refer to Section 2.2.2) on the Common Objects in Context (COCO) dataset [12] to recognize mold components and tools. Then, we identify the status of the sub-task by recognizing defined actions. The expected outcome of this paper is a status recognition method to enable robots to assist humans in the HRC mold assembly operation. In most practical applications, human workers control the robot's execution using a push button. However, with status recognition, they can induce the early execution of the robot's task before completing the manual task. This minimizes the robot's idle time and the completion time of the recognized task. Furthermore, we can reduce the production time and the energy consumption to achieve a sustainable HRC mold assembly operation.
The structure of this paper is as follows: Section 2 provides the literature review related to this research, including deep learning-based recognition and transfer learning techniques. Section 3 explains the proposed status recognition system for HRC mold assembly and the methodology. Section 4 describes the experiment and results. Finally, Section 5 concludes our paper and provides the potential future work for our research.

Deep Learning-Based Recognition in HRC Assembly
The Convolutional Neural Network (CNN) is the most common deep learning method used in computer vision tasks. Image classification, object localization, and object detection are the three main computer vision tasks. Image classification seeks to classify the image by assigning it to a specific label [13]. AlexNet [14], ResNet [15], VGGNet [16], Inception Net and GoogleLeNet [17] are the most common CNN architectures that researchers on image classification have implemented. Object localization takes an image with one or more objects as input and identifies the objects' location with bounding boxes. The combination of image classification and object localization results in object detection or recognition, which identifies the types of classes of the located objects [13]. The most common deep learning-based object recognition models are R-CNN (Region-based Convolutional Neural Network) [18], Fast R-CNN [19], Faster R-CNN [20], Mask R-CNN [21], and You Only Look Once (YOLO) [22][23][24][25] models. R-CNN families are two-stage object detectors that extract the regions of interest (ROIs), then perform feature extraction and classify objects only within the ROIs. Hence, two-stage object detectors require longer detection times than one-stage object detectors. YOLO models are one-stage detectors that directly classify and regress the candidate boundary boxes without extracting ROIs. Our study detects the parts and tools required during an assembly operation so as to recognize a task and then estimates the task's progress based on the position of parts or tools. Therefore, we focus on object recognition that involves classification and detection tasks.
The training data used in the existing recognition systems were raw data, collected using wearable sensors, and image data, such as an image captured during operation, and images of spatial and frequency domains derived from sensors. Uzunovic et al. [26] introduced a conceptual task-based robot control system that received human activity recognition and robot capability inputs. They recognized the ten human activities in the car production environment based on the data from nineteen wearable sensors on both arms using machine learning models. However, the attachment of wearable sensors on the human worker caused discomfort during the practical assembly operation. Furthermore, deep learning algorithms in computer vision allow us to perform motion recognition better using assembly videos or images. Researchers have developed deep learning-based approaches using images to recognize common tasks based on different recognition algorithms: gestures or motion recognition and combinations of part and motion recognition in manufacturing assemblies [27,28]. Therefore, this study focuses on applying deep learning algorithms to assembly videos or image data for task recognition.
The research on action and phase recognition has been developed and applied widely for common human activities and surgical applications. Still, the related research on manufacturing assembly applications is worth exploring. Wen et al. [29] used a 3D CNN to recognize seven human tasks in visual controller assembly for the learning process of the robot. They separated eleven assembly videos, collected into seven labeled segments representing seven tasks, and performed data augmentation to increase the dataset for training. The accuracy of the task recognition was only 82% because of the small training dataset and the environmental changes during the assembly operation. Wang et al. [28] used two AlexNets for human motion recognition and part tool identification, respectively. They recognized grasping, holding, and assembling motions to identify human intention. For the part tool identification of a screwdriver, small and large parts were the only parts and tools included in this study. However, they only tested the proposed method on a simple assembly that involved a single tool and limited types of parts.
Chen et al. [27] implemented the YOLOv3 algorithm to detect tools for assembly action recognition and the convolutional pose machine (CPM) to estimate the poses and operating times of the repetitive assembly actions. They tested the algorithm on three assembly actions, which were filling, hammering, and nut screwing. They only estimated the operating times using the cycle of action curve, and not the progress or the remaining operating times. Action recognition based on this tool is inefficient in monitoring the assembly progress because different tasks may require the same tool. Chen et al. [30] extended the previous study [27] and proposed a 3D CNN model with batch normalization for assembly action recognition to reduce the environmental effect and improve recognition speed. Besides this, they employed fully convolutional networks (FCN) to perform depth image segmentation, in order to recognize different parts from assembled products for assembly sequence inspection. They recognized parts using computer-aided design (CAD) models instead of original parts, and compared the accuracy and training time for RGB, binary, gray, and depth images. The results show that the RGB image data gave the highest accuracy, but the training time was longer than for the gray images dataset.
The performance of the developed recognition models in the existing research for assembly applications is worse than expected due to the limited dataset and the environmental changes during the assembly operation. Therefore, this study uses a transfer learning technique to overcome these problems.

Transfer Learning
Transfer learning is a technique that uses the pre-trained model on other large datasets, such as ImageNet [31] or COCO [12], to train the model on custom data for a new but related problem. This technique helps speed up the development and training process with a small dataset that limits the deep learning model's capacity to be trained from scratch [32].
Deep learning models with transfer learning have been applied in various fields, such as computer vision and natural language processing. Computer vision includes recognizing objects, activities, and scenes that usually require numerous labeled image datasets. However, it is difficult to obtain large-scale labeled data in most practical applications. This problem can be solved using the transfer technique to transfer the knowledge from a source domain to a target domain. The integration of transfer learning with a pre-trained CNN, such as AlexNet, ResNet, or VGG, often solves vision-based recognition tasks [33]. We can implement transfer learning on a convolutional neural network using two approaches. First, we freeze the convolutional layers and use the pre-trained model as a feature extractor. The second approach is fine-tuning, whereby we freeze the initial layers and unfreeze deeper convolutional layers. The unfrozen convolutional layers are trained to update the weights. If we have limited new data, we can apply the first approach to prevent overfitting. On the other hand, we can use the second approach with larger datasets to train the deeper layers to detect task-specific features [34].
In this paper, we focus on applying transfer learning in object and action recognition in manufacturing assemblies. Židek et al. [35] applied transfer learning to detect assembly parts and product features. They tested two pre-trained models trained on the COCO dataset: Mobilenet V2 and Fast RCNN Inception V2, for screw and nut recognition. To apply transfer learning in assembly action recognition, Liu and Wang [36] implemented a transfer learning-based human poses recognition method in a collision-free HRC system to recognize operator's poses with low computational expense. Besides this, Tao et al. [37] applied transfer learning to perform real-time operation recognition during desktop CNC carving machine assembly. They chose the pre-trained DenseNet model because it performed the best among the pre-trained VGG, ResNet, and DenseNet models. They used the pre-trained model trained on ImageNet to recognize ten sequential operations and achieved a 95% recognition accuracy. An assembly task contains parts to be assembled, the tool used, and the action. Therefore, Wang et al. [28] implemented two pre-trained AlexNet models trained on ImageNet to recognize three actions (grasping, holding, and assembling) and the part/tool (small, large parts, and screwdriver), respectively. They adapted AlexNet trained on ImageNet because ImageNet contains image categories of human actions and tools related to manufacturing.
The existing research shows that the application of transfer learning on a pre-trained model improved the accuracy, even with a small dataset. In this study, we aim to detect and localize the parts and tools used during the assembly operation. Therefore, we use the YOLO model instead of image classification CNN models.

YOLO Algorithm
An object detection task identifies objects present on an image and determines the location of the identified objects on the image. YOLO is an object detector that detects objects in images and localizes them directly into bounding box coordinates and class probabilities [38]. First, the image is divided into an S×S grid. The grid cell, which consists of the center of an object, is responsible for detecting the object. Each grid cell predicts bounding boxes, the confidence scores of boxes, and the class probabilities of the grid cell containing an object. The first developed YOLO network had 24 convolutional layers followed by two fully connected layers [22]. Redmon et al. [23] introduced a few improvements to YOLOv2. They added batch normalization on all the convolutional layers and used a high-resolution classification network to increase the mean average precision. Besides this, they used the k-means clustering method to cluster bounding boxes so that the grid cell could detect more than one object. YOLOv3 uses a feature extractor network known as Darknet-53 and improves the detection accuracy and speed [24]. However, YOLOv3 performed worse than the previous YOLO version on medium and large objects. Bochkovskiy et al. [25] proposed YOLOv4, consisting of CSPDarknet53 as a backbone network and spatial pyra-mid pooling (SPP), with PANet as the neck part and YOLOv3 as the head part. YOLOv5 is the latest YOLO version that uses the CBL (Conv2D + Batch Normal + LeakyRELU) module as the basic convolution module and the BottleneckCSP module for feature extraction [39,40]. YOLOv5 includes different models, such as YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, which differ by the width and depth of the BottleneckCSP module [41].
YOLOv5 is an object detection model trained on the COCO dataset, which contains 80 classes and more than 200,000 labeled images. This study applies transfer learning on the pre-trained YOLOv5 model for part, tool, and action recognition in the context of status recognition in the HRC mold assembly operation. Figure 1 illustrates the proposed conceptual framework of status recognition. An assembly operation consists of tasks for joining or assembling various parts to create a functional model. We define a task as a series of sub-tasks executed to assemble a specific part at the designated item using a defined tool. Thus, we decompose the assembly operation into tasks and sub-tasks and define actions that represent the status of a sub-task. Status recognition identifies an assembly task based on the recognized unique part, and recognizes the status of the task based on the actions that are decomposed from sub-tasks.

Status Recognition for HRC Mold Assembly Operation
used a high-resolution classification network to increase the mean average precision. Besides this, they used the k-means clustering method to cluster bounding boxes so that the grid cell could detect more than one object. YOLOv3 uses a feature extractor network known as Darknet-53 and improves the detection accuracy and speed [24]. However, YOLOv3 performed worse than the previous YOLO version on medium and large objects. Bochkovskiy et al. [25] proposed YOLOv4, consisting of CSPDarknet53 as a backbone network and spatial pyramid pooling (SPP), with PANet as the neck part and YOLOv3 as the head part. YOLOv5 is the latest YOLO version that uses the CBL (Conv2D + Batch Normal + LeakyRELU) module as the basic convolution module and the BottleneckCSP module for feature extraction [39,40]. YOLOv5 includes different models, such as YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, which differ by the width and depth of the Bottle-neckCSP module [41].
YOLOv5 is an object detection model trained on the COCO dataset, which contains 80 classes and more than 200,000 labeled images. This study applies transfer learning on the pre-trained YOLOv5 model for part, tool, and action recognition in the context of status recognition in the HRC mold assembly operation. Figure 1 illustrates the proposed conceptual framework of status recognition. An assembly operation consists of tasks for joining or assembling various parts to create a functional model. We define a task as a series of sub-tasks executed to assemble a specific part at the designated item using a defined tool. Thus, we decompose the assembly operation into tasks and sub-tasks and define actions that represent the status of a sub-task. Status recognition identifies an assembly task based on the recognized unique part, and recognizes the status of the task based on the actions that are decomposed from sub-tasks.

Decomposition of Mold Assembly Operation
This paper focuses on a two-plate mold assembly operation that consists of core and cavity sub-assemblies. Table 1 lists the sixteen mold assembly tasks and each corresponding part. Each task assembles a unique mold part and joining part, such as screws and pins. Since the unique part assembled in each task is not repeated in other tasks, we can recognize a task by recognizing the corresponding unique part. Table 1. List of tasks in a two-plate mold assembly.

No.
Task Part Figure 1. Proposed status recognition model for HRC system.

Decomposition of Mold Assembly Operation
This paper focuses on a two-plate mold assembly operation that consists of core and cavity sub-assemblies. Table 1 lists the sixteen mold assembly tasks and each corresponding part. Each task assembles a unique mold part and joining part, such as screws and pins. Since the unique part assembled in each task is not repeated in other tasks, we can recognize a task by recognizing the corresponding unique part.
In this paper, we define a task as assembling a component consisting of a series of sub-tasks to assemble and join components. We also categorize sub-tasks in mold assembly into nine categories [11]. Table 2 lists the tools used in each sub-task. We must perform a series of sub-tasks on the component and corresponding joining components, such as screws and pins, to complete a task. We further decompose these sub-tasks into a series of actions for status recognition purposes. Mold assembly requires two types of tools, which are the hammer and hex-keys. Some sub-tasks need to be executed with a tool or without any tool. For sub-tasks that require a tool, we must include the actions to handle the tool. Assemble core and cavity sub-assembly Sub-assemblies Table 2. Categories of sub-tasks for mold assembly and tool used. The identification of actions plays an essential role in status recognition. Generally, a sub-task starts with a hand approaching the part or tool, and ends with an empty hand leaving the assembly area or returning the tool. Based on the common actions, we can summarize that a sub-task starts when the hand approaches the part and ends when the hand leaves the assembly area. The common actions in a sub-task can be listed as follows:

Code Description of Sub-Tasks Tool
• Picking or grasping part/tool; • Positioning part; • Assembly using a tool, such as tightening a screw or inserting a pin; • Leaving assembly area with an empty hand.
This study aims to develop a status recognition system for HRC mold assembly based on object and action recognition. The status recognition consists of two stages, as shown in Figure 2. In the first stage, we recognize a task by recognizing parts and tools. In the second stage, we recognize the status of a sub-task based on the executed action. Figure 2 shows an example of the stages of the proposed status recognition model. We decompose the task "Assemble location ring" into two sub-tasks: "Lift and place location ring" and "Insert and tighten the screw". We recognize the part during the first stage as location ring, and the screws and hex-key as the tools, in order to identify the sub-tasks. Then, we recognize the status based on the actions defined for the sub-task. During the execution of "insert and tighten screws", the defined action sequence is "insert screws", "tighten screws" with hex-key, then leave the assembly area. Status recognition plays an essential role in enabling the robot to identify the status of the manual task and execute the subsequent task in any future study.
"Insert and tighten the screw". We recognize the part during the first stage as location ring, and the screws and hex-key as the tools, in order to identify the sub-tasks. Then, we recognize the status based on the actions defined for the sub-task. During the execution of "insert and tighten screws", the defined action sequence is "insert screws", "tighten screws" with hex-key, then leave the assembly area. Status recognition plays an essential role in enabling the robot to identify the status of the manual task and execute the subsequent task in any future study.

Figure 2.
Decomposition of "Assemble location ring" task in proposed status recognition model.

Implementation of YOLOv5 and Transfer Learning
In this paper, we use the YOLOv5 model to develop a status recognition model because YOLOv5 has been proven to perform better in detection speed compared to R-CNN families [42,43]. We aim to implement status recognition in real-time task re-assignment and task execution in a future study. Therefore, the fast detection speed of the YOLO model is an important characteristic that enables us to recognize objects and actions in real-time during the assembly operation. Since we do not have a large assembly parts and tool images dataset, we implement a pre-trained YOLOv5 model instead of building a model from scratch. In other words, we apply the transfer learning technique using a pretrained YOLOv5 model to recognize assembly parts and tools based on small image datasets.

Data Collection and Processing
In this paper, we focus on a two-plate mold assembly operation, as shown in Figure  2. We need three image datasets to train the status recognition model: parts, tools, and hand actions. For the parts, we categorized the mold parts into seven types based on the geometric shape. Besides this, we collected images of tools, such as pins, screws, guide pins, sprue bushings, and location rings, from the internet. The mold assembly operation requires two types of tools, which are hammer and hex-key. Then, we captured images from a YouTube video for actions representing the status of the sub-task during mold assembly [44]. After we gathered the images, we increased the number of images for training by rotating those images 90, 180, and 270 degrees. After collecting the images, we used the LabelImg data annotation tool to label and create annotation files in the YOLO format [45]. Finally, we partitioned the dataset into training and testing sets containing 80% and 20% of the data, respectively. We then implemented k-fold cross-validation in the

Implementation of YOLOv5 and Transfer Learning
In this paper, we use the YOLOv5 model to develop a status recognition model because YOLOv5 has been proven to perform better in detection speed compared to R-CNN families [42,43]. We aim to implement status recognition in real-time task reassignment and task execution in a future study. Therefore, the fast detection speed of the YOLO model is an important characteristic that enables us to recognize objects and actions in real-time during the assembly operation. Since we do not have a large assembly parts and tool images dataset, we implement a pre-trained YOLOv5 model instead of building a model from scratch. In other words, we apply the transfer learning technique using a pre-trained YOLOv5 model to recognize assembly parts and tools based on small image datasets.

Data Collection and Processing
In this paper, we focus on a two-plate mold assembly operation, as shown in Figure 2. We need three image datasets to train the status recognition model: parts, tools, and hand actions. For the parts, we categorized the mold parts into seven types based on the geometric shape. Besides this, we collected images of tools, such as pins, screws, guide pins, sprue bushings, and location rings, from the internet. The mold assembly operation requires two types of tools, which are hammer and hex-key. Then, we captured images from a YouTube video for actions representing the status of the sub-task during mold assembly [44]. After we gathered the images, we increased the number of images for training by rotating those images 90, 180, and 270 degrees. After collecting the images, we used the LabelImg data annotation tool to label and create annotation files in the YOLO format [45]. Finally, we partitioned the dataset into training and testing sets containing 80% and 20% of the data, respectively. We then implemented k-fold cross-validation in the YOLOv5m model to evaluate the effects on model performance. We divided the datasets into five batches (i.e., k = 5), with 80% training datasets and 20% validation datasets for each fold.

Transfer Learning
We trained the models using the Windows 10 operating system and the Pytorch 1.7.0 framework with a single NVIDIA GeForce RTX2080Ti GPU. In this paper, we used the YOLOv5 pre-trained models trained on the COCO dataset. We trained the datasets using YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x models to compare the performances of the models. We downloaded the weights obtained from the pre-trained model as the initial weights for training purposes. In the YOLOv5 model, the backbone acts as the feature extractor, and the head locates the bounding box and classifies the objects in each box. Therefore, we froze the backbone to use the YOLOv5 models as a feature extractor and trained the head using the collected training datasets.
For YOLOv5 training parameter setting, we set the image size to 640 × 640 because the images collected from the internet have different sizes close to 640 × 640 pixels. We trained the models by employing different batch sizes and numbers of epochs with early stopping conditions. We obtained the best precision and weight from trial-and-error experiments by setting the batch size as 8 and using 600 epochs, with a learning rate = 0.01.

Comparison and Results
In this section, we compare the performances of different pre-trained YOLOv5 models and two conditions of transfer learning. The four YOLOv5 models included in the comparison are YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. We compared the performances of models with freezing of the backbone (F = 10) and without freezing (F = 0) any layers. We evaluated the performance based on mean average precision (mAP). The mAP value is calculated by taking the mean average precision over all classes or the overall intersection over union (IoU) thresholds. We evaluated the performance of the recognition models based on the mAP value. The higher the value, the better the average detection accuracy [40]. Tables 3 and 4 compare the mAP values and inference times of four different YOLOv5 models with pre-trained weights, respectively. Figure 3 shows the result of tool detection, which are hammer and hex-key. Some examples of part detection are shown in Figure 4. Figure 5 illustrates the status recognition of the "Assemble location ring" task ( Figure 2), which consists of three actions: "Position plate", "Insert screw", and "Tighten screw." Figure 5d shows the completion of the "Assemble location ring" task, where no action is detected. Table 5 compares the performances of the pre-trained YOLOv5 models with and without freezing of the backbone layers in order to detect the parts, tools, and statuses of the "Assemble location ring" task. We evaluated the performances of the YOLOv5 models by comparing the class probabilities of the statuses and parts/tools detected. The values in Tables 5 and 6 indicate the class probabilities. As regards YOLOv5s (F = 10) in Table 5, the value of 0.71 indicates that the YOLOv5s model without freezing detected "Position plate" status with a probability of 0.71. However, the YOLOv5s (F = 10) model could not detect the hex-key tool, indicated by "X". We set the confidence threshold to 0.4, meaning any class probability lower than 0.4 is considered as "not detected", which is indicated using an "X" in Tables 5 and 6. All the models in Table 5 were trained using image datasets that consist of all parts, tools, and action classes. Then, we ran inferences based on the weight trained (w all ). In addition, the image dataset for action classes was smaller than that for part and tool classes. Hence, we trained the YOLOv5m model without freezing again using a dataset consisting of action classes only, then performed inference using weights trained using all classes and actions classes only (W all &W action ). As mentioned in Section 3.2.1, we trained YOLOv5m using the k-fold cross-validation method, where k = 5. Thus, we used the weights obtained from 5-fold cross-validation training (W k=5 ) to evaluate the effect of cross-validation in the inference. Table 6 compares the recognition ability of the YOLOv5m model without freezing using different weights.      Table 5. Performance comparison of detecting parts, tools, and statuses during "Assemble location ring" sub-task using different YOLOv5 models (F = 0: without freezing; F = 10: freeze backbone).

Status
Part/Tool YOLOv5s YOLOv5m YOLOv5l YOLOv5x   Table 6. Performance comparison of detecting parts, tools, and statuses during "Assemble location ring" sub-task using different trained weights (W all : using weights trained using all classes; W action : weight trained using actions classes only; W k=5 : weight trained using 5-fold cross-validation).

Discussion
From Table 3, we see that the pre-trained YOLOv5x without freezing model has the best average mAP score (85.6%) compared with other YOLOv5 models. All the models except YOLOv5s performed better without freezing layers than with freezing the backbone. As mentioned in Section 3.2.2, the backbone acts as the feature extractor, and the head locates bounding boxes and classifies the objects. This result shows that training the backbone enables the larger YOLOv5 models to extract the features of new datasets before locating the bounding boxes and classifying the objects.
We set the confidence threshold as 0.4 and used the best weight from training to run the inference for all models. The YOLOv5l models both with and without freezing can recognize all the parts, tools, and statuses ( Table 5). The YOLOv5s model achieved the fastest average inference time, 0.0148 s, but it could not detect some statuses of the sub-tasks. The YOLOv5x without freezing model achieved the best mAP score (85.6%) and could detect all parts and statuses, but it had a longer inference time, 0.035 s. Compared to the YOLOv5x model, the YOLOv5l without freezing model achieved a lower mAP score (84.8%) but had the fastest inference time, 0.0271 s. The inference time of all models was less than 0.04 s. The quickest action execution time of a mold assembly operation based on the simulation was 5 s. We tested the YOLOv5l model's ability to recognize the status of "Assemble return pin" and "Assemble ejection plate" tasks so as to show the model's compatibility with other tasks. The two statuses of the "Assemble return pin" task are "Insert pin" and "Hammering", as shown in Figure 6. Figure 7 shows the status recognition of "Assemble ejection plate", which consists of "Position plate", "Insert screw", and "Tighten screw", Therefore, we can conclude that the pre-trained YOLOv5l without freezing model performed the best, and can be implemented in practical HRC assembly operations for task and status recognition. In this study, we mixed images from the internet and images from an assembly video to train the model. Our collected datasets had different backgrounds, and the number of images for each class was uneven. Thus, these limitations may be why the mAP score lower than 90% for all the models. Smaller YOLOv5 models, such as YOLOv5s and YOLOv5m, failed to detect some actions when using images from assembly videos because of the lower number of action images, different scenes, and environmental changes. The number of images available for parts/tools classes was more than that available for actions classes. We trained the YOLOv5m without freezing model separately using only the actions dataset. Then, we combined the weight obtained from action training only ( ) and the previous weight ( ) to investigate the improvement of the performance (see Table 6). The inference using only was unable to detect the hex-key and location ring. By adding to the inference, we could detect the hex-key with class probability of 0.57, but it still failed to detect the location ring. The inference using weights trained using 5-fold cross-validation ( =5 ) increased the class probability of hex-key to 0.81 compared to 0.57 using & . Besides this, the inference using =5 could detect all the statuses, parts and tools with class probabilities higher than 0.67. We found that inference using both weights was better, but the inference time increased from 0.0253 s to 0.0399 s. We applied 5-fold cross-validation to the YOLOv5m without freezing model to increase the recognition performance. The average mAP score using 5-fold cross-validation was 91.48%, which improved the mAP scores of the model by 7.75%, but it increased the inference time of the models by 2.4-fold, as shown in Table 6. Both methods improved the detection performance. However, the training time and inference time were In this study, we mixed images from the internet and images from an assembly video to train the model. Our collected datasets had different backgrounds, and the number of images for each class was uneven. Thus, these limitations may be why the mAP score lower than 90% for all the models. Smaller YOLOv5 models, such as YOLOv5s and YOLOv5m, failed to detect some actions when using images from assembly videos because of the lower number of action images, different scenes, and environmental changes. The number of images available for parts/tools classes was more than that available for actions classes. We trained the YOLOv5m without freezing model separately using only the actions dataset. Then, we combined the weight obtained from action training only ( ) and the previous weight ( ) to investigate the improvement of the performance (see Table 6). The inference using only was unable to detect the hex-key and location ring. By adding to the inference, we could detect the hex-key with class probability of 0.57, but it still failed to detect the location ring. The inference using weights trained using 5-fold cross-validation ( =5 ) increased the class probability of hex-key to 0.81 compared to 0.57 using & . Besides this, the inference using =5 could detect all the statuses, parts and tools with class probabilities higher than 0.67. We found that inference using both weights was better, but the inference time increased from 0.0253 s to 0.0399 s. We applied 5-fold cross-validation to the YOLOv5m without freezing model to increase the recognition performance. The average mAP score using 5-fold cross-validation was 91.48%, which improved the mAP scores of the model by 7.75%, but it increased the inference time of the models by 2.4-fold, as shown in Table 6. Both methods improved the detection performance. However, the training time and inference time were In this study, we mixed images from the internet and images from an assembly video to train the model. Our collected datasets had different backgrounds, and the number of images for each class was uneven. Thus, these limitations may be why the mAP score lower than 90% for all the models. Smaller YOLOv5 models, such as YOLOv5s and YOLOv5m, failed to detect some actions when using images from assembly videos because of the lower number of action images, different scenes, and environmental changes. The number of images available for parts/tools classes was more than that available for actions classes. We trained the YOLOv5m without freezing model separately using only the actions dataset. Then, we combined the weight obtained from action training only (W action ) and the previous weight (W all ) to investigate the improvement of the performance (see Table 6). The inference using W all only was unable to detect the hex-key and location ring. By adding W action to the inference, we could detect the hex-key with class probability of 0.57, but it still failed to detect the location ring. The inference using weights trained using 5-fold cross-validation (W k=5 ) increased the class probability of hex-key to 0.81 compared to 0.57 using W all &W action . Besides this, the inference using W k=5 could detect all the statuses, parts and tools with class probabilities higher than 0.67. We found that inference using both weights was better, but the inference time increased from 0.0253 s to 0.0399 s. We applied 5-fold cross-validation to the YOLOv5m without freezing model to increase the recognition performance. The average mAP score using 5-fold cross-validation was 91.48%, which improved the mAP scores of the model by 7.75%, but it increased the inference time of the models by 2.4-fold, as shown in Table 6. Both methods improved the detection performance. However, the training time and inference time were longer than those of the basic models, especially when using 5-fold cross-validation. We aim to implement this study in real-time assembly operations that recognize tasks and statuses as fast as possible. Thus, we will collect images from an HRC mold assembly operation for model training to improve recognition accuracy in the future.
Our previous study developed a task allocation model for HRC mold assembly composed of one human and two robots with flexible collaboration mode [46]. Based on the result of task allocation in the previous study, the robot tasks followed by manual tasks were determined as "pick and place" and "screw tightening." In the simulation, we divided robot tasks into pick, move, and place parts. The pick and move motions did not interfere with the human at the assembly area. In other words, robots can start to pick and move a part for the next task after the human worker has moved a part of the current task, even when the human worker is assembling a part in the assembly area. Therefore, we can reduce the time a robot takes to pick and move parts based on the status recognized. Since the robot picks and moves parts simultaneously with manual assembling task, the time of a robot task is only the time required for a robot to place and position a part in the assembly area. In the simulation, the average time for a robot to place a part was five seconds. For "screw tightening", the robot tightens the screw after the human worker inserts the screw. In the previous simulation, the human worker inserted all required screws (four screws to assemble the bottom clamp plate). The robot began screw-tightening after the human worker had inserted all screws. However, we have enabled robots to tighten the first screw after status recognition once the human's hand leaves the first position. Based on the previous study (refer to Table 7), the human worker required twenty seconds to pick and insert four screws (t i ), while the robot required sixteen seconds to tighten the screws (t j ). This means that the time required for both actions (t i + t j ) was thirty-six seconds. With the status recognition of inserting screws performed by humans, we designed the robot to tighten the first screw after the human worker had inserted the second screw for safety reasons. In other words, the robot starts tightening screws ten seconds earlier. Therefore, the execution time for both actions was reduced from thirty-six seconds to twenty-six seconds. There were four "screw tightening" tasks performed by the robot after the human worker had inserted screws, and two "pick and place" tasks performed by the robot(s) after the human worker had assembled the part (see Table 7). The assembly time without early execution was 638 s (10 min 38 s). With early execution, we eliminated 50 s of the robot's idle time. Hence, we achieved a 7.84% assembly time reduction based on the simulation. Since we calculated the result based on the simulation time, we are in the process of establishing an HRC mold assembly testbed to evaluate the practical applications of this study in the future. This study shows that task and status recognition during mold assembly operation is achievable using transfer learning on a pre-trained YOLOv5 model, even with small image datasets. The criteria to measure the sustainability introduced in Sustainable Artificial Intelligence (AI) include the use of reusable data and the training of the algorithm [47].
The rule of thumb is 1000 images per class for developing object recognition using deep learning. Furthermore, training a YOLO model from scratch may require up to several days. However, this study reduced the number of images and the training time by using pre-trained YOLOv5 models. We spent one and half hours training the YOLOv5s without freezing model, and up to seven and half hours training the YOLOv5x without freezing model. With the developed recognition model, the robot involved in HRC mold assembly can identify the status of the current manual task in order to execute the subsequent task earlier without a signal from the human worker. Besides this, the robot can avoid collision with a human by detecting human hands in the assembly area. In the future, we will implement this recognition model in two robot HRC mold assembly testbeds to improve the human workload and efficiency of the mold assembly operation. Therefore, this study supports sustainability in terms of human safety, working conditions, and reductions in assembly time.

Conclusions
This study presents the development of task and status recognition for an HRC mold assembly operation. The proposed recognition model consists of task recognition stages utilizing part and tool detection and status recognition, which identify the status of a task based on the human action.
Before developing the recognition model, we decomposed the assembly operation into tasks, sub-tasks, and action. The sub-tasks contain information on parts and tools used. Then, we decomposed the sub-tasks into a series of actions that defined the status of the task. Therefore, we collected images of parts and tools, and defined actions to train the recognition model. We used a pre-trained YOLOv5 model to develop the model due to the limited dataset available. We selected pre-trained YOLOv5l without freezing layers to implement the task and status recognition because it showed the best performance based on accuracy and inference time among all YOLOv5 models. Besides this, smaller YOLOv5 models cannot detect all the statuses and parts because of the uneven number of images in each class. We re-trained the YOLOv5m model with only images of action classes and with a 5-fold cross-validation method. Then, we combined the weights during the inference to investigate the detection ability. We found that the 5-fold cross-validation method improved the average mAP score and detection ability, but the inference time increased 2.4 times.
We are currently pursuing physical experiments using an HRC assembly cell testbed to further evaluate and verify the real-time practical implementations of this study. In this study, we focused on recognizing the manual task. However, it is necessary to recognize robot tasks so as to enable progress estimation and communication between humans and robots. Furthermore, we will expand the developed task recognition model to estimate the progress of the recognized task based on object tracking, and to estimate the completion time of the recognized task.  Data Availability Statement: Data available on request due to restrictions. The data presented in this study are available on request from the corresponding author. The data are not publicly available because the data are also parts of the authors' ongoing research.