Evaluation of Vision-Based Hand Tool Tracking Methods for Quality Assessment and Training in Human-Centered Industry 4.0

: Smart industrial workstations for the training and evaluation of workers are an innovative approach to face the problems of manufacturing quality assessment and fast training. However, such products do not implement algorithms that are able to accurately track the pose of a hand tool that might also be partially occluded by the operator’s hands. In the best case, the already proposed systems roughly track the position of the operator’s hand center assuming that a certain task has been performed if the hand center position is close enough to a speciﬁed area. The problem of the pose estimation of 3D objects, including the hand tool, is an open and debated problem. The methods that lead to high accuracies are time consuming and require a 3D model of the object to detect, which is why they cannot be adopted for a real-time training system. The rise in deep learning has stimulated the search for better-performing vision-based solutions. Nevertheless, the problem of hand tool pose estimation for assembly and training procedures appears to not have been extensively investigated. In this study, four different vision-based methods based, respectively, on ArUco markers, OpenPose, Azure Kinect Body Tracking and the YOLO network have been proposed in order to estimate the position of a speciﬁc point of interest of the tool that has to be tracked in real-time during an assembly or maintenance procedure. The proposed approaches have been tested on a real scenario with four users handling a power drill simulating three different conditions during an assembly procedure. The performance of the methods has been evaluated and compared with the HTC Vive tracking system as a benchmark. Then, the advantages and drawbacks in terms of the accuracy and invasiveness of the method have been discussed. The authors can state that OpenPose is the most robust proposal arising from the study. The authors will investigate the OpenPose performance in more depth in further studies. The framework appears to be very interesting regarding its integration into a smart workstation for quality assessment and training.


Introduction
The Industry 4.0 revolution has affirmed the centrality of the human operator, proposing a new way to automate production processes based on human and robotic expertise synergy. The impact of new technologies, such as collaborative robots (co-bots), virtual reality (VR) and augmented reality (AR) [1], wearable devices, Internet of Things (IoT) and artificial vision [2,3], allows the companies to streamline their processes, ensuring better quality and enhancing the flexibility of their production potential [4,5]. From this perspective, the human-centered approach has once again been reaffirmed in the automotive sector in the era of electric vehicles [6]; as an example, the technological processes of battery pack assembly require several manual procedures, especially in the completion of electrical connections.
The continuous evolution of production processes, not only in the automotive industry [7] but in the whole world of automation, together with the problem of employee turnover, has led to a progressive and ceaseless need to train novel operators on new techniques and to assess the quality of their job. The assembly and maintenance of industrial assets are complex tasks that require a large number of training hours in classrooms and hands-on sessions. Furthermore, especially in times of restrictions related to the pandemic, only a few trainees can access training on a physical asset at the same time, and their availability is always a trade-off with production line efficiency. Training all the operators of a plant on a piece of new equipment in a reasonable lead time is a huge challenge.
In addition to the personnel training, one of the key factors of success in the new industry is the traceability of manual operations to ensure product quality [8]. Indeed, in most of the manufacturing processes, it is important to collect data from the manual assembly and maintenance interventions and to store these data in a centralized data management system for subsequent analysis of the performance of the plants [9]. In this scenario, the companies that can accurately understand workers' behaviors and evaluate their performance in real-time will outperform their competitors.
Smart industrial workstations for training and evaluating the performance of the workers are drawing attention as an innovative approach to facing the problem. These systems are designed to guide a non-experienced operator in achieving the same level of precision and performance as the expert [10,11]. In addition, they ensure that procedures are carried out in the right way while collecting anonymous real-time data that can be used to verify technical parameters and to improve the working efficiency where needed.
The market proposes some commercial solutions for virtual guidance, such as Bosch's Active Assist system (https://www.boschrexroth.com/en/xc/products/product-groups/ assembly-technology/news/activeassist-assistance-system/index/, accessed on 20 January 2022), Arkite HIM (https://arkite.com/product/, accessed on 20 January 2022), Rhinoassembly Light Guide System (https://www.rhinoassembly.com/en/catalog/product/ -Light-Guide--LGS--LGS/, accessed on 20 January 2022) and Vir.GIL (https://www.comau. com/it/competencies/digital-initiatives/technologies/vir-gil/, accessed on 20 January 2022) ( Figure 1). They are powerful and flexible systems that are able to guide the operator during the training of a new assembly, inspection or maintenance procedure by means of digital information projected on a workbench or directly on the asset to be manipulated. However, such innovative products do not implement algorithms that are able to accurately track the pose of a hand tool that might also be partially occluded by the operator's hands. In fact, in the best case, these industrial solutions roughly track the position of the hand center and assume that a certain task has been performed if the hand center position is enough close to a specified area. This can clearly lead to a rough performance evaluation and the learning of bad habits. In order to make a difference, it is fundamental to track, with high accuracy, the pose of the tools the worker handles during the learning session of the procedure [12]. As an example, if a worker is required to tighten several screws in sequence, an evaluation of whether the screws were tightened following the correct sequence without missing any screws might be requested. The problem of the pose estimation of 3D objects, including the hand tool for allowing autonomous manipulation [13], has been studied for years and is an open and debated problem. The approaches that allow for high accuracies rely on a 3D model of the object that has to be detected within a 3D point cloud [14]. However, such methods are time consuming and may not be applicable due to the unavailability of the 3D model [14]. The unstoppable rise of deep learning in recent years has pushed the pose estimation research out of the the boundaries of classical computer vision techniques, and new approaches based on deep neural networks have been exploited [15]. Novel methods require minimal human intervention, improve the performance of 3D data-based approaches [16,17] and, in some cases, avoid the usage of 3D models for estimating the pose of an object [18,19]. To the best of the authors' knowledge, even if the problem has also been extended to the general situation of occluded objects, hand tool pose estimation and tracking in the industrial environment during assembly and maintenance procedures has not been extensively investigated.
Detecting and tracking a tool handled by an operator performing an assembly or maintenance procedure is not easy, mainly because of the tool's occlusion due to the operator's hands. In the authors' opinion, the problem of hand tool pose estimation can be solved with a direct or indirect approach. In a direct approach, the tool is detected and then tracked using some robust features that are visible even if it is occluded. In an indirect approach, body joints of the operator and their hands are detected and tracked; the tool pose is derived assuming a unique handling pose of the tool. In order to investigate both paradigms, four artificial vision-based systems have been design, implemented and compared. Each developed system has been evaluated with a set of experiments replicating real industrial scenarios, and their performances have been compared to analyze the pros and cons according to the task properties. Even though, in this work, the authors proposed general methodologies, the developed systems have been tested with a specific use case considering the estimation of the 3D position of a cordless power drill.
This study has the ambition to identify the best method(s) to integrate in the next cutting-edge workstations for training and assessment in Industry 4.0. Such systems will be designed with the following objectives: • Training an operator on assembly and maintenance procedures with a recorded sequence of actions; • Tracking an operator activity to validate each manual operation and certify the quality of the job; • Detecting risky actions and behaviors in time and alerting the operator; • Ergonomics monitoring during the procedure for always proposing to the operator the less stressful posture to perform the operations.
The main constraints of these solutions should be: • Real-time execution; • High accuracy.
The real-time execution is a key factor of a virtual guidance system because it has to be ready to warn the operator as soon as possible when their safety is compromised, when ergonomics guidelines are not respected [3,20] or when the process is going to be completed in the wrong way. On the other hand, a high level of accuracy is required to control the operator's movement in the assembling process to avoid errors.
The paper is organized as follows: in Section 2, the authors explain the technical characteristics of the selected methods, how they have been implemented, the experiments designed to evaluate the relative performance, the metrics computed and the statistical analysis conducted for the quantitative comparisons; Section 3 reports the results of the metrics computed on the data acquired during the experiment and the statistical analysis; in Section 4, the results are extensively discussed; Section 5 reports the conclusions of this paper and the future scope.

Systems for Hand Tool Tracking
In this work, four different systems based on either open-source or commercial solutions have been developed and compared: 1.
The first system is a marker-based solution using the ArUco markers [21,22]; 2.
The second system is based on a deep learning model engineered for 2D detection problems and is called YOLO v4 [23,24]; 3.
Each system has been designed to estimate a single interesting point of a hand-held tool, even though three out of four systems have the intrinsic capability of estimating the entire pose of the tool. It is worth citing that all four systems are based on RGBD camera, except for the method based on Aruco, which does not need the depth information. For this reason, the new Kinect Azure camera has been used either for acquiring RGBD data or estimating the human skeleton configuration with the Microsoft SDK "Azure Kinect Body Tracking".

ArUco-Based SYSTEM
The system based on ArUco marker considers the possibility of applying a 2D planar marker on the hand tool that must be always visible by a RGB 2D camera. The main benefit of this approach is that a single marker provides enough correspondences to obtain the camera pose and, from that, a tracked object pose can be derived. Once the position of the tracked tool's point within the marker reference frame is known, the path of such a point can be reconstructed by the marker frame pose. Such an approach is the most invasive one since it considers the introduction of a new object, i.e., the marker, within the scenario. The strength of this approach is that it can determine a robust 3D pose estimation using a simple 2D camera and some printed markers, keeping computational costs low and therefore saving hardware resources for other essential tasks. On the other hand, it is an invasive way to track an object because markers have to be installed on it. Furthermore, if the marker is not visible, it is not possible to detect the object and, as a consequence, to establish the correct pose.
ArUco [21,22] is a popular opensource library for detection of square fiducial markers and camera pose estimation mostly used in augmented reality applications. An ArUco marker is a synthetic square marker composed of a wide black border and an inner binary matrix that determines its identifier (id). The black border facilitates its fast detection in the image, and the binary codification allows for its identification and the application of error detection and correction techniques. The marker size determines the size of the internal matrix. For instance, a marker size of 4 × 4 is composed of 16 bits. The ArUco decoding algorithm can locate, decode and estimate the pose in realtime of any ArUco markers in the camera's field of view, as shown in Figure 2. It is based on the knowledge of the matrix encoded in the square. Multiple matrices are encoded in a group of markers, identified in dictionaries. It is possible to choose among several predefined dictionaries or by generating one yourself. For the sake of simplicity, in this study, a wooden support for the power drill was realized and three ArUco square markers with 5 cm side were stuck on it. Usage of a group of markers was preferred to a single marker in order to provide a more robust pose estimation: a single marker being visible but not recognized quickly in the scene, or not recognized at all because of an occlusion, could occur. When more markers of the group are detected in the same frame, only one marker's pose needs to be chosen as reference for the power drill chuck pose estimation. The area in pixels of a marker was used as selection criteria for the best marker because it is a matter of fact that AruCo algorithms have better performance the bigger the detected marker appears in the frame.

System Based on Deep Detection Model (YOLO)
The system that uses a deep neural network, i.e., YOLO, is based on two subsequent steps: (1) a deep neural network is trained to localize within a 2D RGB image the specific tool's area (or point) that has to be tracked, and (2) the 3D position of the tracked tool's point is estimated by computing the spacial centroid of the point cloud underlying the ROI found in the previous step.
You Only Look Once (YOLO) [23,24] is a family of convolutional neural networks (CNN) that achieve near state-of-the-art results with a single end-to-end model that can perform object detection in real-time. Compared to the approach taken by previous object detection algorithms, YOLO proposes the use of an end-to-end neural network that makes predictions of bounding boxes and class probabilities all at once ( Figure 3). Starting from YOLO's pre-trained models, it is possible to re-train the last layers to introduce new classes that are not available in the original dataset to fit the object detector capabilities to new objects. In the re-training process, it is also possible to tune settings of the net and to adjust accuracy over performance, training time and batch iterations, choosing the best weight candidate. YOLO's performance has improved over time, starting from the v1 version. The original YOLO v1 was born as the first object detection network to combine the problem of drawing bounding boxes and identifying class labels in one end-to-end differentiable network. YOLO v4 outperforms most of the other object detection models [31] by a significant margin, keeping frame-rate high and making it the best opportunity to detect objects in real-time. In this study, YOLO was implemented in its fourth version. The last layers of the model were re-trained to recognize the power drill chuck as a new class, and then the performance of the model inference was evaluated. In order to train the model, a dataset consisting of 735 images was created. The images were acquired by Azure Kinect RGB camera with the power drill in different poses, with heterogeneous light conditions and in different environments. Considering the objective of study, the model was trained with images containing only the particular power drill used for our experiment, leading the net to recognize only the chuck of this tool and not that of other power drills. Data augmentation was performed in order to improve the performance of the model using flip upside-down, flip left-right, rotation and Gaussian blur. The training process was based on transfer learning and the CNN model was initialized with the weights retrieved from the GitHub page of YOLO v4 based on darknet framework. The net was trained with images of 512 × 512 resolution and max batches set to 2000. The starting weights were obtained from the training on the COCO dataset, containing 80,000 images and 80 different object classes.
The trained model was able to recognize the power drill chunk in every frame of real-time acquisition, returning the pixel coordinates x and y of the top-left corner of a box that contains it, the width and the height of the box and the confidence as the probability that the object was classified correctly. In order to estimate the 3D pose of the power drill chuck, the center point (x,y) of the box returned by YOLO model was calculated and the 3D coordinates of that point in the scene were computed by using the functions of Azure Kinect SDK.

System Based on Azure Kinect Body Tracking
The system based on the Azure Kinect Body Tracking exploits the 3D body configuration estimation performed by the AKBD SDK of Microsoft. The system is based on the tracking of the upper link main segments to estimate a specific tool's point if the relative position of such a point, with respect to the hand reference system, is known.
The Azure Kinect from Microsoft is a cutting-edge spatial computing developer kit with sophisticated computer vision and speech models, advanced AI sensors and a range of powerful SDKs. It is equipped with several sensors in order to sense the surrounding environment; the device integrates a 12-megapixel RGB camera supplemented by 1-megapixel depth camera, a 360-degree seven-microphone array and an orientation sensor. The main modules of the SDKs are the Sensor SDK and Body Tracking SDK. The first one is designed for the interface with sensors and for managing data provided by them, automatically handling the problem of RGB and depth camera data alignment; the Body Tracking SDK is based on a complex deep learning model that supports body segmentation, human skeleton reconstruction, human body instance recognition and body tracking in real-time. The model recognizes 31 joints of the human body, each joint with its own reference system organized in a hierarchical structure, as shown in Figure 4. The deep learning model is able to reconstruct the entire human body model, where the higher the accuracy, the better the visibility of the body; the model has a good tolerance to the occlusions only for the highest joints of the hierarchical structure. Body Tracking SDK is able to detect and track all joints of a human body in the scene, returning position and orientation (in quaternion form) for each of them [25,26].
Intuitively, the kinect hand joints, i.e., n.8-left hand and n.15-right hand, should be the most useful to estimate the power drill chunk pose. Unfortunately, several tests led the authors to exclude the above-mentioned joints because experimental trials proved the poor quality of the orientation prediction. In fact, most of the time, the model predicted the power drill in the scene as part of the hand of the operator. Thus, the joints of the wrists (n.7-left and n.14 right) have been considered. The downside of this choice is that the tool pose changes relative to the wrist flexion/extension, and wrist radial-ulnar deviation movements are not considered.

System Based on OpenPose
This system, like the one based on the AKBT, uses the information of the operator upper limb pose to obtain the tool's point position. In detail, it relies on the 3D position of some hand's keypoints. OpenPose [27][28][29][30] can be considered as the state-of-the-art approach for real-time human pose estimation. It is the first framework that can jointly detect human body, hand, facial and foot keypoints on single images. OpenPose is a multi-stage CNN that uses a bottom-up approach to find every instance of a key point and then attempts to assemble groups of key points into skeletons of distinct humans. The deep learning OpenPose model is based on a CNN and follows a precise pipeline: in the first step, the model computes confidence maps for every body part detection; in the second part, it predicts part affinity fields (PAF) in order to associate every key point of every person in the frame; in the third step, bipartite matching is computed, so a several-graph connection is evaluated in order to choose the best performing PAF; in the last step, results are parsed and the skeleton is reconstructed. A confidence map is the 2D representation of the belief that a particular body part can be located. A single body part will be represented on a single map. Therefore, the number of maps is the same as the total number of body parts, and is a number that depends on the dataset the model is trained on. Instead, PAF is a set of 2D vector fields that encode the location and orientation of limbs over the frame domain. The framework integrates three different trained models for body, hands and face keypoint estimation. They return, respectively, 25 keypoints for the body, 21 × 2 keypoints for hands and 70 keypoints for the face (135 keypoints).
For this study, the authors focused on OpenPose capability to track up to twenty keypoints of the hand, consisting of wrist, finger's knuckles and phalanges. The adopted configuration for the framework has the following characteristics: Body Network input size equal to 160 × 160, Hands Net input size equal to 368 × 368 and Face Net disabled. The keypoints are just recognized in the 2D frame and the output is expressed in pixel coordinates within the image. Hence, a registered 3D point cloud is used to obtain the 3D position of the keypoints.
Three out of all detected keypoints of the hand have been considered to compute the estimated hand's pose (  The reference frame of the operator's hand was computed as follows. First, two vectors are computed: one from the wrist point (kp 0 ) to the middle finger knuckle point (v 1 ) (Equation (1)) and one from the wrist point to the little finger knuckle point (v 2 ) (Equation (2)). The cross-product between v 1 and v 2 returns v 3 , as shown in Equation (3); it is an orthogonal vector to v 1 and v 2 with a direction given by the right-hand rule outgoing from the back of the hand. Then, it was possible to calculate the vector whose direction is toward the thumb (v 4 ) as cross-product between v 1 and v 3 (Equation (4)).

Calibration of the Proposed Systems
All of the proposed methodologies consider the estimation of the 3D position of a specific point of interest of the hand tool. The system based on YOLO is the only one that directly estimate such a position. The other three systems are based on the knowledge of the position (that can be considered fixed) of such a point with respect to the estimated reference system. A specific calibration procedure is thus needed to acquire the position of such a point that, in this work, has been experimentally found by positioning the tool's point (while it was hand-held by the operator) in correspondence of a known position. However, other model-based techniques might be investigated.

Experimental Validation
In order to evaluate and compare the proposed systems, the authors selected a specific scenario that considered an operator holding a cordless power drill; then, its mandrel was considered as the point of interest to be tracked. As will be deeply discussed below, the pose of the power drill and the position of the mandrel were also acquired with an accurate system in order to quantitatively evaluate the performance of each system.

Evaluation System Setup
The experiment environment was set up with the Azure Kinect positioned at approximately 1.8 m from the ground and two meters from a wall, tilted 15 degrees downwards with respect to the horizon. The distance of the operator from the Kinect camera could be in a range between 0.80 m and 2 m, compatible with the range defined in the official documentation of the device.
In order to evaluate the performance of the four frameworks under investigation, a highly accurate tracking of the power drill chunk pose was needed as reference. The HTC Vive ( Figure 6) was selected as benchmark for the experiment since the precision of its tracking technologies has been tested to be around RMS 1.5 mm, and its accuracy around RMS 1.9 mm [32][33][34]. The HTC Vive system is used for rendering 3D virtual reality and it is developed by HTC in partnership with Valve. The headset ( Figure 6-Top-left) uses room scale tracking technology (Lighthouse, as shown in Figure 6-Middle) for virtual reality experiences that allow users to freely move around a play area, accurately tracking the position and orientation of the user's head-mounted display and controllers, reflecting all real-life movement in the VR simulation environment. The tracking is possible thanks to two infrared signals emitters called base stations ( Figure 6-Top-right) and special active sensors that cover the surface of the headset, and controllers that intercept the infrared pulses can autonomously track their own position and orientation in the workspace determined by the field of view of the stations. The HTC Vive capabilities can be extended by means of small devices called Vive trackers (Figure 6-Bottom) [35] that implement Lighthouse technology too. They allow for a high degree of flexibility, making it possible to track items or body parts if correctly configured [36].
HTC Vive base stations were placed in the space between Azure Kinect, delimiting a workspace that completely contained the field of view of the Kinect cameras to ensure the operator movements were in a monitored space. An HTC Vive tracker was placed on the power drill to track its pose. The position of the chunk with respect to the Vive tracker reference frame was found using the calibration procedure described above. The power drill Vive tracker was installed by means of a wooden support (Figure 7-Right) designed for the purpose.  It is worth noting that the position of the tool's point of interest estimated by the proposed system is referred to as the reference frame of the kinect camera. Since the accurate measure of the power drill pose is with respect to the reference system of the HTC Vive, it is necessary to know the relative pose between the two reference systems in order to compare the accurate measure with the estimated one. Hence, a second Vive tracker was fixed on the Azure kinect chassis by using a 3D-printed support (Figure 8). Such a support designed by the authors allowed for the positioning of the Vive tracker with a well-known pose with respect to the camera frame.

Experiment Description
In order to evaluate and compare the pose estimation performance of the proposed systems, some experiments reproducing typical manual task performed by a worker were conducted. As already reported, a power drill (Figure 7-Left) was selected for the experiment because it is one of the most frequently used tools in industrial assembly/disassembly procedures.
Three different scenarios have been designed: • In the first scenario, the operator holds the power drill in static position on a workbench (stationary position) (Figure 9-Left); • In the second scenario, the operator follows a trajectory with the power drill on the platform, keeping the speed of the movement low (slow-motion condition) (Figure 9-Right); • In the last scenario, the previous trajectory is considered, but the speed was increased (fast-motion condition).
For the second and third scenario, the trajectory was defined as a path from a starting point A to another point B on a horizontal workbench. Two different velocities were adopted for performing the same trajectory (approximately 4 cm/s for slow-motion condition; approximately 8 cm/s for fast-motion condition). A vision feedback on a monitor was used as a virtual reference to follow. Four users were involved in the experiment. Everyone was asked to perform three trials in each condition, for a total amount of twelve trials.
The execution of all of the proposed systems was performed by a real-time C++ application that both integrated all of the frameworks and synchronized the data acquired by the HTC Vive system with the estimation positions computed by each system (Figure 10).
The application ran on a computer with the following configuration: Intel i7-8750H, GTX 1060 with 6 GB memory and 16 GB RAM. The frame per second (FPS) performance for the frameworks on the machine was the following: YOLO 30+ FPS, AKBT 10 FPS, OpenPose 10 FPS and ArUco 30+ FPS. The leveling off of all the performances to the lowest FPS (10 FPS) was required. The final performance of the application was under 10 FPS.

Evaluation Metrics
The performance of each proposed methodology was evaluated considering both the root mean square point to point distance (D. RMS) and the multivariate R 2 between the estimated and measured tool path. In particular, such metrics have been independently computed for each trajectory of a trial. It is worth remembering that the tool's pose acquired by the HTC Vive system was considered as the measured pose.
The multivariate R 2 index represents how much variability of the estimated path components is explained by the variability of the measured path components. It is then a global indicator of the goodness of the estimation. The R 2 was computed as follows: where path C (t k ) is the value of the measured path component (on the X, Y or Z axis) at the specific sample time t k , path est C (t k ) is the value of the estimated path component (on the X, Y or Z axis) at the specific sample time t k , SSE is the sum of the squared errors and SST is the sum of the squared residuals from the meanpath C .

Statistics
For a deeper look into the difference in the methods, the authors performed the Friedman test and the Dunn's pairwise post hoc tests with Bonferroni correction on root mean square point to point distance (D. RMS) and multivariate R 2 data for each condition, comparing the four methods. The significance level was set to 0.05. Non-parametric tests were adopted since the assumptions underlying parametric tests resulted in being violated for all sets of data. All the analyses were performed using the SPSS software (Version 21). The authors left out the correlation data of stationary condition from this analysis, considering it pointless to apply the test in that particular condition because its variability is already explained by standard deviation previously calculated.

Results
For each trajectory estimated by the four proposed systems during the 12 trials, the root mean square point to point distance (D. RMS) and the multivariate R 2 (see Equation (6)) between the measured and estimated trajectory were computed.
The results obtained for the stationary, slow-motion and fast-motion conditions are reported in Tables 1, 2    In order to compare the performance of the methods side by side in the three different conditions, the authors represented the results in boxplot charts. This is a useful way to visualize differences between groups and to quickly identify information, such as median and data dispersion. The resulting boxplots are shown in Figures 11 and 12. Then, the Friedman test and the Dunn's pairwise post hoc tests with Bonferroni correction were performed on the data to compare one-to-one the methods and to understand if their distributions were significantly different (Tables 4 and 5).   The results show that, in all of the conditions, the ArUco method has a great performance. Indeed, it shows a significantly lower root mean square point to point distance (D.RMS) than AKBT and YOLO (stationary ArUco vs. YOLO: t-stat = −2.154, p < 0.001; stationary ArUco vs. AKBT: t-stat = −2.308, p < 0.001; slow-motion ArUco vs. YOLO: t-stat = −1.769, p = 0.003; slow-motion ArUco vs. AKBT: t-stat = −2.538, p < 0.001; fast-motion ArUco vs. YOLO: t-stat = −1.917, p = 0.002; fast-motion ArUco vs. AKBT: t-stat = −2.833, p < 0.001), but not OpenPose; the ArUco variability increases in dynamic conditions. In addition, OpenPose obtained notable results, showing a significantly lower D.RMS than AKBT and YOLO in the slow-motion condition (OpenPose vs. AKBT: t-stat = 2.077, p < 0.001; OpenPose vs. YOLO: t-stat = −1.308, p = 0.05) and only AKBT in the fast-motion condition (t-stat = 1.917, p = 0.002). It appears to be the most reliable in terms of variability compared to all of the methods. On the other hand, the multivariate R 2 results ( Figure 12, Table 5) show that the ArUco performance is significantly better than YOLO and AKBT (slow-motion ArUco vs. YOLO: t-stat = 1.846, p = 0.002; slow-motion ArUco vs. AKBT: t-stat = 2.462, p < 0.001; fast-motion ArUco vs. YOLO: t-stat = 1.667, p = 0.009; fast-motion ArUco vs. AKBT: t-stat = 2.583, p < 0.001) but is comparable with OpenPose in dynamic conditions, and also confirm the difference between OpenPose and Kinect (slow-motion Open-Pose vs. YOLO: t-stat = 1.385, p = 0.037; slow-motion OpenPose vs. AKBT: t-stat = −2.0, p < 0.001; fast-motion OpenPose vs. AKBT: t-stat = −2.167, p < 0.001).

Discussion
In order to evaluate and compare ArUco-, OpenPose-, YOLO-and AKBT-based methods, the authors designed a system composed of an Azure Kinect device, HTC Vive tracking system (the benchmark) and an application with all of the frameworks in one that runs on a computer with this configuration: Intel I7-8750H, GTX 1060 with 6GB memory and 16 GB RAM.
The frame per second (FPS) performance for the frameworks on the machine were the following: YOLO 30+ FPS, AKBT 10 FPS, OpenPose 10 FPS and ArUco 30+ FPS. The leveling off of all performances to the lowest FPS (10 FPS) was required. The final performance of the application was under 10 FPS because it was subjected to the overhead of the post-processing phases and the simultaneous execution of the methods.
The authors designed an experiment in which a user handling a power drill simulated three different conditions during an assembly procedure: holding the tool in a static position (stationary condition) and moving the tool at two different velocities (slow-motion condition, 4 cm/s, and fast-motion condition, 8 cm/s).
The performance of the four methods has been evaluated on two metrics: the root mean square point to point distance (D. RMS) and the multivariate R 2 of a trajectory compared to the benchmark system (HTC Vive tracking system).The results have been reported as boxplot charts and a statistical analysis has been performed.
As reported in Figure 11, ArUco resulted in a very accurate method in the stationary condition, showing a D. RMS lower than all the other methods and a low variability. The slow-motion and fast-motion conditions boxplot charts ( Figure 11) show that the better accuracy of ArUco is also preserved in dynamic conditions, even if the variability increases, becoming comparable with the one of OpenPose and YOLO for the slow-motion condition and even being the worst overall for the fast-motion condition. The correlation boxplots in Figure 12 confirm the same situations, and the ranking of the methods seems, globally, the following: ArUco, OpenPose, YOLO, AKBT.
The authors could presume that ArUco would be the most accurate of the frameworks because it implements a marker-based technique, unlike the others. It confirms the results of other studies, such as [37]. Nevertheless, it is not enough, because an invasive method cannot be seriously considered in a real industrial application; it is rarely accepted, for safety and ergonomics reasons, to install markers on a hand tool and to handle it in a way that always keeps them visible.
Although OpenPose does not have the best performance in term of D. RMS, it could be considered a valid framework for building a virtual guidance system; the low variability of the D. RMS suggests it is a reliable method if the user can accept a D.RMS of around 10 cm. Furthermore, its non-invasiveness and flexibility make OpenPose a brilliant option to adopt. As a matter of fact, its performance is probably reduced by the limited hardware used for the experiment; a network with a higher resolution should improve its performance. YOLO v4 did not show an exceptional performance, but it has a good variability of D.RMS for dynamic conditions. The main problem of the framework is that it requires being retrained with many pictures of the hand tool under consideration (735 images were used in this study) in order to make the model able to recognize the point/s of interest. This framework would require a deeper investigation exploiting different points of interest for the selected hand tool to be detected and considering a more effective reinforcement learning for the DL model.
Azure Kinect Body tracking returned the worst results. It is certainly an innovative, compact, lightweight and well-documented framework; its DL model capabilities and accuracy improve with time thanks to continuous support from Microsoft. A study could be designed, in order to reinforce the model performance, on tracking some interesting points of the upper limb for more accurately deriving the hand tool position.

Conclusions
In this study, the authors conducted a literature review and market research, looking for studies and out-of-the-box solutions that proposed or implemented methodologies for hand tool pose estimation during assembly and maintenance procedures in the industrial field. They discovered that, even if there are plenty of studies concerning static and dynamic object pose estimation in the literature, the problem of occluded hand tool pose estimation and tracking in the industrial environment is not extensively investigated. Furthermore, it emerged that all of the commercial solutions do not really implement algorithms that are able to accurately track the pose of a tool partially occluded by the operator's hands, but they roughly derive it by the hand pose or by using color matching techniques.
For this reason, the authors selected four of the most promising computer vision and deep learning frameworks (ArUco, OpenPose, YOLO and AKBT) in the field to evaluate their performance in the task of industrial hand tool detection and pose estimation in real-time during an assembly or maintenance procedure. Two different approaches have been considered: a direct approach, in which the tool is directly detected and tracked using some robust features that are visible, and an indirect approach, in which the body joints of the operator and their hands are detected and tracked, and the tool pose is derived by considering the unique handling pose of the tool. To the best of the authors' knowledge, this study is the first that compares the performance of ArUco-, OpenPose-, YOLO-and AKBT-based methods side by side in the task of occluded hand tool tracking.
In order to fairly compare these frameworks, the objective was redefined, reducing the pose estimation problem to only 3D position estimation, because it is intuitive that a method such as YOLO, which only returns a position estimation, requires being complemented with another technique for orientation estimation. A complete pose estimation investigation is worth a further study.
The results of the study suggests OpenPose is the most complete proposal, thanks to its acceptable root mean square point to point distance, D. RMS (approximately 12 cm) and low variability in dynamic conditions, even when a limited network resolution is adopted. This framework is worth a dedicated study in order to exploit all model capabilities in extracting the information with a higher accuracy. OpenPose could be used to implement a tool pose estimation module in a smart workstation for training and assessment. A system designed with this feature would have a great impact in the automotive industry, especially for critical procedures that require monitoring, with high accuracy, the movements of the operator and the correct usage of hand tools for battery-pack assembling, engine repairing and overhauling, glue smearing and adhesive and sealant application. In future studies, the authors aim to investigate the problem of complete pose estimation (including orientation) and to conduct a deeper study on OpenPose, implementing a pose estimation module based on it and evaluating the performance in experiments that simulate procedures at different levels of complexity.