Comparative Analysis of Skeleton ‐ Based Human Pose Estimation

: Human pose estimation (HPE) has become a prevalent research topic in computer vision. The technology can be applied in many areas, such as video surveillance, medical assistance, and sport motion analysis. Due to higher demand for HPE, many HPE libraries have been developed in the last 20 years. In the last 5 years, more and more skeleton ‐ based HPE algorithms have been de ‐ veloped and packaged into libraries to provide ease of use for researchers. Hence, the performance of these libraries is important when researchers intend to integrate them into real ‐ world applications for video surveillance, medical assistance, and sport motion analysis. However, a comprehensive performance comparison of these libraries has yet to be conducted. Therefore, this paper aims to investigate the strengths and weaknesses of four popular state ‐ of ‐ the ‐ art skeleton ‐ based HPE librar ‐ ies for human pose detection, including OpenPose, PoseNet, MoveNet, and MediaPipe Pose. A com ‐ parative analysis of these libraries based on images and videos is presented in this paper. The per ‐ centage of detected joints (PDJ) was used as the evaluation metric in all comparative experiments to reveal the performance of the HPE libraries. MoveNet showed the best performance for detecting different human poses in static images and videos.


Introduction
Human pose estimation (HPE) aims to locate all of the human body parts from input images or videos.Nowadays, HPE has become a popular task in the field of computer vision.It is widely used in video surveillance [1][2][3][4][5][6], medical assistance [7][8][9][10][11][12][13][14][15], and sport motion analysis [16][17][18][19][20][21][22][23][24][25][26][27][28].Human keypoints are used to classify the poses and measure the correctness of poses in these applications.Using an intelligent video surveillance system, the human keypoints can be extracted from human body parts to classify poses between kidnapping and child abuse cases.In the aspect of medical assistance, detected keypoints from body parts can be used to evaluate the correctness of postures for physiotherapy exercises, fall detection, and in-home rehabilitation.In addition, the performance and correctness of an athlete's movements can be evaluated by comparing the detected keypoints from body parts with reference poses (ground truth).
HPE can be classified into two-dimensional (2D) HPE and three-dimensional (3D) HPE.It can also be classified into single-person HPE and multi-person HPE based on the number of people captured in the input image.In both single-person and multi-person HPE, it can be further classified into top-down and bottom-up methods, based on the ways of detecting the skeleton keypoints [29].This paper focuses on a comparative study of 2D single-person HPE.
As the demand for HPE increases, many skeleton-based HPE algorithms have been developed and packaged into libraries to provide ease of use for researchers.The performance of these HPE libraries is important to ensure the reliability of the different practical applications for which they are integrated.For instance, when the HPE library is applied to an in-home rehabilitation system, it needs to accurately detect the poses of patients freely performing rehabilitation poses in different home environments to ensure the reliability of the application.This situation is even more complicated when common challenges, such as inappropriate camera position and self-occlusion [11,20,22,23], affect skeleton keypoint detection.Recently, four state-of-the-art HPE libraries have been applied in various applications, namely PoseNet [30], MoveNet [31], OpenPose [32], and MediaPipe Pose [33].Table 1 shows a list of applications in different domains that have utilized these four HPE libraries in the last 5 years.OpenPose [6] 2018 Kidnapping detection-using HPE to classify kidnapping cases and normal cases in an intelligent video surveillance system.
OpenPose [5] 2019 A child abuse prevention decision-support system-using Open-Pose to classify adults and children in CCTV.

Medical Assistance
OpenPose [15] 2020 A fall detection system-using OpenPose to extract features of human body.
OpenPose [11] 2021 Measure joint angles and conduct semi-automatic ergonomic postural assessments to evaluate the risk of musculoskeletal disorders.
MoveNet [14] 2021 A healthcare system that measures patient's strength, balance, and range of motion during physical therapy activities.MediaPipe Pose [12] 2022 A fall detection system.
MediaPipe Pose [13] 2022 A posture corrector system-to notify people who are spending most of their time sitting in front of the computer with bad posture to avoid long-term health issues.

Sport Motion Analysis
OpenPose [28] 2018 A basketball free-throw shooting prediction system-using OpenPose to generate body keypoints.
OpenPose [21] 2020 A real-time push-up counter-to classify the correct and incorrect push-ups.
OpenPose [20] 2021 A system to evaluate baseball swinging poses and help baseball players correct their poses.
MediaPipe Pose [24] 2021 A mobile application-to analyze, improve, and track cricket players' batting performance.
PoseNet [25] 2021 A real-time workout analyzer-allows fitness enthusiast to perform their workout accurately at home and with proper guidance.
PoseNet [26] 2021 A fitness tutor-to maintain the correctness of the posture during workout exercises.
PoseNet [27] 2021 A fitness application-provides instant feedback to users to ensure the accuracy of their workout exercise poses.MediaPipe Pose [22] 2022 To score the human body's balance ability on the wobble board.
MediaPipe Pose [23] 2022 A free weight exercise tracking software-allows users to learn and correct their exercise poses.
The four HPE libraries face two common challenges in pose estimation, including inappropriate camera position and the self-occlusion effect.Even though each HPE library uses different approaches to overcome these challenges, the strengths and weaknesses of these four existing HPE libraries have yet to be discovered.Therefore, the comparative performance of these libraries should be carried out to investigate their robustness in detecting different human poses.This paper aims to compare the performance of these four state-of-the-art HPE libraries for human pose detection and analyze the strengths and weaknesses of each HPE library.Hence, a comparative analysis of these four HPE libraries based on images and videos was carried out.To the best of the authors' knowledge, this paper is the first attempt to compare and analyze the performance of PoseNet, OpenPose, MoveNet, and MediaPipe Pose using both image and video datasets.
The rest of this paper is organized as follows.Section 2 reviews the existing comparative analysis of the different HPE libraries and summarizes the functionalities of Po-seNet, OpenPose, MoveNet, and MediaPipe Pose.The methodology used to evaluate the performance of each HPE library is presented in Section 3. Next, Section 4 provides the experimental results, in terms of image and video datasets.Section 5 analyzes the strengths and weaknesses of each HPE library.The last section presents the conclusion of the study.

Existing Comparative Analysis
In two of the existing comparative analysis studies, the researchers compared the performance of a few HPE libraries using image datasets.In [34], the AR dataset was used to compare the performance of OpenPose and BlazePose [33] using PCK@0.2 as the evaluation metric.The results showed that OpenPose achieved slightly better performance than BlazePose, with the results of 87.8 and 84.1, respectively.Ref. [31] used OpenPose, PoseNet, MoveNet Lightning, and MoveNet Thunder in their study.The researchers used two image datasets: the COCO [35] and MPII [36] datasets.The performances of the HPE libraries were measured using their own proposed evaluation metric.PoseNet had the best performance, while MoveNet Lightning had the worst performance.The comparative analysis of both existing studies was limited to a few selections of HPE libraries using image datasets.However, the performance of four state-of-the-art HPE libraries based on video datasets has yet to be investigated.In this paper, the authors have conducted the experiments using image and video datasets.

HPE Libraries
Four state-of-the-art HPE libraries are discussed in this section and their specifications are summarized in Table 2.Among the four HPE libraries, the total number of commonly detected keypoints is 17.The commonly detected keypoints of the head include ears, eyes, and nose (5 keypoints).The 6 commonly detected keypoints of the shoulders, elbows, and wrists are categorized as the upper body, while the lower body includes 6 keypoints from the hips, knees, and ankles.In addition, OpenPose and MediaPipe Pose provide more annotations of the keypoints at the face, hand, and foot to reach the maximum number of keypoints, at 135 and 33 keypoints, respectively.OpenPose provides an additional 70 keypoints of the face, 20 keypoints of both hands, 1 keypoint of the upper body, and 7 keypoints of the lower body.MediaPipe Pose provides 6 additional keypoints of the head, 6 keypoints of the upper body, and 4 keypoints of the lower body.The approach of keypoint detection in the HPE libraries can be classified into top-down and bottom-up methods.In the top-down method, the number of people is first detected from the given input and each person is assigned into a separate bounding box, respectively [37].Subsequently, the keypoint estimation is performed in each bounding box.In contrast to the top-down method, the bottom-up method performs keypoint detection in the first step [38].After that, the keypoints are grouped based on human instances.Among these four libraries, PoseNet and MediaPipe Pose employ the top-down method while OpenPose and MoveNet use the bottom-up method to perform human pose estimations.The four HPE libraries use different underlying networks for pose estimation.OpenPose uses ImageNet with the VGG-19 backbone, PoseNet uses ResNet [39] and MobileNet [40], Me-diaPipe Pose uses the Convolutional Neural Network (CNN), and MoveNet uses the Mo-bileNetV2.OpenPose is the first open-source library available since 2017 for 2D multi-person HPE [32].OpenPose employs a non-parametric representation known as Part Affinity Fields (PAFs) to detect the body parts associated with the person in an input image.The PAFs describe a list of 2D vector fields in an image, encoding both orientation and location of the body limits.In 2019, Cao et al. [41] released a new version of OpenPose that combined body and foot keypoint detectors.The combined detector needs less inference time than running the body and foot keypoint detectors independently, while also maintaining the accuracy rate.Hence, OpenPose became the first open-source library that can detect body, hand, foot, and facial keypoints on a single image with a total of 135 keypoints.Furthermore, OpenPose is also able to perform the task of vehicle keypoint detection by utilizing the same network architecture [41].
Similar to OpenPose, PoseNet was also released in 2017 [30].It was built on a Ten-sorFlow machine learning platform and provides real-time HPE implementation in the browser.There are two versions of the algorithm in PoseNet.One algorithm is used to estimate the single pose and the other is used to estimate multiple poses from the input image or video.Both algorithms are able to detect 17 keypoints in a single person.The computational time of the multi-person HPE algorithm is slightly slower than that of the single-person HPE algorithm.However, it is not affected by the number of detected persons.When using the single-person algorithm, the keypoints might be conflated if there is more than one person in the input image or video.Moreover, there are two architectures in PoseNet, which are ResNet [39] and MobileNet [40].MobileNet is designed for mobile devices.It is more lightweight than the ResNet, but has lower accuracy.Although ResNet achieves higher accuracy than MobileNet, its larger number of layers requires longer loading and inference time.
In 2020, a solution called MediaPipe Pose was released to achieve higher fidelity human body pose tracking using the machine learning approach [33].It utilizes the BlazePose and ML Kit Pose Detection API to infer a maximum of 33 keypoints (3D landmarks) from an RGB input.It can be performed in real-time on mobile phones, desktops, or laptops.BlazePose employs a two-step detector-tracker pipeline for single-person pose estimation [33].The first step of this pipeline locates the region-of-interest (ROI) of the person inside the image frame.Subsequently, the tracker uses the ROI from the detector as the input to predict the position of each keypoint within the ROI.If the input is a video, the detector is invoked at the first frame to extract the human ROI, followed by keypoint extraction using the tracker.The tracker uses the same ROI to estimate the keypoints of the human in the next frame.When the algorithm loses track of the person, the detector is invoked again to generate a new ROI.
MoveNet [31], which was released in 2021, is a pose detection model that detects 17 keypoints of a single person in real-time.There are two variants of MoveNet, which are Lightning and Thunder.The accuracy of Lightning is lower than that of Thunder.However, the inference time of Lightning is faster than that of Thunder.MoveNet uses heatmaps to accurately localize the human keypoints.Its architecture consists of two components, which are a feature extractor and a set of prediction heads.The prediction technique of MoveNet loosely follows that of CenterNet [42] to improve its accuracy and speed.CenterNet is an object detector that uses the keypoint estimation networks to find the center points and regress to the object size, location, and orientation.The feature extractor is MobileNetV2 with an attached feature pyramid network [43] to produce a high resolution and semantically rich feature map output.MobileNetV2 is a neural network designed with mobile devices to extract features for object detection, classification, and semantic segmentation.There are four parts in the prediction heads, which include person center heatmap, keypoint regression field, person keypoint heatmap, and 2D per-keypoint offset field.They are responsible for predicting the human keypoints using heatmaps.

Methodology
In this section, the descriptions of the datasets used in the experiments are presented.After selecting the dataset, data pre-processing was carried out.The HPE procedure was then performed.Next, an evaluation metric was used to evaluate the performance of the HPE library.The evaluation metric is discussed in this section.

Datasets
The datasets of image and video sources used in this experiment were the Microsoft Common Object in Context (COCO) [35] and Penn Action [44] datasets, respectively.Table 3 shows the characteristics of the COCO and Penn Action datasets.Both datasets have 6 common upper body keypoints and 6 lower body keypoints.However, the difference between these two datasets is the number of keypoints annotated for the head.COCO has 5 keypoints, including the nose, eyes, and ears, while Penn Action only provides 1 keypoint at the head position.Figures 1 and 2 show sample images of the COCO and Penn Action datasets, respectively.
COCO [35] is a large-scale object detection, segmentation, and captioning dataset.It is commonly used in experiments for HPE [45][46][47].It consists of 330,000 images, 1.5 million object instances, 80 object categories, 91 stuff categories, and 250,000 humans with keypoints.This dataset provides annotations for the body keypoint detection, where each instance of a person is labeled with 17 keypoints.There are various versions of COCO.COCO 2017 is commonly selected for HPE experiments.
Penn Action [44] is a video dataset, which consists of 2326 video sequences with 15 actions.It is commonly used in HPE experiments [18,[48][49][50].Each video sequence contains RGB image frames and annotations.The annotations include different human actions, 2D bounding boxes to locate human positions, and skeleton keypoints in the body parts.Each instance of a human is labeled with 13 keypoints.

Data Pre-Processing
Before evaluating the performance of the HPE libraries, data pre-processing was conducted to filter out irrelevant data in both datasets.There are three types of images in COCO, including images of a single person, images of multiple-people, and images without people.This experiment focused on single-person HPE.Hence, images with multiple people and without people were removed.In addition, images with only half of the human body were removed.Thus, 1100 remaining images were used in the experiment.In order to compare the performance of the four HPE libraries, 17 commonly detected keypoints from the human body were matched with 17 annotations provided by the dataset (ground truth).
For the Penn Action videos, the action of guitar strumming was removed since the video frames only consisted of the upper half of the human body.The first 14 actions were used in this experiment (refer to Figure 2).Since the Penn Action dataset only provided 1 annotation of a keypoint of the head, which differed from the four HPE libraries (refer to Table 3), the head annotation was removed from the experiments to maintain a fair comparison among all libraries.Thus, the 12 remaining keypoints were used as the ground truth.

Evaluation Metrics
Evaluation metrics play an important role in evaluating the quality of HPE libraries.The evaluation metric used in this study was the percentage of detected joints (PDJ), which was able to measure the performance of the HPE library [37,51,52].PDJ uses the Euclidean distance between the ground truth and predicted keypoints in pixel(s) to measure the detection accuracy of the HPE libraries.The higher the value of PDJ, the higher the accuracy rate.The calculation of Euclidean distance  ,  between the ground truth  ,  and predicted keypoints  ,  is shown in Equation ( 1).The threshold of PDJ was 0.05 for the value of the torso diameter.The torso diameter was computed for the Euclidean distance from the left shoulder to the right hip, represented as the coordinates  ,  and  ,  , as shown in Equation ( 2).When the  ,  between the predicted keypoints and ground truth keypoints was smaller than the threshold, the predicted keypoints were considered to be correctly detected.Hence, the PDJ can be deduced as shown in Equation (3), where  represents the total number of predicted joints.

Experiment Results
Three experiments were conducted to evaluate the performances of the HPE libraries.For the image dataset, an experiment was conducted to compare the performance of each HPE library for each image.For the video dataset, the first experiment was conducted to compare the performance of each HPE library for each video frame, whereas the second experiment investigated how well each HPE library performed for each body part of each action.

Image Dataset
The PDJ value for each image was calculated to compare the performance of the HPE libraries for each image.The results are presented in a box plot, as shown in Figure 3.Among the four HPE libraries, MoveNet (orange box) achieved the highest PDJ in terms of the lower fence, first quartile, median, and third quartile values.The second best performing HPE library was OpenPose (blue box), which achieved the same median and third quartile values as MoveNet; however, OpenPose had lower first quartile and fence (outlier) values.The third best performing HPE library was PoseNet (green box), followed by MediaPipe Pose (red box).The minimum values of MediaPipe Pose and PoseNet were 0. Meanwhile, the outlier values for MoveNet and OpenPose were also 0. The value of 0 indicated that the keypoints detected in some of the images were incorrectly matched with the ground truth provided by the COCO dataset.Table 4 divides the performance of the HPE libraries into five groups, including 0%, 0-25%, 25-50%, 50-75%, and 75-100%.Each group shows the number of images that were recognized by the HPE library in the specific range of PDJ values.MoveNet had the least number of images that achieved 0% of PDJ, which included 5 images.OpenPose achieved the second lowest number of images at 0% (8 images), followed by PoseNet (30 images) and MediaPipe Pose (240 images).In the range of 0-25%, MoveNet had the least number of images (33 images), followed by OpenPose (87 images), PoseNet (128 images), and Me-diaPipe Pose (342 images).Likewise, MoveNet had the least number of images that achieved the PDJ within the range of 25-50% (68 images).
MoveNet achieved superior performance because it had the highest number of images in the last two groups (50-100%), which indicated that more than 50% of the detected keypoints from 994 images were correctly matched with the ground truth.In contrast, MediaPipe Pose was found to have the poorest performance because it had the highest number of images in the first to third groups (0-50%).In a total of 698 images, less than 50% of the detected keypoints were correctly matched with the ground truth.Overall, MoveNet achieved the top performance because it showed the highest number of images in the fifth group, which indicated that 747 out of 1,100 images were recognized by MoveNet with 75-100% detected keypoints.In contrast, MediaPipe Pose showed the poorest performance as it only achieved the lowest number of images in the range of 75-100%, but it received the highest number of images in the 0% group.OpenPose achieved the second highest performance, which was slightly lower than MoveNet.PoseNet and MediaPipe Pose achieved the third and fourth highest performances, respectively.In the overall comparison, MoveNet was the most robust because it achieved the top performance in terms of lower fence, first quartile, median, and third quartile values compared to the other HPE libraries.MoveNet achieved more than 50% PDJ value for 994 out of 1100 images.In addition, MediaPipe Pose showed the poorest performance in the image dataset as it has the lowest PDJ in terms of first quartile, median, and third quartile values.MediaPipe Pose also showed that less than 50% of detected keypoints could be correctly matched with the ground truth in 698 out of 1100 images.OpenPose and PoseNet achieved the second best and third best performances, respectively.

Video Dataset
The mean PDJ for each action was calculated and the PDJ values for all actions are shown in Figure 4  Similar to Table 4, Table 5 lists the performance of the HPE libraries for all actions into five groups, which are 0%, 0-25%, 25-50%, 50-75%, and 75-100%.There was no action showing 0% in all HPE libraries.In the range of 75-100%, MediaPipe Pose achieved the greatest number of actions, which was 5 actions.MoveNet and PoseNet achieved 4 actions in this group, while OpenPose showed 0 actions in this group.In the fourth group (50-75%), MoveNet achieved the greatest number of actions (10 actions), followed by Po-seNet (7 actions), MediaPipe Pose (6 actions), and OpenPose (4 actions).In the first three groups (0-50%), OpenPose showed the greatest number of actions (10 actions), MediaPipe Pose and MoveNet showed 3 actions, while MoveNet showed 0 actions in these groups.Overall, MoveNet achieved the best performance because it achieved above 50% PDJ in all actions.OpenPose showed the worst performance since the PDJ for 10 actions were lower than 50%.To get a better understanding of the performance of the HPE libraries for each action, the highest and lowest average PDJ values of each library for each action are highlighted in Table 6.Among 14 actions, the libraries achieved the best performance in Action 8 (jumping jacks).MediaPipe Pose showed the poorest performance in Action 12 (squat) among all actions.Coincidentally, both OpenPose and PoseNet showed the poorest performance in Action 2 (bench press).On the other hand, MoveNet showed the poorest performance in Action 4 (bowling).The overall average PDJ values for all actions are highlighted in the last row of Table 6.The performance rank from the highest to lowest PDJ values was MoveNet, MediaPipe Pose, PoseNet, and OpenPose.The performances among MoveNet, MediaPipe Pose, and PoseNet fell between 65% and 70%.The performance of OpenPose was much lower than the others, which fell at approximately 37%.Although MediaPipe Pose successfully detected 5 actions (refer Table 5), achieving between 75% and 100% PDJ, which was more than MoveNet, its overall performance was 1.38% lower than that of MoveNet.The ranking of each library for each action is listed in Table 7. MediaPipe Pose achieved the top performance in 7 of 14 actions.It achieved the second highest performance in 3 actions and the third highest performance in 4 actions.MoveNet achieved the highest performance in 6 actions, second highest performance in 5 actions, and third highest performance in 3 actions.PoseNet showed the highest performance in only Action 4 (bowling), second highest performance in 6 actions, and third highest performance in 7 actions.OpenPose showed the lowest performance in all actions among all the HPE libraries.Since the results in Table 6 show that the four HPE libraries achieved the best performance in Action 8 (jumping jacks) among all actions, a closer analysis was performed.MediaPipe Pose, MoveNet, and PoseNet reached approximately 92% while OpenPose achieved approximately 72% for Action 8. Figure 5 shows the video frame and detection results of Action 8, where the two challenges, self-occlusion and inappropriate camera position, are absent.Hence, the performance of HPE should be better when there are less challenges affecting keypoint detection.On the other hand, the performance of HPE is reduced when there are more challenges.MediaPipe Pose showed the poorest performance in Action 12 (squat); OpenPose and PoseNet showed the worst performance in Action 2 (bench press); while MoveNet showed the poorest performance in Action 4 (bowling).Figures 6-8 show the sample video frames and detection results of each HPE library for Actions 12, 2, and 4, respectively.In addition, the PDJ values for each body part in each action were also calculated.Frequent changes in keypoint positions also affects the performance of keypoint detection.In Actions 1, 3, 4, 6, 13, and 14, the PDJ values for elbows and wrists in all the tested libraries were lower than those of other body parts, as shown in  This was due to frequent changes in keypoint positions in the elbows and wrists compared to other body parts.For instance, the person who plays bowling only needs to use his elbow and wrist to release the ball to the bowling lane, hence showing fewer changes of movement in other body parts.Likewise, the performance of OpenPose (blue line) showed the lowest PDJ among these actions, while the fluctuations in PDJ values between other HPE libraries were relatively small.When the common challenges (self-occlusion and inappropriate camera position) occurred in the videos, the performance of keypoint detection was affected in all HPE libraries.Figure 12 shows the video frames of Action 9 (pull up), showing the self-occlusion effect in the action.In this action, all libraries showed much lower performance in some of the body parts compared to that of other body parts due to this self-occlusion effect.The corresponding PDJ values are reported in Figure 13.The PDJ values for ankles (black dotted box) were much lower than other parts because of self-occlusion from the crossed ankles, as shown in Figure 14.OpenPose (blue line) achieved the lowest performance for all body parts, particularly for elbows and wrists.Additionally, the performance of MoveNet for the ankles was lower than that of MediaPipe Pose and PoseNet when facing the self-occlusion effect.

Frame 1
Frame 10 Frame 17  In addition, inappropriate camera position is one of the common challenges for human pose estimation.In Action 7 (jump rope), the camera was placed on the right side of the person and the video frames were recorded from the right side, as shown in Figure 15.Hence, self-occlusion occurred to the left body parts, which reduced the performance of keypoint detection for the left body parts.Figure 16 clearly shows that the performance of all HPE libraries for the right body parts was higher than that for the left body parts, where the PDJ values for the right shoulder and right elbow were always higher than those for the left shoulder and left elbow.In the overall comparison, MoveNet achieved the highest PDJ values in terms of minimum, first quartile, and third quartile values compared to the other HPE libraries.The PDJ values for MoveNet in all actions were more than 50%.MediaPipe Pose achieved the highest PDJ values in terms of maximum and median values.All HPE libraries achieved superior performance in Action 8 (jumping jacks) among all actions, which consists of fewer challenges.Based on the ranking of the HPE libraries in each action, MediaPipe Pose achieved the top performance in 7 actions, followed by MoveNet (6 actions) and Po-seNet (1 action).OpenPose showed the poorest performance in all actions.MoveNet achieved the highest average PDJ value, while OpenPose showed the lowest average PDJ value.However, MediaPipe Pose and PoseNet were still competitive with MoveNet because the average PDJ values for these three libraries were in the range of 65-68%.

Discussion
The main findings of this study are summarized in this section.MediaPipe Pose had the lowest overall performance in the image dataset.However, it performed well in the video dataset with an overall performance slightly lower than the top overall performer, MoveNet.It achieved the best performance in 7 out of 14 actions.
MoveNet achieved the highest overall performance for keypoint detection in the image and video datasets.For the video dataset, the PDJ values for all actions were greater than 50%, which was the best performance compared to the other HPE libraries.Additionally, it achieved the best performance in 6 actions, compared to 7 actions in MediaPipe Pose.However, its overall average PDJ was the highest (69.85%).
OpenPose achieved the second highest performance in the image dataset.However, the weakness of OpenPose was in detecting the keypoints in continuous video frames.Its overall performance in the video dataset was the lowest, approximately 30% lower than the other HPE libraries.Overall, PoseNet had the third highest performance in both the image and video datasets.
The performance of HPE libraries is reduced when they are constrained by challenges such as inappropriate camera position and the self-occlusion effect.OpenPose was the least robust in detecting video frames when facing these challenges.This is because Open-Pose always loses track when self-occlusion occurs in the video frames.Figure 17 illustrates this point in the line graph of PDJ values for OpenPose in each frame in Action 1 (baseball pitch).Based on the results, the PDJ values decreased from frame 84 to 102 (refer to the green box).Due to the bottom-up method used by OpenPose, the detected keypoints cannot be grouped into a human instance when there is self-occlusion in these frames, resulting in tracking failure.The original video frames and results of the Open-Pose detection from frame 78 to 103 are shown in Figure 18 for illustration.OpenPose obviously performed poorly in frames with self-occlusion (frames 84 to 103).
Based on the results of this experiment, MoveNet is suitable for detecting both images and videos.On the other hand, OpenPose is more suitable for detecting images while on 2D single-person HPE.PDJ was the evaluation metric used in this study to evaluate the quality of the HPE libraries.
As a result, MoveNet has superior performance while MediaPipe Pose has the lowest performance in detecting images.In addition, MoveNet also has top performance while OpenPose has the lowest performance in detecting videos.However, OpenPose has the second highest performance in detecting images.PoseNet showed average performance in detecting images and videos.When facing challenges such as inappropriate camera position or self-occlusion, the performance in detecting body parts will be reduced.MoveNet, MediaPipe Pose, and PoseNet can handle these challenges well, but OpenPose shows the poorest performance under these conditions.In detecting videos, OpenPose had the lowest robustness because it loses track when self-occlusion occurs with body parts.
The limitation of this study is that the experiments were focused on analyzing the performance of four HPE libraries using the PDJ.The memory consumption, inference time, and detection speed of each HPE library will be compared in our future work.

Figure 3 .
Figure 3. Box plot of PDJ of all images in the image dataset.
. Among three HPE libraries, MoveNet (orange box), MediaPipe Pose (red box), and PoseNet (green box) achieved slightly similar performances in terms of minimum, first quartile, median, maximum, and third quartile values.MediaPipe Pose (red box) had the highest PDJ in terms of maximum and median values, while MoveNet (orange box) scored the highest in terms of minimum, first, and third quartile values.OpenPose (blue box) portrayed a weak performance with the lowest values for all quartiles.

Figure 4 .
Figure 4. PDJ in all actions in the video dataset.

Figure 5 .
Figure 5. Sample video frame and detection result of Action 8 (jumping jacks).Green lines indicate the ground truth, red lines indicate the tested HPE library.

Figure 6 .
Figure 6.Sample video frame and detection result of Action 12 (squat).Green lines indicate the ground truth, red lines indicate the tested HPE library.

Figure 7 .Figure 8 .
Figure 7. Sample video frame and detection result of Action 2 (bench press).Green lines indicate the ground truth, red lines indicate the tested HPE library.

Figure 9 .
Figure 9. PDJ for each body part in Action 1-baseball pitch.

Figure 10 .
Figure 10.PDJ for each body part in Action 4-bowling.

Figure 11 .
Figure 11.PDJ for each body part in Action 13-tennis forehand.

Figure 16 .
Figure 16.PDJ for each body part in Action 7 (jump rope).

Table 2 .
Specifications of each HPE library.

Table 3 .
The characteristics of the COCO and Penn Action datasets.Left Eye, Right Eye, Left Ear, Right Ear, Left Shoulder, Right Shoulder, Left Elbow, Right Elbow, Left Wrist, Right Wrist, Left Hip, Right Hip, Left Knee, Right Knee, Left Ankle, Head, Left Shoulder, Right Shoulder, Left Elbow, Right Elbow, Left Wrist, Right Wrist, Left Hip, Right Hip, Left Knee, Right Knee, Left Ankle, Right Ankle

Table 4 .
Number of images recognized by each HPE library in each specific range of PDJ values.

Table 5 .
Number of actions recognized by each HPE library in each specific range of PDJ values.

Table 6 .
Overall average of PDJ (%) in each action.Green words represent the highest PDJ of each HPE library among all actions.Red words represent the lowest PDJ of each HPE library among all actions.The green cell represents the highest value in average PDJ.The red cell represents the lowest values in average PDJ.

Table 7 .
Overall average of PDJ (%) in each action.