This study utilizes the YOLOv8-Pose model to enhance the accuracy of acupoint detection on the human back. Specifically, we introduce a non-local attention mechanism to capture the spatial dependencies between acupoint keypoints while integrating global features by calculating the correlations among keypoint locations within the feature map. Additionally, we constructed a keypoint regression bounding box that allows the OKS loss function to incorporate geometric information, such as the area, width, and height of the bounding box, based on the Euclidean distance. This approach effectively reduces the prediction error of the model.
2.2.1. Dataset Construction
To build a high-quality acupoint dataset, this study collected data from the back regions of diverse individuals, covering variations in BMI, age, gender, and skin color. Examples of some dataset images are shown in
Figure 11. Back acupoints are primarily distributed along the Bladder Meridian of Foot-Taiyang, the Small Intestine Meridian of Hand-Taiyang, and the Governor Vessel. Acupoints on the Governor Vessel are mainly located in the depressions below the spinous processes of the vertebrae, representing key areas of the spinal column; those on the Bladder Meridian of Foot-Taiyang are situated approximately 1.5 or 3 cun lateral to the spinous processes, following a consistent positioning pattern; while acupoints on the Small Intestine Meridian of Hand-Taiyang are concentrated around the scapular spine, which can be precisely located based on the relative positions between thoracic vertebrae and the scapula. Additionally, the back region includes several fixed extra-meridian points that have unique therapeutic functions and significance. During the dataset construction, 49 fundamental acupoint feature points were selected and annotated according to their spatial distribution.
The annotation of acupoint locations is a crucial step in dataset construction, as it directly affects the accuracy and effectiveness of model training. In this study, the Labelme annotation software was used to complete the annotation of acupoints on the back. Labelme (version 5.0.1) is an open-source image annotation tool that supports various annotation forms. The annotation interface is shown in
Figure 12. During the annotation process, the positioning of the acupoints strictly followed the national standards, and the accuracy of each annotated acupoint was ensured under the guidance of relevant traditional Chinese medicine practitioners.
During the dataset construction process, a total of 420 back images were annotated, covering a diverse population to ensure broad adaptability of the dataset. In the annotation process, all acupoint locations were stored as keypoints, with the two-dimensional pixel coordinates of each acupoint recorded in the image. These coordinate data accurately reflect the positions of acupoints within the images and are used for supervised learning in subsequent model training.
The annotated data images are shown in
Figure 13. Different colors in the figure represent 49 different acupoints. All the annotation information is stored in a standard JSON file format. Each annotated JSON file contains the image file information, keypoint information, and acupoint number.
2.2.2. Improved YOLOv8-Pose Model
The YOLO algorithm is a deep-learning-based object detection model widely applied in real-time object detection tasks [
23]. Its core concept treats object detection as a regression task, directly predicting all object categories and their localization information within an image through a single forward pass. Specifically, YOLO divides the input image into S × S grids, with each grid responsible for predicting whether an object exists within that region while simultaneously regressing the object’s bounding box and category probability.
YOLOv8 is a variant within the YOLO series, adopting the CSPDarknet53 backbone architecture with enhanced feature extraction capabilities [
24]. It replaces the Decoupled-Head component in the head section with an optimized version. YOLOv8-Pose is a branch of the YOLOv8 model designed to simultaneously perform human object bounding box detection and human keypoint localization [
25]. Its network architecture primarily consists of three components: backbone, neck, and head. The backbone serves as the main network, responsible for pre-training and feature extraction of input images through the CBS, C2f, and SPPF modules. The CBS module comprises a convolutional layer, a batch normalization layer, and an activation function. It first extracts features through convolution, then applies batch normalization to accelerate training, and finally employs the SiLu activation function to enhance the network’s nonlinear representation capabilities. The module structure is illustrated in
Figure 14. This design effectively mitigates the vanishing gradient problem commonly encountered during small object detection.
The C2f module first performs preliminary processing on the input feature map via labeled convolutions, then divides the feature map into two branches: one part preserves the original features, while the other undergoes layer-by-layer processing through N Bottleneck modules. Each Bottleneck module consists of two CBS modules and one Concatenate (Concat) module. After processing through the Bottleneck modules, the feature stream splits into two paths: one path transforms the features and feeds them into the next Bottleneck module; the other path retains the current features and provides input for subsequent feature concatenation processing. After processing by N Bottleneck modules, features from all paths are fused. The network architecture of the C2f module is illustrated in
Figure 15.
The SPPF module expands the network’s receptive field by applying multi-scale pooling operations to input feature maps, thereby enhancing the model’s ability to process diverse objects while improving detection performance without increasing computational overhead. Specifically, input features undergo preliminary processing through a CBS module before branching into two paths: one path preserves the original features and feeds them into a Concat module for concatenation, while the other path sequentially downsamples the features through three max-pooling layers. The pooled features are ultimately also fed into the Concat module for fusion. The concatenated features are then processed through another CBS module to generate the final output. The network structure of the SPPF module is illustrated in
Figure 16.
The neck network serves as the model’s neck component, integrating and further processing features extracted from the backbone through C2f, Concat, and Upsample modules. This enhances the model’s ability to generate clearer prediction results. The head component serves as the model’s head network. It processes the fused features through two CBS modules, followed by a Conv2D layer that convolves the data and outputs the final result.
However, in human acupoint detection, acupoints on the back exhibit a dense distribution and appear in large numbers. Particularly in high-resolution images, the smaller acupoint dimensions pose challenges for model differentiation. Therefore, this study proposes an improved network architecture to address the limitations of the existing model, as illustrated in
Figure 17. In the red area (a) of
Figure 17, we integrate the non-local attention mechanism. When the input image passes through the SPPF, it enters the non-local module, which captures the long-range dependencies of the acupoints by calculating the correlations among the keypoints. This integration allows for a more comprehensive feature representation. In the red area (b) of
Figure 17, we enhance the original OKS loss function by incorporating geometric information into the bounding box regression. This modification transitions the focus from single keypoint regression to keypoint bounding box regression. Consequently, the model can fully account for geometric attributes such as width, height, and area during the regression process, thereby improving the accuracy of acupoint detection by effectively evaluating the discrepancies between predicted results and actual labels in the head prediction layer.
2.2.3. Improved Non-Local Attention Mechanism
During the detection process of acupoints on the human back, relative dependencies exist between different acupoints. The distribution of back acupoints exhibits inherent long-range spatial dependencies. First, acupoints along the Bladder Meridian of Foot-Taiyang (e.g., Feishu, Xinshu, Ganshu, Shenshu) are spaced with large longitudinal distances, which can span tens to hundreds of pixels in images, forming typical long-range dependencies. Second, bilaterally symmetric acupoints (e.g., left and right Dachangshu) maintain cross-hemisphere correlations. Conventional local attention mechanisms model feature dependencies through convolutional kernels with limited receptive fields, where the effective modeling distance is constrained by the number of convolutional layers and kernel size. Consequently, they struggle to capture the aforementioned long-range dependencies. In contrast, the non-local attention mechanism directly models global dependencies by computing pairwise similarities between all positions in the feature map, making it more suitable for back acupoint detection. To enable the model to better capture correlations among various acupoints, this paper introduces the non-local attention mechanism into the backbone network of the YOLOv8 model. The non-local module captures global features by calculating the similarity between any two positions within the input feature map. This mechanism dynamically adjusts the feature representation at each location, enabling the model to extract local features while incorporating global contextual information. The network architecture of the non-local attention mechanism is illustrated in
Figure 18.
When feature X is input into the non-local attention mechanism for linear mapping after SPPF, T represents the number of time steps, while H and W denote the height and width of the feature, respectively, with 1024 indicating the number of channels. Initially, feature X compresses the channels using a 1 × 1 × 1 convolution, resulting in three new features,
,
, and g, each with 512 channels. Subsequently, the similarity matrix THW × THW is derived by performing a dot-product operation between features
and
to assess the autocorrelation between them. This similarity matrix is then normalized using the Softmax function, transforming it into a weight matrix. Each element in this matrix signifies the similarity weight of one position in relation to others, with values ranging from 0 to 1. The weight matrix is then multiplied by feature g to produce a new feature
of dimensions T × H × W × 512. Finally, a 1 × 1 convolution is applied to restore the number of channels to 1024, followed by a residual connection with the original input feature X, resulting in an output feature
that encapsulates the long-range dependencies of the image. The formula for the matrix dot-product operation between features
, and
is as follows:
where
and
represent the
and
-th positions in the feature graph X, respectively. After mapping
and
to a new feature space by 1 × 1 convolution, representations r and s are obtained, respectively. T is the time step. The new feature
of the
position can be derived as:
In the formula,
is the new feature obtained after the dot-product operation between the weight matrix and feature
,
is the weighted sum of all positions
associated in the input feature map,
is the normalization factor, and
is the result obtained by the linear transformation of the position features. The output feature
of the non-local module is defined as follows:
where
is the new feature map of position
after the fusion of long-range dependencies. Wz is the linear transform used to project the new feature into the same dimension as the original input feature, and
is the original feature before the input non-local module.
To compare the performance of various attention mechanisms in acupoint detection, this study selected the Convolutional Block Attention Module [
26] and the Simple Attention Module [
27]. Additionally, two common attention modules, namely SimAM [
28] and the Activation Enhancement (AE) attention module [
29], were compared with the non-local attention module. The performance evaluation criteria included precision, recall, and mAP.
Table 1 presents the experimental results of different attention mechanisms. Among these, the precision of CBAM reached 87.6%, although its recall was lower at 83.2%. SimAM exhibited a higher recall of 85.5%, while its precision and mAP were 85.4% and 83.8%, respectively. The metrics for AE were relatively balanced, with precision, recall, and mAP values of 85.9%, 85.4%, and 83.2%, respectively. The precision of the non-local attention mechanism was 88.1%, with a recall of 85.7% and an mAP of 85.8%, demonstrating superiority over other attention mechanisms across various metrics. By analyzing the relationship between the location of each keypoint in the input feature map and the locations of all other keypoints, the non-local attention mechanism can generate more discriminative feature representations, thereby enabling the model to more accurately identify and locate dense and small acupoints. Consequently, the precision, recall, and mAP50 are notably superior.
The superiority of non-local attention over CBAM, SimAM, and AE (
Table 1) can be attributed to the unique spatial distribution of back acupoints. Unlike general object parts in natural images, back acupoints exhibit long-range dependencies and bilateral symmetry. Local attention mechanisms, with their limited receptive fields, cannot effectively model such dependencies. In contrast, non-local attention computes pairwise similarities across all spatial positions, making it inherently suitable for back acupoint detection.
2.2.4. Optimization Loss Function
In the task of detecting keypoints on acupoints, the primary role of the loss function is to measure the discrepancy between the model’s predictions and the ground-truth annotated data. The loss function of YOLOv8-Pose comprises multiple components, including Classification Loss, Keypoint Confidence Loss, Bounding Box Loss, keypoint pose localization loss, and Focal Loss. Among these, the keypoint pose localization loss evaluates the accuracy of keypoint predictions. It minimizes the Euclidean distance between the model’s predicted keypoint locations and the true keypoint locations, thereby enhancing the accuracy of keypoint detection.
In the original YOLOv8-Pose model, the OKS loss function serves as the keypoint pose localization loss. This function evaluates the keypoint by comparing the Euclidean distance between the predicted acupoint and the actual acupoint. The calculation principle of the keypoint pose localization loss is illustrated in
Figure 19, which depicts four sets of predicted keypoint regression paths. In this figure, the blue sphere represents the actual keypoint position, while the red sphere denotes the predicted keypoint position. The loss is defined as the Euclidean distance between the predicted keypoint and the ground-truth keypoint. The OKS loss is computed as follows:
where
is the predicted keypoint and
is the real keypoint.
is the Euclidean distance between the first set of predicted keypoints and the ground-truth keypoint.
is the weight coefficient corresponding to each group of keypoints, and the weight coefficient is determined by the number of keypoint groups.
calculates the total loss by summing the products of the Euclidean distance and the weight coefficient between each set of keypoints.
In the keypoint detection task, the original OKS loss function primarily relies on the Euclidean distance between the predicted and ground-truth points, neglecting the geometric relationships among these points. In contrast, the EIOU loss function, as an enhanced version of the IOU loss, not only accounts for the overlap area between the predicted box and the ground-truth bounding box but also integrates the width, height, and center-to-center distance of the bounding boxes. This approach provides a more comprehensive evaluation of the matching degree between the predicted and actual boxes.
Based on the concept of the EIOU loss function [
28], this paper constructs a keypoint regression bounding box and introduces width, height, and center point distance losses in addition to the Euclidean distance loss to enhance the regression accuracy of acupoint keypoints. Specifically, the original OKS loss only computes the Euclidean distance between predicted and ground-truth keypoints, which cannot encode the relative positional constraints inherent in acupoint distribution. Inspired by EIOU, we construct a virtual bounding box for each keypoint, as illustrated in
Figure 20, and introduce width difference, height difference, and center distance as penalty terms. The width and height terms implicitly enforce the standardized inter-acupoint spacing (e.g., 1.5 cun or 3 cun along the Bladder Meridian), while the center distance term ensures the spatial consistency of symmetric acupoints (e.g., left and right acromion points). The weighting coefficients α and β in Equation (12) were determined via grid search on the validation set: among the combinations {0.3, 0.5, 0.7} tested, α = β = 0.5 achieved the highest mAP.
The loss function is reconstructed by considering both bounding box regression and the Euclidean distance between the predicted point and the ground-truth keypoint. The loss function is defined as follows:
In this study, we define the Intersection Over Union (IOU) as the ratio of the area of overlap between the predicted bounding box and the ground-truth bounding box to the area of their union. Let
represent the square of the Euclidean distance between the center point of the predicted bounding box and the true bounding box. We denote
as the diagonal length of the minimum enclosing region that encompasses both the predicted and actual bounding boxes. The variables
and
refer to the width and height of the predicted bounding box, respectively, while
and
denote the width and height of the actual bounding box. Additionally,
and
are defined as the maximum values between the width and height of the predicted bounding box and the actual bounding box. The improved loss function is thus formulated as follows:
where
represents the keypoint regression loss,
denotes the keypoint boundary frame EIOU loss, and
signifies the Euclidean distance loss. α and β are the weighting factors used to balance the contribution of the two loss components, both set to 0.5 in our experiments. To evaluate the impact of the improved loss function
on the model, this paper conducts comparative experiments under consistent configuration conditions.
Figure 21 shows the comparison results between the original OKS loss function and the improved EOKS loss function. Lower loss values indicate that the model progressively optimizes during training, reducing prediction errors. Comparing the two pose loss curves reveals that both loss functions exhibit significant decreases within the first 50 training iterations, with
showing slightly higher values than the improved function. As the number of iterations increases, the loss reduction rate slows and stabilizes. The loss value of the improved
loss function consistently remains below that of
, exhibiting smaller fluctuations.
eventually stabilizes around 2.25, while the optimized
stabilizes around 1.75, representing a 0.5 reduction compared to the original model. The improved loss function demonstrated lower values throughout the training process, indicating that the optimized loss function enhances the accuracy of keypoint detection and improves generalization performance.
2.2.5. Acupoint Inference Mechanism Design
To address the decline in model recognition accuracy caused by the rapid increase in acupoint quantity, this study designed an acupoint inference mechanism during the model detection phase. Based on the existing 49 fundamental feature point acupoints, the remaining 92 acupoint locations are inferred and localized according to anatomical relationships between acupoints and traditional Chinese medicine theory. The principle of the acupoint inference module is illustrated in
Figure 22. First, the original image is input into the keypoint detection model. After feature extraction, the keypoint information of 49 basic benchmark acupoints is obtained, including detected acupoint numbers and acupoint coordinate information. Through a layered processing strategy, we ensure the accuracy of acupoint detection while mitigating the performance degradation that occurs when the number of detected acupoints is expanded.
According to the bone proportional measurement method used in traditional Chinese acupuncture point localization, and referencing the principles of human body proportion calculation, the distance between the left and right acromions is defined as 16 cun [
30]. By traversing the fundamental acupoint feature points and identifying the left acromion point (K1) and the right acromion point (K2), the horizontal pixel difference between these two points is calculated. Dividing this difference by 16 yields the pixel distance corresponding to the unit of cun. This distance serves as a basis for establishing the correlation between the fundamental acupoint feature points and the acupoint to be inferred. Finally, the basic acupoint feature points and the inferred acupoint feature points are superimposed and annotated on the input image to achieve the localization and display of acupoints on the human back.
This study established a reasoning model system for locating acupoints on the human back based on the distribution patterns of the associated meridians. According to the regions of the meridians where the acupoints are situated, the acupoints on the human back were classified into four categories: the Extraordinary Points group, the Bladder Meridian of Foot-Taiyang acupoints group, the Small Intestine Meridian of Hand-Taiyang acupoints group, and the Huatuo Jiaji Points group.
Utilizing the pixel distance S corresponding to one ‘cun’ in the images obtained during parameter solving, and referencing the record in the “Name and Location of Meridian Points” (GB/T 12346-2021) [
31], we identify the Dazhui acupoint as situated along the posterior median line of the human body, specifically in the depression beneath the spinous process of the seventh cervical vertebra. The Dingchuan point is located beneath the spinous process of the seventh cervical vertebra, at a distance of 0.5 ‘cun’ from both the left and right sides of the posterior median line. Among the characteristic points of basic acupoints, K
17 is identified as the Dazhui point, with its coordinate information designated as (X
17, Y
17). Consequently, the pixel coordinates for the left Dingchuan point are (X
17 − S, Y
17), while those for the right Dingchuan point are (X
17 + S, Y
17). Based on these principles, the pixel coordinate table for the inference acupoint is presented in
Table 2.
In
Table 2, K
11 and K
12 serve as the initial points for the Huatuo Jiaji acupoint group. K
13, K
14, K
15, and K
16 represent the starting points of the Foot-Taiyang Bladder Meridian. K
52 and K
53 correspond to the initial points of the Small Intestine Meridian of Hand-Taiyang. Lastly, K
50 and K
51 are identified as the starting points of the Jingweiqi group.