1. Introduction
Positron emission tomography (PET) is a leading technology in the field of nuclear medicine and is widely recognized as one of the most advanced large-scale medical diagnostic imaging devices [
1]. PET imaging plays an irreplaceable role in the diagnosis and pathological research of tumors, cardiovascular diseases, and brain disorders, significantly improving the diagnostic accuracy of various diseases [
2]. However, PET imaging still faces several challenges, such as lower spatial resolution, longer image acquisition times, complex operations, and difficulties in image interpretation [
3]. Typically, a PET scan takes 10 to 15 min to complete [
4]. Since PET imaging relies on the distribution of radioactive tracers within the body [
5], patients are required to remain as still as possible during the examination. This requirement poses a significant challenge, especially for patients with low pain tolerance, such as children or other populations prone to movement. During image acquisition, bodily motion (including overall body movement and the physiological movement of internal organs) can cause artifacts, severely affecting image fusion quality and diagnostic accuracy [
6]. Among the various body regions, the head and neck are some of the most commonly imaged areas in PET scans. However, compared to torso motion, head movement is more difficult to control and has a more significant impact on image quality [
7].
Currently, to address the issue of artifacts in PET head and neck imaging, the primary solution relies on manual screening of the imaging results by doctors to eliminate image segments with significant artifacts that are unsuitable for diagnosis [
6]. This process is not only time-consuming (usually taking 5 to 10 min) but also demands a high level of expertise from the doctors. Therefore, real-time detection of head and neck movement in patients and the automatic filtering of PET imaging results to assist doctors have become key research directions for improving clinical diagnostic efficiency and imaging quality.
To address the issue of artifacts caused by head and neck movement, it is necessary to effectively monitor and detect the subject’s movement during the scanning process, thereby enabling the automated screening of PET scan images. One approach is to attach a large, curved marker to the patient’s forehead and use an external optical camera to track the movement. By recognizing encoded symbols on the marker, six-degree-of-freedom motion data can be recorded in real time [
8]. However, to reduce the discomfort caused by the marker, some studies use ink stamps with rich features as markers, combined with a stereoscopic optical camera system and feature detection algorithms, to achieve close-range head movement tracking [
9]. Additionally, in the field of assistive devices and human–computer interaction, some research has fixed an Inertial Measurement Unit (IMU) to the subject’s head to track real-time six-degree-of-freedom head movement [
10]. Nevertheless, methods based on external markers still have numerous limitations. For the subject, fixing the marker may cause discomfort and even induce involuntary movements. Furthermore, the process of affixing the marker is time-consuming and labor-intensive, and once the marker shifts, it becomes difficult to accurately estimate the movement [
11].
To address this issue, we propose a marker-free PET scan motion detection and recognition system, which implements motion monitoring and detection based on natural image capture by depth cameras from multiple engineering aspects, including structure, hardware, software, and algorithms. The system is equipped with functions such as image acquisition, facial landmark analysis, head pose estimation, post-data processing, and motion intensity evaluation. Specifically, the system uses a depth structured-light camera deployed within the PET system to detect the patient’s motion in real time during the scan. The depth and RGB images collected by the system are registered, and the registration results are output to the host system. The software on the host system decodes, stores, and processes the real-time acquired natural images, and by analyzing and detecting facial RGB images, it extracts robust facial landmarks. By combining the depth registration results, the system obtains the three-dimensional coordinate information of the landmarks in space. Through coordinate transformation and local coordinate system establishment, the system calculates the translational and rotational amplitudes of the head, generating a comprehensive metric for assessing head and neck motion intensity. Based on these metrics, the system can identify periods prone to motion artifacts and output the detection results, assisting doctors in quickly screening PET scan data.
The structure of this study is as follows: 
Section 1 introduces the principles and applications of PET imaging, as well as the existing challenges. It further elaborates on the motivation and objectives of this study, emphasizing the research approach and content in relation to the issues addressed in this field. 
Section 2 introduces the principles of PET imaging, reinforcing the project’s background information. This section also provides a detailed literature review, exploring the current research status and existing methods in this area, and compares and selects methods in the context of this research. 
Section 3 presents the technical roadmap of the proposed system, explaining each technical module, conducting feasibility analyses, and providing corresponding mathematical derivations or performance demonstrations. It integrates RGB-D structured-light camera registration and image fusion techniques, facial landmark detection technology, and head pose estimation techniques. Moreover, it introduces the motion intensity evaluation metric in the context of PET scanning, building an integrated system tailored for the target environment. 
Section 4 explains the data source of the validation experiments, introduces the developed software framework, and presents a detailed analysis of experimental results from both phantom-based and volunteer-based experiments. This section also explores the accuracy and efficiency of the system in line with clinical requirements. 
Section 5 discusses the main advantages of the system, areas for improvement, and prospects for future research. Finally, 
Section 6 analyzes the findings of this study, summarizes the innovative points and major contributions, and discusses the potential for clinical application. In conclusion, this study aims to validate the feasibility and clinical value of the proposed motion monitoring-assisted image selection system through literature review, system development, and experimental investigations. The subsequent sections will provide detailed explanations.
  2. Literature Review
This section is divided into subsections to provide a clearer analysis and discussion of the relevant literature. 
Section 2.1 introduces the principle of PET scan imaging, laying the foundation for the system’s research background; 
Section 2.2 focuses on image information acquisition, comparing different camera technologies and discussing the requirements and selection of hardware based on the target context; 
Section 2.3 discusses head feature point detection and motion tracking techniques, analyzing the strengths and weaknesses of existing methods; 
Section 2.4 introduces the main methods for spatial motion monitoring of rigid bodies.
  2.1. Principle of PET Scan Imaging
Positron emission tomography (PET) is a highly specific molecular imaging technique that provides functional information about organs and their lesions, primarily used in molecular-level medical imaging [
1]. PET employs the short-lived 18F-FDG positron-emitting radionuclide as the main tracer, which allows for high-precision, quantitative detection of abnormal increases in metabolic processes, producing clear images [
12]. Therefore, PET provides crucial early insights into disease progression, particularly in the early detection of tumors. During PET imaging, a positron-emitting radioactive isotope-labeled molecular probe is injected into the body. After the unstable atoms decay and release positrons, these positrons encounter electrons in the tissue and annihilate, generating two oppositely flying 511 keV gamma photons [
13]. The PET scanner detects these annihilation photons using a ring of photon detectors and reconstructs a three-dimensional image of the distribution of the molecular probe within the body based on the path of the photon pairs.
  2.2. Image Information Acquisition
To achieve high-quality and stable image acquisition, selecting the appropriate camera is crucial. Common camera types include monocular cameras, binocular cameras, Time-of-Flight (ToF) cameras, and RGB-D structured-light cameras [
14], each with its specific application scenarios and advantages and disadvantages.
Monocular cameras are the most commonly used type due to their low cost and ease of operation. However, due to scale uncertainty, a single-frame image cannot directly recover the three-dimensional information of objects. To improve accuracy, multiple monocular cameras are typically required to capture images from different viewpoints, and multi-view fusion is used to estimate the spatial pose of the object [
15]. Additionally, in recent years, deep learning methods have been widely applied to monocular camera pose estimation [
16], where neural networks are trained to predict the three-dimensional pose. However, the robustness of this method in complex environments still needs improvement, and the accuracy remains relatively low.
Binocular stereo cameras obtain depth information through the disparity between two cameras, enabling relatively accurate object pose estimation [
17]. While binocular cameras provide high-precision pose estimation in regular environments, their accuracy in image matching and pose estimation may degrade in low-texture, uneven-lighting, or occluded environments [
18].
Time-of-Flight (ToF) cameras calculate object depth information by emitting pulsed light and measuring the reflection time. They can maintain high accuracy over long distances, making them suitable for pose estimation in dynamic scenes [
19]. However, the high cost of ToF cameras may increase the overall system cost.
RGB-D structured-light cameras acquire object depth information by actively projecting structured light and capturing images with a camera. These cameras achieve high accuracy over short distances and are particularly suited for pose estimation in confined spaces [
20]. However, the accuracy of depth information deteriorates over long distances or under strong lighting conditions. To address these limitations, researchers often integrate depth learning techniques, combining image features and depth information to enhance the stability and robustness of pose estimation [
21].
Despite the low cost and ease of operation of monocular cameras, they cannot accurately recover three-dimensional information, typically requiring multiple cameras to capture images from different viewpoints in order to improve precision. In contrast, binocular cameras, ToF cameras, and RGB-D structured-light cameras can achieve higher precision in three-dimensional pose estimation by directly or indirectly acquiring depth information. Given the specific conditions of a PET scanning environment, such as complex indoor settings, uneven lighting, and the proximity between the camera and the subject, RGB-D structured-light cameras are more suitable for this system after considering factors such as detection accuracy, hardware deployment complexity, and cost-effectiveness.
  2.3. Detection and Recognition of Head Feature Points
During the PET scanning process, to detect the motion of the patient’s head in real time, it is necessary to perform feature point recognition and motion tracking over a certain period on the frame-by-frame images transmitted to the terminal from the communication equipment. Currently, the algorithms addressing this issue can be categorized into the following types based on their underlying principles and hardware devices: traditional vision-based methods, tracking-based methods, multimodal information fusion-based methods, and deep learning-based methods.
Traditional vision-based methods mainly rely on manually designed facial features and techniques from image processing and geometry, such as Haar cascade classifiers [
22], feature point matching algorithms (e.g., SIFT [
23], SURF [
24]), and optical flow methods [
25]. The advantages of these methods lie in their fast processing speed and low computational requirements. However, their performance tends to degrade in complex environments, under significant pose changes, or in the presence of occlusions [
26].
Tracking-based methods include both traditional and deep learning-based target tracking algorithms, with representative algorithms such as the Kalman filter [
27], particle filter [
28], and Siamese network [
29]. These tracking-based algorithms are suitable for real-time scenarios but are not robust enough in the presence of complex occlusions or rapid movements, and they generally require substantial computational overhead.
Multimodal information fusion-based methods refer to approaches that combine information from multiple sensors, such as RGB, depth, and thermal infrared, for feature point recognition. The advantage of these methods lies in the complementary information provided by different sensors, which enhances robustness in complex environments [
30]. However, the use of various types of sensors requires complex hardware support, resulting in higher system costs and significant challenges in sensor calibration [
31].
The mainstream deep learning algorithms for motion tracking utilize convolutional neural networks (CNNs) [
32] and recurrent neural networks (RNNs), such as LSTM [
33], for feature learning and motion tracking. The development of such methods includes CNN-based face recognition, keypoint detection, and motion tracking using RNN/LSTM. There are many mature network architectures in deep learning, such as the open-source DLIB library developed in C++, which can achieve stable face detection, feature localization, and landmark tracking [
34]. Prados et al. [
35] proposed the SPIGA network, a combination of CNNs and graph attention network regression sub-level cascades, which performs well in identifying blurry contours and edge points of faces. These methods typically offer more accurate detection of human body keypoints and motion tracking, but larger models may have certain hardware requirements when deployed.
Considering the need for a robust and real-time feature point recognition algorithm in the PET scanning environment, which must be adaptable to complex environments, capable of being deployed on medium-to-small hardware systems (such as integration into PET scanning devices), and economically feasible, a lightweight deep learning-based feature point recognition algorithm, such as the DLIB algorithm, is ultimately selected.
  2.4. Space Motion Monitoring of Rigid-like Objects
Monitoring the spatial motion of rigid bodies requires the use of spatial feature point information to estimate the translational and rotational movements of the object in different directions. The main methods currently employed include Euler angle-based methods [
36], quaternion-based methods [
37], Denavit–Hartenberg (D-H) matrix methods [
38], and rotation matrix-based methods [
39].
The Euler angle-based method estimates rotation by describing the rotation angles of an object around the X, Y, and Z axes in three-dimensional space, and uses changes in these angles to estimate the rotational motion of the object. This method has the simplest computational principle. However, it suffers from the gimbal lock problem when two rotation axes approach parallel alignment, and its computational load is relatively high, making it difficult to convert angles into distance metrics [
36]. The quaternion-based method describes spatial rotation using a quaternion expression consisting of a scalar and three vectors, which avoids the gimbal lock problem found in Euler angles. However, the selection of the rotation axis in this method is challenging, and the mathematical transformations involved are complex [
37]. The Denavit–Hartenberg (D-H) matrix method represents the relative position and orientation of the object with respect to the reference coordinate system using Denavit–Hartenberg (D-H) parameterization. It is effective for estimating the spatial pose of pure rigid bodies, but it is less robust when feature points are lost or experience jitter [
38]. The spatial rotation matrix-based method uses a rotation matrix to represent both the translation and rotation of the object, with the matrix elements indicating the spatial translation–rotation relationship. Its advantages include high accuracy in pose estimation and computational stability. However, it suffers from a relatively high computational load during complex movements [
39].
Considering that the motion of the head and neck during PET scanning is primarily rotational, involving mainly pitch and yaw movements, and that the scanning process takes a relatively long time, the algorithm for estimating spatial pose must exhibit high stability and accuracy. After considering factors such as computational resources, accuracy, and stability, the spatial rotation matrix-based method is more optimal.
  3. Research Methods
In this study, we developed a software system for human head motion monitoring during PET scans, with its architecture illustrated in 
Figure 1. The system is primarily composed of three components: image processing algorithm, motion detection algorithm, and visualization software.
The image processing algorithm serves as the foundation of the system. Its core task is to register depth images and RGB images captured by the RGB-D camera, enabling precise recognition of facial feature points and calculation of local coordinate systems. This component provides accurate image data and positioning information, which are essential for subsequent motion monitoring. The image processing algorithm is depicted in 
Figure 1a and corresponds to 
Section 3.1, 
Section 3.2 and 
Section 3.3, which detail the stages involved in processing the image data.
The motion detection algorithm acts as the central part of the system, directly determining the accuracy and robustness of motion monitoring. This component comprises three key stages: spatial point monitoring, spatial pose estimation, and motion intensity evaluation. By accurately tracking the trajectory and intensity of head motion, the system effectively assesses the impact of motion artifacts on PET scan results. The motion detection algorithm is illustrated in 
Figure 1b and corresponds to 
Section 3.4 and 
Section 3.5, where the process of motion detection and evaluation is thoroughly explained.
The visualization software, implemented using a Qt-based interactive interface, aims to provide an intuitive and user-friendly operating platform for medical professionals and non-engineering personnel. It facilitates real-time evaluation and analysis of motion artifacts. The software supports two working modes: real-time image acquisition mode and image loading mode, catering to different application scenarios. The visualization software is shown in 
Figure 1c and corresponds to 
Section 4.1.2, which provides a detailed description of the user interface and its functionality.
In terms of hardware configuration, the system utilizes an Orbbec Astra Pro Plus RGB-D monocular structured-light camera for image acquisition and operates on a Windows x64 platform. The processing system is equipped with an Intel i7-14700HX CPU and an NVIDIA RTX 4060 GPU to ensure efficient data processing. The specific layout of the test environment is illustrated in 
Figure 1d.
  3.1. Camera Registration and Image Acquisition
To acquire head motion data from PET subjects, the first step is to obtain depth and RGB images of the subject during scanning. This study employs the Astra Pro Plus camera module from Orbbec, a high-precision, low-power 3D camera based on structured-light technology. The effective working distance of the camera ranges from 0.6 m to 8 m. The module consists of an infrared camera, an infrared projector, and a depth computation processor. The infrared projector projects a structured-light pattern (speckle pattern) onto the target scene, while the infrared camera captures the reflected infrared structured-light image. The depth computation processor processes the captured infrared image using depth calculation algorithms to generate depth images of the target scene.
For motion estimation, the camera principle is described as a mathematical model that maps 3D coordinates to a 2D pixel plane. This study adopts the pinhole camera model. The mathematical representation of the pinhole model can be expressed in matrix form as follows:
		
        where 
u and 
v represent the coordinates of the target point in the pixel coordinate system, while 
 denotes the coordinates of the target point in the camera coordinate system. 
 and 
 are the focal lengths of the camera, and 
 and 
 are the coordinates of the principal point. In this equation, the matrix 
K, formed by these intermediate variables, is referred to as the camera intrinsic matrix and is considered a constant property of the camera.
Based on the pinhole camera model, Zhang’s calibration method [
40] is employed for camera calibration. This method requires only a flat checkerboard calibration board to complete the entire calibration process. In this study, we calibrated the structured-light depth camera using the MATLAB R2023b Camera Calibrator, and the calibration interface is shown in 
Figure 2. 
Figure 2a shows the interface of the calibration software, where the green circles represent detected points, the orange circles represent the pattern origin, and the red dots represent the reprojected points. 
Figure 2b,c are automatically generated by the calibration software. The former presents the error analysis based on 20 calibration images, while the latter shows the variation in different calibration images relative to the camera’s initial pose.
Generally, if the reprojection error of the camera calibration is less than 0.5 pixels, the calibration result is considered accurate. In this study, the maximum reprojection errors for both RGB and depth images did not exceed 0.25 pixels, with an average reprojection error of 0.09 pixels, which meets the accuracy criteria, indicating high calibration precision.
  3.2. Head Feature Point Recognition
This project employs the DLIB facial landmark detection model [
41] for identifying and tracking human head landmarks, consisting of two main components: face detection and face alignment. The face detection algorithm leverages Histogram of Oriented Gradients (HOG) for feature extraction combined with a Support Vector Machine (SVM) for classification. The face alignment algorithm is based on the Ensemble of Regression Trees (ERT) method, which optimizes the process through gradient boosting to iteratively fit the facial shape.
The face detection and alignment algorithm used in this study is based on an open-source pre-trained model that has been trained and validated on a large-scale facial database. This model is capable of meeting the requirements for facial feature point detection in most scenarios. Although the target application scenario of this study is the PET scanning environment, the RGB images collected and processed are still facial images of human subjects. Therefore, the characteristics of the dataset remain largely unchanged in this scenario. Through extensive phantom and volunteer experiments conducted on various phantoms and different volunteers, the pre-trained model has been demonstrated to achieve satisfactory performance in the target environment, including high accuracy and stability of feature point recognition. Additionally, the model is highly efficient, with a size of only 85 MB and an average processing time of 0.017 ms per frame. This makes it particularly suitable for scenarios with limited hardware resources and stringent real-time requirements.
The main implementation steps of the facial detection and landmark recognition algorithm are as follows:
- Use the HOG-based cascaded classifier to extract all feature vectors, including HOG features, from patient images; 
- Input the extracted feature vectors into the SVM model inherited from the CPP-DLIB library [ 41- ] to classify and extract features around the facial region, thereby identifying and annotating the location of the face in the image; 
- Pass the annotated facial region as input to the 68-point alignment model to achieve real-time detection of 68 facial landmarks. The alignment standard, shown in  Figure 3- , includes key facial regions such as the contours, eyes, eyebrows, nasal triangle, and mouth. Among them, the 8 red dots and 4 star points represent the 12 feature points used for pose calculation, with the 4 star points specifically serving as the initial selection points. 
- In the detected video stream, the landmark information of each frame is recorded in real time. A filtering algorithm is applied to extract 12 robust landmarks, typically located in regions such as the nasal triangle and eye corners. These selected landmarks are used as input data for the spatial pose estimation algorithm to achieve precise motion state evaluation. 
The feasibility of the proposed algorithm was validated on both the constructed phantom experimental platform and the collected facial dataset. Experimental results demonstrated that the algorithm effectively tracks and monitors facial landmarks, outputting the position of each landmark for every frame in the video stream. Furthermore, the results were synthesized into motion detection videos using video processing techniques.
  3.3. Fusion Registration of RGB Images and Depth Images
After calibrating the structured-light RGB and depth cameras, discrepancies in their intrinsic and extrinsic parameters can lead to significant errors when directly overlaying the captured RGB and depth images. Therefore, image registration between the two views is essential to ensure accurate alignment of images captured at the same moment. Based on the known depth map, RGB image, and camera intrinsic parameters, depth and RGB images can be registered and fused.
The registration process is based on known depth images, RGB images, and camera intrinsic parameters. Using the intrinsic matrix of the depth camera, the 3D coordinates of a point in the depth camera coordinate system can be obtained from the depth map. Subsequently, the point is transformed from the depth camera coordinate system to the RGB camera coordinate system using the rotation matrix 
R and translation vector 
t. Finally, the 3D coordinates are converted into pixel coordinates in the RGB image through the intrinsic matrix of the RGB camera. The critical step in this process is solving for the rotation matrix 
R and translation vector 
t. By obtaining the extrinsic matrices of a checkerboard pattern in both the depth camera and RGB camera coordinate systems, the transformation matrix linking the two camera coordinate systems can be computed. The computation of the rotation matrix 
R and translation vector 
t is as follows:
		
        where 
 and 
 represent the rotation matrix and translation vector transforming point coordinates from the world coordinate system to the RGB camera coordinate system, respectively. Similarly, 
 and 
 represent the rotation matrix and translation vector transforming point coordinates from the world coordinate system to the depth camera coordinate system. The registration effect is shown in 
Figure 4.
Therefore, in the same scene, it is sufficient to obtain the extrinsic parameter matrices of the chessboard in both the depth camera and RGB camera coordinate systems in order to compute the transformation matrix that links the two camera coordinate systems. Although the extrinsic matrices obtained in different scenes may vary, using a front-facing chessboard calibration image typically yields satisfactory results.
  3.4. Calculation of Head Space Pose
After obtaining the precise 3D spatial coordinates of the feature points, these data are utilized to compute and track the head motion of the subject. The method employed in this study monitors the rotational displacement of the head using a rotation matrix about a fixed coordinate system, while independently tracking its translational displacement. The core of the pose detection algorithm lies in selecting an appropriate rigid-body coordinate system on the subject’s head to derive a set of suitable orthogonal vectors.
Although head feature point detection algorithms can stably identify 68 facial landmarks, significant variations in recognition accuracy across different facial regions occur under extreme conditions, such as when the yaw angle exceeds 60°. Therefore, to establish the head coordinate system, it is essential to select feature points with high recognition accuracy and robust performance. Specifically, priority is given to points that are distant from facial contour edges, exhibit significant depth variations, and demonstrate strong geometric invariance. The feature points selected in this study are shown in 
Figure 3b, the eight red dots and the four yellow stars, corresponding to the numbers 22, 23, 30, 31, 37, 40, 43, 46, 49, 52, 55, and 58.
Among these 12 points, the left outer canthus, right outer canthus, center of the upper lip, and the tip of the nose (marked as yellow stars in 
Figure 3b) are used to describe the derivation of the formulas in this subsection, denoted as 
, 
, 
, and 
, respectively. For most individuals, the plane defined by the two outer canthi and the center of the upper lip is generally parallel to the face. Therefore, the vector perpendicular to this plane can be used to estimate the position of the head’s center of mass by integrating the spatial pose data of the nose tip and the head dimensions. Consequently, an orthogonal coordinate system for rigid-body motion can be constructed using these three points. The process of establishing the coordinate system is described by the following formula:
Equations (3) and (4) describe the fundamental constraints of rigid-body rotation: all vectors are three-dimensional, where , , and  are unit vectors that are mutually orthogonal, forming a right-handed coordinate system . Since the spatial coordinates of feature points on the rigid body are represented relative to the camera coordinate system,  effectively correspond to the spatial rotation matrix of the rigid body around the fixed axis  of the camera coordinate system.
Moreover, considering the spatial rotation matrix 
, which represents a rigid body rotating by 
 radians around the 
x-axis of the fixed coordinate system, then by 
 radians around the 
y-axis, and finally by 
 radians around the 
z-axis, the definitions of directional rotation matrices and the physical interpretation of 
 yield Equations (5) through (7).
		
By comparing Formula (6) with Formula (7), it can be concluded that
		
By continuously monitoring 
, 
, and 
, the rotational amplitudes of the human head in three directions can be accurately obtained. The translation of the head is determined by estimating the spatial position of the head’s centroid. The fundamental principle involves subtracting a normal vector perpendicular to the facial plane from the spatial pose of the nasal tip. The magnitude of this normal vector represents the average human head radius, which is set to 80 mm based on national standards obtained by National Bureau of Statistics of China. The mathematical representation is given in Equation (8), where 
 denotes the average head radius, and 
 is the estimated centroid coordinate of the rigid body, represented as a three-dimensional vector.
		
By utilizing the three-dimensional spatial pose data of facial feature points, head rotation and translation can be monitored. However, the complexity of motion in real clinical environments may lead to certain errors in the algorithm. In particular, two prominent issues are feature point occlusion when the yaw angle is large and the presence of outliers during motion synthesis. To address these issues, the following solutions are proposed in this study:
- Feature Point Occlusion: During head rotation, when the yaw angle becomes large, some facial feature points may move out of the depth camera’s view. To solve this problem, a feature point compensation method is proposed. When a feature point (such as the outer corner of the eye) experiences significant fluctuations in its spatial pose across consecutive frames, it is determined that the feature point is no longer suitable for input into the spatial pose estimation algorithm. Other stable feature points are then used to supplement the calculation, ensuring the continuity and accuracy of the spatial pose estimation. 
- Outliers in Motion Synthesis: When synthesizing the motion intensity curve from consecutive frames, outliers may occur, causing the curve to exhibit abnormal fluctuations. To address this issue, a low-pass filtering method is applied to smooth the calculated motion intensity curve, eliminating noise interference. This results in a stable and continuous motion intensity curve, which is then used for motion pattern classification. 
  3.5. Definition and Calculation of Head Motion Intensity
The primary objective of this study is to determine whether the subject’s head motion during PET scanning exceeds a specific intensity threshold that may compromise imaging quality, thereby enabling the selection of valid video segments. Due to the complexity of head motion, it is challenging to characterize motion intensity using either translation or rotation alone. To quantify the overall head motion, this study introduces a dimensionless motion intensity metric, Amplitude, validated through theoretical analysis and simulation testing. Amplitude is defined as follows:
		
        where Rot represents the total rigid-body rotational displacement (unit: °), Trans represents the total rigid-body translational displacement (unit: mm), and 
 is a weighting factor ranging from 0 to 1.
Since the comprehensive head movement of the subject cannot be effectively evaluated by considering either translational or rotational motion alone, it is necessary to carefully define the translational component (Trans) and rotational component (Rot) to accurately characterize the overall motion. For the translational component, Trans, the displacements along the X, Y, and Z axes are combined to represent the overall translation. According to the principles of spatial vector composition, Trans is defined as follows:
        where 
, 
, and 
 denote the displacements along the X, Y, and Z axes, respectively.
For rotational motion, which involves rotations around the X, Y, and Z axes, the rotational component, Rot, is derived using the properties of spatial rotation matrices. Assuming small-angle approximations, the total rotation matrix 
 can be expressed as follows:
        where 
I is the identity matrix, 
, 
, and 
 are the rotation angles around the X, Y, and Z axes, and 
, 
, and 
 are the corresponding skew-symmetric matrices:
By substituting these approximations into the overall rotation matrix and neglecting higher-order terms, the matrix simplifies to the following:
According to the axis–angle representation of small-angle rotations, the rotation matrix can also be expressed as follows:
        where 
Rot is the equivalent rotation angle, and 
K is the skew-symmetric matrix corresponding to the unit rotation axis 
. By comparing the two forms, the equivalent rotation angle is derived as follows:
Thus, when the rotation angles (
, 
, and 
) about the fixed coordinate axes are small, the equivalent rotation angle can be calculated using the above formula. The translational and rotational components, Trans and Rot, are defined as follows:
		
        where 
 and 
 are derived by averaging the elements corresponding to the first 15 sampling points of the sequence, thereby minimizing the error in the reference initial values.
Due to the differing units of rotation and motion, this formula serves only as a numerical operation, and Amplitude is expressed as a dimensionless quantity. Theoretically, the subject’s head can be approximated as a sphere with a radius of 80 mm, rolling and sliding on the bed surface of the PET scanner. Given that rolling is the predominant motion, the value of 
 is set between 0.7 and 1, reflecting a primary focus on rotation with supplementary consideration of translation [
42]. Specifically, not only is this metric sensitive to the rotation of the head, but it also effectively monitors the motion changes caused by the head’s translation. In the case of a two-dimensional scenario, the specific meaning of this monitoring metric can be referenced in 
Figure 5. In this figure, 
 represents the rotation angle synthesized by the pose monitoring algorithm, 
 denotes the displacement caused by rotation, and 
 refers to the displacement induced by translation.
Experimental and clinical tests indicate that  achieves optimal motion representation. Based on this, Amplitude thresholds are established: 10 as the warning value and 15 as the critical threshold. PET scans with Amplitude values exceeding 15 are excluded from imaging analysis.
  5. Discussion
In this study, we validated the proposed motion monitoring system through testing on a precisely constructed phantom platform and in a realistically replicated PET scanning room. The experimental results demonstrate that this system effectively addresses the significant problem of artifacts caused by head motion during PET scans. By evaluating the intensity of head motion, the system enables artifact screening and post-processing of PET imaging results, significantly improving identification efficiency and saving considerable time for physicians. This improvement enhances the comfort of medical services and contributes to the advancement of healthcare systems, highlighting the system’s potential for clinical application.
One notable advantage of the system is its reliance on facial feature recognition and tracking to monitor head motion. This approach simplifies project complexity, avoids the need for external markers or complex hardware setups, and ensures low computational requirements, making it adaptable for deployment on various processors. Furthermore, the system achieves angular displacement accuracy and translational displacement accuracy of less than 2.0° and 2.5 mm, respectively, well within the clinical thresholds of 5.0° and 5.0 mm. These results validate the feasibility and reliability of the system for practical use. Despite these strengths, the system faces limitations when handling large head motion amplitudes (e.g., yaw angles exceeding 60°), which may reduce the stability of facial feature point recognition. While this instability slightly affects motion amplitude estimation, its impact on overall motion intensity remains manageable due to the clear distinctions between periods of large motion and stationary intervals.
The system’s real-time responsiveness further supports its applicability, as it generates rigid-body motion monitoring images and comprehensive motion intensity metrics within approximately 10 s post-scan. This efficiency surpasses manual image selection speeds, significantly improving diagnostic workflows. Additionally, the developed visualization software enhances user experience by supporting features such as real-time data acquisition, image loading, and intuitive motion artifact evaluations through RGB and fusion images.
Future research will focus on expanding this system to monitor full-body 3D motion by incorporating advanced three-dimensional reconstruction methods for facial feature points. This would enable comprehensive elimination of motion artifact periods in PET imaging and further support clinical applications. The use of higher-precision depth cameras will be explored to enhance detection accuracy, targeting motion amplitude errors within 2 mm. Collaboration with industry partners, such as Shanghai United Imaging Healthcare Co., Ltd., will also be prioritized to accelerate the clinical translation of this technology.
  6. Conclusions
This study addressed the challenges of prolonged PET scanning durations and motion artifacts by designing a motion detection and recognition system based on natural images. The system achieves contactless head motion monitoring and intensity estimation without relying on external markers, providing reliable criteria for artifact screening. This innovation simplifies the manual selection of imaging results, saving valuable time for physicians. The system employs an RGB-D monocular structured-light camera, avoiding complex hardware setups while balancing accuracy and real-time performance. Experiments conducted on both phantom models and human volunteers validated the system’s capability across various motion scenarios, achieving clinically acceptable displacement accuracy. This balance of performance and feasibility underscores the system’s potential for practical deployment.
In summary, the proposed motion monitoring system bridges a critical gap in contactless, marker-free PET motion monitoring using low-cost and non-invasive RGB-D cameras. By combining accuracy, real-time responsiveness, and user-friendly visualization tools, the system significantly enhances artifact identification efficiency, benefiting both physicians and patients. This framework lays a foundation for future developments in PET imaging and broader motion monitoring applications.