2. Related Works
In recent years, the field of computer vision has evolved into a cornerstone of intelligent systems research, providing machines with the capability to interpret and respond to both static and dynamic visual inputs. By leveraging advanced machine learning techniques, including both traditional machine learning algorithms and state-of-the-art deep learning architectures, computer vision systems can analyze complex image data to perform a wide array of tasks with high precision. These tasks typically include object recognition, anomaly detection, behavior analysis, and motion tracking, each tailored to meet the specific operational goals of the system.
At its core, computer vision extends beyond simple image processing; it represents an intelligent mechanism capable of perceiving and understanding the visual world in a manner that approximates human vision. Unlike conventional control systems, vision-based algorithms can adapt to contextual variations, enabling machines to make data-driven decisions in real time. This has allowed for the seamless integration of computer vision into a variety of real-world applications, ranging from industrial automation to healthcare, security, and transportation.
Notably, surveillance systems have increasingly adopted computer vision to enhance situational awareness and threat detection through live video analysis. Similarly, autonomous vehicles rely on computer vision to identify obstacles, interpret traffic conditions, and navigate dynamic environments safely. These applications often rely on the symmetrical spatial structure of objects and scenes to improve recognition accuracy and reduce computational complexity, particularly when paired with optimization strategies.
Beyond detection and classification, computer vision facilitates intelligent automation by enabling systems to learn from visual patterns and make predictive decisions. This ability to process and act upon visual input in real time has positioned computer vision as a pivotal component in next-generation intelligent infrastructure. As research in this domain advances, it continues to draw upon mathematical models of visual symmetry, geometric regularity, and optimization principles, especially in applications where precision and adaptability are paramount.
2.1. Concept of 3D Pose Estimation Method
The proposed system initiates its processing pipeline by employing a convolutional neural network (CNN) to extract spatial features from RGB video inputs. Specifically, the CNN is responsible for identifying two-dimensional (2D) skeletal keypoints that correspond to the posture and configuration of the human body in each video frame. These 2D pose estimations serve as the foundational representation for subsequent inference tasks. To derive three-dimensional (3D) pose information from the 2D keypoints, the system incorporates a Long Short-Term Memory (LSTM) architecture enhanced with Parametric Skip Connections (p-LSTM). This temporal modeling approach allows the network to implicitly infer depth information over time, leveraging learned dependencies across sequential frames rather than relying on explicitly defined depth cues. The Skip Connections embedded within the p-LSTM architecture serve to bridge non-consecutive layers, enabling the model to retain and reintroduce relevant feature information across time steps, thereby improving its depth prediction performance. By combining spatial feature extraction via the CNN with temporal–sequential learning through p-LSTM, the system achieves 3D pose reconstruction in a manner that does not require explicit ground-truth depth maps or externally calibrated 3D references. Instead, depth estimation is learned implicitly as part of the end-to-end training process, guided by pose consistency and temporal coherence. This method offers a practical and efficient solution for 3D human pose estimation from monocular video input, particularly in scenarios where depth sensors or multi-camera setups are unavailable. An overview of the conceptual framework underlying this architecture is presented in
Figure 1.
Pose estimation is a key technique in computer vision, designed to detect and analyze the spatial configuration of individuals or objects through the localization of specific anatomical or structural keypoints. These keypoints, such as joints, facial landmarks, or object corners, serve as critical reference markers for interpreting posture, movement, and behavioral patterns. By evaluating the spatial relationships among these points, systems can infer dynamic poses over time, enabling advanced analyses of motion, activity, and symmetry. Applications of pose estimation span multiple disciplines, including human–computer interaction, sports analytics, healthcare monitoring, augmented reality, and robotics. In particular, symmetrical body configurations often serve as a basis for improving accuracy in classification and anomaly detection, as consistent patterns across mirrored joints (e.g., left and right arms or legs) provide valuable cues for verifying pose integrity. Several deep learning-based models have been developed to address pose estimation challenges, each incorporating unique architectures and optimization strategies. Notable frameworks include OpenPose, PoseNet, BlazePose, DeepPose, DensePose, and DeepCut, which differ in their keypoint localization granularity, inference speed, and structural modeling techniques. These pose estimation approaches are conceptually illustrated in
Figure 2.
Figure 2 gives an example of pose estimation output, showing human skeleton keypoints and limb connections. White circles indicate anatomical landmarks such as the head, shoulders, elbows, wrists, hips, knees, and ankles. Color-coded lines represent limb segments: red for upper arms, blue for lower arms, yellow for shoulders and torso, green for legs, and purple for pelvis. This visual schema supports symmetry-based modeling by highlighting bilateral joint structures and their alignment.
2.2. Camera Surveillance Systems
Surveillance camera systems have become an essential component in elderly individual monitoring, leveraging advanced image processing and artificial intelligence (AI) algorithms. The integration of high-resolution cameras with real-time video analytics allows for detailed observations of elderly people’s activities, thereby enhancing safety through prompt alerts to caregivers [
18,
19,
20]. Innovations in this domain include the use of infrared or thermal imaging to enhance monitoring capabilities during low-light or no-light conditions [
21,
22,
23,
24,
25]. Recent advancements in machine learning, particularly with models like convolutional neural networks (CNNs), have greatly improved systems designed to interpret complex human behaviors [
26,
27]. These capabilities are essential not only for identifying emergencies such as falls or prolonged inactivity, but also for analyzing routine behaviors to detect potential health risks or well-being issues. Integrating edge computing into surveillance camera systems marks a transformative step in elderly individual monitoring. Processing data directly on the device enhances cost-efficiency, lowers latency, and accelerates response times, which are critical factors for timely interventions during emergencies [
28,
29,
30,
31]. Additionally, local data processing addresses privacy concerns by limiting the transfer of sensitive information to external servers. This is especially pertinent in elderly care, where privacy and data protection are critical. Nonetheless, while edge computing optimizes data handling by transmitting only key alerts or behavioral summaries to caregivers, it presents a trade-off. Caregivers may lack a full contextual understanding of situations, since fall detection and behavior analysis algorithms can still yield false positives or overlook certain events. The ongoing development in this field underscores the importance of balancing technological advancements with practical considerations in elderly care. The ability to provide accurate and timely information while ensuring privacy and security is vital for the effective implementation of these monitoring systems. As technology continues to evolve, it is essential to address these challenges to fully realize the potential benefits of surveillance systems in enhancing the quality of life of elderly people.
2.3. Advances in Artificial Intelligence for Elderly Care
In elderly care, artificial intelligence (AI) technologies have made remarkable strides, especially through the use of advanced algorithms and neural networks tailored for tasks like pose detection, fall detection, and activity recognition. At the core of these automated systems lies human pose estimation, which forms the essential basis for further analysis such as detecting falls or monitoring activities. Algorithms such as PoseNet and BlazePose have notably advanced this area by leveraging 2D and 3D imaging processed through convolutional neural networks (CNNs), yielding significant benefits in both academic and practical domains. These methods offer the real-time tracking and precise identification of human body positions, which are vital for effectively monitoring elderly individuals.
For fall detection, systems often depend on either object detection techniques or pose estimation frameworks. Typically, these solutions integrate both spatial characteristics—such as body posture and positioning—and temporal patterns that capture motion dynamics over time. Techniques like CNNs combined with Long Short-Term Memory (LSTM) networks process these dimensions to detect falls accurately. Some studies have proposed direct classification models that interpret pose or skeletal data to assess whether a fall has taken place, while others utilize bounding box strategies and object detection approaches for similar purposes. Innovative methodologies continue to emerge, improving the precision and dependability of fall detection systems, a key aspect in safeguarding elderly populations.
Furthermore, pose estimation models have spurred advancements in activity recognition research. By extracting skeletal information and integrating it with spatial–temporal analysis and other robust features, machine learning models have become more adept at distinguishing various human activities. Visual-based models tend to outperform motion sensor systems because of their ability to capture distinct visual cues that differentiate between actions, whereas sensor-based data often shows overlapping movement patterns, which can complicate classification efforts. As a result, many elderly individual monitoring frameworks now incorporate these models to automate behavior tracking. Some studies also enhance classification performance by combining pose data with handcrafted features such as the distances and angles between keypoints across frames, or by employing biometric features derived from these keypoints to train classifiers like random forests.
Despite these advancements, skeleton-based fall detection still faces challenges, particularly in terms of privacy and designing user-centered solutions for elderly care. Addressing privacy concerns is critical for broader acceptance and comfort among elderly users. Approaches such as anonymizing visual data and employing privacy-preserving techniques are being explored to build greater trust and ethical compliance in monitoring systems [
32,
33,
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53].
2.4. Symmetry-Aware Pose Estimation in Human Motion Analysis
The structural symmetry of the human body plays a pivotal role in improving the reliability of pose estimation systems. In most human motions, bilateral symmetry reflected in the mirroring of limbs and joints across the sagittal plane is a natural characteristic that can be exploited for model training, refinement, and error correction. By enforcing symmetry-based constraints during learning, pose estimation models are better equipped to handle occlusion, viewpoint variation, and partial body visibility. Several contemporary models incorporate symmetry either explicitly through regularization terms or implicitly through architectural design. For example, models such as OpenPose and BlazePose consider paired landmarks (e.g., left–right shoulders, elbows, and knees) and optimize their relative positions during inference to maintain geometric consistency. This ensures more accurate posture classification, particularly in dynamic scenarios where one side of the body is less visible. In real-world deployments, symmetry-aware mechanisms also serve as internal checks to reduce false positives in keypoint detection. When asymmetrical outputs are detected, the system can re-evaluate its predictions using mirrored templates or optimization-based correction heuristics. This is especially critical in elderly individual monitoring applications, where posture misclassification could delay emergency response or trigger false alarms. Moreover, symmetry is not only advantageous for accuracy, but also for computational efficiency. By leveraging mirrored joint relationships, models can reduce redundancy in feature computation and enable lighter architectures suitable for real-time processing on embedded or edge devices.
2.5. Optimization Strategies for Symmetry-Enhanced Pose Learning
Optimization lies at the heart of most learning-based pose estimation frameworks. Whether employing convolutional backbones or transformer-based architectures, these models rely on well-tuned objective functions to minimize positional error across keypoints. Incorporating symmetry into these objective functions can further enhance performance. Loss functions that penalize asymmetric keypoint deviations have been shown to improve generalization, particularly when training data includes noise or unbalanced poses. For example, some models integrate symmetry loss terms into the training objective to enforce equidistant relationships between mirrored joints. Others use multi-objective optimization to simultaneously minimize standard pose loss and maximize symmetry alignment across the predicted skeleton. In the context of elderly care, such optimization techniques are vital for ensuring a robust performance under variable lighting, body deformation due to age-related posture, and diverse environments. These adjustments contribute to more dependable gesture interpretation and safer decision making, ultimately advancing real-world applicability.
Building upon the technological advancements and research trends outlined in related works, this study proposes a real-time intelligent monitoring framework that integrates symmetry-aware pose estimation with temporal gesture analysis. The following section describes the methodology used to design, implement, and evaluate the proposed system, detailing the architecture, data processing pipeline, and machine learning models employed. This approach bridges the theoretical foundations with practical deployment for elderly care applications.
3. Methodology
This section presents the overall methodology used to develop the proposed intelligent monitoring system for elderly care. The framework integrates multiple components, including computer vision, symmetry-aware pose estimation, temporal gesture recognition, and IoT-based alerting, into a cohesive system capable of real-time operation. The methodology is structured into the following four main phases: data acquisition, data preparation, model training, and system evaluation. The following diagram provides an overview of the system architecture and workflow.
In this section, the researchers present their scholarly inquiry and data aggregation, incorporating the Internet of Things (IoT) to aid in the remote monitoring and surveillance of the elderly population, particularly in instances where caretakers are absent or unable to be physically present. Leveraging computer vision technology, the system conducts assessments of elderly individuals’ postures, including sitting, standing, and lying down, as well as their communication through predefined sign language gestures. These assessments are visualized on the system’s monitor display and communicated to caregivers via the Line messaging platform, facilitating convenience, assistance, and vigilance in elderly care. Furthermore, the system aggregates diverse datasets to advance the development of real-time remote monitoring systems for elderly people, incorporating hand gesture recognition through computer vision. The conceptual framework is depicted in
Figure 3. The process of pose estimation entails the identification and localization of key points within an image, commonly referred to as keypoints. These keypoints serve to denote significant landmarks or features of the subject, such as joints or distinctive anatomical landmarks. They are typically represented as coordinates in a two-dimensional (2D) or three-dimensional (3D) space, denoted by [x, y] or [x, y, z], respectively.
During the estimation stage, the detected keypoints are utilized to compute a dataset comprising the positions and orientations of the subject. This computation is facilitated through mathematical methodologies or models, which may involve geometric computations based on the keypoints data. In this research endeavor, a selection of hardware components is deemed necessary for the successful execution of the proposed study. The essential hardware elements encompass the following.
Figure 1 illustrates the conceptual architecture of a real-time intelligent monitoring system developed to detect postural behaviors and recognize hand gestures among elderly individuals. The system is organized into the following four main phases: data acquisition, data preparation, model training, and model evaluation. This modular structure facilitates both computational efficiency and generalizability across real-world environments. During implementation, the CNN backbone was configured with three convolutional layers employing 3 × 3 kernels, each followed by ReLU activation and max pooling to progressively extract hierarchical spatial features. The feature maps were then fed into a two-layer LSTM network, with each layer comprising 128 hidden units, to model temporal dependencies across consecutive frames. Model optimization was performed using the Adam optimizer with an initial learning rate of 0.001 and a mini-batch size of 32. Training was conducted over 20 epochs, with early stopping applied to prevent overfitting. These settings were chosen based on preliminary tuning experiments to balance training stability and computational efficiency. For classification, a random forest model was trained using 100 decision trees, with the maximum depth limited to 10 to reduce overfitting and improve interpretability.
The system begins with a vision-based detection mechanism using RGB video streams to capture both human posture and hand gestures. Five predefined hand gestures are recognized, alongside three common elderly individual postures, including sitting, standing, and sleeping. The inclusion of hand gestures introduces a non-verbal communication channel that promotes autonomy and safety, especially for elderly users with speech or mobility limitations.
Captured image frames are preprocessed through several steps, including pixel normalization and color space conversion, to standardize the input data. These preprocessing techniques are crucial in enhancing symmetry consistency across datasets by aligning posture keypoints and hand contours, ensuring model robustness under diverse lighting and camera perspectives.
- 3.
Model Training (CNN + Mediapipe + LSTM Pipeline)
The core of the system is a hybrid learning framework combining convolutional neural networks (CNNs) for spatial feature extraction with Long Short-Term Memory (LSTM) networks for temporal regression modeling. The CNN component is responsible for extracting 2D features from each frame, while the LSTM model with pose depth cues infers spatiotemporal consistency to implicitly approximate 3D pose dynamics. Symmetry plays a key role here: the bilateral alignment of limbs and joint trajectories across time improves both training convergence and model generalization. This hybrid approach is further complemented by random forest classifiers, which operate on the extracted temporal features to classify postures and gestures based on feature importance and decision aggregation.
- 4.
Model Evaluation
The performance of the model is benchmarked against baseline architectures (CNN-only and LSTM-only) using evaluation metrics centered on classification accuracy. The evaluation phase highlights the benefit of combining symmetry-informed vision features with optimization-based temporal modeling, where the CNN–Mediapipe–LSTM configuration demonstrates a higher accuracy and adaptability.
In this work, we propose a novel framework that integrates geometric symmetry and temporal optimization for robust human pose and hand gesture recognition in elderly individual monitoring systems. The architecture, illustrated in
Figure 3, is structured into the following four stages: data acquisition, data preparation, model training, and performance evaluation. The core contribution lies in the mathematical formalization and exploitation of symmetry, both spatial and temporal, in a data-driven learning pipeline.
- 1.
Feature Representation in Symmetric Space
Let the input video stream be represented as a sequence of image frames
. Each frame
is processed to extract a set of 3D body landmarks (e.g., shoulders, elbows, wrists, and hips) and hand landmarks via a CNN-based pose estimation module (e.g., MediaPipe Pose). These are converted into high-dimensional feature vectors, as follows:
Here, denotes the number of features extracted per frame. These features preserve the geometric structure of the body and are spatially normalized to ensure consistent alignment across subjects.
- 2.
Geometric Symmetry Embedding
To explicitly encode
geometric symmetry, we define a transformation operator
, as follows:
where
are 3D coordinates of symmetric anatomical points (e.g., left and right wrists). During training, we impose a symmetry regularization term
to enforce the reflective consistency, as follows:
This constraint encourages the model to learn features that are invariant under geometric reflection, which is essential for recognizing mirrored gestures of symmetrical posture deviations in elderly individuals.
- 3.
Temporal Optimization via Recurrent Modeling
Given the sequential nature of pose and gesture data, we model the temporal dependencies using a multi-layer LSTM.
The LSTM captures both short-term transitions and long-term temporal patterns in the behavior of an individual, allowing for discrimination between similar postures with different motion contexts (e.g., sitting down vs. standing up).
To enhance temporal smoothness, we introduce a temporal regularization loss, as follows:
which minimizes abrupt feature transitions and reinforces behavioral continuity—a characteristic of natural human motion.
- 4.
Depth-Aware Regression Layer
For more accurate 3D interpretation, particularly in fall-risk situations, we incorporate
depth cues through an auxiliary regression task. Given an input
, the model estimates the pose depth
as follows:
where
is a CNN–LSTM network optimized with a regression loss, as follows:
- 5.
Symmetry-Preserving Classification
The final classification is performed using a
random forest on the temporally aggregated hidden states
. Let the model
map the sequence to the output label
, as follows:
The forest model is selected for its ability to handle non-linear interactions and its transparency in measuring feature importance, which we use to rank the symmetric body features that are most discriminative.
This research introduces a symmetry-aware temporal modeling framework with the following novel elements.
Geometric Symmetry Integration: Unlike prior works, we explicitly encode symmetric body structures into the loss function, enhancing generalizability across mirrored actions.
Temporal Optimization Loss: A novel regularization term is introduced to stabilize LSTM outputs across time, improving motion continuity recognition.
Depth-Aware Enhancement: The inclusion of depth cues improves 3D perception in occluded or critical scenarios like falls.
Explainable Classification: The random forest provides interpretable insights into which symmetric joint features most affect the classification outcomes.
This integration of geometric and temporal symmetry principles into an AI-based gesture recognition system represents a significant step toward more intelligent, robust, and explainable monitoring of human activity, especially in the context of elderly care.
To support real-time human pose and gesture recognition within an elderly individual monitoring environment, a carefully selected hardware configuration was deployed. The system is designed to be low-cost, non-intrusive, and capable of running computer vision and deep learning algorithms locally or at the edge.
Figure 4 shows the system’s operational control using Raspberry Pi for real-time elderly individual monitoring. The Raspberry Pi serves as the central processing unit, executing the pose estimation and gesture recognition algorithms. The RGB camera module is connected via a USB 3.0 port for video streaming, while GPIO (General Purpose Input/Output) ports are used to interface with peripheral devices such as alert indicators or emergency buttons. The system can communicate with cloud services or local storage via Wi-Fi, allowing for remote monitoring and notification through IoT platforms such as LINE. This setup ensures a low latency and reliable real-time operation within the elderly care environment.
Figure 5 shows the system’s operational control is facilitated by the integration of Raspberry Pi, a versatile and cost-effective single-board computer renowned for its capabilities in various computational tasks. Acting as a central processing unit, the Raspberry Pi orchestrates the execution of algorithms and commands essential for system functionality. Its compact size, low power consumption, and GPIO (General Purpose Input Output) pins make it an ideal choice for interfacing with external hardware components and peripherals. Through the deployment of suitable software frameworks, the Raspberry Pi enables the seamless integration and control of the entire system, ensuring the efficient operation and effective coordination of its constituent elements. The process of connecting Closed-Circuit Television (CCTV) cameras involves integrating cameras installed in various locations with systems for recording images or videos to capture scenes from the monitored areas. This process typically entails the utilization of light sensors such as Charge-Coupled Devices (CCDs) or Complementary Metal–Oxide–Semiconductor (CMOS) sensors to capture images or videos of the observed scenarios within objects or areas under surveillance by the CCTV cameras. Data transmission from CCTV cameras can be achieved through various communication channels, including analog signal systems utilizing coaxial cables to transmit image or video data to recording devices or other image management systems. Additionally, Internet Protocol (IP) network systems utilize Ethernet cables to transmit data to image-recording devices or other connected image management systems in the form of digital signals that are detected and transmitted through the network for convenient and efficient data recording or management processes.
Step1: As depicted in
Figure 5, the system initiates the image acquisition process by establishing a connection between the Raspberry Pi and the RGB camera module. Once the video stream is received, individual frames are extracted for further processing. To enhance computational efficiency and improve the precision of subsequent detection tasks, each frame undergoes a resolution downscaling procedure. The original input resolution of 2560 × 1440 pixels is resampled to 1920 × 1080 pixels using bilinear interpolation. This reduction not only preserves essential structural features, but also simplifies downstream computation by eliminating high-resolution redundancies. Furthermore, focusing on predefined Regions of Interest (ROIs) allows the system to concentrate on symmetrical body structures (e.g., face, torso, and limbs) while disregarding irrelevant background regions.
To formalize this optimization, let the original image
and the downscaled image
, where
and
. The spatial reduction factor
is computed as follows:
In this case, we derive the following:
This indicates a 78% reduction in pixel volume, significantly decreasing the input dimensionality while retaining sufficient symmetry information for detection tasks. Additionally, the detection region
can be defined by a binary mask
as follows:
This mask identifies and isolates key subregions (e.g., face and hands) to be passed forward for pose or gesture recognition, as shown in
Figure 6. By applying this targeted resolution optimization and spatial masking strategy, the system reduces processing latency while maintaining a high detection accuracy, aligning with the principles of spatial symmetry-aware optimization in visual computation.
For body and hand detection (Detect Pose and hard), this research employs Mediapipe, which consists of a convolutional neural network (CNN) architecture comprising the following two main components: Feature Extraction CNN and Classification.
Figure 7 shows the Feature Extraction CNN is utilized to extract features or characteristics from the images, while the Classification process involves using models to identify or classify objects or entities within the images. The classification process involves the following steps.
The input consists of images acquired from a camera or video in the form of image frames, with dimensions represented as Win × Hin × Din, where Win and Hin denote the width and height of the image, respectively, and Din represents the number of color channels (for RGB images, Din would be 3).
3.1. Body and Hand Detection Using CNN-Based MediaPipe
Figure 8 shows the system utilizes the MediaPipe framework for real-time pose estimation, incorporating a convolutional neural network (CNN) architecture designed to detect both body posture and hand gestures. The architecture is composed of the following two key modules: a Feature Extraction CNN, which learns the spatial representations of body parts and hand shapes, and a Classification Head, which assigns labels to the detected keypoints.
The input to the network is an RGB image frame represented by a tensor
, where
and
denote the image with and height, and
for three color channels (RGB). The convolutional operation can be formalized as follows:
where
is the resulting feature map,
is the learned convolutional filter,
is the bias term,
is a non-linear activation function such as ReLU,
* denotes the convolution operation.
These feature maps retain the structural symmetry inherent in human anatomy, such as the mirrored positions of limbs or bilateral hand gestures. This symmetry serves as a validation heuristic, reducing false detections from occlusion or lighting variation and strengthening robustness in real-time environments.
3.2. Temporal Modeling with LSTM and Symmetry Regularization
Once the spatial features are extracted, a Long Short-Term Memory (LSTM) network is employed to model the temporal dynamics of movement across successive video frames. Let
denote the sequence of feature vectors over T time steps. The LSTM processes this sequence as follows:
where
is the hidden state at time ,
is the spatial feature at frame ,
is the previous hidden state.
To promote spatial–temporal symmetry, especially in hand gesture and body pose consistency, a symmetry loss term is optionally introduced, as follows:
Here, and refer to the extracted features of paired body parts (e.g., left and right wrists), and Mirror() reflects the spatial alignment across the sagittal axis. This constraint enforces symmetrical consistency across both space and time, resulting in a more stable and accurate representation of posture and gesture.
Together, this step forms the backbone of the system’s symmetry-informed perception pipeline, capable of analyzing behavior patterns in real time with a high structural fidelity.
Convolution, also known as feature extraction, involves transforming images into pixel values ranging from 0 to 255.
Figure 9 and
Figure 10 shows the process utilizes mathematical operations to extract features by convolving the image with a filter, often referred to as a kernel. The kernel serves to highlight specific features of interest, producing a feature map or output. Mathematically, the feature map, denoted as Y, is obtained by convolving the image with the kernel, as depicted in Equation (2). For the computation of
Yi,j, it is expressed as follows.
where
are the results in position (i,j) of the feature map.
is the pixel value in position (i+u,j+v,k) of the input image.
are the filter values in position (u,v) and in the kth color channel.
is the bias value
is the activation function ReLU.
Here, is the Rectified Linear Unit (ReLU), F represents the size of the filter, and W u,v,k are the values of the filter at position (u,v) and in the kth color channel.
Pooling, a dimensionality reduction technique, involves using small filters to extract the maximum value from a group of pixels within a feature map. This process employs a specified stride to continuously move the filter across the feature map. A common example of pooling is max pooling, where the maximum value within a 2 × 2 area of the feature map is selected, with the stride determining the extent of movement. Let Z denote the resulting feature map or output after pooling. Dimensionality reduction is typically achieved using max pooling, as illustrated in
Figure 11.
The Fully Connected Layer represents the final layer in the convolutional neural network architecture. Its operation involves connecting the data from the preceding layer in a fully connected manner. Subsequently, the SoftMax technique is applied to classify the data, yielding O as the result after the fully connected connection. The equation for the fully connected layer is depicted in Equation (3), as illustrated in
Figure 12.
where
is the result in the jth fully connected layer.
is the weight of the connection between feature map i and j.
is the value of the feature map.
is the bias value
is the activation function ReLU.
The output (O) represents the result obtained after the fully connected layer, which is subsequently utilized for various tasks according to the objectives of person and hand detection. This is illustrated in
Figure 13.
Step 3: Hand pose detection using Mediapipe is employed to facilitate communication through predefined sign language gestures. The process of Mediapipe hand landmark detection commences with palm detection utilizing a dedicated model. Subsequently, the crucial landmarks of the hand, totaling 21 points, are identified through the Hand Landmark Model. This model simulates hand poses based on image detection inputs in
Figure 14.
The Mediapipe hand landmark detection system comprises the following three primary components: hand landmarks, handedness, and a palm detection model.
Hand landmarks: This component identifies keypoints on the hand, such as fingertips, finger joints, or other hand features. It consists of an array of points that are detected on the hand.
Handedness: This component is used to determine whether the detected hand is left- or right-handed. It provides information indicating “left” or “right” accordingly and is instrumental in distinguishing between left and right hands.
Palm detection model: This model is employed to detect the palm of the hand, aiding in determining the position and orientation of the hand in the image. It plays a crucial role in the process of detecting key points on the hand.
The system successfully classifies the detected hand as left-handed, with a confidence score of 0.98396. The result is part of the handedness inference module, where index = 0 corresponds to the left hand and a score above 0.9 indicates a high classification certainty. This information is critical for context-aware gesture interpretation and symmetry-based pose adjustment. The hand landmark model defines 21 three-dimensional key points, each characterized by normalized x, y, and z coordinates. The x and y values are normalized within the range [0.0, 1.0] relative to the image width and height, respectively. This normalization enables the z coordinate to represent the depth of each key point. The origin is defined at the wrist with a z value of zero, and smaller z values indicate points closer to the camera. For instance, Landmark #0 is located at (0.638852, 0.671197, −3.41 × 10−7) and Landmark #1 is at (0.634599, 0.536441, −0.06984). These spatial coordinates provide essential input features for downstream tasks such as gesture classification and symmetry-aware pose estimation. The system detects 21 world-space hand landmarks, where each point is described by its absolute 3D coordinates (x, y, z) in metric space rather than normalized image space. These world landmarks provide physically meaningful positions that are independent of image resolution or camera framing. For example, Landmark #0 is located at (0.067485, 0.031084, 0.055223) and Landmark #1 is at (0.063209, −0.00382, 0.020920). This world coordinate representation is essential for applications requiring spatial reasoning, gesture interaction with physical environments, and depth-aware gesture recognition.
The Mediapipe Pose Landmarker is employed for gesture recognition purposes. It utilizes a set of models to predict key landmarks indicative of gestures. The first model detects the presence of the human body within the image frame, while the second model identifies key landmarks on the body. The Pose Landmarker tracks the positions of 33 key landmarks on the body model, which approximate the locations of various body parts. This is illustrated in
Figure 15.
The Pose Landmarker consists of two components, each comprising arrays of landmarks denoted by their x and y coordinates. These coordinates are normalized between 0.0 and 1.0 using the width (x) and height (y) of the image as the main reference. The z-coordinate represents the depth of the key landmarks, with the origin point located at the center of the hips. A lower z-value indicates that the landmark is closer to the camera. The scale of z is consistent with that of x. Additionally, the visibility parameter indicates the probability of a key landmark being visible in the image. The pose detection system extracts 33 three-dimensional body landmarks, with each point described by x, y, and z coordinates, as well as two confidence measures, visibility and presence. The x and y values are normalized within the image frame, while the z coordinate represents depth. Visibility indicates how likely it is that the landmark is visible to the camera, and presence reflects the confidence of its existence in the pose model. For instance, Landmark #0 is located at (0.638852, 0.671197, 0.129995), with a visibility of 0.99999976 and presence of 0.99999845. These enriched annotations provide a robust basis for real-time human activity recognition and symmetry-aware pose analysis. The world landmarks comprise x, y, and z coordinates, representing the three-dimensional coordinates in real-world units, typically measured in meters. These coordinates originate from the center of the hips, serving as the primary reference point. The visibility parameter denotes the probability of a key landmark being visible in the image. The system extracts 33 world-space pose landmarks, where each point is defined by its absolute x, y, and z coordinates in real-world metric space. These values are independent of the image resolution and are used for accurate spatial analysis. Each landmark is also associated with a visibility score, indicating how likely it is that it is visible to the camera, and a presence score, reflecting the model’s confidence in its detection. For example, Landmark #0 is located at (0.067485, 0.031084, 0.055223), with a visibility of 0.99999976 and presence of 0.99999845. Such data enables precise 3D body tracking, useful for applications in human movement analysis and intelligent monitoring systems.
Step 4: The process of training a model for hand and body pose classification and prediction involves several key steps, outlined as follows.
Importing various pose image datasets into the system, as illustrated in
Figure 16.
- 2.
Resizing the dataset to have uniform dimensions of 1920 × 1080 pixels, with a total of 100 images, as depicted in
Figure 17.
- 3.
Utilizing Mediapipe to determine the keypoints of each image and converting them into Landmarker values representing various parts of the body and hands, as illustrated in
Figure 18.
- 4.
The entire dataset is divided into the following two sets: 80% for the training dataset and 20% for the testing dataset. The training dataset is represented by the red boundary, while the testing dataset is represented by the green boundary, as shown in
Figure 19.
Step 5: To train a machine learning model using the random forest algorithm for the purpose of predicting and classifying hand and body poses, the following academic writing can be provided:
Training a Random Forest Model for Hand and Body Pose Prediction and Classification. The utilization of machine learning algorithms, particularly the random forest algorithm, plays a pivotal role in predicting and classifying hand and body poses in various applications, such as sign language recognition, human–computer interaction, and gesture-based control systems. In this context, the random forest algorithm, a robust ensemble learning method, offers promising capabilities in handling multi-class classification tasks with complex and high-dimensional data. Entropy is a measure utilized to quantify the uncertainty or randomness present within a dataset, particularly in the context of classification tasks. In the process of classifying data, entropy serves as a metric to assess the level of uncertainty associated with the distribution of classes within the dataset. It ranges between zero and one, where lower values indicate less uncertainty or higher orderliness in the data, often referred to as data having a discernible pattern. Conversely, higher values signify greater uncertainty or more confusion within the data, suggesting a lack of distinct patterns or significant disorder. Entropy reaches its maximum value when all classes within the dataset occur with equal frequency, indicating a state of maximum uncertainty. Conversely, it achieves its minimum value when there is only one class present with a frequency equal to the total number of instances, representing a state of perfect certainty or orderliness. Mathematically, entropy is calculated using the following formula.
where
represents the probability of occurrence of class iii within the dataset
, and
n denotes the total number of classes. This formulation captures the degree of uncertainty associated with the class distribution and provides a quantitative measure of the dataset’s entropy, facilitating informed decision making in classification tasks.
Mean Absolute Error (MAE) is a method commonly used to measure prediction error in regression tasks, particularly in predicting numerical values. It quantifies the average magnitude of the differences between the predicted values and the actual target values across all instances in the test dataset. A lower MAE indicates that the model has a higher accuracy in its predictions, while a higher MAE suggests that the model’s predictions deviate more from the actual values. Mathematically, MAE is calculated as follows.
where
n is the total number of instances in the test dataset.
yi represents the actual target value for instance i.
i represents the predicted value for instance i.
Step 6: Gestures and hand gestures classification. The process of modeling for the classification of gestures and hand gestures in the long-distance elderly individual monitoring system via the Internet of Things encompasses the collection of datasets consisting of three body gestures and five hand gestures, as illustrated in the image below. Each gesture category comprises 100 images in the dataset, with variations in angles and degrees for each image to ensure an optimal classification accuracy. For hand pose 1, hold up one index finger, specifying that the message for this pose is “Water Please”. For hand pose 2, hold up two fingers, the index finger and middle finger, specifying that the message in this position is “want food” (Hungry). For hand position number 3, hold up three fingers, including the pinky finger and thumb, specifying the message of this pose as “Miss you”. The fourth hand pose is a four-finger gesture, with the message of this pose being “help”. The fifth hand pose is the thumb-up pose, specifying that this message is “want to take a shower or go to the bathroom” (Bathroom). Posture 1 is a standing pose, specifying that the message for this position is “Stand” (Stand). Posture 2 is a sitting pose, by specifying that the message for this position is “Sit” (Sit). Posture 3 is a lying position, by specifying that the message for this position is “Lying” (Sleep). This is shown in the
Figure 20 and
Figure 21.
Step 7: System decisions and decision principles. For the classification of gestures and hand gestures, take the original video and process it. Transform the image into a Landmark array with x, y, and z coordinates. The x and y coordinates are normalized to be in the range of [0.0, 1.0] using the width and height of the image, respectively, so that the z value represents the depth of important points, which will be matched (Matching) with the model that has been prepared. When there is a matching or closest value, that value will be displayed according to the message specified. As a result, the text cannot be displayed correctly every time, which has factors such as angle, degree, and depth involved, as shown in the picture. The algorithm detailing this procedure is as follows in
Figure 22 and
Figure 23.
As shown in Algorithm 1, the classification involves feature extraction, gesture modeling, and decision making.
Algorithm 1: Mathematical Representation of Gesture and Hand Gesture Classification |
- 1.
Step 1: Input Video Frame Capture - 2.
Let I represent the input frame from the video stream. - 3.
I = capture_frame(video_stream) - 4.
Step 2: Preprocessing Normalize the image dimensions to a standard size (W,H): Iresized = resize (I,W,H)
- 5.
Step 3: Landmark Detection Detect hand and body landmarks using Mediapipe, resulting in an array of coordinates L: L = detect_landmarks(Iresized)
- 6.
Step 4: Normalization - 7.
Step 5: Model Matching - 8.
Step 6: Classification - 9.
Step 7: Post-processing and Display Display the identified gesture and its corresponding message: display(message) Handle Variability Factors such as angle, degree, and depth may affect recognition accuracy. These factors can be represented by additional terms in the normalization and matching equations to account for variability. Final Algorithm in Mathematical Notation The final algorithm can be summarized as:
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
|