Next Article in Journal
A Novel Evolutionary Structural Topology Optimization Method Based on Load Path Theory and Element Bearing Capacity
Previous Article in Journal
A Bonferroni Mean Operator for p,q-Rung Triangular Orthopair Fuzzy Environments and Its Application in COPRAS Method
Previous Article in Special Issue
Well-Posedness of Cauchy-Type Problems for Nonlinear Implicit Hilfer Fractional Differential Equations with General Order in Weighted Spaces
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Geometric Symmetry and Temporal Optimization in Human Pose and Hand Gesture Recognition for Intelligent Elderly Individual Monitoring

by
Pongsarun Boonyopakorn
1 and
Mahasak Ketcham
2,*
1
Department of Digital Network and Information Security Management, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand
2
Department of Information Technology Management, King Mongkut’s University of Technology North Bangkok, Bangkok 10800, Thailand
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(9), 1423; https://doi.org/10.3390/sym17091423
Submission received: 26 May 2025 / Revised: 9 July 2025 / Accepted: 18 July 2025 / Published: 1 September 2025

Abstract

This study introduces a real-time, non-intrusive monitoring system designed to support elderly care through vision-based pose estimation and hand gesture recognition. The proposed framework integrates convolutional neural networks (CNNs), temporal modeling using LSTM networks, and symmetry-aware keypoint analysis to enhance the accuracy and reliability of behavior detection under varied real-world conditions. By leveraging the bilateral symmetry of human anatomy, the system improves the robustness of posture and gesture classification, even in the presence of partial occlusion or variable lighting. A total of 21 hand landmarks and 33 body pose points are used to recognize predefined actions and communication gestures, enabling seamless interaction without wearable devices. Experimental evaluations across four distinct lighting environments confirm a consistent accuracy above 90%, with real-time alerts triggered via IoT messaging platforms. The system’s modular architecture, interpretability, and adaptability make it a scalable solution for intelligent elderly individual monitoring, offering a novel application of spatial symmetry and optimized deep learning in healthcare technology.

1. Introduction

Thailand’s progression into an aged society, officially recognized under the Elderly Person Act of 2003, reflects a demographic transformation that extends beyond statistical benchmarks. As of November 2023, individuals aged 60 and above accounted for approximately 20% of the total population, amounting to nearly 13 million citizens, placing increasing pressure on the nation’s healthcare and social support systems [1,2,3,4]. These demographic shifts, consistent with United Nations criteria, are further shaped by contextual factors such as gender-based longevity disparities and differences in access between rural and urban populations. Social perceptions also suggest that aging manifests uniquely across demographic contexts, influencing the lived experience of elderly individuals, as well as the type and quality of care they receive. In Thailand, identifying an individual as elderly involves more than just chronological age. It includes observable changes in physical function and behavior, such as decreased mobility, cognitive decline, and emotional vulnerability factors that often demand specialized, proactive intervention. Many elderly individuals must manage multiple chronic conditions, including hypertension, diabetes, and heart disease, leading to repeated cycles of medical treatment, dependence, and rehabilitation. These health challenges often result in periods of immobility that, if left unmonitored, can exacerbate frailty, delay recovery, and result in acute incidents such as falls or wandering [5,6,7,8,9,10,11]. Given these complex realities, it is increasingly recognized that aging care must extend beyond financial benefits such as pensions and address the broader dimensions of autonomy, dignity, and health security. This calls for the development of intelligent, technology-assisted systems that can support or enhance caregiving efforts, especially in situations where continuous human supervision is not feasible. One particularly promising direction is the deployment of non-intrusive monitoring systems driven by advancements in computer vision and the Internet of Things (IoT). From a computational perspective, the daily behaviors of elderly individuals often display recurring, symmetrical patterns, particularly in postural transitions such as sitting, standing, and lying down. Recognizing these natural patterns provides a structural advantage for monitoring systems. For example, deviations from typical postural symmetry such as a loss of balance or irregular movement can be flagged for caregiver attention. These consistent behaviors serve as baselines for identifying anomalies and facilitating timely interventions. In addition, optimization techniques play a critical role in enhancing these systems’ adaptive capabilities. By fine-tuning classification thresholds, refining pose estimation accuracy, and calibrating gesture recognition across varied real-world environments, systems can maintain a high accuracy and responsiveness while reducing latency. This is particularly important for real-time alerts delivered via messaging platforms such as Line, enabling remote caregivers to respond effectively based on detected anomalies or user-initiated gestural cues [12,13,14]. This research offers several novel contributions to the intersection of intelligent monitoring and elderly care technology. First, it introduces a robust and scalable framework that integrates state-of-the-art computer vision techniques, namely pose landmark detection and hand landmark recognition, with real-time camera input, enabling the precise classification of body postures and hand gestures. Second, the system is uniquely tailored to the behavioral characteristics and practical needs of elderly individuals, incorporating gesture-based interaction as a non-verbal communication channel for safety and autonomy. Third, the system demonstrates a high adaptability and performance across a wide range of environments, including variable lighting conditions and unfamiliar spaces, with classification accuracies consistently exceeding 90%. These results validate the system’s reliability and practicality for real-world deployment. Collectively, the findings not only underscore the transformative potential of combining IoT infrastructure with vision-based symmetry analysis and optimization, but also establish a benchmark for future advancements in technology-enabled eldercare. In this study, we propose a real-time remote monitoring framework that integrates symmetry-based pose estimation and hand gesture recognition with an IoT-enabled alert system. Beyond its technical components, the system is designed to empower elderly individuals by providing non-invasive digital assistance that supports their dignity, independence, and well-being. Through the use of behavioral symmetry and optimization-driven learning, this research contributes a novel and practical approach to intelligent eldercare, bridging computational innovation with human-centered application.

2. Related Works

In recent years, the field of computer vision has evolved into a cornerstone of intelligent systems research, providing machines with the capability to interpret and respond to both static and dynamic visual inputs. By leveraging advanced machine learning techniques, including both traditional machine learning algorithms and state-of-the-art deep learning architectures, computer vision systems can analyze complex image data to perform a wide array of tasks with high precision. These tasks typically include object recognition, anomaly detection, behavior analysis, and motion tracking, each tailored to meet the specific operational goals of the system.
At its core, computer vision extends beyond simple image processing; it represents an intelligent mechanism capable of perceiving and understanding the visual world in a manner that approximates human vision. Unlike conventional control systems, vision-based algorithms can adapt to contextual variations, enabling machines to make data-driven decisions in real time. This has allowed for the seamless integration of computer vision into a variety of real-world applications, ranging from industrial automation to healthcare, security, and transportation.
Notably, surveillance systems have increasingly adopted computer vision to enhance situational awareness and threat detection through live video analysis. Similarly, autonomous vehicles rely on computer vision to identify obstacles, interpret traffic conditions, and navigate dynamic environments safely. These applications often rely on the symmetrical spatial structure of objects and scenes to improve recognition accuracy and reduce computational complexity, particularly when paired with optimization strategies.
Beyond detection and classification, computer vision facilitates intelligent automation by enabling systems to learn from visual patterns and make predictive decisions. This ability to process and act upon visual input in real time has positioned computer vision as a pivotal component in next-generation intelligent infrastructure. As research in this domain advances, it continues to draw upon mathematical models of visual symmetry, geometric regularity, and optimization principles, especially in applications where precision and adaptability are paramount.

2.1. Concept of 3D Pose Estimation Method

The proposed system initiates its processing pipeline by employing a convolutional neural network (CNN) to extract spatial features from RGB video inputs. Specifically, the CNN is responsible for identifying two-dimensional (2D) skeletal keypoints that correspond to the posture and configuration of the human body in each video frame. These 2D pose estimations serve as the foundational representation for subsequent inference tasks. To derive three-dimensional (3D) pose information from the 2D keypoints, the system incorporates a Long Short-Term Memory (LSTM) architecture enhanced with Parametric Skip Connections (p-LSTM). This temporal modeling approach allows the network to implicitly infer depth information over time, leveraging learned dependencies across sequential frames rather than relying on explicitly defined depth cues. The Skip Connections embedded within the p-LSTM architecture serve to bridge non-consecutive layers, enabling the model to retain and reintroduce relevant feature information across time steps, thereby improving its depth prediction performance. By combining spatial feature extraction via the CNN with temporal–sequential learning through p-LSTM, the system achieves 3D pose reconstruction in a manner that does not require explicit ground-truth depth maps or externally calibrated 3D references. Instead, depth estimation is learned implicitly as part of the end-to-end training process, guided by pose consistency and temporal coherence. This method offers a practical and efficient solution for 3D human pose estimation from monocular video input, particularly in scenarios where depth sensors or multi-camera setups are unavailable. An overview of the conceptual framework underlying this architecture is presented in Figure 1.
Pose estimation is a key technique in computer vision, designed to detect and analyze the spatial configuration of individuals or objects through the localization of specific anatomical or structural keypoints. These keypoints, such as joints, facial landmarks, or object corners, serve as critical reference markers for interpreting posture, movement, and behavioral patterns. By evaluating the spatial relationships among these points, systems can infer dynamic poses over time, enabling advanced analyses of motion, activity, and symmetry. Applications of pose estimation span multiple disciplines, including human–computer interaction, sports analytics, healthcare monitoring, augmented reality, and robotics. In particular, symmetrical body configurations often serve as a basis for improving accuracy in classification and anomaly detection, as consistent patterns across mirrored joints (e.g., left and right arms or legs) provide valuable cues for verifying pose integrity. Several deep learning-based models have been developed to address pose estimation challenges, each incorporating unique architectures and optimization strategies. Notable frameworks include OpenPose, PoseNet, BlazePose, DeepPose, DensePose, and DeepCut, which differ in their keypoint localization granularity, inference speed, and structural modeling techniques. These pose estimation approaches are conceptually illustrated in Figure 2.
Figure 2 gives an example of pose estimation output, showing human skeleton keypoints and limb connections. White circles indicate anatomical landmarks such as the head, shoulders, elbows, wrists, hips, knees, and ankles. Color-coded lines represent limb segments: red for upper arms, blue for lower arms, yellow for shoulders and torso, green for legs, and purple for pelvis. This visual schema supports symmetry-based modeling by highlighting bilateral joint structures and their alignment.

2.2. Camera Surveillance Systems

Surveillance camera systems have become an essential component in elderly individual monitoring, leveraging advanced image processing and artificial intelligence (AI) algorithms. The integration of high-resolution cameras with real-time video analytics allows for detailed observations of elderly people’s activities, thereby enhancing safety through prompt alerts to caregivers [18,19,20]. Innovations in this domain include the use of infrared or thermal imaging to enhance monitoring capabilities during low-light or no-light conditions [21,22,23,24,25]. Recent advancements in machine learning, particularly with models like convolutional neural networks (CNNs), have greatly improved systems designed to interpret complex human behaviors [26,27]. These capabilities are essential not only for identifying emergencies such as falls or prolonged inactivity, but also for analyzing routine behaviors to detect potential health risks or well-being issues. Integrating edge computing into surveillance camera systems marks a transformative step in elderly individual monitoring. Processing data directly on the device enhances cost-efficiency, lowers latency, and accelerates response times, which are critical factors for timely interventions during emergencies [28,29,30,31]. Additionally, local data processing addresses privacy concerns by limiting the transfer of sensitive information to external servers. This is especially pertinent in elderly care, where privacy and data protection are critical. Nonetheless, while edge computing optimizes data handling by transmitting only key alerts or behavioral summaries to caregivers, it presents a trade-off. Caregivers may lack a full contextual understanding of situations, since fall detection and behavior analysis algorithms can still yield false positives or overlook certain events. The ongoing development in this field underscores the importance of balancing technological advancements with practical considerations in elderly care. The ability to provide accurate and timely information while ensuring privacy and security is vital for the effective implementation of these monitoring systems. As technology continues to evolve, it is essential to address these challenges to fully realize the potential benefits of surveillance systems in enhancing the quality of life of elderly people.

2.3. Advances in Artificial Intelligence for Elderly Care

In elderly care, artificial intelligence (AI) technologies have made remarkable strides, especially through the use of advanced algorithms and neural networks tailored for tasks like pose detection, fall detection, and activity recognition. At the core of these automated systems lies human pose estimation, which forms the essential basis for further analysis such as detecting falls or monitoring activities. Algorithms such as PoseNet and BlazePose have notably advanced this area by leveraging 2D and 3D imaging processed through convolutional neural networks (CNNs), yielding significant benefits in both academic and practical domains. These methods offer the real-time tracking and precise identification of human body positions, which are vital for effectively monitoring elderly individuals.
For fall detection, systems often depend on either object detection techniques or pose estimation frameworks. Typically, these solutions integrate both spatial characteristics—such as body posture and positioning—and temporal patterns that capture motion dynamics over time. Techniques like CNNs combined with Long Short-Term Memory (LSTM) networks process these dimensions to detect falls accurately. Some studies have proposed direct classification models that interpret pose or skeletal data to assess whether a fall has taken place, while others utilize bounding box strategies and object detection approaches for similar purposes. Innovative methodologies continue to emerge, improving the precision and dependability of fall detection systems, a key aspect in safeguarding elderly populations.
Furthermore, pose estimation models have spurred advancements in activity recognition research. By extracting skeletal information and integrating it with spatial–temporal analysis and other robust features, machine learning models have become more adept at distinguishing various human activities. Visual-based models tend to outperform motion sensor systems because of their ability to capture distinct visual cues that differentiate between actions, whereas sensor-based data often shows overlapping movement patterns, which can complicate classification efforts. As a result, many elderly individual monitoring frameworks now incorporate these models to automate behavior tracking. Some studies also enhance classification performance by combining pose data with handcrafted features such as the distances and angles between keypoints across frames, or by employing biometric features derived from these keypoints to train classifiers like random forests.
Despite these advancements, skeleton-based fall detection still faces challenges, particularly in terms of privacy and designing user-centered solutions for elderly care. Addressing privacy concerns is critical for broader acceptance and comfort among elderly users. Approaches such as anonymizing visual data and employing privacy-preserving techniques are being explored to build greater trust and ethical compliance in monitoring systems [32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53].

2.4. Symmetry-Aware Pose Estimation in Human Motion Analysis

The structural symmetry of the human body plays a pivotal role in improving the reliability of pose estimation systems. In most human motions, bilateral symmetry reflected in the mirroring of limbs and joints across the sagittal plane is a natural characteristic that can be exploited for model training, refinement, and error correction. By enforcing symmetry-based constraints during learning, pose estimation models are better equipped to handle occlusion, viewpoint variation, and partial body visibility. Several contemporary models incorporate symmetry either explicitly through regularization terms or implicitly through architectural design. For example, models such as OpenPose and BlazePose consider paired landmarks (e.g., left–right shoulders, elbows, and knees) and optimize their relative positions during inference to maintain geometric consistency. This ensures more accurate posture classification, particularly in dynamic scenarios where one side of the body is less visible. In real-world deployments, symmetry-aware mechanisms also serve as internal checks to reduce false positives in keypoint detection. When asymmetrical outputs are detected, the system can re-evaluate its predictions using mirrored templates or optimization-based correction heuristics. This is especially critical in elderly individual monitoring applications, where posture misclassification could delay emergency response or trigger false alarms. Moreover, symmetry is not only advantageous for accuracy, but also for computational efficiency. By leveraging mirrored joint relationships, models can reduce redundancy in feature computation and enable lighter architectures suitable for real-time processing on embedded or edge devices.

2.5. Optimization Strategies for Symmetry-Enhanced Pose Learning

Optimization lies at the heart of most learning-based pose estimation frameworks. Whether employing convolutional backbones or transformer-based architectures, these models rely on well-tuned objective functions to minimize positional error across keypoints. Incorporating symmetry into these objective functions can further enhance performance. Loss functions that penalize asymmetric keypoint deviations have been shown to improve generalization, particularly when training data includes noise or unbalanced poses. For example, some models integrate symmetry loss terms into the training objective to enforce equidistant relationships between mirrored joints. Others use multi-objective optimization to simultaneously minimize standard pose loss and maximize symmetry alignment across the predicted skeleton. In the context of elderly care, such optimization techniques are vital for ensuring a robust performance under variable lighting, body deformation due to age-related posture, and diverse environments. These adjustments contribute to more dependable gesture interpretation and safer decision making, ultimately advancing real-world applicability.
Building upon the technological advancements and research trends outlined in related works, this study proposes a real-time intelligent monitoring framework that integrates symmetry-aware pose estimation with temporal gesture analysis. The following section describes the methodology used to design, implement, and evaluate the proposed system, detailing the architecture, data processing pipeline, and machine learning models employed. This approach bridges the theoretical foundations with practical deployment for elderly care applications.

3. Methodology

This section presents the overall methodology used to develop the proposed intelligent monitoring system for elderly care. The framework integrates multiple components, including computer vision, symmetry-aware pose estimation, temporal gesture recognition, and IoT-based alerting, into a cohesive system capable of real-time operation. The methodology is structured into the following four main phases: data acquisition, data preparation, model training, and system evaluation. The following diagram provides an overview of the system architecture and workflow.
In this section, the researchers present their scholarly inquiry and data aggregation, incorporating the Internet of Things (IoT) to aid in the remote monitoring and surveillance of the elderly population, particularly in instances where caretakers are absent or unable to be physically present. Leveraging computer vision technology, the system conducts assessments of elderly individuals’ postures, including sitting, standing, and lying down, as well as their communication through predefined sign language gestures. These assessments are visualized on the system’s monitor display and communicated to caregivers via the Line messaging platform, facilitating convenience, assistance, and vigilance in elderly care. Furthermore, the system aggregates diverse datasets to advance the development of real-time remote monitoring systems for elderly people, incorporating hand gesture recognition through computer vision. The conceptual framework is depicted in Figure 3. The process of pose estimation entails the identification and localization of key points within an image, commonly referred to as keypoints. These keypoints serve to denote significant landmarks or features of the subject, such as joints or distinctive anatomical landmarks. They are typically represented as coordinates in a two-dimensional (2D) or three-dimensional (3D) space, denoted by [x, y] or [x, y, z], respectively.
During the estimation stage, the detected keypoints are utilized to compute a dataset comprising the positions and orientations of the subject. This computation is facilitated through mathematical methodologies or models, which may involve geometric computations based on the keypoints data. In this research endeavor, a selection of hardware components is deemed necessary for the successful execution of the proposed study. The essential hardware elements encompass the following.
Figure 1 illustrates the conceptual architecture of a real-time intelligent monitoring system developed to detect postural behaviors and recognize hand gestures among elderly individuals. The system is organized into the following four main phases: data acquisition, data preparation, model training, and model evaluation. This modular structure facilitates both computational efficiency and generalizability across real-world environments. During implementation, the CNN backbone was configured with three convolutional layers employing 3 × 3 kernels, each followed by ReLU activation and max pooling to progressively extract hierarchical spatial features. The feature maps were then fed into a two-layer LSTM network, with each layer comprising 128 hidden units, to model temporal dependencies across consecutive frames. Model optimization was performed using the Adam optimizer with an initial learning rate of 0.001 and a mini-batch size of 32. Training was conducted over 20 epochs, with early stopping applied to prevent overfitting. These settings were chosen based on preliminary tuning experiments to balance training stability and computational efficiency. For classification, a random forest model was trained using 100 decision trees, with the maximum depth limited to 10 to reduce overfitting and improve interpretability.
  • Data Acquisition
The system begins with a vision-based detection mechanism using RGB video streams to capture both human posture and hand gestures. Five predefined hand gestures are recognized, alongside three common elderly individual postures, including sitting, standing, and sleeping. The inclusion of hand gestures introduces a non-verbal communication channel that promotes autonomy and safety, especially for elderly users with speech or mobility limitations.
2.
Data Preparation
Captured image frames are preprocessed through several steps, including pixel normalization and color space conversion, to standardize the input data. These preprocessing techniques are crucial in enhancing symmetry consistency across datasets by aligning posture keypoints and hand contours, ensuring model robustness under diverse lighting and camera perspectives.
3.
Model Training (CNN + Mediapipe + LSTM Pipeline)
The core of the system is a hybrid learning framework combining convolutional neural networks (CNNs) for spatial feature extraction with Long Short-Term Memory (LSTM) networks for temporal regression modeling. The CNN component is responsible for extracting 2D features from each frame, while the LSTM model with pose depth cues infers spatiotemporal consistency to implicitly approximate 3D pose dynamics. Symmetry plays a key role here: the bilateral alignment of limbs and joint trajectories across time improves both training convergence and model generalization. This hybrid approach is further complemented by random forest classifiers, which operate on the extracted temporal features to classify postures and gestures based on feature importance and decision aggregation.
4.
Model Evaluation
The performance of the model is benchmarked against baseline architectures (CNN-only and LSTM-only) using evaluation metrics centered on classification accuracy. The evaluation phase highlights the benefit of combining symmetry-informed vision features with optimization-based temporal modeling, where the CNN–Mediapipe–LSTM configuration demonstrates a higher accuracy and adaptability.
  • Proposed Method: Geometric Symmetry-Aware Temporal Modeling for Pose and Gesture Recognition
In this work, we propose a novel framework that integrates geometric symmetry and temporal optimization for robust human pose and hand gesture recognition in elderly individual monitoring systems. The architecture, illustrated in Figure 3, is structured into the following four stages: data acquisition, data preparation, model training, and performance evaluation. The core contribution lies in the mathematical formalization and exploitation of symmetry, both spatial and temporal, in a data-driven learning pipeline.
1.
Feature Representation in Symmetric Space
Let the input video stream be represented as a sequence of image frames I t t = 1 T . Each frame I t is processed to extract a set of 3D body landmarks (e.g., shoulders, elbows, wrists, and hips) and hand landmarks via a CNN-based pose estimation module (e.g., MediaPipe Pose). These are converted into high-dimensional feature vectors, as follows:
x t R d ,   t = 1,2 , , T
Here, d denotes the number of features extracted per frame. These features preserve the geometric structure of the body and are spatially normalized to ensure consistent alignment across subjects.
2.
Geometric Symmetry Embedding
To explicitly encode geometric symmetry, we define a transformation operator S , as follows:
S p i = p j
where p i , p j R 3 are 3D coordinates of symmetric anatomical points (e.g., left and right wrists). During training, we impose a symmetry regularization term L s y m to enforce the reflective consistency, as follows:
L s y m = ( i , j ) Ρ p i S 1 ( p j ) 2
This constraint encourages the model to learn features that are invariant under geometric reflection, which is essential for recognizing mirrored gestures of symmetrical posture deviations in elderly individuals.
3.
Temporal Optimization via Recurrent Modeling
Given the sequential nature of pose and gesture data, we model the temporal dependencies using a multi-layer LSTM.
h t = L S T M ( x t , h t 1 )
The LSTM captures both short-term transitions and long-term temporal patterns in the behavior of an individual, allowing for discrimination between similar postures with different motion contexts (e.g., sitting down vs. standing up).
To enhance temporal smoothness, we introduce a temporal regularization loss, as follows:
L t e m p = t = 2 T h t h t 1 2
which minimizes abrupt feature transitions and reinforces behavioral continuity—a characteristic of natural human motion.
4.
Depth-Aware Regression Layer
For more accurate 3D interpretation, particularly in fall-risk situations, we incorporate depth cues through an auxiliary regression task. Given an input x t , the model estimates the pose depth Z ^ t R as follows:
Z ^ t = f θ x t
where f θ is a CNN–LSTM network optimized with a regression loss, as follows:
L d e p t h = 1 T t = 1 T ( z t z ^ t ) 2
5.
Symmetry-Preserving Classification
The final classification is performed using a random forest on the temporally aggregated hidden states H = [ h 1 , , h T ] . Let the model F map the sequence to the output label y ^ , as follows:
y = F ( H )
The forest model is selected for its ability to handle non-linear interactions and its transparency in measuring feature importance, which we use to rank the symmetric body features that are most discriminative.
  • Novel Contributions and Knowledge Advancement
This research introduces a symmetry-aware temporal modeling framework with the following novel elements.
Geometric Symmetry Integration: Unlike prior works, we explicitly encode symmetric body structures into the loss function, enhancing generalizability across mirrored actions.
Temporal Optimization Loss: A novel regularization term is introduced to stabilize LSTM outputs across time, improving motion continuity recognition.
Depth-Aware Enhancement: The inclusion of depth cues improves 3D perception in occluded or critical scenarios like falls.
Explainable Classification: The random forest provides interpretable insights into which symmetric joint features most affect the classification outcomes.
This integration of geometric and temporal symmetry principles into an AI-based gesture recognition system represents a significant step toward more intelligent, robust, and explainable monitoring of human activity, especially in the context of elderly care.
  • Hardware Implementation
To support real-time human pose and gesture recognition within an elderly individual monitoring environment, a carefully selected hardware configuration was deployed. The system is designed to be low-cost, non-intrusive, and capable of running computer vision and deep learning algorithms locally or at the edge.
Figure 4 shows the system’s operational control using Raspberry Pi for real-time elderly individual monitoring. The Raspberry Pi serves as the central processing unit, executing the pose estimation and gesture recognition algorithms. The RGB camera module is connected via a USB 3.0 port for video streaming, while GPIO (General Purpose Input/Output) ports are used to interface with peripheral devices such as alert indicators or emergency buttons. The system can communicate with cloud services or local storage via Wi-Fi, allowing for remote monitoring and notification through IoT platforms such as LINE. This setup ensures a low latency and reliable real-time operation within the elderly care environment.
Figure 5 shows the system’s operational control is facilitated by the integration of Raspberry Pi, a versatile and cost-effective single-board computer renowned for its capabilities in various computational tasks. Acting as a central processing unit, the Raspberry Pi orchestrates the execution of algorithms and commands essential for system functionality. Its compact size, low power consumption, and GPIO (General Purpose Input Output) pins make it an ideal choice for interfacing with external hardware components and peripherals. Through the deployment of suitable software frameworks, the Raspberry Pi enables the seamless integration and control of the entire system, ensuring the efficient operation and effective coordination of its constituent elements. The process of connecting Closed-Circuit Television (CCTV) cameras involves integrating cameras installed in various locations with systems for recording images or videos to capture scenes from the monitored areas. This process typically entails the utilization of light sensors such as Charge-Coupled Devices (CCDs) or Complementary Metal–Oxide–Semiconductor (CMOS) sensors to capture images or videos of the observed scenarios within objects or areas under surveillance by the CCTV cameras. Data transmission from CCTV cameras can be achieved through various communication channels, including analog signal systems utilizing coaxial cables to transmit image or video data to recording devices or other image management systems. Additionally, Internet Protocol (IP) network systems utilize Ethernet cables to transmit data to image-recording devices or other connected image management systems in the form of digital signals that are detected and transmitted through the network for convenient and efficient data recording or management processes.
Step1: As depicted in Figure 5, the system initiates the image acquisition process by establishing a connection between the Raspberry Pi and the RGB camera module. Once the video stream is received, individual frames are extracted for further processing. To enhance computational efficiency and improve the precision of subsequent detection tasks, each frame undergoes a resolution downscaling procedure. The original input resolution of 2560 × 1440 pixels is resampled to 1920 × 1080 pixels using bilinear interpolation. This reduction not only preserves essential structural features, but also simplifies downstream computation by eliminating high-resolution redundancies. Furthermore, focusing on predefined Regions of Interest (ROIs) allows the system to concentrate on symmetrical body structures (e.g., face, torso, and limbs) while disregarding irrelevant background regions.
To formalize this optimization, let the original image I o r i g R H 0 × W 0 and the downscaled image I s c a l e d R H 1 × W 1 , where H 1 < H 0 and W 1 < W 0 . The spatial reduction factor R is computed as follows:
R = H 0 × W 0 H 1 × W 1
In this case, we derive the following:
R = 2560 × 1440 1920 × 1080 1.78
This indicates a 78% reduction in pixel volume, significantly decreasing the input dimensionality while retaining sufficient symmetry information for detection tasks. Additionally, the detection region D ( x , y ) I s c a l e d can be defined by a binary mask M x , y 0,1 , as follows:
D x , y = I s c a l e d ( x , y ) · M ( x , y )
This mask identifies and isolates key subregions (e.g., face and hands) to be passed forward for pose or gesture recognition, as shown in Figure 6. By applying this targeted resolution optimization and spatial masking strategy, the system reduces processing latency while maintaining a high detection accuracy, aligning with the principles of spatial symmetry-aware optimization in visual computation.
  • Step 2: Feature Extraction and Symmetry-Aware Body–Hand Detection
For body and hand detection (Detect Pose and hard), this research employs Mediapipe, which consists of a convolutional neural network (CNN) architecture comprising the following two main components: Feature Extraction CNN and Classification. Figure 7 shows the Feature Extraction CNN is utilized to extract features or characteristics from the images, while the Classification process involves using models to identify or classify objects or entities within the images. The classification process involves the following steps.
The input consists of images acquired from a camera or video in the form of image frames, with dimensions represented as Win × Hin × Din, where Win and Hin denote the width and height of the image, respectively, and Din represents the number of color channels (for RGB images, Din would be 3).

3.1. Body and Hand Detection Using CNN-Based MediaPipe

Figure 8 shows the system utilizes the MediaPipe framework for real-time pose estimation, incorporating a convolutional neural network (CNN) architecture designed to detect both body posture and hand gestures. The architecture is composed of the following two key modules: a Feature Extraction CNN, which learns the spatial representations of body parts and hand shapes, and a Classification Head, which assigns labels to the detected keypoints.
The input to the network is an RGB image frame represented by a tensor I R W i n × H i n × D i n , where W i n and H i n denote the image with and height, and D i n = 3 for three color channels (RGB). The convolutional operation can be formalized as follows:
F = σ ( W * I + b )
where
  • F   is the resulting feature map,
  • W is the learned convolutional filter,
  • b   is the bias term,
  • σ is a non-linear activation function such as ReLU,
  • * denotes the convolution operation.
These feature maps retain the structural symmetry inherent in human anatomy, such as the mirrored positions of limbs or bilateral hand gestures. This symmetry serves as a validation heuristic, reducing false detections from occlusion or lighting variation and strengthening robustness in real-time environments.

3.2. Temporal Modeling with LSTM and Symmetry Regularization

Once the spatial features are extracted, a Long Short-Term Memory (LSTM) network is employed to model the temporal dynamics of movement across successive video frames. Let { F t } t = 1 T denote the sequence of feature vectors over T time steps. The LSTM processes this sequence as follows:
h t = L S T M ( F t , h t 1 )
where
  • h t is the hidden state at time t ,
  • F t is the spatial feature at frame t ,
  • h t 1 is the previous hidden state.
To promote spatial–temporal symmetry, especially in hand gesture and body pose consistency, a symmetry loss term is optionally introduced, as follows:
L s y m = t = 1 T F t l e f t M i r r o r ( F t r i g h t ) 2
Here, F t l e f t and F t r i g h t refer to the extracted features of paired body parts (e.g., left and right wrists), and Mirror( · ) reflects the spatial alignment across the sagittal axis. This constraint enforces symmetrical consistency across both space and time, resulting in a more stable and accurate representation of posture and gesture.
Together, this step forms the backbone of the system’s symmetry-informed perception pipeline, capable of analyzing behavior patterns in real time with a high structural fidelity.
Convolution, also known as feature extraction, involves transforming images into pixel values ranging from 0 to 255. Figure 9 and Figure 10 shows the process utilizes mathematical operations to extract features by convolving the image with a filter, often referred to as a kernel. The kernel serves to highlight specific features of interest, producing a feature map or output. Mathematically, the feature map, denoted as Y, is obtained by convolving the image with the kernel, as depicted in Equation (2). For the computation of Yi,j, it is expressed as follows.
Y i , j = σ u = 0 F 1 v = 0 F 1 k = 0 D i n 1 x i + u , j + v , k × W u , v , k + b
where
  • Y i , j are the results in position (i,j) of the feature map.
  • x i + u , j + v , k is the pixel value in position (i+u,j+v,k) of the input image.
  • W u , v , k are the filter values in position (u,v) and in the kth color channel.
  • b is the bias value
  • σ is the activation function ReLU.
Here, σ is the Rectified Linear Unit (ReLU), F represents the size of the filter, and W u,v,k are the values of the filter at position (u,v) and in the kth color channel.
Pooling, a dimensionality reduction technique, involves using small filters to extract the maximum value from a group of pixels within a feature map. This process employs a specified stride to continuously move the filter across the feature map. A common example of pooling is max pooling, where the maximum value within a 2 × 2 area of the feature map is selected, with the stride determining the extent of movement. Let Z denote the resulting feature map or output after pooling. Dimensionality reduction is typically achieved using max pooling, as illustrated in Figure 11.
The Fully Connected Layer represents the final layer in the convolutional neural network architecture. Its operation involves connecting the data from the preceding layer in a fully connected manner. Subsequently, the SoftMax technique is applied to classify the data, yielding O as the result after the fully connected connection. The equation for the fully connected layer is depicted in Equation (3), as illustrated in Figure 12.
Z j = σ i = 0 n W i j × Y i + b j
where
  • Z j is the result in the jth fully connected layer.
  • W i j is the weight of the connection between feature map i and j.
  • Y i is the value of the feature map.
  • b j is the bias value
  • σ is the activation function ReLU.
The output (O) represents the result obtained after the fully connected layer, which is subsequently utilized for various tasks according to the objectives of person and hand detection. This is illustrated in Figure 13.
Step 3: Hand pose detection using Mediapipe is employed to facilitate communication through predefined sign language gestures. The process of Mediapipe hand landmark detection commences with palm detection utilizing a dedicated model. Subsequently, the crucial landmarks of the hand, totaling 21 points, are identified through the Hand Landmark Model. This model simulates hand poses based on image detection inputs in Figure 14.
The Mediapipe hand landmark detection system comprises the following three primary components: hand landmarks, handedness, and a palm detection model.
  • Hand landmarks: This component identifies keypoints on the hand, such as fingertips, finger joints, or other hand features. It consists of an array of points that are detected on the hand.
  • Handedness: This component is used to determine whether the detected hand is left- or right-handed. It provides information indicating “left” or “right” accordingly and is instrumental in distinguishing between left and right hands.
  • Palm detection model: This model is employed to detect the palm of the hand, aiding in determining the position and orientation of the hand in the image. It plays a crucial role in the process of detecting key points on the hand.
The system successfully classifies the detected hand as left-handed, with a confidence score of 0.98396. The result is part of the handedness inference module, where index = 0 corresponds to the left hand and a score above 0.9 indicates a high classification certainty. This information is critical for context-aware gesture interpretation and symmetry-based pose adjustment. The hand landmark model defines 21 three-dimensional key points, each characterized by normalized x, y, and z coordinates. The x and y values are normalized within the range [0.0, 1.0] relative to the image width and height, respectively. This normalization enables the z coordinate to represent the depth of each key point. The origin is defined at the wrist with a z value of zero, and smaller z values indicate points closer to the camera. For instance, Landmark #0 is located at (0.638852, 0.671197, −3.41 × 10−7) and Landmark #1 is at (0.634599, 0.536441, −0.06984). These spatial coordinates provide essential input features for downstream tasks such as gesture classification and symmetry-aware pose estimation. The system detects 21 world-space hand landmarks, where each point is described by its absolute 3D coordinates (x, y, z) in metric space rather than normalized image space. These world landmarks provide physically meaningful positions that are independent of image resolution or camera framing. For example, Landmark #0 is located at (0.067485, 0.031084, 0.055223) and Landmark #1 is at (0.063209, −0.00382, 0.020920). This world coordinate representation is essential for applications requiring spatial reasoning, gesture interaction with physical environments, and depth-aware gesture recognition.
The Mediapipe Pose Landmarker is employed for gesture recognition purposes. It utilizes a set of models to predict key landmarks indicative of gestures. The first model detects the presence of the human body within the image frame, while the second model identifies key landmarks on the body. The Pose Landmarker tracks the positions of 33 key landmarks on the body model, which approximate the locations of various body parts. This is illustrated in Figure 15.
The Pose Landmarker consists of two components, each comprising arrays of landmarks denoted by their x and y coordinates. These coordinates are normalized between 0.0 and 1.0 using the width (x) and height (y) of the image as the main reference. The z-coordinate represents the depth of the key landmarks, with the origin point located at the center of the hips. A lower z-value indicates that the landmark is closer to the camera. The scale of z is consistent with that of x. Additionally, the visibility parameter indicates the probability of a key landmark being visible in the image. The pose detection system extracts 33 three-dimensional body landmarks, with each point described by x, y, and z coordinates, as well as two confidence measures, visibility and presence. The x and y values are normalized within the image frame, while the z coordinate represents depth. Visibility indicates how likely it is that the landmark is visible to the camera, and presence reflects the confidence of its existence in the pose model. For instance, Landmark #0 is located at (0.638852, 0.671197, 0.129995), with a visibility of 0.99999976 and presence of 0.99999845. These enriched annotations provide a robust basis for real-time human activity recognition and symmetry-aware pose analysis. The world landmarks comprise x, y, and z coordinates, representing the three-dimensional coordinates in real-world units, typically measured in meters. These coordinates originate from the center of the hips, serving as the primary reference point. The visibility parameter denotes the probability of a key landmark being visible in the image. The system extracts 33 world-space pose landmarks, where each point is defined by its absolute x, y, and z coordinates in real-world metric space. These values are independent of the image resolution and are used for accurate spatial analysis. Each landmark is also associated with a visibility score, indicating how likely it is that it is visible to the camera, and a presence score, reflecting the model’s confidence in its detection. For example, Landmark #0 is located at (0.067485, 0.031084, 0.055223), with a visibility of 0.99999976 and presence of 0.99999845. Such data enables precise 3D body tracking, useful for applications in human movement analysis and intelligent monitoring systems.
Step 4: The process of training a model for hand and body pose classification and prediction involves several key steps, outlined as follows.
  • Importing various pose image datasets into the system, as illustrated in Figure 16.
2.
Resizing the dataset to have uniform dimensions of 1920 × 1080 pixels, with a total of 100 images, as depicted in Figure 17.
3.
Utilizing Mediapipe to determine the keypoints of each image and converting them into Landmarker values representing various parts of the body and hands, as illustrated in Figure 18.
4.
The entire dataset is divided into the following two sets: 80% for the training dataset and 20% for the testing dataset. The training dataset is represented by the red boundary, while the testing dataset is represented by the green boundary, as shown in Figure 19.
Step 5: To train a machine learning model using the random forest algorithm for the purpose of predicting and classifying hand and body poses, the following academic writing can be provided: Training a Random Forest Model for Hand and Body Pose Prediction and Classification. The utilization of machine learning algorithms, particularly the random forest algorithm, plays a pivotal role in predicting and classifying hand and body poses in various applications, such as sign language recognition, human–computer interaction, and gesture-based control systems. In this context, the random forest algorithm, a robust ensemble learning method, offers promising capabilities in handling multi-class classification tasks with complex and high-dimensional data. Entropy is a measure utilized to quantify the uncertainty or randomness present within a dataset, particularly in the context of classification tasks. In the process of classifying data, entropy serves as a metric to assess the level of uncertainty associated with the distribution of classes within the dataset. It ranges between zero and one, where lower values indicate less uncertainty or higher orderliness in the data, often referred to as data having a discernible pattern. Conversely, higher values signify greater uncertainty or more confusion within the data, suggesting a lack of distinct patterns or significant disorder. Entropy reaches its maximum value when all classes within the dataset occur with equal frequency, indicating a state of maximum uncertainty. Conversely, it achieves its minimum value when there is only one class present with a frequency equal to the total number of instances, representing a state of perfect certainty or orderliness. Mathematically, entropy is calculated using the following formula.
E n t r o p y S = i = 1 n p i · l o g 2 ( p i )
where p i represents the probability of occurrence of class iii within the dataset S , and n denotes the total number of classes. This formulation captures the degree of uncertainty associated with the class distribution and provides a quantitative measure of the dataset’s entropy, facilitating informed decision making in classification tasks.
Mean Absolute Error (MAE) is a method commonly used to measure prediction error in regression tasks, particularly in predicting numerical values. It quantifies the average magnitude of the differences between the predicted values and the actual target values across all instances in the test dataset. A lower MAE indicates that the model has a higher accuracy in its predictions, while a higher MAE suggests that the model’s predictions deviate more from the actual values. Mathematically, MAE is calculated as follows.
M A E = 1 n i = 1 n y i y ^ i
where
n is the total number of instances in the test dataset.
yi represents the actual target value for instance i.
y ^ i represents the predicted value for instance i.
Step 6: Gestures and hand gestures classification. The process of modeling for the classification of gestures and hand gestures in the long-distance elderly individual monitoring system via the Internet of Things encompasses the collection of datasets consisting of three body gestures and five hand gestures, as illustrated in the image below. Each gesture category comprises 100 images in the dataset, with variations in angles and degrees for each image to ensure an optimal classification accuracy. For hand pose 1, hold up one index finger, specifying that the message for this pose is “Water Please”. For hand pose 2, hold up two fingers, the index finger and middle finger, specifying that the message in this position is “want food” (Hungry). For hand position number 3, hold up three fingers, including the pinky finger and thumb, specifying the message of this pose as “Miss you”. The fourth hand pose is a four-finger gesture, with the message of this pose being “help”. The fifth hand pose is the thumb-up pose, specifying that this message is “want to take a shower or go to the bathroom” (Bathroom). Posture 1 is a standing pose, specifying that the message for this position is “Stand” (Stand). Posture 2 is a sitting pose, by specifying that the message for this position is “Sit” (Sit). Posture 3 is a lying position, by specifying that the message for this position is “Lying” (Sleep). This is shown in the Figure 20 and Figure 21.
Step 7: System decisions and decision principles. For the classification of gestures and hand gestures, take the original video and process it. Transform the image into a Landmark array with x, y, and z coordinates. The x and y coordinates are normalized to be in the range of [0.0, 1.0] using the width and height of the image, respectively, so that the z value represents the depth of important points, which will be matched (Matching) with the model that has been prepared. When there is a matching or closest value, that value will be displayed according to the message specified. As a result, the text cannot be displayed correctly every time, which has factors such as angle, degree, and depth involved, as shown in the picture. The algorithm detailing this procedure is as follows in Figure 22 and Figure 23.
As shown in Algorithm 1, the classification involves feature extraction, gesture modeling, and decision making.
Algorithm 1: Mathematical Representation of Gesture and Hand Gesture Classification
1.
Step 1: Input Video Frame Capture
2.
Let I represent the input frame from the video stream.
3.
I = capture_frame(video_stream)
4.
Step 2: Preprocessing
  • Normalize the image dimensions to a standard size (W,H):
  • Iresized = resize (I,W,H)
5.
Step 3: Landmark Detection
  • Detect hand and body landmarks using Mediapipe, resulting in an array of coordinates L:
  • L = detect_landmarks(Iresized)
6.
Step 4: Normalization
  • Normalize the coordinates x,y to the range [0.0, 1.0] and represent depth with z:
    x n o r m = x W ,     y n o r m = y H
    L n o r m = { ( x n o r m , y n o r m , z ) | ( x , y , z ) L }
7.
Step 5: Model Matching
  • Compute the similarity or distance D between the detected lanmarks L n o r m and the model landmarks Μ :
    D L n o r m , M = i = 1 n ( x n o r m , i x m , i ) 2 + ( y n o r m , i y m , i ) 2 + ( z i z m , i ) 2
  • Find the model with the minimum distance:
    M b e s t = arg m i n M j D ( L n o r m , M j )
8.
Step 6: Classification
  • Output the corresponding message for the best matching model M b e s t :
    m e s s a g e = g e t _ m e s s a g e ( M b e s t )
9.
Step 7: Post-processing and Display
  • Display the identified gesture and its corresponding message:
  • display(message)
  • Handle Variability
  • Factors such as angle, degree, and depth may affect recognition accuracy. These factors can be represented by additional terms in the normalization and matching equations to account for variability.
    L n o r m ,   a d j u s t e d = a d j u s t _ f o r _ v a r i a b i l i t y ( L n o r m , a n g l e , d e g r e e , d e p t h )
  • Final Algorithm in Mathematical Notation
  • The final algorithm can be summarized as:
            1.
    I = c a p t u r e _ f r a m e ( v i d e o _ s t r e a m )
            2.
    I r e s i z e d = r e s i z e ( I , W , H )
            3.
    L = d e t e c t _ l a n d m a r k s ( I r e s i z e d )
            4.
    L n o r m = { ( x W , y H , z ) | ( x , y , z ) L }
            5.
    D L n o r m , M = i = 1 n ( x n o r m , i x m , i ) 2 + ( y n o r m , i y m , i ) 2 + ( z i z m , i ) 2
            6.
    M b e s t = arg m i n M j D ( L n o r m , M j )
            7.
    m e s s a g e = g e t _ m e s s a g e ( M b e s t )
            8.
    d i s p l a y ( m e s s a g e )
L n o r m , a d j u s t e d = a d j u s t _ f o r _ v a r i a b i l i t y ( L n o r m , a n g l e , d e g r e e , d e p t h )

4. Experimental Results

The system’s output is divided into the following two sections: the first section for body gestures and the second section for hand gestures. The system classifies three different body gestures, as shown in Figure 24, and five different hand gestures. These gestures are matched with pretrained models, as shown in Figure 25. Upon successful matching, the system displays the corresponding message and sends an alert via the Line application along with an image. This process is illustrated in Figure 26. The dataset used in this study was custom-collected to reflect realistic scenarios in elderly care environments. It comprised a total of six physical activities and five hand gesture classes relevant to non-verbal communication and behavior monitoring. These included standing, sitting, lying down, walking, hand waving (left/right), and emergency or assistance gestures (e.g., “water please”, “hungry”, “help”, and “bathroom”). Data acquisition was conducted using a top-view RGB camera at 30 frames per second under the following four distinct lighting conditions: low light, medium light, bright light, and simulated environments. Each class contained from 100 to 200 samples, resulting in a balanced dataset across all categories. Participants were instructed to perform each gesture multiple times using both the left and right hands to ensure symmetrical representation. Captured frames were resized to a 1920 × 1080 resolution and preprocessed through pixel normalization and background reduction. Landmark extraction was performed using the MediaPipe framework, which detects 33 body keypoints and 21 hand landmarks per frame. The resulting landmark arrays were used as the primary features for model training. The dataset was split into 80% for training and 20% for testing, ensuring consistent evaluation across models. This setup enabled robust model generalization under varying lighting, pose angle, and movement speed conditions commonly found in real-world elderly individual monitoring contexts.

4.1. System Evaluation Results

The evaluation of the real-time remote monitoring system for elderly people and the hand gesture recognition system using computer vision is segmented into four parts.

4.1.1. Low-Light Environment Testing

Experiments were conducted under low-light conditions, with the ambient brightness depicted in Figure 27. This image shows a bedroom captured in very low-light conditions, likely during nighttime. The room is dimly lit, making most objects appear dark and indistinct. A bed is visible in the center of the image, covered with a blanket, and there appears to be a person sleeping underneath. Surrounding the bed are pieces of furniture, such as a door, cabinets, and a shelving unit, which are faintly visible due to the limited lighting. This image may be part of a sleep monitoring or nighttime surveillance system, aimed at detecting posture or behavior during rest in low-light environments.

4.1.2. Medium-Light Environment Testing

Experiments were conducted under medium-light conditions, with the ambient brightness depicted in Figure 28. This image shows a bedroom environment under moderate-lighting conditions, which is one of the experimental settings used to evaluate the behavior or posture recognition system. The image clearly reveals key objects in the room such as the bed, a blanket with a cat pattern, wall-mounted cabinets, a stuffed toy, and a storage shelf, indicating sufficient visibility without being overly bright or too dark. This level of illumination was intentionally chosen to simulate realistic indoor conditions, such as typical room lighting during the evening. It serves to test the system’s accuracy and robustness in detecting body landmarks or human activity when operating under non-optimal, everyday lighting scenarios. This ensures that the system can perform reliably in real-world applications.

4.1.3. High-Light Environment Testing

Experiments were conducted under high-light conditions, with the ambient brightness depicted in Figure 29. This image illustrates a well-lit indoor environment, representing the high-illumination condition used in the experimental setup. The lighting level is bright enough to clearly reveal all objects and surfaces within the room, including the bed, blanket, wall cabinets, toys, furniture, and floor details. This high-brightness condition was applied to evaluate the performance of the human posture or activity recognition system under optimal visual conditions. It ensures that the system’s detection and analysis algorithms can operate with maximum visual clarity, allowing for a benchmark assessment of accuracy when lighting is not a limiting factor.

4.1.4. Simulated Environment Testing

Experiments were conducted in a simulated environment, with the ambient brightness depicted in Figure 30.
The system’s accuracy in detecting hand gestures and body poses and displaying the appropriate messages was evaluated using computer vision. True Positive (TP) indicates the number of correctly identified hand gestures, while False Positive (FP) represents the number of incorrectly identified hand gestures. The evaluation of the real-time remote monitoring system for elderly people, combined with hand gesture recognition using computer vision, was conducted under low-light conditions. The testing involved five participants, each performing five different gestures, with each gesture repeated ten times using both the left and right hands. The system’s accuracy requirement for hand gesture recognition and real-time remote monitoring was set to a minimum of 90%. The results of these tests are depicted in Figure 31.
Table 1 presents the results of the hand gesture recognition test conducted under low-light conditions. The findings indicate that the accuracy for Hand Gesture 1 is 96%, Hand Gesture 2 is 94%, Hand Gesture 3 is 96%, Hand Gesture 4 is 94%, and Hand Gesture 5 is 92%. Overall, the system demonstrates an accuracy rate of 94.4% under low-light conditions.
The test results of the real-time remote monitoring system for elderly people and hand gesture recognition using computer vision in low-light conditions were evaluated by five participants. Each participant tested three different gestures, with each gesture being performed ten times, both from the front and the back. For the real-time remote monitoring system and hand gesture recognition using computer vision to be considered effective, the accuracy must not fall below 90%, as illustrated in Figure 32.
Table 2 presents the results of the gesture recognition tests conducted under low-light conditions. The findings indicate that the accuracy for Gesture 1 is 96%, Gesture 2 is 90%, and Gesture 3 is 96%. Overall, the system demonstrates an average accuracy of 94% for gesture recognition in low-light conditions.
The test results for the real-time remote monitoring system for elderly people and hand gesture recognition using computer vision under moderate-lighting conditions were evaluated by five participants. Each participant tested five different gestures, with each gesture being performed ten times using both the left and right hands. For the real-time remote monitoring system and hand gesture recognition using computer vision to be deemed effective, the accuracy must not fall below 90%, as illustrated in Figure 33.
In Table 3, the test results for various hand gestures are as follows: the gesture of raising the index finger is recognized as the phrase “Water Please”; the gesture of raising both the index and middle fingers is recognized as the phrase “Hungry”; the gesture of raising the index, middle, and ring fingers is recognized as the phrase “Miss you”; and the gesture of an open hand is recognized as the phrase “Help.” Table 3 presents the results of the hand gesture recognition tests conducted under moderate-lighting conditions. The findings indicate that the accuracy for Hand Gesture 1 is 100%, Hand Gesture 2 is 96%, Hand Gesture 3 is 96%, Hand Gesture 4 is 96%, and Hand Gesture 5 is 94%. Overall, the system demonstrates an average accuracy of 96.4% for hand gesture recognition under moderate-lighting conditions. The test results for the real-time remote monitoring system for elderly people and hand gesture recognition using computer vision under low-light conditions were evaluated by five participants. Each participant tested three different gestures, with each gesture being performed ten times from both the front and the back. For the real-time remote monitoring system and hand gesture recognition using computer vision to be considered effective, the accuracy must not fall below 90%.
From Figure 34, the test results indicate the following: for the standing posture, the accuracy rate is 96%; for the sitting posture, the accuracy rate is 94%; and for the lying posture, the accuracy rate is 96%. These findings correspond with Table 4, which illustrates the results of the hand gesture recognition tests conducted under moderate-lighting conditions. The accuracy rates for Gesture 1, Gesture 2, and Gesture 3 are 96%, 94%, and 96%, respectively. Consequently, the overall accuracy rate for hand gesture recognition under moderate-lighting conditions is calculated to be 95.33%.
The test results for the real-time remote monitoring system for elderly people and hand gesture recognition using computer vision under well-lit conditions were assessed by five participants. Each participant executed five distinct gestures, with each gesture being performed ten times using both the left and right hands. The system’s accuracy for real-time remote monitoring and hand gesture recognition must exceed 90% to be considered effective, as depicted in Figure 35. According to the test results depicted in Figure 35, the following interpretations are made: the gesture of raising the index finger is recognized as “Water Please”; the gesture of raising both the index and middle fingers is interpreted as “Hungry”; the gesture involving the index, middle, and ring fingers is identified as “Miss you”; and the gesture of an open hand is classified as “Help”.
Table 5 presents the outcomes of assessments pertaining to hand gesture recognition conducted under conditions of diminished illumination. The results indicate that Hand Gesture 1 achieves an accuracy rate of 98%, while Hand Gestures 2, 3, and 4 attain accuracy rates of 96%, with Hand Gesture 5 registering an accuracy rate of 92%. Consequently, the aggregate accuracy of the hand gesture recognition system in low-light settings is recorded at 96%.
The evaluation results of the real-time remote monitoring system and hand gesture recognition using computer vision for elderly individuals in low-light environments were conducted with a sample size of five participants. Each participant performed three distinct gestures, both facing forward and turning backward, for ten repetitions each. The system’s performance criteria necessitate an accuracy rate of no less than 90%, as depicted in Figure 36. Referring to the depicted figure, the test outcomes are organized as follows: standing posture, sitting posture, lying posture, and standing posture.
Table 6 presents the outcomes of the gesture recognition tests conducted in well-lit environments. The results indicate that Gesture 1 achieves an accuracy rate of 98%, Gesture 2 attains an accuracy rate of 96%, and Gesture 3 achieves an accuracy rate of 96%. Consequently, the overall accuracy of the gesture recognition system in moderately illuminated environments is recorded at 96.67%.
The results of the simulated scenario testing area for the real-time remote monitoring system and hand gesture recognition using computer vision for elderly individuals were assessed with a sample size of five participants. Each participant executed five distinct gestures, with each gesture being performed ten times using both the left and right hands. The system’s performance criteria stipulate an accuracy rate of no less than 90%, as depicted in Figure 37.
From Figure 37, the interpretations of the test results are as follows: the gesture of raising the index finger is recognized as “Hungry”; the gesture of raising both the index and middle fingers is interpreted as “Miss you”; the combined gesture involving the index, middle, and ring fingers is classified as “Bathroom”; and the gesture of an open hand is categorized as “Water Please”.
Table 7 exhibits the outcomes of the hand gesture recognition tests conducted within the simulated testing area. The table summarizes the evaluation results of the hand gesture recognition system based on five different hand signs. Each gesture was tested 50 times, and the table reports the number of correct and incorrect classifications, along with the corresponding accuracy value. The first row represents a gesture with the index and middle fingers crossed, achieving a 96% accuracy with 48 correct predictions. The second gesture, showing a “V” sign, has an accuracy of 92%, with 46 correct out of 50. The third gesture, resembling the “I love you” hand sign, records a 90% accuracy with 45 correct predictions. The fourth and fifth gestures, both involving four or more fingers extended, achieve a 94% accuracy, each with 47 correct predictions and 3 misclassifications. These results demonstrate the system’s strong ability to generalize across multiple hand poses with a consistently high performance. The average accuracy across all gestures is approximately 93.2%, indicating the reliability of the proposed approach under controlled testing conditions.
Figure 38 illustrates the pose classification results for daily activities in a living room setting using a skeleton-based recognition system. Four different frames are shown. The top-left image captures the subject in a sitting posture on a sofa. The pose landmarks are accurately plotted, with the system correctly recognizing the activity as “Sit”. The top-right image shows the subject standing upright. The system identifies the alignment of the upper and lower body and classifies the activity as “Stand”. The bottom-left and bottom-right images demonstrate variations of a reclining or lying posture. Despite the difference in limb positions and camera angles, the system consistently detects the activity as “Sleep”, indicating strong generalization capabilities. The skeleton overlay and labels highlight the effectiveness of the proposed approach in detecting human poses under indoor environmental conditions with furniture and decorative objects in the background.
The protocol entails the transmission of notifications via the Line application subsequent to each instance of real-time remote monitoring and hand gesture assessment conducted by the computer vision system for elderly individuals. In the event of alterations in hand gestures or the necessity for sign language communication, the system initiates the dissemination of images capturing the elderly subjects, accompanied by corresponding textual messages through the Line platform, as delineated in the illustrated depiction shown in Figure 39.
This computer vision-based real-time remote monitoring and hand gesture recognition system for elderly people has undergone testing using both hand gestures and body postures, aligning with the specified objectives and scope. Performance evaluation entails measuring accuracy in terms of percentages for both hand gestures and body postures. The test results reveal that, for hand gesture assessment, the system achieved an accuracy of 94.4% in low-light conditions, 96.4% in moderate-light conditions, and 96.0% and 93.2% in well-lit and very-well-lit environments, respectively. Similarly, body posture assessment yielded an accuracy of 94.0% in low-light conditions, 95.3% in moderate-light conditions, and 96.9% and 93.3% in well-lit and very-well-lit environments, respectively.
Table 8 provides a comprehensive comparison of the accuracies of various action recognition methods over a span of ten years, from 2014 to 2024. The analysis reveals a clear trend of increasing accuracy, demonstrating the advancements in action recognition techniques and technologies. In 2014, the Two Streams (RGB+OF) method achieved an accuracy of 88.0%. The following year, 2015, saw methods such as C3D+Linear SVM and LSTM30+OF+RGB, with accuracies of 85.2% and 88.6%, respectively. By 2016, significant improvements were observed with the introduction of S:VGG-16, T:VGG-16 (92.5%), and TSN (3 modalities), achieving 94.2%. However, the ST-LSTM+Trust Gate method exhibited a considerably lower accuracy of 69.2%, highlighting the variability in performance among different approaches within the same year. The year 2017 marked the emergence of highly accurate methods, such as LTC (92.7%), I3D (98.0%), T3D(+TSN) (93.2%), P3D ResNet (88.6%), and L2STM (93.6%). Despite this overall trend, some methods like STA-LSTM in 2018 showed a lower accuracy of 73.4%, contrasting with the high performance of methods like R(2+1)D-Two (97.3%) in the same year. In 2019, the R(2+1)D-152 method demonstrated a lower accuracy of 81.3%. By 2020, approaches like the Fully Connected Layer achieved an 88.55% accuracy, and a feature-threshold-based method focusing on object height/width ratio, ratio change speed, and MHI reached 95.16%. The most recent advancements in 2024 include the YOLOv8 model’s pose detection, which achieved remarkable accuracies of 98.86% (LE2I) and 96.23% (URFD). The proposed method also demonstrated a high accuracy of 98.0%, showcasing the effectiveness of the latest techniques in action recognition.
  • Dataset and Evaluation Metrics
Experiments were conducted using a custom-collected dataset consisting of six common activities performed by elderly individuals and four hand gestures, including standing, sitting, walking, lying down, hand waving (left/right), and help-seeking gestures. Each class included 100–200 samples captured from a top-view RGB camera at 30 fps. The evaluation metrics included the following:
  • Accuracy (%)
  • Precision, Recall, and F1-Score
  • Frame-Wise Latency (ms)
  • Confusion Matrix
  • Symmetry Consistency Index (SCI)
We compared the proposed symmetry-aware model (Sym-LSTM) with the following baselines in Table 9 and Table 10.
  • Experimental Analysis and Interpretation
The proposed framework was rigorously evaluated through a series of experiments designed to assess its effectiveness in recognizing human poses and hand gestures within an elderly individual monitoring context. Our focus was to analyze how the incorporation of symmetry, both geometric and temporal, enhanced the robustness and interpretability of the recognition process compared to baseline models.
  • Classification Accuracy and Performance Trends
Figure 40 presents a comparison of model performance in terms of overall classification accuracy and F1 scores. The baseline CNN-only model, which operates on individual frames without temporal context, achieved an accuracy of 82.4%. When sequential modeling was introduced via LSTM and Bi-LSTM, performance improved to 88.1% and 89.3%, respectively, reflecting the benefit of capturing temporal dynamics. In contrast, our symmetry-integrated model (Sym-LSTM) outperformed all baselines, attaining a classification accuracy of 92.7% and an F1 score of 0.91. These results demonstrate the efficacy of encoding reflectional consistency into the training process. The improvement can be attributed to the model’s ability to generalize across mirrored gestures (e.g., waving with either hand) and maintain coherent predictions during smooth transitions in human posture.
  • Confusion Matrix Insights
As illustrated in the confusion matrix (Figure 41), most pose categories were classified with a high precision. The model showed a particularly strong discriminative power in distinguishing between symmetric gestures such as left- and right-hand waving, where standard models often exhibited ambiguity due to appearance similarity. Minor misclassifications were observed between the “sitting” and “lying down” classes, which share overlapping features in camera space. However, these instances were significantly reduced compared to the CNN-only and LSTM baselines, indicating the impact of geometric symmetry constraints in preserving body structure integrity.
  • Ablation Study and Loss Component Evaluation
To evaluate the contribution of each component in our model, we conducted an ablation study in which the individual modules geometric symmetry loss, temporal continuity loss, and depth estimation were selectively removed. The results showed that excluding the geometric loss term led to a 3.8% drop in accuracy, underscoring the role of structural consistency in pose representation. Similarly, omitting temporal regularization resulted in more erratic predictions, particularly during transitional motions such as sitting down or standing up. These findings affirm that the deliberate integration of symmetry constraints into the learning process not only enhances performance, but also enforces interpretable behavior aligned with the physical structure and rhythm of human motion.
  • Real-Time Feasibility
Despite the increased complexity introduced by additional loss terms and sequential modeling, the model maintained a practical inference speed of approximately 7–9 frames per second on The Jetson Xavier NX edge device was manufactured by NVIDIA Corporation and sourced from Santa Clara, CA, USA. The model’s size remained lightweight (~18 MB), allowing for deployment in real-world elderly care settings without requiring cloud connectivity or high-end computational resources.
  • Theoretical and Practical Implications
From a theoretical standpoint, this study demonstrates that embedding geometric and temporal symmetry directly into deep learning architectures yields tangible performance gains. More importantly, it opens a pathway toward symmetry-aware AI models that are not only accurate, but also interpretable and biologically plausible. In practical applications, such models have direct implications for assistive technologies, offering the real-time recognition of postural changes, identifying abnormal gestures, and potentially predicting fall risks, all within a framework that respects the natural symmetry of human behavior.
Figure 41 shows the training and validation performance across 20 epochs. The left subfigure shows a consistent decline in both training and validation loss, indicating the stable convergence of the model during optimization. The right subfigure presents the corresponding accuracy curves, which demonstrate progressive improvement and alignment between the training and validation trends. This behavior suggests that the model generalizes well and does not exhibit overfitting, benefiting from the integration of geometric and temporal symmetry constraints into the learning process.
Figure 42 shows an illustration of symmetric keypoint detection in a human pose sequence. Left- and right-limb movements exhibit mirror symmetry, which is explicitly encoded in the model through geometric reflection constraints. This feature enhances the recognition accuracy for mirrored gestures and improves generalization to unseen subjects.
Figure 43 shows the Receiver Operating Characteristic (ROC) curve (left) and Precision–Recall (PR) curve (right) for the binary classification scenario. The area under the ROC curve (AUC) demonstrates the model’s ability to distinguish between positive and negative classes. The PR curve emphasizes the balance between Precision and Recall, which is particularly informative in imbalanced settings or when prioritizing detection sensitivity.
Figure 44 shows a comparison of the inference runtime (milliseconds per frame) and model size (megabytes) across four architectures. Although the proposed Sym-LSTM model incurs a modest increase in computational cost compared to simpler models, it remains efficient enough for real-time deployment. The trade-off between performance and resource usage is justified by the substantial gain in recognition accuracy and robustness.
Table 11 shown, the complete model yields the highest classification accuracy of 92.7%, validating the effectiveness of the integrated design. When the geometric symmetry constraint (Lgeo) is removed, accuracy drops sharply to 88.9%. This indicates that explicitly encoding reflectional symmetry in body pose significantly improves the model’s ability to generalize across mirrored gestures, such as distinguishing between left- and right-hand movements. Eliminating the temporal continuity loss (Ltemp) also results in a noticeable decline in performance. Without this constraint, the model becomes less robust to transitions between similar postures, such as sitting down versus lying down, which are often temporally smooth but spatially ambiguous. Lastly, removing the depth regression module marginally reduces accuracy to 90.1%. While the loss in performance is less severe, this component still contributes meaningfully by helping the model to disambiguate poses that appear similar in two-dimensional space but differ in spatial configuration, particularly important in applications such as fall detection or seated posture analysis. These findings affirm that each module, particularly those incorporating symmetry and temporal structure, plays a critical role in the overall success of the model. The ablation study not only supports the architectural choices made, but also highlights the importance of aligning model design with the inherent properties of human motion, such as bilateral symmetry and temporal consistency.

5. Conclusions

This study presents a real-time monitoring and gesture recognition system tailored for elderly care, evaluated under diverse environmental conditions to assess its robustness, adaptability, and real-world applicability. The system was tested across four lighting scenarios, including low-light, medium-light, high-light, and simulated environments, using five participants performing five hand gestures and three body postures, each repeated multiple times with both hands and from multiple orientations. Despite variability in environmental conditions, the system consistently demonstrated a strong recognition performance, with its overall accuracy across all scenarios exceeding 90%. The highest performance was observed in medium- and high-light conditions, where hand gesture recognition achieved 96.4% and 96%, respectively. Posture recognition in the same settings reached 95.33% and 96.67%. In low-light conditions, although there was a slight reduction in accuracy (94.4% for hand gestures and 94% for postures), the system retained an acceptable reliability given the challenges of visual degradation. In simulated environments, where variability in background and movement was introduced, the system maintained a 93.2% accuracy for hand gestures and 93.3% for body postures. The system also proved resilient to changes in user distance from the camera, maintaining a gesture recognition accuracy above 90% within a range from 30 to 150 cm. However, accuracy began to degrade beyond 150 cm, primarily due to the reduced resolution of spatial features necessary for keypoint detection. Additionally, while integration with the LINE messaging API allowed for seamless real-time notifications, each alert resulted in a minor performance dip (~0.1 s delay per message), which did not affect the stability of the overall system. This research offers several novel contributions to the field of intelligent monitoring and human–machine interaction with the proposed symmetry-informed pose and gesture recognition framework. The system introduces a structured pipeline that integrates spatial symmetry, especially the bilateral relationships in body and hand structure, into both detection and classification processes. This improves robustness in noisy or visually impaired settings by leveraging inherent anatomical regularities. It also enables real-time vision-based monitoring without wearable sensors. Unlike systems that rely on on-body sensors, this solution offers a non-intrusive approach using only standard RGB cameras, providing greater comfort and scalability for elderly individuals in everyday settings. In addition, it provides environmental adaptability through robust feature normalization The framework accounts for variability in lighting and spatial perspective through image preprocessing and landmark normalization, enabling the model to adapt without significant retraining across different deployment conditions. Furthermore, it incorporates an integrated alerting mechanism with intelligent filtering. The system integrates behavioral monitoring with a practical IoT messaging interface. It intelligently triggers alerts based on classified gestures or postures, reducing caregiver response time and enhancing real-time safety interventions. We performed performance validation based on realistic use cases. The evaluation methodology focused not only on static accuracy metrics, but also on functional effectiveness under environmental and operational constraints. This included frame rate impact, distance-based accuracy drop-off, and responsiveness to gesture transitions.

6. Discussion

The experimental findings presented in this study provide strong empirical support for the hypothesis that integrating geometric and temporal symmetry into deep learning architectures can significantly enhance the accuracy and robustness of human pose and gesture recognition systems, particularly in the context of elderly individual monitoring. A key aspect of the proposed model lies in its explicit treatment of geometric symmetry. By modeling the human body as a bilaterally symmetric structure, the network is trained to recognize corresponding patterns across mirrored joints, such as left and right limbs. This symmetry constraint not only improves the generalization to unseen gestures, but also increases the model’s tolerance to occlusions and irregular viewpoints, which are common in non-controlled environments like home settings. The ablation results clearly show that removing the geometric symmetry loss leads to a considerable drop in accuracy, highlighting the importance of preserving structural coherence during learning. Furthermore, the inclusion of temporal continuity constraints addresses the challenge of distinguishing between postures that may be spatially similar but evolve differently over time. For example, the act of sitting down and lying down may exhibit overlapping keypoint positions at certain frames, but their temporal trajectories diverge. The model’s ability to recognize such patterns relies on its capacity to learn motion consistency and temporal rhythm. This property aligns with the idea of temporal symmetry, where the transition between physical states adheres to predictable, cyclic, or balanced sequences. The consistent performance improvements observed across models with temporal regularization affirm this design decision. Another important dimension of the model’s effectiveness is its integration of depth estimation as an auxiliary task. While the contribution of this component is comparatively modest, it enhances the model’s understanding of three-dimensional spatial context, particularly useful in distinguishing between subtle variations in seated or reclined positions. This feature is particularly valuable in real-world elderly care applications, where identifying falls or abnormal sitting postures can be critical. From a broader perspective, the results suggest that symmetry is not merely a mathematical abstraction, but a powerful inductive bias that can be harnessed to inform neural network learning. In this regard, our work contributes to a growing body of research that seeks to align machine learning architectures with fundamental principles found in nature, such as anatomical balance and temporal regularity. Despite these strengths, there are limitations that warrant further investigation. For instance, gestures with highly asymmetric motion patterns (e.g., reaching with one hand while rotating the torso) may not fully benefit from symmetry constraints and could require adaptive weighting mechanisms. In addition, while the current model performs well in indoor environments, its robustness under varying lighting conditions and occlusion severity should be assessed in future work. In summary, this discussion underscores the value of incorporating domain-specific structural priors, namely symmetry and temporal smoothness, into deep learning frameworks for human behavior recognition. These findings open up new avenues for designing interpretable, efficient, and context-aware systems capable of operating in dynamic real-world settings.

Author Contributions

Conceptualization, P.B. and M.K.; methodology, M.K.; software, P.B. and M.K.; validation, P.B. and M.K.; formal analysis, P.B. and M.K.; investigation, M.K.; data curation, M.K.; writing—original draft preparation, P.B. and M.K.; writing—review and editing, M.K.; visualization, M.K.; supervision, M.K.; project administration, M.K.; funding acquisition, P.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study was collected by the authors and is not publicly available due to privacy concerns. Access to the dataset can be granted upon request, subject to approval by the data owner and appropriate ethical review.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Department of Older Persons. Current Situation of the Elderly Society and Economy in Thailand. Available online: https://www.dop.go.th/th/know/15/926 (accessed on 17 July 2025).
  2. Department of Older Persons. Elderly Individuals Refer to Persons Aged 60 and Over, Entitled to Protection, Promotion, and Support in Various Aspects. Available online: https://www.dop.go.th/th/know/15/646 (accessed on 17 July 2025).
  3. Department of Older Persons. Elderly Care. Available online: https://www.dop.go.th/th/know/15/741 (accessed on 17 July 2025).
  4. Pimkrai, A.; Nantawong, P.; Muangkaew, R.; Tritipornrak, A.; Tipajak, N. Cost of Dementia Disease in Chiangmai Province; Research Report; Chiang Mai Neurological Hospital: Chiang Mai, Thailand, 2020. [Google Scholar]
  5. Busabong, K. Fall Detection System for the Elderly Using Earth Gravity. Ph.D. Thesis, Mahasarakham University, Maha Sarakham, Thailand, 2019. [Google Scholar]
  6. Kiatsin, G.; Kanokwan, B.; Nawaporn, L.; Noppachai, K. Network-Based Monitoring and Notification System for the Elderly. Bachelor’s Thesis, Mahasarakham University, Maha Sarakham, Thailand, 2019. [Google Scholar]
  7. Jantnipa, K.; Kornchanok, P. 3-Axis Accelerometer-Based Fall Detection System. Bachelor’s Thesis, Suranaree University of Technology, Nakhon Ratchasima, Thailand, 2013. [Google Scholar]
  8. Kittisak, B.; Supassara, J.; Wasawee, S.; Sansasri, M.; Manachai, T. A Real-Time Mobility-Related Activity Tracking System for Mobility and Fall Risk Assessment in Elderly People. JIST 2016, 6, 16–24. [Google Scholar]
  9. Tangwongcharoen, W.; Saklertwilai, S.; Pimpunchat, B. Internet-Based Assistance Request Detection Program for the Disabled. Bachelor’s Thesis, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand, 2020. [Google Scholar]
  10. Kongsiriwattana, W.; Whangsirikunchok, P.; Chantigo, W. Student Class Attendance and Interest Assessment System with Facial Expression Detection via Webcam. Kmitl_SciJ. 2021, 30, 42–57. [Google Scholar]
  11. Utkamtiang, J.; Prisutsuntorn, N.; Bunpan, K.; Kumechai, P. Tracking Camera by Using Face Recognition in The Barracks. In 2nd National Academic Conference on Management in Disruptive Technologies Era; Chanthorakotikasika, K., Ed.; College of Management Innovation, Rajamangala University of Technology Rattanakosin: Nakhon Pathom, Thailand, 2020; ISBN 978-974-625-895-1. [Google Scholar]
  12. Paliyawan, P. Office Workers Syndrome Monitoring Using Kinect. Master’s Thesis, School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand, 2014. [Google Scholar]
  13. Kulkasem, P.; Rasmequan, S.; Jantarakongkul, B.; Chinnasarn, K.; Rodtook, A.; Lursinsap, C.; Yajai, A. Fall Detection System for Monitoring an Elderly Person in Elderly Care Center; Research Report; Faculty of Informatics, Burapha University: Chonburi, Thailand, 2015. [Google Scholar]
  14. Potha, S. Development of a Student Attendance System Using Face Detection Technique on Raspberry Pi; Research Report; Department of Electrical Engineering, Faculty of Engineering, Chiang Mai University: Chiang Mai, Thailand, 2021. [Google Scholar]
  15. Lee, K.; Lee, I.; Lee, S. Propagating LSTM: 3D Pose Estimation Based on Joint Interdependency. In Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018. Proceedings, Part VII. [Google Scholar] [CrossRef]
  16. Samkari, E.; Arif, M.; Alghamdi, M.; AlGhamdi, M.A. Human Pose Estimation Using Deep Learning: A Systematic Literature Review. Mach. Learn. Knowl. Extr. 2023, 5, 1612–1659. [Google Scholar] [CrossRef]
  17. BenGamra, M.; Akhloufi, M.A. A Review of Deep Learning Techniques for 2D and 3D Human Pose Estimation. Image Vis. Comput. 2021, 114, 104282. [Google Scholar] [CrossRef]
  18. Buzzelli, M.; Albé, A.; Ciocca, G. A Vision-Based System for Monitoring Elderly People at Home. Appl. Sci. 2020, 10, 374. [Google Scholar] [CrossRef]
  19. Jansen, B.; Deklerck, R. Home Monitoring of Elderly People with 3D Camera Technology. In Proceedings of the First BENELUX Biomedical Engineering Symposium, Brussels, Belgium, 7–8 December 2006. [Google Scholar]
  20. Feng, W.; Liu, R.; Zhu, M. Fall Detection for Elderly Person Care in a Vision-Based Home Surveillance Environment Using a Monocular Camera. Signal Image Video Process. 2014, 8, 1129–1138. [Google Scholar] [CrossRef]
  21. Yang, Y.; Yang, H.; Liu, Z.; Yuan, Y.; Guan, X. Fall Detection System Based on Infrared Array Sensor and Multi-Dimensional Feature Fusion. Meas. J. Int. Meas. Confed. 2022, 192, 110870. [Google Scholar] [CrossRef]
  22. Ramanujam, E.; Padmavathi, S. Real Time Fall Detection Using Infrared Cameras and Reflective Tapes under Day/Night Luminance. J. Ambient. Intell. Smart Environ. 2021, 13, 285–300. [Google Scholar] [CrossRef]
  23. Park, J.; Chen, J.; Cho, Y.K.; Kang, D.Y.; Son, B.J. CNN-Based Person Detection Using Infrared Images for Night-Time Intrusion Warning Systems. Sensors 2020, 20, 34. [Google Scholar] [CrossRef]
  24. Cosar, S.; Yan, Z.; Zhao, F.; Lambrou, T.; Yue, S.; Bellotto, N. Thermal Camera Based Physiological Monitoring with an Assistive Robot. In Proceedings of the 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, USA, 18–21 July 2018; pp. 5010–5013. [Google Scholar]
  25. Riquelme, F.; Espinoza, C.; Rodenas, T.; Minonzio, J.-G.; Taramasco, C. eHomeSeniors Dataset: An Infrared Thermal Sensor Dataset for Automatic Fall Detection Research. Sensors 2019, 19, 4565. [Google Scholar] [CrossRef]
  26. Fernando, Y.P.N.; Gunasekara, K.D.B.; Sirikumara, K.P.; Galappaththi, U.E.; Thilakarathna, T.; Kasthurirathna, D. Computer Vision Based Privacy Protected Fall Detection and Behavior Monitoring System for the Care of the Elderly. In Proceedings of the 2021 26th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Vasteras, Sweden, 7–10 September 2021; pp. 1–7. [Google Scholar]
  27. Beddiar, D.R.; Nini, B.; Sabokrou, M.; Hadid, A. Vision-Based Human Activity Recognition: A Survey. Multimed. Tools Appl. 2020, 79, 30509–30555. [Google Scholar] [CrossRef]
  28. Nikouei, S.Y.; Chen, Y.; Song, S.; Xu, R.; Choi, B.Y.; Faughnan, T.R. Real-Time Human Detection as an Edge Service Enabled by a Lightweight CNN. In Proceedings of the 2018 IEEE International Conference on Edge Computing (EDGE), San Francisco, CA, USA, 2–7 July 2018; pp. 125–129. [Google Scholar]
  29. Chen, Y.; Kong, X.; Meng, L.; Tomiyama, H. An Edge Computing Based Fall Detection System for Elderly Persons. Procedia Comput. Sci. 2020, 174, 9–14. [Google Scholar] [CrossRef]
  30. Kim, S.; Park, J.; Jeong, Y.; Lee, S.E. Intelligent Monitoring System with Privacy Preservation Based on Edge AI. Micromachines 2023, 14, 1749. [Google Scholar] [CrossRef] [PubMed]
  31. Williams, A.; Xie, D.; Ou, S.; Grupen, R.; Hanson, A.; Riseman, E. Distributed Smart Cameras for Aging in Place. In Proceedings of the ACM SenSys Workshop on Distributed Smart Cameras, Boulder, CO, USA, 31 October 2006. [Google Scholar]
  32. Oudah, M.; Al-Naji, A.; Chahl, J. Elderly Care Based on Hand Gestures Using Kinect Sensor. Computers 2021, 10, 5. [Google Scholar] [CrossRef]
  33. Kendall, A.; Grimes, M.; Cipolla, R. PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; Volume 2015, pp. 2938–2946. [Google Scholar]
  34. Bazarevsky, V.; Grishchenko, I.; Raveendran, K.; Zhu, T.; Zhang, F.; Grundmann, M. BlazePose: On-Device Real-Time Body Pose Tracking. arXiv 2020, arXiv:2006.10204. [Google Scholar]
  35. Li, S.; Man, C.; Shen, A.; Guan, Z.; Mao, W.; Luo, S.; Zhang, R.; Yu, H. A Fall Detection Network by 2D/3D Spatio-Temporal Joint Models with Tensor Compression on Edge. ACM Trans. Embed. Comput. Syst. 2022, 21, 83. [Google Scholar] [CrossRef]
  36. Egawa, R.; Miah, A.S.M.; Hirooka, K.; Tomioka, Y.; Shin, J. Dynamic Fall Detection Using Graph-Based Spatial Temporal Convolution and Attention Network. Electronics 2023, 12, 3234. [Google Scholar] [CrossRef]
  37. Noor, N.; Park, I.K. A Lightweight Skeleton-Based 3D-CNN for Real-Time Fall Detection and Action Recognition. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Paris, France, 2–6 October 2023; pp. 2171–2180. [Google Scholar]
  38. Min, W.; Yao, L.; Lin, Z.; Liu, L. Support Vector Machine Approach to Fall Recognition Based on Simplified Expression of Human Skeleton Action and Fast Detection of Start Key Frame Using Torso Angle. IET Comput. Vis. 2018, 12, 1133–1140. [Google Scholar] [CrossRef]
  39. Kong, X.; Kumaki, T.; Meng, L.; Tomiyama, H. A Skeleton Analysis Based Fall Detection Method Using ToF Camera. Procedia Comput. Sci. 2021, 187, 252–257. [Google Scholar] [CrossRef]
  40. De Miguel, K.; Brunete, A.; Hernando, M.; Gambao, E. Home Camera-Based Fall Detection System for the Elderly. Sensors 2017, 17, 2864. [Google Scholar] [CrossRef]
  41. Lafuente-Arroyo, S.; Martín-Martín, P.; Iglesias-Iglesias, C.; Maldonado-Bascón, S.; Acevedo-Rodríguez, F.J. RGB Camera-Based Fallen Person Detection System Embedded on a Mobile Platform. Expert Syst. Appl. 2022, 197, 116715. [Google Scholar] [CrossRef]
  42. Alam, E.; Sufian, A.; Dutta, P.; Leo, M. Vision-Based Human Fall Detection Systems Using Deep Learning: A Review. Comput. Biol. Med. 2022, 146, 105626. [Google Scholar] [CrossRef]
  43. Gutiérrez, J.; Rodríguez, V.; Martin, S. Comprehensive Review of Vision-Based Fall Detection Systems. Sensors 2021, 21, 947. [Google Scholar] [CrossRef]
  44. Hbali, Y.; Hbali, S.; Ballihi, L.; Sadgal, M. Skeleton-Based Human Activity Recognition for Elderly Monitoring Systems. IET Comput. Vis. 2017, 12, 16–26. [Google Scholar] [CrossRef]
  45. Nguyen, H.-C.; Nguyen, T.-H.; Scherer, R.; Le, V.-H. Deep Learning for Human Activity Recognition on 3D Human Skeleton: Survey and Comparative Study. Sensors 2023, 23, 5121. [Google Scholar] [CrossRef]
  46. Alaoui, A.Y.; ElFkihi, S.; Thami, R.O.H. Fall Detection for Elderly People Using the Variation of Key Points of Human Skeleton. IEEE Access 2019, 7, 154786–154795. [Google Scholar] [CrossRef]
  47. Wang, Y.; Deng, T. Enhancing Elderly Care: Efficient and Reliable Real-Time Fall Detection Algorithm. Digit. Health 2024, 10, 20552076241233690. [Google Scholar] [CrossRef] [PubMed]
  48. Hoang, V.H.; Lee, J.W.; Piran, M.J.; Park, C.S. Advances in Skeleton-Based Fall Detection in RGB Videos: From Handcrafted to Deep Learning Approaches. IEEE Access 2023, 11, 92322–92352. [Google Scholar] [CrossRef]
  49. Xiao, H.; Peng, K.; Huang, X.; Roitberg, A.; Li, H.; Wang, Z.; Stiefelhagen, R. Toward Privacy-Supporting Fall Detection via Deep Unsupervised RGB2Depth Adaptation. IEEE Sens. J. 2023, 23, 29143–29155. [Google Scholar] [CrossRef]
  50. Cao, Y.; Erdt, M.; Robert, C.; Naharudin, N.B.; Lee, S.Q.; Theng, Y.L. Decision-Making Factors Toward the Adoption of Smart Home Sensors by Older Adults in Singapore: Mixed Methods Study. JMIR Aging 2022, 5, e34239. [Google Scholar] [CrossRef]
  51. Gochoo, M.; Alnajjar, F.; Tan, T.-H.; Khalid, S. Towards Privacy-Preserved Aging in Place: A Systematic Review. Sensors 2021, 21, 3082. [Google Scholar] [CrossRef] [PubMed]
  52. Demiris, G.; Hensel, B.K.; Skubic, M.; Rantz, M. Senior Residents’ Perceived Need of and Preferences for “Smart Home” Sensor Technologies. Int. J. Technol. Assess. Health Care 2008, 24, 120–124. [Google Scholar] [CrossRef]
  53. Pirzada, P.; Wilde, A.; Doherty, G.H.; Harris-Birtill, D. Ethics and Acceptance of Smart Homes for Older Adults. Informatics Health Soc. Care 2021, 47, 10–37. [Google Scholar] [CrossRef]
  54. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2014; pp. 568–576. [Google Scholar]
  55. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Las Condes, Chile, 11–18 December 2015; pp. 4489–4497. [Google Scholar]
  56. Yue-Hei Ng, J.; Hausknecht, M.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4694–4702. [Google Scholar]
  57. Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
  58. Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 20–36. [Google Scholar]
  59. Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal LSTM with trust gates for 3D human action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 816–833. [Google Scholar]
  60. Varol, G.; Laptev, I.; Schmid, C. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1510–1517. [Google Scholar] [CrossRef] [PubMed]
  61. Carreira, J.; Zisserman, A. Quo vadis, action recognition? In A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
  62. Diba, A.; Fayyaz, M.; Sharma, V.; Karami, A.H.; Arzani, M.M.; Yousefzadeh, R.; Van Gool, L. Temporal 3D convnets: New architecture and transfer learning for video classification. arXiv 2017, arXiv:1711.08200. [Google Scholar] [CrossRef]
  63. Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5533–5541. [Google Scholar]
  64. Sun, L.; Jia, K.; Chen, K.; Yeung, D.Y.; Shi, B.E.; Savarese, S. Lattice Long Short-Term Memory for Human Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  65. Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Trans. Image Process. 2018, 27, 3459–3471. [Google Scholar] [CrossRef] [PubMed]
  66. Zhang, B.; Wang, L.; Wang, Z.; Qiao, Y.; Wang, H. Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans. Image Process. 2018, 27, 2326–2339. [Google Scholar] [CrossRef]
  67. Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
  68. Ghadiyaram, D.; Tran, D.; Mahajan, D. Large-scale weakly-supervised pre-training for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12046–12055. [Google Scholar]
  69. Menacho, C.; Ordonez, J. Fall detection based on CNN models implemented on a mobile robot. In Proceedings of the 2020 17th International Conference on Ubiquitous Robots (UR), Kyoto, Japan, 22–26 June 2020; pp. 284–289. [Google Scholar]
  70. Thummala, J.; Pumrin, S. Fall Detection using Motion History Image and Shape Deformation. In Proceedings of the 2020 8th International Electrical Engineering Congress (iEECON), Chiang Mai, Thailand, 4–6 March 2020; pp. 1–4. [Google Scholar]
  71. Wang, C.-Y.; Lin, F.-S. AI-Driven Privacy in Elderly Care: Developing a Comprehensive Solution for Camera-Based Monitoring of Older Adults. Appl. Sci. 2024, 14, 4150. [Google Scholar] [CrossRef]
Figure 1. Conceptual framework of 3D pose estimation [15].
Figure 1. Conceptual framework of 3D pose estimation [15].
Symmetry 17 01423 g001
Figure 2. Various pose processing models [16,17].
Figure 2. Various pose processing models [16,17].
Symmetry 17 01423 g002
Figure 3. The proposed framework of a machine vision-based real-time remote monitoring system for elderly care using pose estimation and hand gesture recognition.
Figure 3. The proposed framework of a machine vision-based real-time remote monitoring system for elderly care using pose estimation and hand gesture recognition.
Symmetry 17 01423 g003
Figure 4. The system’s operational control.
Figure 4. The system’s operational control.
Symmetry 17 01423 g004
Figure 5. The procedure for importing data from the camera to the Raspberry Pi board.
Figure 5. The procedure for importing data from the camera to the Raspberry Pi board.
Symmetry 17 01423 g005
Figure 6. Reducing the pixel size.
Figure 6. Reducing the pixel size.
Symmetry 17 01423 g006
Figure 7. Structure of the convolutional neural network.
Figure 7. Structure of the convolutional neural network.
Symmetry 17 01423 g007
Figure 8. Viewing the subspace with feature extraction (feature).
Figure 8. Viewing the subspace with feature extraction (feature).
Symmetry 17 01423 g008
Figure 9. Input data and kernel.
Figure 9. Input data and kernel.
Symmetry 17 01423 g009
Figure 10. Feature map.
Figure 10. Feature map.
Symmetry 17 01423 g010
Figure 11. Feature map after pooling.
Figure 11. Feature map after pooling.
Symmetry 17 01423 g011
Figure 12. Making data fully connected.
Figure 12. Making data fully connected.
Symmetry 17 01423 g012
Figure 13. Class prediction results from CNN.
Figure 13. Class prediction results from CNN.
Symmetry 17 01423 g013
Figure 14. Twenty-one important hand positions.
Figure 14. Twenty-one important hand positions.
Symmetry 17 01423 g014
Figure 15. Thirty-two important body positions.
Figure 15. Thirty-two important body positions.
Symmetry 17 01423 g015
Figure 16. Example dataset.
Figure 16. Example dataset.
Symmetry 17 01423 g016
Figure 17. Example of adjusting the size of an image to the same size.
Figure 17. Example of adjusting the size of an image to the same size.
Symmetry 17 01423 g017
Figure 18. Example of converting images to Landmarker values for various body parts and hands.
Figure 18. Example of converting images to Landmarker values for various body parts and hands.
Symmetry 17 01423 g018
Figure 19. Example of dataset division.
Figure 19. Example of dataset division.
Symmetry 17 01423 g019
Figure 20. Hand gestures used in the experiment.
Figure 20. Hand gestures used in the experiment.
Symmetry 17 01423 g020
Figure 21. Gestures used in the experiment.
Figure 21. Gestures used in the experiment.
Symmetry 17 01423 g021
Figure 22. Decision principles of the hand gesture system.
Figure 22. Decision principles of the hand gesture system.
Symmetry 17 01423 g022
Figure 23. Decision principles of the body posture system.
Figure 23. Decision principles of the body posture system.
Symmetry 17 01423 g023
Figure 24. Shows hand gestures on the monitor.
Figure 24. Shows hand gestures on the monitor.
Symmetry 17 01423 g024
Figure 25. Shows the gesture on the monitor.
Figure 25. Shows the gesture on the monitor.
Symmetry 17 01423 g025
Figure 26. Notification via Line.
Figure 26. Notification via Line.
Symmetry 17 01423 g026
Figure 27. Low-light environment.
Figure 27. Low-light environment.
Symmetry 17 01423 g027
Figure 28. Medium-illumination environment.
Figure 28. Medium-illumination environment.
Symmetry 17 01423 g028
Figure 29. Very bright environment.
Figure 29. Very bright environment.
Symmetry 17 01423 g029
Figure 30. Experiment in a simulated situation with brightness according to the environment used in the experiment.
Figure 30. Experiment in a simulated situation with brightness according to the environment used in the experiment.
Symmetry 17 01423 g030
Figure 31. Example of test results in a poorly lit area (hand).
Figure 31. Example of test results in a poorly lit area (hand).
Symmetry 17 01423 g031
Figure 32. Example of test results in a poorly lit area (body).
Figure 32. Example of test results in a poorly lit area (body).
Symmetry 17 01423 g032
Figure 33. Example of test results in a moderately lit area (hand).
Figure 33. Example of test results in a moderately lit area (hand).
Symmetry 17 01423 g033
Figure 34. Example of test results in a moderately lit area (body).
Figure 34. Example of test results in a moderately lit area (body).
Symmetry 17 01423 g034
Figure 35. Example of test results in a brightly lit area (hand).
Figure 35. Example of test results in a brightly lit area (hand).
Symmetry 17 01423 g035
Figure 36. Example of test results in a brightly lit area (body).
Figure 36. Example of test results in a brightly lit area (body).
Symmetry 17 01423 g036
Figure 37. Example of simulation area test results.
Figure 37. Example of simulation area test results.
Symmetry 17 01423 g037
Figure 38. Example of test results for the simulation area.
Figure 38. Example of test results for the simulation area.
Symmetry 17 01423 g038
Figure 39. Example of test results for notifications via Line.
Figure 39. Example of test results for notifications via Line.
Symmetry 17 01423 g039
Figure 40. Quantitative evaluation of symmetry-aware gesture recognition model.
Figure 40. Quantitative evaluation of symmetry-aware gesture recognition model.
Symmetry 17 01423 g040
Figure 41. Learning progress of the proposed symmetry-aware model.
Figure 41. Learning progress of the proposed symmetry-aware model.
Symmetry 17 01423 g041
Figure 42. Geometric symmetry in human pose.
Figure 42. Geometric symmetry in human pose.
Symmetry 17 01423 g042
Figure 43. ROC and Precision–Recall curves.
Figure 43. ROC and Precision–Recall curves.
Symmetry 17 01423 g043
Figure 44. Runtime and model size comparison.
Figure 44. Runtime and model size comparison.
Symmetry 17 01423 g044
Table 1. Test results in low-light (hand) scenario.
Table 1. Test results in low-light (hand) scenario.
Hand GestureNumber of TestsCorrectIncorrectlyAccuracy Value
Symmetry 17 01423 i0015048296%
Symmetry 17 01423 i0025047394%
Symmetry 17 01423 i0035048296%
Symmetry 17 01423 i0045047394%
Symmetry 17 01423 i0055046492%
Table 2. Test results in low-illumination (physical) scenario.
Table 2. Test results in low-illumination (physical) scenario.
Hand GestureNumber of TestsCorrectIncorrectlyAccuracy Value
Symmetry 17 01423 i0065048296%
Symmetry 17 01423 i0075045590%
Symmetry 17 01423 i0085048296%
Table 3. Test results in medium-illumination (hand) scenario.
Table 3. Test results in medium-illumination (hand) scenario.
Hand GestureNumber of TestsCorrectIncorrectlyAccuracy Value
Symmetry 17 01423 i00950500100%
Symmetry 17 01423 i0105048296%
Symmetry 17 01423 i0115048296%
Symmetry 17 01423 i0125048296%
Symmetry 17 01423 i0135047394%
Table 4. Test results in simulated medium-illumination (body) scenario.
Table 4. Test results in simulated medium-illumination (body) scenario.
Hand GestureNumber of TestsCorrectIncorrectlyAccuracy Value
Symmetry 17 01423 i0145048296%
Symmetry 17 01423 i0155047394%
Symmetry 17 01423 i0165048296%
Table 5. Test results in a high-light (hand) scenario.
Table 5. Test results in a high-light (hand) scenario.
Hand GestureNumber of TestsCorrectIncorrectlyAccuracy Value
Symmetry 17 01423 i0175049198%
Symmetry 17 01423 i0185048296%
Symmetry 17 01423 i0195048296%
Symmetry 17 01423 i0205049298%
Symmetry 17 01423 i0215046392%
Table 6. Test results in simulated high-light (body) scenario.
Table 6. Test results in simulated high-light (body) scenario.
Hand GestureNumber of TestsCorrectIncorrectlyAccuracy Value
Symmetry 17 01423 i0225049198%
Symmetry 17 01423 i0235048296%
Symmetry 17 01423 i0245048296%
Table 7. Simulation test results.
Table 7. Simulation test results.
Hand GestureNumber of TestsCorrectIncorrectlyAccuracy Value
Symmetry 17 01423 i0255048296%
Symmetry 17 01423 i0265046492%
Symmetry 17 01423 i0275045590%
Symmetry 17 01423 i0285047394%
Symmetry 17 01423 i0295047394%
Table 8. Comparison with the state-of-the-art methods.
Table 8. Comparison with the state-of-the-art methods.
MethodYearAccuracy
Two streams (RGB+OF) [54]201488.0%
C3D+Linear SVM [55]201585. 2%
LSTM30+OF+RGB [56]201588.6%
S:VGG-16, T:VGG-16 [57]201692.5%
TSN (3 modalities) [58]201694.2%
ST-LSTM+Trust Gate [59]201669.2%
LTC [60]201792.7%
I3D [61]201798.0%
T3D(+TSN) [62]201793.2%
P3D ResNet [63]201788.6%
L2STM [64]201793.6%
STA-LSTM [65]201873.4%
DTMV+RGB-CNN [66]201887.5%
R(2+1)D-Two [67]201897.3%
R(2+1)D-152 [68]201981.3%
Fully connected layer [69]202088.55%
Feature-threshold-based. Object height/width ratio, ratio change speed, and MHI [70]202095.16%
YOLOv8 model’s pose detection by LE2I [71]202098.86%
YOLOv8 model’s pose detection by URFD [71]202496.23%
The Proposed Method202598.0%
Table 9. Baseline models for comparison.
Table 9. Baseline models for comparison.
ModelDescription
CNN-onlyNo temporal modeling, frame-wise classification
CNN + LSTMStandard sequential model without symmetry constraints
CNN + Bi-LSTMBidirectional LSTM for better context
Sym-LSTM (ours)Our proposed model with geometric + temporal symmetry losses
Table 10. Quantitative results.
Table 10. Quantitative results.
ModelAccuracy (%)F1-ScoreLatency (ms)SCI
CNN-only82.40.81950.72
CNN + LSTM88.10.861350.79
CNN + Bi-LSTM89.30.881450.81
Sym-LSTM (ours) 92.70.911380.91
Table 11. Ablation study.
Table 11. Ablation study.
ConfigurationAccuracy (%)
Full Model (Sym-LSTM)92.7
–w/o Geometric Loss88.9
–w/o Temporal Loss89.4
–w/o Depth Regression90.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Boonyopakorn, P.; Ketcham, M. Geometric Symmetry and Temporal Optimization in Human Pose and Hand Gesture Recognition for Intelligent Elderly Individual Monitoring. Symmetry 2025, 17, 1423. https://doi.org/10.3390/sym17091423

AMA Style

Boonyopakorn P, Ketcham M. Geometric Symmetry and Temporal Optimization in Human Pose and Hand Gesture Recognition for Intelligent Elderly Individual Monitoring. Symmetry. 2025; 17(9):1423. https://doi.org/10.3390/sym17091423

Chicago/Turabian Style

Boonyopakorn, Pongsarun, and Mahasak Ketcham. 2025. "Geometric Symmetry and Temporal Optimization in Human Pose and Hand Gesture Recognition for Intelligent Elderly Individual Monitoring" Symmetry 17, no. 9: 1423. https://doi.org/10.3390/sym17091423

APA Style

Boonyopakorn, P., & Ketcham, M. (2025). Geometric Symmetry and Temporal Optimization in Human Pose and Hand Gesture Recognition for Intelligent Elderly Individual Monitoring. Symmetry, 17(9), 1423. https://doi.org/10.3390/sym17091423

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop