1. Introduction
Every year, approximately 30% of adults aged 65 and older experience at least one fall, and nearly 10% of these incidents result in severe injuries such as fractures or intracranial hemorrhages, making falls the leading cause of accidental injury-related deaths among the elderly [
1,
2,
3,
4]. As global populations age rapidly, the societal and economic costs of falls continue to escalate, posing urgent challenges to healthcare systems and family caregivers. While traditional elderly care relies heavily on manual supervision and wearable devices, both approaches have notable shortcomings. Manual monitoring often has “blind spots” and response delays. If a fall leads to a prolonged period of immobility (commonly referred to in medicine as “long-term immobilization” or “bed rest”), the resulting harm can far exceed the trauma of the fall itself, potentially triggering a chain reaction in multiple systems and even worsening the condition or increasing the mortality rate. In contrast, wearable sensors or smart bracelets rely on the active compliance of the user. In domestic settings, they may fail due to detachment or discomfort [
5].
Given these limitations, video-based non-contact fall detection has emerged as a promising solution, enabling real-time monitoring without requiring user cooperation. Such systems can be seamlessly integrated into existing surveillance infrastructures, enhancing both safety and convenience [
6,
7,
8,
9,
10,
11]. Early methods, however, were predominantly based on classical computer vision techniques—such as HOG + SVM or optical flow—which rely on handcrafted features and are highly sensitive to environmental noise. These methods perform poorly in scenes with background clutter or varying illumination, leading to reduced detection accuracy [
12]. The advent of deep learning, particularly object detection architectures like Faster R-CNN and the YOLO series, has transformed the field through end-to-end feature learning and high-speed inference [
13]. Yet, despite their promise, these models are typically optimized for generic object detection and struggle to handle elderly falls characterized by atypical horizontal postures and small-scale targets. Consequently, they are prone to both false negatives (missed falls) and false positives (misclassified sitting or lying actions).
Existing research can broadly be grouped into two technical paradigms: pose estimation–based and object detection–based approaches. Pose estimation methods, such as those utilizing HRNet and MediaPipe, infer joint coordinates to assess body angles or center-of-gravity shifts [
14]. While precise, their multi-stage inference processes hinder real-time performance, limiting deployment in continuous monitoring scenarios. In contrast, object detection–based methods directly treat falls as a target category, offering end-to-end detection. For example, ref. [
15] improved YOLOv4 with data augmentation for complex scenes, yet the model’s large size (about 60 MB) constrains its use on edge devices. YOLOv5s is a lightweight model in the YOLOv5 series of object detection algorithms open-sourced by the Ultralytics team in 2020. Based on the core idea of “end-to-end real-time detection” of the YOLO series, it achieves an excellent balance among model size, inference speed and detection accuracy, and is one of the preferred models for object detection and lightweight deployment scenarios. To address the application on edge devices, ref. [
16] adopted YOLOv5s for lightweight deployment but reported limited recall (82.1%) for small, distant targets. Building on this, ref. [
17] enhanced YOLOv5s by integrating CIoU Loss and attention mechanisms, improving bounding box precision and suppressing irrelevant background interference. Similarly, ref. [
18] optimized YOLOv5s by introducing an Efficient Channel Attention (ECA) module and replacing the PANet structure with BiFPN for adaptive feature fusion, yielding a 2.7% accuracy gain on the Le2i public dataset.
Although these studies demonstrate meaningful progress, two critical limitations persist. First, most algorithms fail to simultaneously balance detection accuracy, model compactness, and real-time performance—a trade-off vital for real-world deployment in elderly care settings. Second, few models explicitly optimize for fall-specific spatial orientations and small-scale body features. As a result, the generalizability of existing approaches across diverse environments—such as occluded, low-light, or cluttered home scenes—remains limited.
To address these gaps, this study explores an improved lightweight YOLOv5s-based framework designed specifically for elderly fall detection. Building upon the works of [
17,
18], the proposed model integrates a Convolutional Block Attention Module (CBAM) to enhance feature saliency, refines multi-scale feature fusion to boost small-object detection, and reclusters anchor boxes to better adapt to the horizontal morphology of falls. Moreover, a multi-scene elderly fall dataset is constructed to ensure robust performance under varying illumination, occlusion, and scale conditions. The overarching objective is to achieve a practical balance between accuracy, speed, and deployability, advancing both theoretical understanding of fall detection in AI-assisted healthcare and its practical application in real-world monitoring systems.
3. Results
3.1. Experimental Environment and Dataset
A multi-scene elderly fall dataset was constructed using a combination of controlled studio recordings and publicly shared home environment clips. Data were captured using four consumer-grade RGB cameras (Logitech C920, Xiaomi Mi Home Camera 2K, and two 1080p PTZ surveillance cameras) with resolutions of 1920 × 1080 or 1280 × 720 at 25–30 FPS. To ensure ethical compliance, all videos involved authorized volunteers simulating daily activities and fall scenarios indoors. Faces were anonymized using automatic blurring to protect identity.
Data labeling was performed using LabelImg following a strict definition: a fall was labeled only when the entire body transitioned from upright to near-ground contact with an unintended motion trajectory, reflecting medical guidelines for fall characterization. Sitting, resting, bending, crawling, and lying intentionally were labeled as normal. Each bounding box was independently annotated by two trained annotators and reviewed by a third annotator. Samples with disagreement or unclear postures were excluded.
To improve dataset reliability, filtering rules removed:
- (1)
frames with motion blur obscuring the torso or limbs,
- (2)
partial bodies visible under occlusion > 70%,
- (3)
recordings with severe overexposure/underexposure,
- (4)
duplicate frames from static post-fall periods.
The final dataset contains 11,314 images, stratified into:
8052 training samples
1131 validation samples
1131 test samples
ensuring consistent scene distribution (living rooms 38%, bedrooms 22%, corridors 25%, activity rooms 15%).
Data augmentation techniques—including random rotation (±15°), horizontal flipping, brightness shift (±25%), scaling (0.8–1.2), and Gaussian noise—were applied to enhance model generalization across illumination and occlusion variations.
The experiments were conducted on a workstation equipped with an Intel Core i7-12700 K CPU, NVIDIA RTX 3090 GPU (24 GB), and 32 GB of memory, running Ubuntu 20.04 and PyTorch 1.12.0 with Python 3.8. To ensure reproducibility, data labeling was performed using LabelImg.
To address the limitations of existing datasets such as Le2i and UR, a custom elderly fall dataset was constructed to capture realistic indoor environments. These data were obtained from previous research, internet searches and generated by artificial intelligence. It comprises 11,314 images divided into a training set (8052), validation set (1131), and test set (1131). The dataset encompasses four representative scenes—living rooms, bedrooms, nursing home corridors, and activity rooms—with variations in illumination, occlusion, and background complexity. Each image was labeled into two categories: fall and normal. Data augmentation techniques, including random flipping, scaling, brightness adjustment, and Gaussian noise, were applied to enhance model robustness and prevent overfitting.
All models were trained under identical experimental settings to ensure fair comparison. The improved YOLOv5s was trained for 300 epochs using a batch size of 16 and an initial learning rate of 0.01, optimized by the Stochastic Gradient Descent (SGD) optimizer with momentum = 0.937 and weight decay = 0.0005. A cosine annealing schedule was applied to gradually reduce the learning rate during training. The input resolution was fixed at 640 × 640 for all experiments. Mosaic data augmentation, label smoothing (ε = 0.1), and early stopping were adopted to prevent overfitting. Hardware acceleration using CUDA and cuDNN ensured consistent inference benchmarking. These configurations were applied not only to the proposed model but also to baseline YOLOv5s and comparison methods to maintain consistency.
3.2. Evaluation Metrics
The model was evaluated using standard object detection metrics: Precision, Recall, Mean Average Precision (mAP@0.5), Frames Per Second (FPS), and False Alarm Rate (FAR). These metrics collectively assess detection accuracy, sensitivity, real-time performance, and reliability. High precision indicates low false positives, while high recall reflects effective identification of actual falls. The mAP@0.5 metric serves as a comprehensive measure of overall model performance, while FPS determines suitability for real-time monitoring.
3.3. Comparative Performance Analysis
This paper mainly assesses performance based on image-based detection. Considering the constraints of the hardware environment, the triangulation method using multiple cameras was not adopted to verify the possible false negatives and false positives.
Table 3 summarizes the comparative performance of the proposed model against YOLOv4, YOLOv5s, and prior works [
14,
15]. The proposed algorithm achieved a Precision of 93.8%, Recall of 92.5%, and mAP@0.5 of 94.2%, outperforming the baseline YOLOv5s by 6.8 percentage points in mAP. While its frame rate (32 FPS) is slightly lower than the original YOLOv5s (35 FPS), it remains well above the minimum real-time monitoring threshold of 25 FPS. Importantly, the FAR was reduced to 4.2%, indicating a significant improvement in reliability for real-world deployments.
This performance enhancement can be attributed to three synergistic design improvements. First, the CBAM attention mechanism strengthened feature learning by emphasizing critical postural and spatial cues. Second, the multi-scale feature fusion allowed the model to detect both small distant targets and large close-range falls more effectively. Third, anchor box re-clustering improved IoU matching for horizontally oriented falls, reducing missed detections. Together, these improvements enhanced the model’s robustness across diverse environmental conditions. The improved algorithm is compared with the original YOLOv5s, YOLOv4 [
14,
15]. The experimental results are shown in
Table 3.
Table 3 shows that the mAP@0.5 of the improved algorithm proposed in this paper reaches 94.2%, which is 6.8 percentage points higher than that of the original YOLOv5s. Compared with the literature [
14], the proposed algorithm increased by 0.3 percentage points, and the literature [
15] increased by 0.5 percentage points, indicating that the improved measures effectively improved the detection accuracy. The recall rate was increased to 92.5%, which was 6.8 percentage points higher than that of the original YOLOv5s, mainly due to the anchor box clustering and multi-scale fusion optimization, which solved the problem of missed fall detection in perspective. The FPS reaches 32, which is slightly lower than that of the original YOLOv5s (35), but it still meets the real-time requirements of “≥25 FPS” in the monitoring scene, and is better than that of YOLOv4 and literature [
15] the false positive rate (FAR) was reduced to 4.2%, which was 2.3 percentage points lower than that of the original YOLOv5s, indicating that the multi-feature decision rule effectively reduced the false positive.
To further position the proposed framework within the lightweight model landscape, we compared its performance with representative lightweight fall detection models, including recent YOLOv5s variants and infrared-based approaches [
14,
15,
16,
17,
18,
19,
20,
21]. As shown in
Figure 2, our model demonstrates the most balanced performance across precision, recall, mAP@0.5, FPS, and FAR. Notably, the FAR (4.2%) is the lowest among the compared lightweight approaches, which is a critical factor in reducing unnecessary caregiver interventions in real elderly care environments. These results indicate that the proposed improvements reinforce detection reliability while retaining real-time constraints, validating its suitability for edge-deployable healthcare monitoring systems.
3.4. Comparative Evaluation with Lightweight Models
To evaluate computational deployment feasibility, we further compared the proposed framework with widely used lightweight detectors including YOLOv8-n, MobileNet-SSD, and EfficientDet-D0. As shown in
Table 4, YOLOv8-n achieves the fastest inference speed due to its extremely compact architecture, while EfficientDet-D0 yields slightly higher COCO accuracy. However, both models exhibit lower recall when adapted to fall detection tasks in our domain-specific dataset. The proposed model delivers a more balanced trade-off between accuracy, latency, and memory footprint, enabling real-time inference (≥30 FPS) while preserving high detection reliability.
These experimental results confirm that our approach maintains state-of-the-art lightweight performance while tailoring model structures to fall-related spatial characteristics, thus achieving superior task-specific accuracy under resource-constrained deployment environments such as edge-based elderly monitoring systems.
3.5. Ablation Experimental Results
The ablation study (
Table 5) demonstrates the incremental effect of each module. Introducing CBAM alone increased mAP by 2.7%, validating the importance of attention-guided feature refinement. Multi-scale fusion optimization yielded a 4.8% recall improvement, underscoring its role in capturing small-scale features. Anchor box re-clustering contributed an additional 2.9% mAP gain by aligning detection anchors to horizontal postures. When all three components were integrated, a cumulative improvement of 6.8% in mAP was achieved, confirming the synergistic effect of multi-component optimization.
These findings align with recent advances in lightweight object detection frameworks [
9,
21], demonstrating that combining attention and adaptive anchor design can significantly enhance both accuracy and efficiency. The present model achieves an optimal trade-off between computational cost and detection reliability—a key requirement for elderly monitoring systems.
Table 5 shows that adding CBAM attention alone improves mAP by 2.7 percentage points, indicating that the attention mechanism effectively strengthens the extraction of key features of falls. When multi-scale fusion optimization is added alone, the recall rate is increased by 4.8 percentage points, which verifies the improvement effect of this module on small object detection. When anchor box clustering is added alone, the mAP is increased by 2.9 percentage points, which proves that the re-clustered anchor boxes are more suitable for the fall shape. After the combination of the three modules, the mAP was increased by 6.8 percentage points, and there was a synergistic effect between each module (for example, after the multi-scale fusion of the features strengthened by CBAM, the detection accuracy of small objects was further improved).
3.6. Visualization of Typical Scene Detection Results
Figure 3 illustrates qualitative examples of fall detection across three representative scenarios: (1) close-range, unobstructed falls; (2) distant, partially occluded falls; and (3) interference scenes involving non-fall actions such as sitting. The improved model successfully detects falls with high confidence (e.g., probability = 0.92) in both near and occluded conditions. Notably, the enhanced algorithm eliminates false positives by using post-detection logical filtering, effectively distinguishing between genuine falls and benign activities.
However, under extreme lighting or heavy occlusion conditions (
Figure 4), partial detection failures persist. The classification accuracy of visual models depends on the precise capture of features such as human contours, joint points, and movement trajectories. However, changes in lighting can disrupt the integrity of these features, ultimately affecting classification results. In low light conditions (illuminance < 100 lx), features become blurred and are subject to noise interference; in strong light conditions (illuminance > 1000 lx), overexposure and reflection interference occur; and in cases of sudden changes in lighting (illuminance change rate > 500 lx/min), features between frames become discontinuous. These limitations highlight future research directions, including illumination-invariant feature extraction and multimodal sensor fusion (e.g., integrating infrared or depth data). Despite these constraints, the current model demonstrates strong robustness and generalizability across realistic monitoring environments.
The left picture in
Figure 3 is a close-range fall without occlusion. The improved algorithm accurately frames the fall area with a class probability of 0.92 and no missed detection. The middle image is the perspective occlusion fall (far in the corridor, the target size accounted for 5%, and part was occluded by the chair): the improved algorithm can still detect the fall target, with a recall rate of 0.89, and the original YOLOv5s did not detect it. The right image is an interference scene (elderly people sit down), which is not judged as a fall by the improved algorithm through the aspect ratio (0.8 < 1.0) and position screening, and the false alarm rate is low. However, for some extreme scenes, such as insufficient illumination and severe occlusion, the performance of the algorithm is still insufficient, as shown in
Figure 4 above.
Misclassification Case Analysis
Figure 5 presents typical misclassification cases to illustrate the remaining weaknesses of the proposed model. In the top row, all three examples are false positives triggered by daily activities that temporarily produce a horizontal or highly flexed posture. In (a), the subject performs a rapid sitting motion onto the bed, and the torso orientation becomes similar to that of a fall. In (b), the subject squats outdoors to tie shoelaces, resulting in a compact posture close to the ground. In (c), an elderly passenger bends forward to pick up an object from the subway floor. In these cases, the model primarily focuses on body aspect ratio and ground proximity, leading to normal activities of daily living being misclassified as falls.
The bottom row of
Figure 5 shows false negatives under visually challenging conditions. In (d), the subject falls on a dimly lit staircase; strong shadows and low contrast obscure body contours, and the detector fails to localize the fall. In (e), the subject is partially occluded by curtains. Although the network outputs a fall confidence of 0.51, the prediction does not surpass the decision threshold and is treated as a missed detection. In (f), the fall occurs at an awkward viewpoint close to the floor, where foreshortening and limited visible joints lead to a marginal confidence of 0.63 that is again filtered out. These examples confirm that the current model is still sensitive to illumination degradation, severe occlusion, and unconventional camera viewpoints, motivating the multimodal and temporal extensions discussed in the Discussion and Conclusions.
In future work, we will integrate depth sensing and temporal motion features to distinguish gradual intentional movements from abrupt fall transitions and utilize self-calibrating illumination normalization to mitigate lighting interference. This analysis provides clearer direction for enhancing real-world model reliability in diverse monitoring environments.
3.7. YOLOV5s-Based Elderly Fall Detection System (EFDS) Development
We developed the YOLOv5s-based Elderly Fall Detection System (EFDS) aimed to provide an efficient, real-time, and user-friendly solution for monitoring and detecting fall incidents among the elderly. Designed with an emphasis on intuitiveness, conciseness, and informativeness, the system integrates advanced computer vision techniques with an interactive graphical user interface (GUI) built using the PyQt5 library. Its development aims to bridge the gap between high-performance deep learning models and practical usability, ensuring that both professionals and non-expert users can operate the system with ease. The EFDS interface combines visual clarity, responsive control mechanisms, and intelligent feedback to support accurate fall detection and timely intervention in elderly care environments.
Figure 6 is the YOLOV5s-based Elderly Fall Detection System (EFDS), which was designed according to the principles of intuitiveness, conciseness, and informativeness and developed using the PyQt5 library. The layout of the main interface, as shown in
Figure 6, was carefully structured into several key functional areas to enhance usability and interaction. The Model Input Source enables users to switch between various input options, including “Camera,” “Video File,” “Folder,” and “Image,” with corresponding dialog boxes for file selection. The Detection Parameter Group, serving as an optional advanced function, allows users to adjust the detection confidence and Intersection over Union (IoU) thresholds in real time, thereby optimizing the balance between precision and recall. The System Operation and Exit Group contains the Run, Stop, and Exit buttons, which, respectively, initialize the video source and start detection, pause ongoing detection for frame inspection, and terminate the application. The Video Display Area forms the core of visual feedback, displaying processed video streams in real-time, with detected human targets highlighted by bounding boxes labeled with detection results (e.g., “Normal” or “Fall”) and corresponding confidence levels. The Test Result Area records and displays the model’s output path and system operation logs in real time, supporting post-event analysis and enhancing system transparency. During system testing, the application accurately handled camera invocation, video file reading, and operational controls such as start, pause, and exit. The improved YOLOv5s model was successfully loaded and performed real-time inference with a stable video stream processing rate of approximately 32 frames per second (FPS), fulfilling real-time performance requirements. Overall, the GUI design demonstrated a clear layout, straightforward control logic, and a high degree of user-friendliness, allowing non-expert users to operate the system efficiently. The integrated multi-level alarm mechanism effectively enhanced user attention during operation.
3.8. Training Stability and Convergence Analysis
Figure 7 shows convergence curves of the proposed network, including loss, precision, recall, and mAP@0.5. Stable optimization is achieved after epoch ~210, and the model avoids overfitting due to early stopping once validation improvement saturates.
Figure 7 shows Training and validation convergence curves of the improved YOLOv5s model across 300 epochs. The training losses (box, classification, and DFL) decrease steadily with no oscillation, while validation losses converge after approximately epoch 210. In parallel, precision, recall, and mAP metrics continuously improve before stabilizing at high levels, indicating good generalization performance and the effectiveness of early stopping to prevent overfitting. Additional implementation details, post-processing rules, and reproducibility information are provided in
Appendix A.
3.9. Summary
The experimental results substantiate that the proposed improved YOLOv5s framework not only outperforms traditional CNN- and RNN-based fall detection systems but also competes favorably with more complex transformer-based architectures [
22]. Unlike earlier models requiring high-end computation, this lightweight approach maintains a balance between speed, accuracy, and scalability. The CBAM-enhanced backbone captures subtle postural variations critical for distinguishing fall-like actions, while the anchor recalibration ensures accurate detection of horizontally oriented targets—a limitation often overlooked in general-purpose object detectors.
From a theoretical standpoint, this research extends the applicability of attention-based feature enhancement and anchor optimization to human activity recognition tasks involving non-standard poses. Practically, the findings offer valuable insights for developing edge-deployable, privacy-preserving fall detection systems that can be integrated into smart healthcare environments. In summary, the improved YOLOv5s model provides a technically efficient and context-aware solution for elderly safety monitoring, bridging the gap between laboratory research and real-world implementation.
4. Discussion
The experimental results demonstrate that the improved YOLOv5s framework achieved a significant performance enhancement over baseline models, confirming the effectiveness of the proposed architectural and logical optimizations. Specifically, the model attained a mean average precision (mAP@0.5) of 94.2%, a recall of 92.5%, and a precision of 93.8%, outperforming the original YOLOv5s by 6.8 percentage points in mAP and markedly reducing the false alarm rate to 4.2%. These results validate that integrating CBAM attention, optimizing multi-scale feature fusion, and re-clustering anchor boxes collectively enhance detection accuracy, robustness, and generalization across complex real-world environments.
This study builds upon and extends prior research that applied deep learning to fall detection. Traditional methods such as HOG + SVM and optical flow–based approaches are limited by handcrafted feature extraction and poor adaptability to variable lighting or occlusion [
12]. In contrast, deep learning–based object detection algorithms, such as YOLOv4 [
15] and pose estimation models like HRNet [
14], improved recognition capability but often compromised between precision and computational efficiency. Recent variants such as the improved YOLOv5s [
16] and ECA-based architectures [
18] demonstrated notable gains in accuracy but still faced difficulties handling horizontally oriented or small-scale fall targets. The present model effectively overcomes these deficiencies by explicitly redesigning the feature extraction and fusion pathways for fall-specific characteristics.
The ablation analysis further corroborates that each module contributed distinct benefits. The inclusion of the CBAM enhanced feature saliency by directing the network’s attention to critical human body regions, yielding a 2.7% improvement in mAP. The multi-scale fusion design addressed the scale variance challenge in fall scenarios, improving recall by 4.8%. The anchor box re-clustering produced a 2.9% mAP gain, aligning the detection anchors with the horizontal morphology of falls. The synergistic combination of all modules yielded an overall 6.8% improvement in mAP, consistent with findings from [
9,
20], who emphasized the importance of cross-scale information flow and adaptive anchor design in lightweight detection frameworks.
To mitigate additional computational costs from large augmented datasets, recent quality-aware dataset optimization techniques have been explored. These include selective augmentation based on sample difficulty scoring, curriculum-based augmentation scheduling that progressively refines data diversity, and efficient synthetic sample generation using lightweight generative models [
23,
24]. Such strategies enable improved robustness while maintaining real-time feasibility, and will be integrated in our future work to further optimize model efficiency in deployment environments.
Qualitative visualizations reinforce these quantitative results. The improved YOLOv5s accurately identified falls across diverse contexts—close-range, occluded, and interference-rich scenes—demonstrating robustness under varying environmental conditions. However, as noted in the visualization analysis, performance degradation still occurs under extreme illumination and severe occlusion, suggesting opportunities for future work involving illumination-invariant features or multimodal fusion with infrared or depth sensors [
25].
From a theoretical standpoint, this study advances the integration of attention mechanisms and scale-adaptive architectures within real-time safety monitoring systems. It bridges the methodological gap between general-purpose object detection and human activity recognition involving nonstandard postures, such as horizontal body orientations during falls. Practically, the results underscore the model’s suitability for edge computing environments, enabling lightweight deployment in hospitals, nursing homes, and smart living spaces while maintaining high detection reliability.
When compared with the problem framing in the introduction, where traditional monitoring systems were criticized for blind spots and low compliance [
5], the proposed framework directly addresses these limitations by offering a non-intrusive, camera-based alternative that combines accuracy and computational efficiency. The outcome confirms that the incorporation of context-aware visual attention and optimized anchor strategies can close the gap between laboratory performance and real-world application.
In addition, this study currently focuses on fall detection from a single camera perspective. Although it has achieved good performance, it is still limited by the inherent limitations of two-dimensional vision, such as severe occlusion and misjudgment of actions from specific angles. By deploying multiple cameras, the system can obtain three-dimensional spatial information, fundamentally solving the blind spot problem of single perspective and significantly reducing false negatives and false positives caused by occlusion and misrecognition through cross-validation of multi-dimensional data. Although this solution will introduce higher system complexity and deployment costs, it is crucial for improving the robustness of detection in complex scenarios and is thus regarded as the core research direction for us to advance the technology in high-demand clinical environments in the next step.
Despite the proposed framework’s favorable performance, the misclassification analysis and extreme-scene visualizations indicate that the model still struggles under low illumination, severe occlusion, and highly ambiguous body configurations. To address these limitations, several concrete improvements are planned. First, we will integrate multimodal sensing (e.g., RGB, infrared, and depth cameras) to reliably capture body contours and spatial occupancy, even when RGB images are degraded by lighting or shadows. Second, we will extend the present frame-based detector to a sequence-based framework by incorporating temporal motion modeling, enabling the system to distinguish between intentional slow movements (such as sitting or lying down) and abrupt, involuntary falls. Third, in high-risk clinical or institutional environments, we plan to deploy multi-camera configurations to recover three-dimensional spatial information and reduce blind spots caused by single-view occlusion. These directions are expected to substantially enhance robustness under extreme conditions while maintaining real-time feasibility on edge devices.
On the basis of high-precision detection, edge devices can push alarm information containing the fall location and time to the family members’ mobile phone APP or directly push the alarm to the display screen of the nursing station through low-power Bluetooth, 5G and other protocols. The edge gateway can also be linked to the smart speaker in the nursing home to broadcast the alarm, achieving a complete closed loop from early warning to rescue.
While the proposed framework demonstrates high accuracy and low false alarm rates in most practical environments, several limitations must be acknowledged. First, the model remains strongly dependent on RGB visual quality. Under conditions involving severe occlusion or rapid illumination changes, fall-related body features may be partially lost, leading to misclassification. Second, the current model infers from single-frame static images, which limits its ability to distinguish voluntary slow transitions (e.g., sitting or reclining) from abrupt fall events based on motion dynamics. Third, using a single camera introduces unavoidable blind spots, particularly when furniture or other obstacles obstruct visibility, which may reduce recall in cluttered indoor spaces.
Addressing these limitations requires technical enhancements that preserve real-time feasibility. Future work will incorporate multimodal sensor fusion (RGB + depth/infrared) to enhance resilience under adverse lighting conditions. We will also explore temporal modeling (e.g., lightweight 3D CNNs, TCNs, or transformer-based motion encoders) to leverage sequential motion cues for more fine-grained activity discrimination. In clinical and high-risk care environments, multi-camera 3D spatial reasoning will be implemented to mitigate visibility blind spots by ensuring cross-view consistency. These strategies can collectively improve robustness across diverse and challenging deployment scenarios.
5. Conclusions
This study proposed an improved YOLOv5s-based lightweight framework for elderly fall detection, integrating CBAM attention mechanisms, multi-scale feature fusion optimization, and fall-specific anchor box re-clustering. Experimental results confirmed substantial performance gains, with a mean average precision (mAP@0.5) of 94.2%, recall of 92.5%, and a false alarm rate of only 4.2%, outperforming existing YOLO-based and traditional vision models. The research contributes both theoretically and practically by addressing critical gaps in balancing detection accuracy, model compactness, and real-time feasibility—issues often neglected in prior studies [
17,
18]. Although the model exhibits minor limitations under extreme lighting or heavy occlusion, these constraints point toward future advancements through multimodal fusion and illumination-invariant feature extraction. Overall, this model not only enhances detection accuracy but also demonstrates the potential of deep learning in promoting autonomous elderly safety systems, which can balance performance, interpretability, and the ability to be deployed across different operational environments. The proposed framework demonstrates a scalable and interpretable approach that bridges algorithmic development and practical implementation, promoting safer, more autonomous monitoring environments for the aging population and advancing the integration of AI-driven fall detection into real-world healthcare systems.
Future research will focus on integrating multimodal sensing, temporal motion learning, and multi-camera spatial constraints to address performance degradation in low-visibility and high-occlusion scenarios fundamentally. In addition, the system will be extended with privacy-preserving mechanisms, intelligent alarm management, and real-world deployment on edge hardware platforms in hospitals and nursing homes. Through these developments, the model can evolve into a clinically reliable, autonomous safety monitoring solution that supports rapid intervention and enhances the quality of care for aging populations.