Detecting Human Falls in Poor Lighting: Object Detection and Tracking Approach for Indoor Safety

: Falls are one the leading causes of accidental death for all people, but the elderly are at particularly high risk. Falls are severe issue in the care of those elderly people who live alone and have limited access to health aides and skilled nursing care. Conventional vision-based systems for fall detection are prone to failure in conditions with low illumination. Therefore, an automated system that detects falls in low-light conditions has become an urgent need for protecting vulnerable people. This paper proposes a novel vision-based fall detection system that uses object tracking and image enhancement techniques. The proposed approach is divided into two parts. First, the captured frames are optimized using a dual illumination estimation algorithm. Next, a deep-learning-based tracking framework that includes detection by YOLOv7 and tracking by the Deep SORT algorithm is proposed to perform fall detection. On the Le2i fall and UR fall detection (URFD) datasets, we evaluate the proposed method and demonstrate the effectiveness of fall detection in dark night environments with obstacles.


Introduction
The World Health Organization states that falling has been ranked as the second leading cause of accidental death [1].A fall is described as a rapid change from a normal state to a reclined or extended position of the whole body.It can be caused by discomfort or unsteadiness while standing [2].Recent studies indicate that the death rate among the elderly due to falls is nearly three times higher than that of younger age groups [3,4].Most older people, especially those living independently without a carer, fear severe injuries or even fatalities from falling because of delayed assistance.This demonstrates the importance of early warning and convenient management of falls.It is foreseeable that with practical and timely detection and warning mechanisms, the rate of severe injuries and even fatalities in falls, especially among the elderly, will be significantly reduced [5,6].
Most of the current fall detection and warning systems employ sensors such as barometers, accelerometers, gyroscopes, and inertial sensors [7,8].These sensors are usually used in wearable devices, including a smartwatch, shoes, and necklace, to monitor users' body parameters to detect a fall.However, they have the drawback of requiring the devices to be put on the user's body, which makes it uncomfortable and infeasible to be worn constantly.Furthermore, sensor-based fall detection systems are expensive and have data privacy and security concerns.
With the advent of vision-based systems, computer vision has witnessed a growing trend in its applications.Notably, significant amounts of research are focused on visionbased systems to monitor any anomalies in body movement for fall detection [5,[9][10][11][12][13].Such methods can be divided into two categories based on the data used, namely, (1) RGB-based detection and (2) depth-based detection [14].In this paper, we perform fall detection on an RGB-based dataset using a deep learning framework.Deep learning networks, in

•
A novel deep-learning-based approach for vision-based fall detection is proposed, integrating YOLOv7 for object detection and the Deep SORT algorithm for tracking and trajectory analysis.

•
The proposed method incorporates dual illumination estimation, utilizing a Retinexbased image enhancement algorithm, to effectively tackle the issue of inconsistent lighting conditions and exposure levels.

•
The effectiveness and superiority of the proposed approach over current state-of-theart methods are extensively demonstrated, offering a robust solution for vision-based fall detection.

Related Works
Elderly care requires continuous monitoring by health care staff, which is costly, timeconsuming, and even considered an intrusive process that disrupts individuals' routines and privacy [33].Thus, an automatic fall detection system is required in healthcare services to provide cost-effective, efficient, and timely care to the person affected.Researchers have exhaustively explored many methods of fall detection to reduce the number of injuries caused by falls.These systems are broadly classified based on the device used to detect the fall, including wearable sensors, environment sensors, and vision-based sensors.In wearable devices, one or more sensors are put on various body parts of the person to detect and identify falls from activities of daily living (ADL).It detects any rapid increase in negative acceleration when the individual suddenly changes its position from standing to lying down [34,35].For environment-based sensors, signals from acoustic [36], infrared [37], vibration [38] and pressure [39] are collected and analyzed to detect falls.These devices are less intrusive than wearable sensor-based devices and require less interaction with the individual, which reduces privacy and security concerns.However, environment-based sensors are prone to false alarms due to changes in the external environment.For example, different noises or other falling items in the room can impact the sensors' performance [40].
In recent years, the development of vision-based sensors has led to several computer vision-based approaches being proposed for fall detection.Vision-based approaches extract data from sensors which can be RGB-based, depth-based, thermal-based, or even a combination of these.The extracted data is then utilized by computer vision algorithms to detect any unusual changes related to the body trajectories, postures, or shapes to identify falls.The most common approach in computer vision is based on hand-crafted feature extraction [9,11].Sehairi et al. [11] use a finite state machine (FSM) on a human silhouette extracted by background subtraction technique to determine if a fall happened.Albawendi et al. [9] perform fall detection based on hand-crafted features extracted from projection histogram, motion information, and human shape variation.Recently, deep learning methods have gained popularity over hand-crafted feature-based methods due to their ability to extract important features in high dimensional data independently.
Deep learning has been successfully applied in fall detection, demonstrating high performance.Lu et al. [12] incorporate a combination of a three-dimensional convolutional neural network (3D CNN) and a long short-term memory (LSTM) in their fall prevention method.Han et al. [13] improve the fall detection speed using a two-stream approach with the lightweight MobileVGG.Recently, human skeleton joint coordinates extracted from RGB data have also been used to detect falls [10,28].The authors [10] propose a spatiotemporal network based on CNNs, GRUs, and fully connected layers to classify falls and ADL.The existing methods for fall detection involve either a combination of various tasks, such as foreground and background separation or human skeleton joint coordinates.In contrast, our proposed method uses spatio-temporal features from RGB data using a deep-learning-based tracking-by-detection approach.
A considerable amount of research has been conducted in the field of tracking-bydetection approach [41][42][43][44].These methods use spatiotemporal data to extract relevant features and then use them as inputs to distinguish between the identified objects.Finally, a tracker follows the object during the video flow.Despite the promising results obtained by tracking-by-detection methods, these methods can still suffer from colour bias or complex underexposed conditions [45].Therefore, this paper presents a two-stage fall detection framework.First, we perform exposure correction of the video frames using a dual illumination estimation method [46].The method adaptively corrects the frames based on exposure conditions, including overexposed, underexposed, and partially overand underexposed conditions.Second, we use the YOLOv7 algorithm [26] to detect the subject's activity and combine the appearance information from object-detection CNNs with a fast, improved Deep SORT [25] based multiple object tracking (MOT) method to extract motion features and compute trajectories.

Experimental Data
In this paper, we use two publicly available datasets, i.e., the Le2i Fall Detection dataset [47] and the UR Fall Detection dataset (URFD) [48].The Le2i dataset is composed of falls captured by narrow angle cameras.The dataset includes videos of various actors falling and not falling in different illumination scenarios.The videos vividly depict realworld scenes of people falling while performing their daily activities.These scenes are captured in a variety of environments, including homes, workplaces, and pantries; it makes for more accurately simulated realistic video sequences of falls during everyday activities, as shown in Figure 1.The UR Fall Detection dataset [48] contains 60 videos taken by two cameras placed at different angles.Figure 1a,b shows some example video frames from the Le2i and URFD dataset, respectively.

Annotation
Our annotate method detects the person class in each frame using a pre-trained YOLOv5 model, and then stores the bounding box coordinates in a CSV table.In the next step, we examine each frame and correct the annotation results for fall and upright, as shown in Table 1. Figure 2 shows the detection results of YOLOv5 and Manual post-correction.

File Name
Xmin Ymin Xmax Ymax YOLOv5 Annotations

Annotation
Our annotate method detects the person class in each frame using a pre-trained YOLOv5 model, and then stores the bounding box coordinates in a CSV table.In the next step, we examine each frame and correct the annotation results for fall and upright, as shown in Table 1. Figure 2 shows the detection results of YOLOv5 and Manual post-correction.

Proposed Framework
The proposed framework is shown in Figure 3, which employs a two-stage framework to detect falls in low-light indoor environments.The object detection model is used as the main component of the fall detection system to further elaborate on the proposed technique.In the first stage, the captured video frames are pre-processed to improve the visual quality of the footage.This is accomplished using a dual illumination estimation-based exposure correction technique field [40].The goal of this step is to adjust the brightness, contrast, and detail of the video frames so that later stages can detect falls more accurately.On the enhanced video frames, the second stage implements a deep-learning-based approach for person tracking and detection.This is accomplished using a technique known as people tracking by detection, which involves using an object detection model to track the movements of people in video footage.The algorithm can detect when a person falls by analysing the motion patterns of the tracked objects.To detect falls in low-light indoor environments, the proposed fall detection system utilizes a sophisticated combination of exposure correction and deep-learning-based people tracking by detection algorithms.By improving the visual quality of the video footage and leveraging advanced object detection techniques, the system can improve the accuracy and reliability of fall detection.

Proposed Framework
The proposed framework is shown in Figure 3, which employs a two-stage framework to detect falls in low-light indoor environments.The object detection model is used as the main component of the fall detection system to further elaborate on the proposed technique.In the first stage, the captured video frames are pre-processed to improve the visual quality of the footage.This is accomplished using a dual illumination estimationbased exposure correction technique field [40].The goal of this step is to adjust the brightness, contrast, and detail of the video frames so that later stages can detect falls more accurately.On the enhanced video frames, the second stage implements a deep-learningbased approach for person tracking and detection.This is accomplished using a technique known as people tracking by detection, which involves using an object detection model to track the movements of people in video footage.The algorithm can detect when a person falls by analysing the motion patterns of the tracked objects.To detect falls in lowlight indoor environments, the proposed fall detection system utilizes a sophisticated combination of exposure correction and deep-learning-based people tracking by detection algorithms.By improving the visual quality of the video footage and leveraging advanced object detection techniques, the system can improve the accuracy and reliability of fall detection.

Object Detection
RetinaNet was proposed by Lin et al. [49] in 2018, which is based on three parts: the ResNet backbone, the feature pyramid network (FPN), and the object classification and bounding box regression subnetwork.Deeper neural networks capture more information, and the ResNet backbone uses identity mappings to ensure that performance does not degrade as the network's depth increases.The FPN brings the benefits of multi-scale fusion by capturing coarse to fine semantic information, allowing for more feature details to be obtained for small object detection tasks and thus improving accuracy.The RetinaNet model runs the classification and regression sub-networks in parallel before attaching them to the FPN module.A sigmoid activation function is used to predict the classification at each location by the classification module.The regression sub-network outputs coordinate values of shape 4 via a fully connected layer.In object detection tasks [50][51][52], an imbalance between positive and negative samples is a major cause of classification difficulties.The Focal Loss method solves this problem by adjusting the weights of the difficult and easy-to-classify samples.

Object Detection
RetinaNet was proposed by Lin et al. [49] in 2018, which is based on three parts: the ResNet backbone, the feature pyramid network (FPN), and the object classification and bounding box regression subnetwork.Deeper neural networks capture more information, and the ResNet backbone uses identity mappings to ensure that performance does not degrade as the network's depth increases.The FPN brings the benefits of multi-scale fusion by capturing coarse to fine semantic information, allowing for more feature details to be obtained for small object detection tasks and thus improving accuracy.The RetinaNet model runs the classification and regression sub-networks in parallel before attaching them to the FPN module.A sigmoid activation function is used to predict the classification at each location by the classification module.The regression sub-network outputs coordinate values of shape 4 via a fully connected layer.In object detection tasks [50][51][52], an imbalance between positive and negative samples is a major cause of classification difficulties.The Focal Loss method solves this problem by adjusting the weights of the difficult and easy-toclassify samples.
Redmon et al. [53] proposed You Only Look Once (YOLO), which combines object detection and regression into a single task.The input image is segmented into multiple grid cells to accomplish this.The case where the object's centre point is located on the grid is seen as the anchor box responsible for detecting the object class.The YOLOv5 [54] and YOLOv7 [32] models used in this paper are optimized from YOLOv3 [55].The reparameterization convolution method is used in YOLOv7.This method fuses the convolutional layers and batch normalization into a single convolutional module, which greatly improving the model inference speed.
The main advantage of one-stage models used in the proposed method is the high computational efficiency and better inference speeds over two-stage models.In real-time, vision-based systems for fall detection need to be faster and more efficient.Furthermore, the one-stage models address the class imbalance problem in fall detection to improve the detection accuracy.

Object Tracking
The Simple Online and Realtime Tracking (SORT) algorithm [30], proposed in 2016, is a straightforward and quick method for multiple object tracking (MOT).The correlation between the preceding and following frames is processed using Kalman filtering and then measured using the Hungarian algorithm [56].The Deep SORT algorithm is an extension of the SORT algorithm.The Deep SORT algorithm's neural network weights training on the pedestrian dataset as best suited for human fall detection.By matching the extracted features to the object's nearest neighbors, it is well adapted for improving object tracking [57] and detection in obstacle-type environments [31].The feature extraction network can significantly improve the Deep SORT algorithm's robustness to obstacles and object loss.
Using object-tracking methods in conjunction with fall detection systems is a promising approach, especially given the limitations of current visual method studies, which frequently overlook the possibility of increased false-negative detection rates.Deep SORT, fortunately, can help address this issue by predicting the potential location of the next frame and calculating correlations, providing an early warning if a person becomes obscured from view.It is possible to improve the accuracy and reliability of fall detection systems by incorporating this approach, thereby improving the safety and well-being of individuals at risk of falling.It is notable to combine object tracking methods with fall detection systems.Most current visual method studies do not account for the potential increase in false-negative detection rates.Deep SORT provides early warning if a person is obscured by predicting the potential location of the next frame and calculating correlations.

Dual Illumination Estimation (DUAL)
Zhang et al. [40] proposed the dual illumination estimation algorithm for dark light image enhancement.This method is based on the core concept of Retinex.The colour of an object is determined by its ability to reflect light of three different wavelengths.The support for this concept is based on colour constancy.Dual illumination enhancement demonstrates that uneven lighting conditions do not affect the colour of the object.When there is underexposure or underexposure, the colour of the object does not change.The illumination is estimated in both the forward and reverse directions to recover a properly exposed image.
The objective of this project is to create the image I by using the image light mapping method L to multiply each pixel value in the existing image I .When an overexposed image is reversed, it produces an underexposed image.This can be interpreted as locating the overexposed portion of the current image by entering a forward and reverse image.The generated image I is defined in Equation (1) below, Here, I is calculated according to Equation ( 2), where I inv = 1 − I is the formula for the inverted image.At this point, the illuminance map L inv for the current state is estimated.As a result of calculating the underexposure correction, the image I inv can be obtained.
where c is the colour channel; p is the pixel and I c p is expressed as the colour channel c in pixel point p.Before estimating the illumination of the image, it needs to extract the maximum channel value for each pixel in the RGB tree-channel image.It is possible to compose an initialized illuminance L p by iteratively obtaining each maximum value as shown in Equation in (3).The desired illumination map L is obtained using the objective defined in Equation ( 4).
In real-world conditions, the performance of deep learning models suffers due to videos captured under suboptimal lighting conditions caused by dim or uneven light.For example, in an indoor setting, a video captured at night has dark or under-exposed regions due to insufficient illumination.This may result in the proposed systems failing to monitor individuals' activities, resulting in high false-negative rates.Figure 4 shows the results from Le2i FDD obtained by using a dual estimation algorithm.Here, ′ is calculated according to Equation ( 2), where  = 1 −  is the formula for the inverted image.At this point, the illuminance map  for the current state is estimated.As a result of calculating the underexposure correction, the image  can be obtained.
where  is the colour channel;  is the pixel and  is expressed as the colour channel  in pixel point .Before estimating the illumination of the image, it needs to extract the maximum channel value for each pixel in the RGB tree-channel image.It is possible to compose an initialized illuminance  ' by iteratively obtaining each maximum value as shown in Equation in (3).The desired illumination map  is obtained using the objective defined in Equation ( 4).

argmin ∑ 𝐿 − 𝐿 '
+   , ( ) +  , In real-world conditions, the performance of deep learning models suffers due to videos captured under suboptimal lighting conditions caused by dim or uneven light.For example, in an indoor setting, a video captured at night has dark or under-exposed regions due to insufficient illumination.This may result in the proposed systems failing to monitor individuals' activities, resulting in high false-negative rates.Figure 4 shows the results from Le2i FDD obtained by using a dual estimation algorithm.

Experimental Setup
This paper uses RetinaNet, YOLOv5, and YOLOv7 object detection algorithms based on a PyTorch framework to evaluate the proposed approach.Experimental results are conducted on a Tesla P100 GPU with 16280Mb video memory.The initial learning rate for YOLO series models is set to 0.1, gradually decreasing to 0.01 during training, whereas RetinaNet uses a learning rate of 0.00025.As shown in Table 2, the batch size for YOLOv7 and RetinaNet is set to 8, and for YOLOv5, the batch size is set to 32.All the models are trained for 100 epochs each.For the YOLOv7 /w DUAL, we use the same model parame-

Experimental Setup
This paper uses RetinaNet, YOLOv5, and YOLOv7 object detection algorithms based on a PyTorch framework to evaluate the proposed approach.Experimental results are conducted on a Tesla P100 GPU with 16,280 Mb video memory.The initial learning rate for YOLO series models is set to 0.1, gradually decreasing to 0.01 during training, whereas RetinaNet uses a learning rate of 0.00025.As shown in Table 2, the batch size for YOLOv7 and RetinaNet is set to 8, and for YOLOv5, the batch size is set to 32.All the models are trained for 100 epochs each.For the YOLOv7/w DUAL, we use the same model parameters as YOLOv7.McNemar's significance test is used to statistically validate the performance of the object detection models.Next, the Deep SORT tracking algorithm is applied to the best performing object detection model.Specifically, we use the performance metrics such as accuracy, 0.5 mAP, and precision score for evaluation.

Results
Table 3 shows the results of McNemar's significance test (p values indicate the results obtained by the best performing model) on the Le2i Fall Detection dataset where YOLOv7 is statistically different than the results by RetinaNet and YOLOv5.Table 4 shows the results from different object detection algorithms on the Le2i Fall Detection dataset (Le2i FDD).YOLOv7 outperforms YOLOv5 and the RetinaNet with an accuracy of 90.5%, and mAP of 0.966.Here, the RetinaNet model performs the worst with an accuracy of 59.02%, and mAP of 0.842.Then, we apply Deep SORT to the best performing object detection model, i.e., YOLOv7, which is trained and tested on enhanced video frames obtained by performing dual illumination estimation.Here, the proposed method (YOLOv7 + Deep SORT/w DUAL) gives an accuracy of 94.5% and mAP of 0.986, which is a significant improvement over the object detection methods.Moreover, YOLOv7 + Deep SORT/w DUAL is compared to the existing state-of-the-art (SOTA) methods on Le2i FDD.Table 4 shows that the proposed approach achieves the highest accuracy, outperforming the SOTA methods by Poonsri et al. [58] and Chamle et al. [59], which have the fall accuracy of 91.38% and 79.41%, respectively.Poonsri et al. [58] and Chamle et al. [59] have used backgroundbased subtraction techniques.However, due to insufficient illumination conditions in videos of Le2i FDD, background subtraction techniques used in these methods incorrectly extract other objects as a human silhouette.This results in high false-positive rates for fall detection.The proposed method is compared to the traditional method proposed by Poonsri et al. [48] and Chamle et al. [49] based on their annotated images and the result presented.Furthermore, the visual results using the YOLOv7 and the proposed method are shown in Figure 5.When the environment light is not insufficiently illuminated, as shown in the first row of Figure 6, the YOLOv7 + Deep SORT misclassifies the fall as upright.However, the proposed method correctly classifies it as a fall, as shown in the second row of Figure 5.  Table 5 shows the results of McNemar's significance test on UR-Fall datase the performance of YOLOv7 is statistically significant (p < 0.05) over RetinaNet, not statistically different from that of YOLOv5.Table 6 shows the different objec tion algorithms on the UR Fall detection dataset.YOLO models achieve high res metrics such as accuracy and mAP, with YOLOv7 performing the best among all th els.However, RetinaNet did not perform well on the UR-Fall dataset, reporting the accuracy of 40.9% as well as the lowest mAP of 0.464.Similarly, to the experiments FDD, YOLOv7 is integrated with the Deep SORT tracking algorithm to report the on the UR-Fall dataset.The proposed method outperforms all the methods in t accuracy, mAP, and precision.As shown in the second row of Figure 6, the visua on the UR-Fall dataset using the proposed method highlights the improved perfo on the enhanced video frames as compared to the YOLOv7 + Deep SORT.

Method
RetinaNet YOLOv5 YOLOv Table 5 shows the results of McNemar's significance test on UR-Fall dataset where the performance of YOLOv7 is statistically significant (p < 0.05) over RetinaNet, but it is not statistically different from that of YOLOv5.Table 6 shows the different object detection algorithms on the UR Fall detection dataset.YOLO models achieve high results on metrics such as accuracy and mAP, with YOLOv7 performing the best among all the models.However, RetinaNet did not perform well on the UR-Fall dataset, reporting the lowest accuracy of 40.9% as well as the lowest mAP of 0.464.Similarly, to the experiments on Le2i FDD, YOLOv7 is integrated with the Deep SORT tracking algorithm to report the results on the UR-Fall dataset.The proposed method outperforms all the methods in terms of accuracy, mAP, and precision.As shown in the second row of Figure 6, the visual results on the UR-Fall dataset using the proposed method highlights the improved performance on the enhanced video frames as compared to the YOLOv7 + Deep SORT.

Method
RetinaNet YOLOv5 YOLOv7 Table 5 shows the results of McNemar's significance test on UR-Fall dataset where the performance of YOLOv7 is statistically significant (p < 0.05) over RetinaNet, but it is not statistically different from that of YOLOv5.Table 6 shows the different object detection algorithms on the UR Fall detection dataset.YOLO models achieve high results on metrics such as accuracy and mAP, with YOLOv7 performing the best among all the models.However, RetinaNet did not perform well on the UR-Fall dataset, reporting the lowest accuracy of 40.9% as well as the lowest mAP of 0.464.Similarly, to the experiments on Le2i FDD, YOLOv7 is integrated with the Deep SORT tracking algorithm to report the results on the UR-Fall dataset.The proposed method outperforms all the methods in terms of accuracy, mAP, and precision.As shown in the second row of Figure 6, the visual results on the UR-Fall dataset using the proposed method highlights the improved performance on the enhanced video frames as compared to the YOLOv7 + Deep SORT.

Conclusions
This paper proposes a vision-based fall detection system with an improved deeplearning-based tracking-by-detection method.The proposed method integrates dual illumination estimation to the YOLOv7 + Deep SORT tracking algorithm to enhance fall detection performance under suboptimal lighting conditions caused by dim or uneven light.The proposed method also incorporates exposure correction for fall detection in videos.The performance of the proposed method is validated on two fall detection datasets, namely, Le2i FDD and UR-Fall datasets.For future experiments, we aim to implement a self-learning framework that automatically adapts to false alarms and adds correct results to help the current fall detection systems to improve their performance.

Figure 1 .
Figure 1.Some example video frames from the publicly available datasets (a) Le2i and (b) URFD dataset.

Figure 1 .
Figure 1.Some example video frames from the publicly available datasets (a) Le2i and (b) URFD dataset.

Figure 3 .
Figure 3. Schematic representation of the proposed framework.

Figure 3 .
Figure 3. Schematic representation of the proposed framework.

Figure 4 .
Figure 4. Examples of images from Le2i Fall Detection dataset.(a) The original images.(b) The images after processing by the DUAL illumination estimation.

Figure 4 .
Figure 4. Examples of images from Le2i Fall Detection dataset.(a) The original images.(b) The images after processing by the DUAL illumination estimation.

Figure 6 .
Figure 6.Visual results of object tracking on UR Fall Detection dataset (a) YOLOv7 + Deep SORT method (b) the proposed method.

Table 1 .
Example bounding box coordinates and annotations.

Table 1 .
Example bounding box coordinates and annotations.

Table 2 .
Details of the parameters of the model trained.

Table 3 .
p-Values of the McNemar's significance test on Le2i dataset.Here, p < 0.05 is statistically significant.

Table 4 .
Performance of the different state-of-the-art methods on Le2i dataset.

Table 5 .
p-Values of the McNemar's significance test on UR-Fall dataset.

Table 5 .
p-Values of the McNemar's significance test on UR-Fall dataset.

Table 5 .
p-Values of the McNemar's significance test on UR-Fall dataset.

Table 6 .
Testing Performance of different models in UR Fall detection dataset.