eNightTrack: Restraint-Free Depth-Camera-Based Surveillance and Alarm System for Fall Prevention Using Deep Learning Tracking

: Falls are a major problem in hospitals, and physical or chemical restraints are commonly used to “protect” patients in hospitals and service users in hostels, especially elderly patients with dementia. However, physical and chemical restraints may be unethical, detrimental to mental health and associated with negative side effects. Building upon our previous development of the wandering behavior monitoring system “eNightLog”, we aimed to develop a non-contract restraint-free multi-depth camera system, “eNightTrack”, by incorporating a deep learning tracking algorithm to identify and notify about fall risks. Our system evaluated 20 scenarios, with a total of 307 video fragments, and consisted of four steps: data preparation, instance segmentation with customized YOLOv8 model, head tracking with MOT (Multi-Object Tracking) techniques, and alarm identiﬁcation. Our system demonstrated a sensitivity of 96.8% with 5 missed warnings out of 154 cases. The eNightTrack system was robust to the interference of medical staff conducting clinical care in the region, as well as different bed heights. Future research should take in more information to improve accuracy while ensuring lower computational costs to enable real-time applications.


Introduction
Falls and their associated injuries pose significant challenges in hospitals, and healthcare institutions prioritize the delivery of safe, effective, and high-quality care to patients [1].In the United States, it is estimated that there are between 0.7 and 1 million patient falls in hospitals, resulting in up to 250,000 injuries and 11,000 deaths [2].Falls are a major safety concern and account for over 84% of all adverse incidents that occur in hospitals [3].Nearly half of these falls occur in close proximity to the patient's bed [4].Approximately 33% of hospital falls result in injuries, and among these incidents, 4 to 6% are severe enough to cause additional health problems and even death, such as fractures and subdural hematomas [5].Consequently, many hostels and hospitals have resorted to using restraints as a precautionary measure [6].
Physical restraint involves the use of devices or equipment to restrict an individual's movement, and these cannot be removed by the person themselves [7].Physical restraints are employed to ensure the safety of individuals using medical devices and to manage aggressive or agitated behaviors [8][9][10].Patients with dementia or cognitive impairments are often physically restrained to prevent harm to themselves or others [11].Chemical restraints can also be used to achieve similar effects to physical restraints in clinical settings [12].
However, there is much debate surrounding the use of both physical and chemical restraints.Restraining individuals can prevent them from fulfilling basic needs like accessing water and using the restroom, which is unethical and detrimental to their mental health.Moreover, serious accidents such as strangulation can occur when individuals are restrained [13].Those who have been restrained have reported experiencing unpleasant emotions and psychological damage, including hopelessness, sadness, fear, anger, and anxiety [14].Adverse health effects of restraint include respiratory problems, malnutrition, urinary incontinence, constipation, poor balance, pressure ulcers, and bruises [15].Similarly, chemical restraints also have negative consequences.For instance, antipsychotic medications can cause drowsiness, gait disturbances, chest infections, and other adverse effects.Some medications can also impact nutrition absorption, increasing the risk of hospitalization.Furthermore, a 1.7-times higher mortality rate over a two-year period has been observed, and the incidence of severe cerebrovascular events is approximately doubled with the use of antipsychotic medications [16].Considering these drawbacks, there is a significant demand for alternatives to physical and chemical restraints.
With advancements in sensor and remote sensing technology, virtual restraints are increasingly being used as alternatives to physical and chemical restraints, such as infrared photodetectors [17], pressure sensors [18], wearable equipment [19], and associated telehealth items [20].The current approaches for human action recognition using RGB-D data can be categorized into three groups based on the type of data modality employed: depth-based methods, skeleton-based methods, and hybrid feature-based methods [21].Muñoz et al. created an RGB-D-based interactive system for upper limb rehabilitation [22].However, it was limited due to the extra computational cost of RoI detection using wholedepth sequences [21].There are several studies of multi-modal interactive frameworks.For example, Avola et al. proposed a system to establish a connection between the activities initiated by the user and the corresponding reactions from the system [23].By processing data in diverse modalities such as RGB images, depth maps, sounds, and proximity sensors, the system actively achieves real-time correlations between outcomes and activities [23].Moreover, instrumented mattresses embedded with sensors may not be cost-effective or feasible in clinical settings, and wearable devices can present compliance issues, particularly for patients with dementia and agitated behaviors [24].The Bed-Ex occupancy monitoring system utilizes weight-sensitive sensor mats attached to the bed to detect when a patient leaves.An alert is triggered on the inpatient ward and the central nursing station when a certain loss of weight is detected [25].However, this type of virtual restraint system functions as a threshold decision system that only detects danger when people exit the bed.Falls can still occur when people engage in risky behaviors on the bed, such as leaning or hanging on the railings, which the weight sensor may still detect.Therefore, a virtual restraint technique capable of continuously tracking the users' state is needed, rather than simply activating an alarm once they exit.The presence of caregivers performing routine services around the bed can introduce distortion to the sensing system, leading to false alarms.Infrared fences and pressure mats need to be turned off before conducting services, which increases the risk of forgetting to restart them and causing misalignment.
In terms of wearable sensors, an IMU (Inertial Measurement Unit) is an electronic device designed to be worn on a specific part of the body.It combines accelerometer and gyroscope sensors, and sometimes a magnetometer, to measure angular rate and magnetic fields in the vicinity of the body [26].The recent progress in MEMS (Micro-Electro-Mechanical Systems) technology has enabled the development of smaller and lighter sensors, allowing for continuous tracking of human motion and device orientation [27].Electromyography (EMG) has found extensive application in human-machine interaction (HMI) tasks.In recent times, deep learning techniques have been utilized to address various EMG pattern recognition tasks, including movement classification and joint angle prediction [28].Deep learning has gained significant popularity in EMG-based HMI systems.
However, most studies have primarily focused on evaluating offline performance using diverse datasets.It is crucial to give due consideration to online performance in realworld applications, such as prosthetic hand control and exoskeleton robot operation [29].Similarly, in indoor applications, such as hospitals and hotels, Wi-Fi is considerably more practical than video or wearable technology.Jannat et al. developed a Wi-Fi-based human activity recognition method using adaptive antenna elimination, which required minimal computational resources to distinguish falls from other human activities based on machine learning [30].Wang et al. [31] introduced a model called WiFall that has the capability to detect falls in elderly individuals, along with monitoring certain activities.WiFall utilizes Channel State Information (CSI) for wireless motion detection.The model employs machine learning algorithms to learn patterns in CSI signal amplitudes.Initially, a Support Vector Machine (SVM) is used to extract features, and Random Forest (RF) is applied to enhance the system's performance.The results demonstrate that WiFall achieves a satisfactory level of ability in fall detection.The approach achieves a detection precision of 90% with a false-alarm rate of 15% when using the SVM classifier.When the RF algorithm is employed, the accuracy is further improved and the false-alarm rate is reduced.However, it is important to note that this approach focuses on monitoring a single individual's motion.Overall, the academic community is additionally highly engaged in innovative sensor exploration for human activity recognition and behavior recognition, which involves novel sensors for HAR/HBR, creative designs and usages of traditional sensors, the utilization of non-traditional sensor categories that are applicable to HAR/HBR, etc. [32].
The utilization of machine learning (ML) represents significant potential for fall detection, including Support Vector Machines (SVMs), Random Forest (RF), and Hidden Markov Models (HMM) [33].These models rely on handcrafted features and require extensive feature engineering [34].Hidden Markov Models (HMMs) are based on the concept of a Markov process, which is a stochastic process with the property that the future state depends only on the current state and not on the past states [33].Liang et al. introduced an alarm system that utilizes an HMM-based Support Vector Machine (SVM) [35].The model was trained and evaluated using a dataset consisting of 180 fall instances.Liu et al. proposed an innovative method for human activity recognition that involves partitioning activities into meaningful phases called motion units, similar to phonemes in speech recognition [36].Hartann et al. developed and assessed a concise set of six high-level features (HLFs) on the CSL-SHARE and UniMiB SHAR datasets [37].They demonstrated that HLFs can be effectively extracted using ML methods, allowing for activity classification across datasets, even in imbalanced and limited training scenarios.Additionally, they identified specific HLF extractors responsible for classification errors.
However, DL models, in particular convolutional neural networks (CNNs) [38,39], recurrent neural networks (RNNs) [40], and their variations, have evolved as enhanced fall detection methods.The requirement for manual feature extraction was eliminated by the capacity of DL models to automatically extract pertinent features from unprocessed sensor data, which allows for quicker and more precise detection.Carneiro et al. employed high-level handcrafted features, including human pose estimation and optical flow, as inputs for individual VGG-16 classifiers [38].Kasturi et al. developed a visual-based system that utilizes video information captured via a Kinect camera [39].Multiple frames from the video are stacked to form a cube, which is inputted into a 3D CNN.The 3D CNN effectively incorporates spatial and temporal characteristics, encoding both appearance and motion characteristics across frames [39].Hasan et al. introduced a system for fall detection utilizing video data, employing a recurrent neural network (RNN) with two layers of Long Short-Term Memory (LSTM) [40].The approach involved performing 2D pose estimation using the OpenPose [41] algorithm to provide body joint information.The extracted pose vectors were then input into a two-layer LSTM network, enabling fall detection.
We endeavored to identify bed exiting and other dangerous activities, which would serve as a measure to prevent falls and injuries and be more effective and desirable.Previously, we developed a depth camera and ultrawideband radar system, named "eNightLog", to monitor and classify the night wandering behaviors of older adults.We demonstrated that it outperformed the integrated pressure mattress and infrared fence system [42] and was effective in managing wandering behaviors in a field test [43].We then developed deep learning models for the depth camera [44,45] and ultrawideband (UWB) radar [46,47] to better classify sleep postures.However, eNightLog had some limitations.It could not distinguish between users and medical staff, especially when they were taking care of the users.False alarms also happened when the bed was raised over the threshold height of the patient's head or the patient's service table height changed.
In view of this, we aimed to optimize the existing eNightLog system and extend its functions to detect potential fall events.We dubbed this new system "eNightTrack" as the succession of our previous system with enhanced tracking functions.

Materials and Methods
This section is composed of 6 subsections.Section 2.1 describes the data collection protocols.Section 2.2 shows the system setup of data collection.Section 2.3 explains the procedure of pre-processing to accommodate the format of input to the instance segmentation model.Section 2.4 describes instance segmentation based on the YOLOv8 model.Section 2.5 explains the head-tracking techniques.Finally, Section 2.6 presents the algorithm for raising an alarm when a user is in danger of falling.Figure 1 presents the overall structure and development of the eNightTrack system.
The extracted pose vectors were then input into a two-layer LSTM network, enabling fall detection.
We endeavored to identify bed exiting and other dangerous activities, which would serve as a measure to prevent falls and injuries and be more effective and desirable.Previously, we developed a depth camera and ultrawideband radar system, named "eNight-Log", to monitor and classify the night wandering behaviors of older adults.We demonstrated that it outperformed the integrated pressure mattress and infrared fence system [42] and was effective in managing wandering behaviors in a field test [43].We then developed deep learning models for the depth camera [44,45] and ultrawideband (UWB) radar [46,47] to better classify sleep postures.However, eNightLog had some limitations.It could not distinguish between users and medical staff, especially when they were taking care of the users.False alarms also happened when the bed was raised over the threshold height of the patient's head or the patient's service table height changed.
In view of this, we aimed to optimize the existing eNightLog system and extend its functions to detect potential fall events.We dubbed this new system "eNightTrack" as the succession of our previous system with enhanced tracking functions.

Materials and Methods
This section is composed of 6 subsections.Section 2.1 describes the data collection protocols.Section 2.2 shows the system setup of data collection.Section 2.3 explains the procedure of pre-processing to accommodate the format of input to the instance segmentation model.Section 2.4 describes instance segmentation based on the YOLOv8 model.Section 2.5 explains the head-tracking techniques.Finally, Section 2.6 presents the algorithm for raising an alarm when a user is in danger of falling.Figure 1 presents the overall structure and development of the eNightTrack system.

Data Collection
Twenty-four nurses (n = 24) from a local hospital participated in a role-play activity simulating a fall-risk-related scenario for data collection.The nurses reported no physical disabilities or chronic diseases.They were divided into 8 teams of 3 members.One member acted as a patient and the others as nurses.During the data collection process, the nurses performed their routine duties on the ward.The "patient" and "nurses" performed different activities according to the protocol listed in Table 1.The simulation was performed under the guidance and advice of nursing school instructors.Also, the protocol was built according to their real-world experience in the hospital after several experiment preparation meetings.Figure 2 illustrates screenshots of some scenarios taken.The Human Subjects Ethics Sub-committee of Hong Kong Polytechnic University approved the study (reference no.HSEARS20210127007).Written and oral descriptions of the experimental procedures were offered to all participants, and informed consent was obtained from all participants.

Data Collection
Twenty-four nurses (n = 24) from a local hospital participated in a role-play activity simulating a fall-risk-related scenario for data collection.The nurses reported no physical disabilities or chronic diseases.They were divided into 8 teams of 3 members.One member acted as a patient and the others as nurses.During the data collection process, the nurses performed their routine duties on the ward.The "patient" and "nurses" performed different activities according to the protocol listed in Table 1.The simulation was performed under the guidance and advice of nursing school instructors.Also, the protocol was built according to their real-world experience in the hospital after several experiment preparation meetings.Figure 2 illustrates screenshots of some scenarios taken.The Human Subjects Ethics Sub-committee of Hong Kong Polytechnic University approved the study (reference no.HSEARS20210127007).Written and oral descriptions of the experimental procedures were offered to all participants, and informed consent was obtained from all participants.

Sc05
Nurse helping adjust position scenario-nurse pulls sheets up to help patient to adjust their sleeping position.

Sc06
Kneeling on rear edge of bed scenario-patient kneels on the bed at the rear edge.

Sc07
Adjusting bed level scenario-nurse/patient adjusts the level of the bed from lying to sitting and raises the level of the bed and returns it to the original position.

Staying In Bed Yes Sc08
Picking up belongings scenario-patient leans over the bed rail to look for personal belongings at the bottom of locker.

Sc09
Nurse helping turn scenario-nurse helps patient to turn and places a pillow for support.

Staying In Bed Yes Sc10
Pillow mimicking scenario-patient exits bed when a supporting pillow similar to a human shape is still on the bed.

Sc11
Changing position scenario-patient changes from a lying to sitting position.

Sc12
Climbing exiting scenario-patient climbs over bed rails and leaves.

Sc13
Pushing table scenario-patient pushes table towards the rear end of bed.

Sc14
Leaning scenario-patient climbs over rail and leans their upper body out to pick up items.

Sc15
Drinking scenario-patient searches for personal belongings on top of the locker (only reaching hand out to pick up a cup of water).

Sc16
Sliding under the blanket scenario-patient slides under the blanket at the rear end of bed and leaves.

Sc17
Use of urinal scenario-male patient sits near the edge of the bed and uses urinal for voiding.

Staying In Bed No Sc18
Leaning forward scenario-patient leans forward when sitting at the edge of bed.

Sc19
Use of bedpan scenario-patient uses bedpan in bed.

Staying In Bed Yes Sc20
Sliding scenario-patient slides to the rear end of the bed and leaves without blanket.

System Setup
Three infrared red-blue-green (RGB) stereo-based depth cameras (Realsense D435i, Intel Corp., Santa Clara, CA, USA) were positioned to capture the entire scenario simulation process using the RealSense Software Development Kit (SDK) platform in a clinical teaching room.We used D435i because it had a smaller minimum Z depth for detection of 28 cm to ensure the user could be detected when they stood up as the depth cameras were installed 1.5 m above the bed.The 1.5 m height of the depth camera above the bed ensures the ability to capture a reasonable field of view to observe the patient's

System Setup
Three infrared red-blue-green (RGB) stereo-based depth cameras (Realsense D435i, Intel Corp., Santa Clara, CA, USA) were positioned to capture the entire scenario simulation process using the RealSense Software Development Kit (SDK) platform in a clinical teaching room.We used D435i because it had a smaller minimum Z depth for detection of 28 cm to ensure the user could be detected when they stood up as the depth cameras were installed 1.5 m above the bed.The 1.5 m height of the depth camera above the bed ensures the ability to capture a reasonable field of view to observe the patient's movements in bed and whether he or she left the bed.The data were transmitted and processed on a personal computer.Figure 3 presents the setup of the experiment equipment.In this experiment, the information from the depth camera in the middle was used for analysis.The information from the other two depth cameras at both ends could be used in a future study of reconstructed 3D monitoring work.The depth cameras were adopted to prevent ethical issues, as the RGB camera would capture the real scene and human appearances.Generally, the middle depth camera obtains the best performance individually compared to the others at the two ends [48].Therefore, we took the data from the middle depth camera to conduct an initial experiment in this project.Multiple depth cameras provided a wider field of view, allowing for a more comprehensive monitoring of patients' movements that could be particularly beneficial in scenarios where a patient's movements are not confined to a single viewpoint.In the future, the data from three depth cameras will be combined for 3D reconstruction to avoid line-of-sight issues.
Algorithms 2023, 16, x FOR PEER REVIEW 7 of 20 movements in bed and whether he or she left the bed.The data were transmitted and processed on a personal computer.Figure 3 presents the setup of the experiment equipment.In this experiment, the information from the depth camera in the middle was used for analysis.The information from the other two depth cameras at both ends could be used in a future study of reconstructed 3D monitoring work.The depth cameras were adopted to prevent ethical issues, as the RGB camera would capture the real scene and human appearances.Generally, the middle depth camera obtains the best performance individually compared to the others at the two ends [48].Therefore, we took the data from the middle depth camera to conduct an initial experiment in this project.Multiple depth cameras provided a wider field of view, allowing for a more comprehensive monitoring of patients' movements that could be particularly beneficial in scenarios where a patient's movements are not confined to a single viewpoint.In the future, the data from three depth cameras will be combined for 3D reconstruction to avoid line-of-sight issues.

Data Preprocessing
The dataset must be preprocessed before being used for instance segmentation.The workflow of data preprocessing is shown in Figure 4.The raw data (in bag format) collected covered 20 scenarios, which contain 10 negatives (patient staying in bed) and 10 positives (patient leaving bed area).The definition of each scenario is presented in Table 1.In total, 307 individual videos were successfully obtained after the original bag files were clipped with a frame rate 6 frames/s, and the number of each scenario is also displayed in Table 1.There were supposed to be an equal number of positive and negative video clips, but one of the positive samples was abnormal due to a power issue and was adopted for testing.To reduce the computation cost of achieving real-time analysis, the clipped data needed to be compressed before further processing.As the particular input of the instance segmentation model is in png file format, the bag format files of different scenarios were converted to mp4 files.In our project, the patient's head movement was representative of patient movement because the head can be more easily detected most of the time, while other parts of the body may be covered by the quilt.Moreover, if the head is detected out of the bed, it is almost certain that the patient is at risk of falling.

Data Preprocessing
The dataset must be preprocessed before being used for instance segmentation.The workflow of data preprocessing is shown in Figure 4.The raw data (in bag format) collected covered 20 scenarios, which contain 10 negatives (patient staying in bed) and 10 positives (patient leaving bed area).The definition of each scenario is presented in Table 1.In total, 307 individual videos were successfully obtained after the original bag files were clipped with a frame rate 6 frames/s, and the number of each scenario is also displayed in Table 1.There were supposed to be an equal number of positive and negative video clips, but one of the positive samples was abnormal due to a power issue and was adopted for testing.
Algorithms 2023, 16, x FOR PEER REVIEW 7 of 20 movements in bed and whether he or she left the bed.The data were transmitted and processed on a personal computer.Figure 3 presents the setup of the experiment equipment.In this experiment, the information from the depth camera in the middle was used for analysis.The information from the other two depth cameras at both ends could be used in a future study of reconstructed 3D monitoring work.The depth cameras were adopted to prevent ethical issues, as the RGB camera would capture the real scene and human appearances.Generally, the middle depth camera obtains the best performance individually compared to the others at the two ends [48].Therefore, we took the data from the middle depth camera to conduct an initial experiment in this project.Multiple depth cameras provided a wider field of view, allowing for a more comprehensive monitoring of patients' movements that could be particularly beneficial in scenarios where a patient's movements are not confined to a single viewpoint.In the future, the data from three depth cameras will be combined for 3D reconstruction to avoid line-of-sight issues.

Data Preprocessing
The dataset must be preprocessed before being used for instance segmentation.The workflow of data preprocessing is shown in Figure 4.The raw data (in bag format) collected covered 20 scenarios, which contain 10 negatives (patient staying in bed) and 10 positives (patient leaving bed area).The definition of each scenario is presented in Table 1.In total, 307 individual videos were successfully obtained after the original bag files were clipped with a frame rate 6 frames/s, and the number of each scenario is also displayed in Table 1.There were supposed to be an equal number of positive and negative video clips, but one of the positive samples was abnormal due to a power issue and was adopted for testing.To reduce the computation cost of achieving real-time analysis, the clipped data needed to be compressed before further processing.As the particular input of the instance segmentation model is in png file format, the bag format files of different scenarios were converted to mp4 files.In our project, the patient's head movement was representative of patient movement because the head can be more easily detected most of the time, while other parts of the body may be covered by the quilt.Moreover, if the head is detected out of the bed, it is almost certain that the patient is at risk of falling.To reduce the computation cost of achieving real-time analysis, the clipped data needed to be compressed before further processing.As the particular input of the instance segmentation model is in png file format, the bag format files of different scenarios were converted to mp4 files.In our project, the patient's head movement was representative of patient movement because the head can be more easily detected most of the time, while other parts of the body may be covered by the quilt.Moreover, if the head is detected out of the bed, it is almost certain that the patient is at risk of falling.
Generally, thousands of samples are required for training most instance segmentation models; therefore, frames were extracted every 40 frames to cover the various poses of patients and 2127 png images were obtained in total.Before polygon labelling, the acquired images needed to be manually filtered.Images were disqualified and excluded if (a) the head was indistinguishable from the background; (b) the head was blocked; or (c) no head was in the scene (patient leaving).Polygon labelling was implemented on the online cloud platform named Roboflow where two classes, 0 for the head of the patient and 1 for the head of medical care personnel, were labelled.
The labels for each png sample were stored in a txt file and the output format for a single row in the segmentation data is '<class-index> <x1> <y1> <x2> <y2> . . .<xn> <yn>'.In this format, <class-index> represents the index of the class assigned to the object, and <x1> <y1> <x2> <y2> . . .<xn> <yn> denote the bounding coordinates of the object's segmentation mask.Once polygon labelling is finished, data augmentation, such as horizontal and vertical flips, was applied to make the model insensitive to object orientation and improve the generalization.The training dataset is typically the largest subset for generalization of the model when maintaining enough cases for validation and testing.A validation dataset is used to fine-tune the hyperparameters, monitor performance during training, and prevent overfitting.A relatively small ratio is used for the validation set.The test dataset is usually the smallest for the final evaluation of the trained model, measuring its generalization capability.We originally selected the train/valid/test ratio as 7:2:1 on the RoboFlow label platform.The system automatically suggested the final ratio after data augmentation when more augmented data were included in the training dataset.Finally, 2552 images were obtained and were split as follows: training dataset 2152, 84.3%; validation dataset 266, 10.4%; and testing dataset 134, 5.3%.

Instance Segmentation
The novel YOLOv8 model expands on the achievement of earlier YOLO iterations and incorporates additional capabilities and enhancements that significantly increase its performance and versatility [49].The pre-trained weight model "yolov8-seg.yaml"from GitHub was used as the initial model.Rather than splitting up into two phases like Faster R-CNN, which first detects regions of interest before recognizing items in those areas, algorithms such as Single-Shot Detector (SSD) and You Only Look Once (YOLO) focus on locating every item in the shot in a single forward pass [49][50][51].Faster R-CNN is a rather sluggish detector that fails in real-time tasks, while it has a slightly improved accuracy when real-time processing is not necessary [52,53].SSD is simpler compared to methods that require region proposals because it completely eliminates the proposal generation phase and the subsequent pixel or feature resampling phase, encapsulating all computations in a single network [50].Therefore, YOLO is more suitable for our application to achieve real-time segmentation.To avoid the disturbance created by medical personnel or visitors entering the identification area, our customized YOLO model should have the ability to identify the head of a patient and heads of non-patient participants.Thus, instance segmentation is preferred over object detection as it takes more morphological information into consideration to identify similar objects.The training phase adopted python 3.10.11,torch 2.0.0, 100 epochs, batch size 16, a learning rate of 0.01, momentum of 0.937, and patience of 50.The computer (Centralfield Computer Ltd., Hong Kong, China) used for training with Windows 10 Education operating system (Microsoft Co., Redmond, WA, USA) 32 GB of RAM, a 2.1 GHz Intel ® Core™ i7-12700 processor with 12 cores and 2 TB of solid state hard disk (SSD).

Multi-Object Detection
MOT (Multi-Object Tracking) is based on object detection and object re-identification (ReID) [54,55].Three different MOT techniques, StrongSORT [54], ByteTrack [56,57], and DeepSORT [58], were adopted and compared based on their tracking performance.Deep-SORT was one of the original methods to use a deep learning model for MOT.It is usually selected because of its generalization and effectiveness [59].Although its tracking paradigm was valuable, the performance of DeepSORT was not comparable due to its outmoded techniques.StrongSORT was developed using the fundamental elements of DeepSORT and advanced components.For instance, Faster R-CNN was applied in DeepSORT while YOLOX-X was chosen for StrongSORT.Also, a superior appearance feature extractor, BoT (Bottleneck Transformer) [60], was selected rather than simple CNN [54].ByteTrack is a tracking method based on the tracking-by-detection paradigm.A simple and efficient data association method called BYTE was proposed.The vital difference between it and other tracking algorithms is that it does not simply remove the low-score detection results but associates every detection box [57].By using the similarity between the detection box and the trajectories, the background can be removed from the low-score detection results while retaining the high-score detection results, and real objects (difficult samples such as occlusion and blurring) are removed, thus reducing missed detections and improving the tracking coherence [56].The resulting MOT method with the best performance from this section is applied in Section 2.6 Alarm Identification.
A widely used method for evaluating the performance and generalizability of a machine learning model is 5-fold cross-validation.The basic concept was to split the original dataset into five parts of equal size, of which four were used to train the model and one was used to validate it.After carrying out this process five times, the final assessment results were calculated by averaging the outcomes of the five performance evaluations.

Alarm Identification
The algorithm should achieve the goal of triggering an alarm when the head center of a patient goes beyond the dynamically defined safe region, which varies with the head height, as shown in Figure 5. DeepSORT was one of the original methods to use a deep learning model for MOT.It is usually selected because of its generalization and effectiveness [59].Although its tracking paradigm was valuable, the performance of DeepSORT was not comparable due to its outmoded techniques.StrongSORT was developed using the fundamental elements of DeepSORT and advanced components.For instance, Faster R-CNN was applied in DeepSORT while YOLOX-X was chosen for StrongSORT.Also, a superior appearance feature extractor, BoT (Bottleneck Transformer) [60], was selected rather than simple CNN [54].ByteTrack is a tracking method based on the tracking-by-detection paradigm.A simple and efficient data association method called BYTE was proposed.The vital difference between it and other tracking algorithms is that it does not simply remove the low-score detection results but associates every detection box [57].By using the similarity between the detection box and the trajectories, the background can be removed from the low-score detection results while retaining the high-score detection results, and real objects (difficult samples such as occlusion and blurring) are removed, thus reducing missed detections and improving the tracking coherence [56].The resulting MOT method with the best performance from this section is applied in Section 2.6 Alarm Identification.
A widely used method for evaluating the performance and generalizability of a machine learning model is 5-fold cross-validation.The basic concept was to split the original dataset into five parts of equal size, of which four were used to train the model and one was used to validate it.After carrying out this process five times, the final assessment results were calculated by averaging the outcomes of the five performance evaluations.

Alarm Identification
The algorithm should achieve the goal of triggering an alarm when the head center of a patient goes beyond the dynamically defined safe region, which varies with the head height, as shown in Figure 5.The detected label information (head) of tracked frames is output as a text file that can be accessed before further processing.Due to the fact that the object size viewed by the camera decreases as its distance from the lens increases, the safe region defined to limit patient movement should be adjusted with the patient's head height, as shown in Figure 6.The detected label information (head) of tracked frames is output as a text file that can be accessed before further processing.Due to the fact that the object size viewed by the camera decreases as its distance from the lens increases, the safe region defined to limit patient movement should be adjusted with the patient's head height, as shown in Figure 6.
DeepSORT was one of the original methods to use a deep learning model for MOT.It is usually selected because of its generalization and effectiveness [59].Although its tracking paradigm was valuable, the performance of DeepSORT was not comparable due to its outmoded techniques.StrongSORT was developed using the fundamental elements of DeepSORT and advanced components.For instance, Faster R-CNN was applied in DeepSORT while YOLOX-X was chosen for StrongSORT.Also, a superior appearance feature extractor, BoT (Bottleneck Transformer) [60], was selected rather than simple CNN [54].ByteTrack is a tracking method based on the tracking-by-detection paradigm.A simple and efficient data association method called BYTE was proposed.The vital difference between it and other tracking algorithms is that it does not simply remove the low-score detection results but associates every detection box [57].By using the similarity between the detection box and the trajectories, the background can be removed from the low-score detection results while retaining the high-score detection results, and real objects (difficult samples such as occlusion and blurring) are removed, thus reducing missed detections and improving the tracking coherence [56].The resulting MOT method with the best performance from this section is applied in Section 2.6 Alarm Identification.
A widely used method for evaluating the performance and generalizability of a machine learning model is 5-fold cross-validation.The basic concept was to split the original dataset into five parts of equal size, of which four were used to train the model and one was used to validate it.After carrying out this process five times, the final assessment results were calculated by averaging the outcomes of the five performance evaluations.

Alarm Identification
The algorithm should achieve the goal of triggering an alarm when the head center of a patient goes beyond the dynamically defined safe region, which varies with the head height, as shown in Figure 5.The detected label information (head) of tracked frames is output as a text file that can be accessed before further processing.Due to the fact that the object size viewed by the camera decreases as its distance from the lens increases, the safe region defined to limit patient movement should be adjusted with the patient's head height, as shown in Figure 6.The bed height is the distance from the camera to the bed and the head height is from the head center to the bed.The initial state is assumed as a patient lying in bed, at which point medical personnel will manually initiate the tracking and alarm program as depicted in Figure 7.The scale of the dynamic safe region is used to calculate an instant safe region relative to the initial condition, and is expressed as follows: where Bh is bed height, Hh is head height.
Algorithms 2023, 16, x FOR PEER REVIEW 10 of 20 The bed height is the distance from the camera to the bed and the head height is from the head center to the bed.The initial state is assumed as a patient lying in bed, at which point medical personnel will manually initiate the tracking and alarm program as depicted in Figure 7.The scale of the dynamic safe region is used to calculate an instant safe region relative to the initial condition, and is expressed as follows: where Bh is bed height, Hh is head height.The tolerable head height for patient safety was set to 1 m in our experiment when the patient was in a sitting position, which was used to limit the maximum safe region.Here, the maximum scale of the safe region is obtained from Equation (1): The patient was regarded as exiting the safe region when the head center was beyond the safe region as shown in Figure 8, where the head center was defined as the 5% highest region in the detected head bounding box.The algorithm flowchart for alarm identification is illustrated in Figure 9.When the subject or patient was detected and was in the safe region, there was no need to raise an alarm.When the subject was not in bed, but he/she was once detected in bed, this was determined as them exiting the bed and the alarm was raised if the duration was over the The tolerable head height for patient safety was set to 1 m in our experiment when the patient was in a sitting position, which was used to limit the maximum safe region.Here, the maximum scale of the safe region is obtained from Equation (1): The patient was regarded as exiting the safe region when the head center was beyond the safe region as shown in Figure 8, where the head center was defined as the 5% highest region in the detected head bounding box.The bed height is the distance from the camera to the bed and the head height is from the head center to the bed.The initial state is assumed as a patient lying in bed, at which point medical personnel will manually initiate the tracking and alarm program as depicted in Figure 7.The scale of the dynamic safe region is used to calculate an instant safe region relative to the initial condition, and is expressed as follows: where Bh is bed height, Hh is head height.The tolerable head height for patient safety was set to 1 m in our experiment when the patient was in a sitting position, which was used to limit the maximum safe region.Here, the maximum scale of the safe region is obtained from Equation (1): The patient was regarded as exiting the safe region when the head center was beyond the safe region as shown in Figure 8, where the head center was defined as the 5% highest region in the detected head bounding box.The algorithm flowchart for alarm identification is illustrated in Figure 9.When the subject or patient was detected and was in the safe region, there was no need to raise an alarm.When the subject was not in bed, but he/she was once detected in bed, this was determined as them exiting the bed and the alarm was raised if the duration was over the The algorithm flowchart for alarm identification is illustrated in Figure 9.When the subject or patient was detected and was in the safe region, there was no need to raise an alarm.When the subject was not in bed, but he/she was once detected in bed, this was determined as them exiting the bed and the alarm was raised if the duration was over the tolerant time.However, if the patient was never detected in the scenario, all the parameters recorded were reset as a new video.When tracking of the subject was lost and they were not previously detected to be in bed, it was assumed that there was no person visible in the field of view and no alarm was required.If tracking of the subject was lost and they were detected to have not previously been in bed with previous warning, this frame was regarded as a warning state.When the subject was in bed previously and tracking was lost without previous warning, the movement value was used to determine whether the patient was still in the safe region.As displayed in Figure 10, the movement indicator diagram was designed using the movement value calculated by where the valid.pixelsare those pixels for which the depth camera is able to retrieve the depth information, while the left-side parts of Figures 7 and 8 contain black regions that are invalid pixels.depth[i] is the difference between the adjacent frames.The threshold of movement was determined by the movement value of the frames when no patient was in bed, which here was 100,000.0.If there is no movement, it should theoretically be all black in the movement view diagram, as in Figure 10a.The red frames in Figure 10 are the bed edge.
Algorithms 2023, 16, x FOR PEER REVIEW 11 of 20 tolerant time.However, if the patient was never detected in the scenario, all the parameters recorded were reset as a new video.When tracking of the subject was lost and they were not previously detected to be in bed, it was assumed that there was no person visible in the field of view and no alarm was required.If tracking of the subject was lost and they were detected to have not previously been in bed with previous warning, this frame was regarded as a warning state.When the subject was in bed previously and tracking was lost without previous warning, the movement value was used to determine whether the patient was still in the safe region.As displayed in Figure 10, the movement indicator diagram was designed using the movement value calculated by  = 0, ℎ  ℎℎ ℎ  , ℎ  ℎℎ . ( where the valid.pixelsare those pixels for which the depth camera is able to retrieve the depth information, while the left-side parts of Figures 7 and 8 contain black regions that are invalid pixels.depth [i] is the difference between the adjacent frames.The threshold of movement was determined by the movement value of the frames when no patient was in bed, which here was 100,000.0.If there is no movement, it should theoretically be all black in the movement view diagram, as in Figure 10a.The red frames in Figure 10 are the bed edge.
To adapt the depth information noise, the noise threshold was set as 80.0, indicating that the difference between adjacent frames should be greater than 80.0 to confirm movement existing.Movement existing in the bed but no patient detected could be the case that tracking was lost for the head but the subject was still in bed.However, warning was needed when the movement was less than the threshold over the tolerant time.

Results
This section is about the demonstration of the experimental aims and consists of four subsections.In Section 3.1, the evaluation metrices are adopted to assess the results and performance of a series of technological processes mentioned in Section 2. Section 3.2 presents the instance segmentation results and evaluates the ability to classify the heads of subjects and medical helpers.In Section 3.3, a comparison of different tracking techniques will be conducted.Then, Section 3.4 will show the final performance of the safe alarm algorithm.

Tracking Evaluation
The three MOT methods, StrongSORT, DeepSORT, and ByteTrack, are compared in terms of lost-tracking rate and the number of ID changes.The lost-tracking rate is calculated as follows: where Ltr is the lost-tracking rate, Nnd is the number of frames without detection, and Nf is the total number of frames.A higher lost-tracking rate means fewer frames are being tracked.Therefore, the MOT method with the lowest lost-tracking rate is preferred.
Moreover, the MOT models should ideally continue tracking a certain subject with a constant ID number.However, the tracker may reassign a new ID number to the same subject once the subject is re-tracked after a break, which means the tracker recognizes the subject with a different identification number.More frequent ID changes or identityswitching errors indicate the tracker has a worse ability for tracking and can lead to incorrect interpretations of object behavior and interactions [55].Thus, the MOT method with fewer ID changes is superior.

Confusion Matrices
One of the most common performance measurements of pattern classification is accuracy, which is defined as the portion of correct predictions to the total size of the dataset.The accuracy generally evaluates the overall classification results but cannot reflect which class the misclassification is from.Therefore, a bias towards the majority class, ignoring the minority class, could happen.The confusion matrix, which also takes the particularities of the decisions into consideration, could be used to solve this issue.As two conditions, positive and negative, are determined, four possible outputs could be defined as true positive (TP), true negative (TN), false positive (FP), and false negative (FN).TP is the number of positive cases correctly predicted as positive.TN is the number of negative To adapt the depth information noise, the noise threshold was set as 80.0, indicating that the difference between adjacent frames should be greater than 80.0 to confirm movement existing.Movement existing in the bed but no patient detected could be the case that tracking was lost for the head but the subject was still in bed.However, warning was needed when the movement was less than the threshold over the tolerant time.

Results
This section is about the demonstration of the experimental aims and consists of four subsections.In Section 3.1, the evaluation metrices are adopted to assess the results and performance of a series of technological processes mentioned in Section 2. Section 3.2 presents the instance segmentation results and evaluates the ability to classify the heads of subjects and medical helpers.In Section 3.3, a comparison of different tracking techniques will be conducted.Then, Section 3.4 will show the final performance of the safe alarm algorithm.

Tracking Evaluation
The three MOT methods, StrongSORT, DeepSORT, and ByteTrack, are compared in terms of lost-tracking rate and the number of ID changes.The lost-tracking rate is calculated as follows: where Ltr is the lost-tracking rate, Nnd is the number of frames without detection, and Nf is the total number of frames.
A higher lost-tracking rate means fewer frames are being tracked.Therefore, the MOT method with the lowest lost-tracking rate is preferred.
Moreover, the MOT models should ideally continue tracking a certain subject with a constant ID number.However, the tracker may reassign a new ID number to the same subject once the subject is re-tracked after a break, which means the tracker recognizes the subject with a different identification number.More frequent ID changes or identityswitching errors indicate the tracker has a worse ability for tracking and can lead to incorrect interpretations of object behavior and interactions [55].Thus, the MOT method with fewer ID changes is superior.

Confusion Matrices
One of the most common performance measurements of pattern classification is accuracy, which is defined as the portion of correct predictions to the total size of the dataset.The accuracy generally evaluates the overall classification results but cannot reflect which class the misclassification is from.Therefore, a bias towards the majority class, ignoring the minority class, could happen.The confusion matrix, which also takes the particularities of the decisions into consideration, could be used to solve this issue.As two conditions, positive and negative, are determined, four possible outputs could be defined as true positive (TP), true negative (TN), false positive (FP), and false negative (FN).TP is the number of positive cases correctly predicted as positive.TN is the number of negative cases correctly predicted as negative.FP is the number of negative cases incorrectly predicted as positive, while FN is the number of positive cases that are incorrectly predicted as negative.
In addition, sensitivity, specificity, and balanced accuracy are usually used alongside the confusion matrix.Sensitivity is the ratio of TP over all positive cases, specificity is the ratio of TN over all negative cases, and the balanced accuracy is the mean value of sensitivity and specificity that eliminates the imbalance in the number of different classes of data.sensitivity = TP TP + FN (5) specificity = TN TN + FP (6) balanced accuracy = sensitivity + specificity 2 (7) where precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive.
Recall, also known as sensitivity or true-positive rate, measures the proportion of predicted true positives out of all positives.recall = TP TP + FN (9) F1 is commonly used for the evaluation of a model's accuracy and is particularly suitable for imbalanced datasets.It can be expressed as follows: F1 = 2 × precision × recall precision + recall (10) 3.1.3.Object Detection Performance Evaluation mAP50 (mean Average Precision at 50) is a performance discipline widely employed in object detection and image retrieval that is used to evaluate the accuracy and efficiency of a machine learning model based on recall and precision.mAP50 calculates the average precision across different levels of recall, specifically at 50 recall points.These recall points are evenly spaced from 0 to 1.The precision at each recall point is determined, and the average precision is computed by taking the mean of all these precision values [61].

Performance of YOLOv8 Instance Segmentation
As displayed in Table 2, the overall mAP50 for the segmentation of the heads of both medical personnel and patients reaches 98.8%, which indicates an acceptable ability of the customized YOLOv8 model.The mAP50 of medical personnel segmentation is 98.6%, which is slightly lower than that of patients, which is 99.0%.The ability to identify the patient's head is related to the following steps, so a mAP50 of 99.0% is reasonable for the tracking and alarm identification process.Table 3 indicates the mean mAP50 values of 5-fold cross-validation for all classes, for medical personnel, and for the patient, which are 97.6%,96.6%, and 98.5%.The crossvalidation results demonstrated the superior performance and good generalization of the model.

Comparison of Different Tracking Techniques
The total number of frames in all video clips is 91,829.As the tracking results recorded in Table 4 show, StrongSORT has the highest lost-tracking rate of 23.4%, while DeepSORT and BtyeTrack have the same rate of 8.9%.StrongSORT loses tracking in almost a quarter of all frames, which is not suitable for further alarm identification.DeepSORT and Byte-Track with lower lost-tracking rates have no reaction delay on re-tracking a subject, while StrongSORT requires a certain number of frames to confirm the appearance of a detected subject.As for the count of ID changes, although StrongSORT has the least amount of ID changes, it could be caused by its worse performance on tracking, which means that fewer frames have been tracked, resulting in fewer scenes where ID change can occur.From the comparison of the number of ID changes in DeepSORT (2109) and ByteTrack (1697), ByteTrack demonstrates a better ability to keep tracking a specific subject.However, it is not directly relevant to the accuracy of identifying a patient's head.Therefore, a further comparison of DeepSORT and ByteTrack is conducted in the next section.

Performance of Alarm Algorithm
To evaluate the performance of the alarm stage, the state of staying in bed is regarded as a positive condition while exiting the bed is negative.Therefore, the detection metrics are defined as follows: • TP: the bed-exiting scenarios are predicted with warning.

•
TN: the staying-in-bed scenarios are identified as safe.

•
FP: the staying-in-bed scenarios are predicted with warning.

•
FN: the bed-exiting scenarios are identified as safe.
The examples of the above four conditions are illustrated in Figure 11.The TP example was from scenario 17 when the patient exited from the bedside.The TN example was from scenario 1 when the medical personnel helped to put a safety vest on the patient.The FP example was from scenario 1 when the in-bed movement was identified as bed exiting.
The FN example was from scenario 2 where the alarm was not triggered as tracking of the patient's head was lost when the patient slipped away from the side of the bed.The causes of lost tracking could be that the patient moved so fast that the frame rate of the depth camera could not capture it.Figure 12 displays the confusion matrices of the alarm algorithm combined with tracking results from DeepSORT and ByteTrack.Both results shown in Table 5 from the MOT methods DeepSORT and ByteTrack have excellent sensitivity with satisfactory specificity.The false-negative scenario where the alarm is not raised while the patient is at risk of a fall should be controlled at the lowest level in real hospital applications, which in our experiment happens at a rate of 3.2%.Both results shown in Table 5 from the MOT methods DeepSORT and ByteTrack have excellent sensitivity with satisfactory specificity.The false-negative scenario where the alarm is not raised while the patient is at risk of a fall should be controlled at the lowest level in real hospital applications, which in our experiment happens at a rate of 3.2%.Both results shown in Table 5 from the MOT methods DeepSORT and ByteTrack have excellent sensitivity with satisfactory specificity.The false-negative scenario where the alarm is not raised while the patient is at risk of a fall should be controlled at the lowest level in real hospital applications, which in our experiment happens at a rate of 3.2%.

Discussion
The innovation of this study lies in the application and integration of depth cameras and deep learning to address the demand for a restraint-free bed-exiting alarm system for fall prevention with real-time tracking and dynamic virtual fence techniques.It is worth noticing that our previous research eNightLog [40] had no tracking.The highest point of the defined region was regarded as the patient's head so the interference factor of bed height would raise a false alarm in scenario 07 when the nurse/patient adjusts the level of the bed.As shown in Figure 13, with the ability of head tracking and application of a dynamic safe region that expands with the head height increasing, our system could avoid the disturbance caused by alternative bed heights, which demonstrated the robustness of our model.Also, since our system could classify whether a person is medical personnel or a user, it is able to prevent mistaking medical personnel for users and raising false alarms when medical personnel leave the safe region.Therefore, comparing our current study, we can adjust the bed height and inclination without affecting the system performance.Although the accuracy of our system cannot match that of previous wearable sensors, it is not complicated to use and reduces the workload of nurses.

Discussion
The innovation of this study lies in the application and integration of depth cameras and deep learning to address the demand for a restraint-free bed-exiting alarm system for fall prevention with real-time tracking and dynamic virtual fence techniques.It is worth noticing that our previous research eNightLog [40] had no tracking.The highest point of the defined region was regarded as the patient's head so the interference factor of bed height would raise a false alarm in scenario 07 when the nurse/patient adjusts the level of the bed.As shown in Figure 13, with the ability of head tracking and application of a dynamic safe region that expands with the head height increasing, our system could avoid the disturbance caused by alternative bed heights, which demonstrated the robustness of our model.Also, since our system could classify whether a person is medical personnel or a user, it is able to prevent mistaking medical personnel for users and raising false alarms when medical personnel leave the safe region.Therefore, comparing our current study, we can adjust the bed height and inclination without affecting the system performance.Although the accuracy of our system cannot match that of previous wearable sensors, it is not complicated to use and reduces the workload of nurses.The misclassification of patient heads in YOLOv8 when identifying the patients and medical personnel is mainly due to head-like interference, such as pillows, shoulders, and knees, when viewing from certain levels.The depth information used in this experiment is only from a top-down viewpoint, which results in the intake of information about the appearance of subjects in the visible area being relatively limited.Therefore, misclassification could occur when our customized YOLOv8 model misidentifies items of head-like shapes as a head.The issue could be addressed when the depth information captured by multi-depth cameras from diverse perspectives is taken into consideration.Also, the multi-depth camera information could also be registered into a new volume to avoid the issue of overlapping objects.Some false negatives occur during tracking with a relatively large safe region.The largest safe region is already limited by the user sitting height, so the user leaves the depth camera's view once he leaves the safe region.This buffer zone can be enlarged by using a camera with a wider angle or installing a camera in a higher position so that the proportion of the predetermined region to the whole field of view is small enough to provide a reasonable buffer zone area.In addition, a greater frame rate than the 6 frames/s in this experiment could increase the flexibility of our eNightTrack system.Though the ReID changes demonstrated robustness as they indicated the situations of lost tracking and mis- The misclassification of patient heads in YOLOv8 when identifying the patients and medical personnel is mainly due to head-like interference, such as pillows, shoulders, and knees, when viewing from certain levels.The depth information used in this experiment is only from a top-down viewpoint, which results in the intake of information about the appearance of subjects in the visible area being relatively limited.Therefore, misclassification could occur when our customized YOLOv8 model misidentifies items of head-like shapes as a head.The issue could be addressed when the depth information captured by multi-depth cameras from diverse perspectives is taken into consideration.Also, the multi-depth camera information could also be registered into a new volume to avoid the issue of overlapping objects.Some false negatives occur during tracking with a relatively large safe region.The largest safe region is already limited by the user sitting height, so the user leaves the depth camera's view once he leaves the safe region.This buffer zone can be enlarged by using a camera with a wider angle or installing a camera in a higher position so that the proportion of the predetermined region to the whole field of view is small enough to provide a reasonable buffer zone area.In addition, a greater frame rate than the 6 frames/s in this experiment could increase the flexibility of our eNightTrack system.Though the ReID changes demonstrated robustness as they indicated the situations of lost tracking and mis-tracking in MOT techniques, it could not exactly reflect the robustness.As for the complexity, both DeepSORT and StrongSORT tracking contained approximately 6.46 M trainable parameters, while the YOLOv8 segmentation model contained 261 layers and approximately 3.41 M trainable parameters.Due to the limitation that only a small number of features were used in this study, it is not easy to remove any feature among them for an ablation study.In a future study, we can track all palms, shoulders, knees, and feet, where OpenPose can be used to recognize them and improved tracking techniques can then be applied to track additional features to enhance the performance.Compared with the bed-exiting identification system developed by Lu et al. in 2018 [62] that achieved an accuracy of 60.3% on 151 samples with combination of DPCA (dynamic principal component analysis) and GMM (Gaussian mixture model) technologies, our system demonstrated higher accuracies of 79.8% and 79.1%.
In future implementations of our system in hospitals, the tolerant head height of the patient used to determine the maximum safe region should be adjusted to the patient's height.Furthermore, more subjects could be recruited in future work to eliminate intersubject interference and improve the generality of the eNightTrack system.Additionally, prior YOLO models, such as YOLO-NAS, could be utilized for head detection.Similarly, an emergency MOT technique could substitute the tracking model adopted in this experiment.The registered 3D data mentioned in Section 3.2 could be utilized for 3D tracking, which could be more comprehensive.Clinical trials should be implemented before it is finally applied to hospital use.

Conclusions
In hospitals, some unattended patient bed-exiting events might result in falls, increasing the burden on medical staff.At present, commonly used means of physical or chemical restraint might harm the physical and mental health of patients.Ordinary RGB camera monitoring systems involve privacy concerns.Therefore, our virtual monitoring system based on depth cameras has been developed for preventing patients' movements and fall risk.
We demonstrated that the eNightTrack system had a convincing sensitivity of 96.8% for detecting bed-exiting events, making it a potential effective tool to prevent falls.Moreover, it offers several advantages, including the avoidance of privacy issues and could serve as an alternative to current restraint measures.The system is robust to disturbances caused by bed height variations, furniture changes, and medical personnel entering the predefined region.However, there are still several limitations and concerns that should be focused on during future developments.A further modified eNightTrack system could be installed in hospital wards to support nurses in monitoring at any moment.

Figure 3 .
Figure 3. Setup and environment of data collection.

Figure 3 .
Figure 3. Setup and environment of data collection.

Figure 3 .
Figure 3. Setup and environment of data collection.

Figure 6 .
Figure 6.Scale determination of safe region.

Figure 6 .
Figure 6.Scale determination of safe region.

Figure 6 .
Figure 6.Scale determination of safe region.

Figure 7 .
Figure 7. Initial state of each scenario, where the red frame here indicates the dynamic safe region and the pink frame indicates the head of user.

Figure 8 .
Figure 8. Illustration of head center out of safe region.

Figure 7 .
Figure 7. Initial state of each scenario, where the red frame here indicates the dynamic safe region and the pink frame indicates the head of user.

Figure 7 .
Figure 7. Initial state of each scenario, where the red frame here indicates the dynamic safe region and the pink frame indicates the head of user.

Figure 8 .
Figure 8. Illustration of head center out of safe region.

Figure 8 .
Figure 8. Illustration of head center out of safe region.

Figure 9 .
Figure 9. Algorithm flowchart of alarm identification.Figure 9. Algorithm flowchart of alarm identification.

Figure 9 .
Figure 9. Algorithm flowchart of alarm identification.Figure 9. Algorithm flowchart of alarm identification.

Figure 10 .
Figure 10.Movement indicator diagram: (a) no movement, so all black in red frame; (b) movement existing.

Figure 10 .
Figure 10.Movement indicator diagram: (a) no movement, so all black in red frame; (b) movement existing.

Algorithms 2023 , 20 Figure 11 .Figure 12 .
Figure 11.Illustration of four conditions of confusion matrices, where the large red frame indicates the dynamic safe region, the pink frame indicates the head of user, and the small red frames indicate the head of medical personnel: (a) true-positive example; (b) true-negative example; (c) false-positive example; (d) false-negative example.

Figure 11 . 20 Figure 11 .Figure 12 .
Figure 11.Illustration of four conditions of confusion matrices, where the large red frame indicates the dynamic safe region, the pink frame indicates the head of user, and the small red frames indicate the head of medical personnel: (a) true-positive example; (b) true-negative example; (c) false-positive example; (d) false-negative example.

Figure 13 .
Figure 13.Bed raising scenario without false warning, where the red frame is dynamic safe region.

Figure 13 .
Figure 13.Bed raising scenario without false warning, where the red frame is dynamic safe region.

Table 2 .
The performance of customized YOLOv8.

Table 4 .
Comparison of MOT methods.

Table 5 .
Performance of alarm algorithm based on DeepSORT and ByteTrack.

Table 5 .
Performance of alarm algorithm based on DeepSORT and ByteTrack.

Table 5 .
Performance of alarm algorithm based on DeepSORT and ByteTrack.