1. Introduction
The Neonatal Intensive Care Unit (NICU) provides critical care for the most vulnerable newborn patients. Such patients are characterized by precarious health and require continuous monitoring. Such continuous monitoring in the NICU typically involves sensors attached to the patient’s skin, which are susceptible to motion artifacts and may interfere with both clinical and parental care. The wired sensors can irritate sensitive skin, with frequent removal and reapplication sometimes required during medical interventions. This motivates the development of robust video-based noncontact patient monitoring [
1,
2,
3].
A patient may experience multiple periods of clinical intervention or routine care throughout their time in the NICU. These interventions can include a clinician or parent reaching into the scene to replace sensors, take readings, change a diaper, feed the patient, or otherwise move the patient. These periods of intervention are often excluded from analysis when studying novel noncontact techniques of monitoring neonates in the NICU (e.g., [
4,
5]). However, studies by Villarroel et al. [
3] and Souley Dosso et al. [
2,
6] attempt to detect these periods of intervention and, in the case of [
6], classify a subset of them (bottle-feeding interventions).
Deep learning has led to dramatic advancements in computer vision, which have translated into new forms of noncontact patient monitoring [
1]. Souley Dosso et al. [
2] used the VGG-16 CNN model introduced in ref. [
7] as the feature extractor for their method of intervention detection. They examined several forms of multi-modal (RGB and depth) fusion, resulting in similar performance between the RGB and RGB-D fusion models while observing significantly lower performance for depth-only models.
In this paper, we develop a model to detect periods of clinical or routine care intervention using only depth-based images, as this modality is more privacy-preserving than RGB or RGB-D images. Detecting such interventions is useful for several reasons. For example, when estimating vital signs, estimation may be paused or patient monitor alarms may be silenced automatically during interventions since a clinician is already attending to the patient. Detecting interventions is a step towards classifying interventions, which may ultimately lead to automated charting of patient care. Furthermore, by creating an intervention detection system based strictly on depth data, detection will be robust to changes in lighting, which can occur frequently in the NICU due to clinical interventions and parent time and regularly throughout the day for some premature patients to promote sleep and support development. This paper focuses on utilizing depth video alone for intervention detection, building on the preliminary results reported in the following thesis [
8]. Note that portions of this manuscript previously appeared in the following thesis [
9].
This study leverages vision transformers (ViTs) [
10], which have been shown to outperform CNNs for image classification in several application areas. The ViT divides each input image into a number of nonoverlapping patches which are flattened into vectors of pixel values and used as the input to the transformer’s encoder. The ViT culminates in a fully connected head layer for the task of image classification. Variations and extensions of this model have had success in image segmentation, object detection, and video action recognition [
11,
12,
13].
When training a deep learning model, large amounts of data and compute resources are needed. For this reason, transfer learning is usually employed, where models are pre-trained on large datasets prior to fine-tuning the model to perform specific tasks with smaller training datasets. For image classification, several convolutional neural network (CNN) and vision transformer (ViT) models are available that have been pre-trained on the ImageNet dataset [
14] consisting of ∼14 million annotated images from 1000 classes. Given a downstream task, pre-trained models are normally chosen from the same or similar domains (e.g., RGB image classification, object detection, semantic segmentation). Transfer learning has been shown to improve the average accuracy of CNN models [
15] as well as ViT [
10] for image classification.
Pre-trained image classification models are generally trained on large amounts of labelled three-channel RGB data. HHA encoding is a method of encoding depth data using three channels for each pixel rather than just the one channel of depth [
16]. An example illustrating the three channels resulting from HHA encoding of a depth image can be seen in 
Figure 1. The three channels correspond to the horizontal disparity (H), the height above the ground (H), and the angle the pixel’s local surface normal makes with the inferred gravity direction (A). This has been shown to improve the performance of a network pre-trained on RGB data and fine-tuned with labelled HHA-encoded depth data when compared to fine-tuning on regular one-channel depth or disparity data. Gupta et al. suggest that this is because the disparity and angle channels may show edges that correspond to object boundaries that can be seen in the RGB images of the same scene [
16]. The authors verify this by fine-tuning a CNN originally trained for object detection and semantic segmentation from RGB images [
17].
The horizontal disparity can be calculated from depth data by using Equation (
1) [
18]:
      where the 
Focal Length and 
Baseline are found from the camera’s intrinsic matrix [
19]. The height above the ground and the angle between the surface normal and inferred gravity direction can be found using the algorithms presented in ref. [
20] and implemented in refs. [
21,
22]. The algorithms require the point cloud representation of the depth image as well as the camera matrix. The direction of gravity is estimated by finding the direction that is best aligned to surface normals, under the assumption that most surfaces in the scene are horizontal. The direction of gravity is initialized to the camera’s Y-axis before iteratively refining the estimated direction by examining local surface normals in the depth data. The height above the ground can then be found by rotating the point cloud of the data to the horizontal direction then subtracting the smallest Y-coordinate value in the scene [
23]. The angle between the surface normal and the gravity direction can be found from the difference in the respective vectors. Finally, the values in each of the channels are normalized to the range of 0–255 (i.e., an 8-bit value).
Despite the notable advancements in noncontact monitoring of patients in the NICU, there remains a critical gap in the literature concerning the automatic detection of clinical interventions and routine care events, particularly using depth data. Many existing studies [
2,
3,
6] have leveraged RGB (colour) or multi-modal RGB-D image data for such detection. Chaichulee et al. achieved excellent accuracy when detecting clinical interventions using RGB video [
24]. However, RGB (and RGB-D) video may be considered intrusive and is sensitive to ambient lighting. Therefore, in this study, we specifically focus on models restricted to privacy-preserving depth images rather than RGB (or RGB-D).
The contributions of this study are as follows: First, we propose an intervention detection method based solely on depth data, thereby increasing robustness to lighting changes and maintaining patient privacy. The method utilizes a vision transformer (ViT) model to interpret the depth data, an approach not previously explored for this application in the NICU setting. We also investigate several design parameters such as the encoding of depth data and the application of perspective transform to account for varying camera placement, hence offering a versatile solution suitable for different NICU environments. Finally, we evaluate our model using real-world NICU data, demonstrating its practical utility and efficacy. Utilizing real-world data for the evaluation is critical in this case, as a simulated environment may not accurately reflect the range of challenges that arise in a complex clinical environment. An example of several challenging scenarios can be seen in 
Figure 2. Our results not only confirm the feasibility of the proposed approach but also set the stage for future work in automatic classification of interventions and eventually automated charting of patient care.
  2. Materials and Methods
  2.1. Data Collection
To support our study, we collected two types of data: clinical data from neonatal patients and simulated data from a neonatal manikin. In the following subsections, we describe the data collection process for each dataset, including details on the data collection setup, data processing, and class labelling.
  2.1.1. NICU Data Collection
Data were collected from 27 neonatal patients in the NICU at the Children’s Hospital of Eastern Ontario (CHEO) following approval by the research ethics boards from the hospital and Carleton University. The data were collected as part of a larger research initiative to develop multi-modal noncontact patient monitoring methods and technologies. The dataset cannot be released publicly due to the restrictions set by the research ethics board.
Figure 3 shows an example of the setup in the NICU environment. An RGB-D camera (Intel RealSense SR300, Santa Clara, CA, USA) was placed above or beside the patient’s bed. The camera was chosen due to its small size, affordability, and suitable depth range to capture patients at a close distance. Recordings were captured at a resolution of 640 × 480 pixels at 30 frames per second. The cameras were placed such that the view planes were at nonuniform angles relative to the plane of the bed. The SR300 captures depth information using the coded-light method; using a combination of an IR projector and IR camera sensor to generate a depth pixel frame. The camera also includes a separate RGB camera sensor that can be used in conjunction with the depth stream to form an RGB-D image. Note that in the present study, all proposed methods use only privacy-preserving depth image data.
 The gold standard respiratory rate signals of the patients were recorded from the bedside patient monitor (Draegar Infinity Delta). Custom Patient Monitor Data Import (PMDI) software Version 1.0, developed for the project, was used to import the data from the serial port on the monitor [
25]. A bedside annotation application was used to annotate events (clinical interventions, etc.) in real time. All data from the camera and patient monitor were saved on a data acquisition laptop.
Still images were extracted from the patient recordings every 30 s and labelled as either ‘Intervention’ (positive) or ‘No Intervention’ (negative). This resulted in 14,892 images in total, with 1260 in the positive class and 13,632 in the negative class (a class imbalance of 10.8:1 in favour of the negative class). The ‘Intervention’ class comprised images where a nurse or other practitioner was reaching into the camera’s view to tend to the patient, while the ‘No Intervention’ class included only the patient (
Figure 4).
The difficulty of intervention detection from depth data can sometimes be misrepresented. Looking at 
Figure 4, one would assume that the difference in the depth frame between the nurse’s hands and the patient/bed would be apparent; however, the task is often more difficult. In 
Figure 5, an intervention frame can be seen that is more challenging to classify by looking only at the depth channel (on the right). If the caregiver’s hands are near or at the height of the patient’s bed, the difference in depth can be sufficiently small to require more advanced methods. This is demonstrated by including a baseline approach in the present study.
  2.1.2. Simulated Data Collection
After the initial data collection stage, additional simulated data were collected to partially address the class imbalance between nonintervention/intervention frames in the clinical data. A neonatal manikin (StandInBaby [
26]) was placed on simulated clinical bedding, and the RealSense SR300 RGB-D camera was used to capture 600 depth images, as illustrated in 
Figure 6. A camera arm was used to place the camera at 5 different angles relative to the plane of the bed. Yellow gloves were worn during data collection to facilitate the use of the collected data in image segmentation studies in future studies by providing a consistent colour reference for the hands (
Figure 7).
  2.2. Proposed Method
Vision transformers have demonstrated excellent accuracy in image classification tasks since their introduction in ref. [
10]. We propose the use of a ViT pre-trained on the ImageNet dataset [
14] and fine-tuned on a subset of our own set of 14,892 depth images. The model architectures were implemented using the PyTorch Image Models library [
27]. Two model sizes with similar architecture but different numbers of trainable parameters were chosen, ‘vit_tiny_patch_16_224’ and ‘vit_base_patch_16_224’, with ∼5.4 M parameters and ∼85 M parameters, respectively. Each of the models accepts input images with a resolution of 224 × 224 pixels and divides them into 16 × 16 patches for embedding. The difference in the number of trainable parameters comes from an increase in the dimensions of the hidden embedding layer and the number of heads in the attention mechanism when moving from the ‘tiny’ model to the ‘base’ model.
Performance evaluation: Throughout this study, we have used a repeated five-fold cross-validation approach, where the dataset of 27 patients was divided into five distinct “folds”. For each combination of system design parameters, five models were trained and evaluated, and the average performance across the five models was computed. Within each of the five folds, a classification model was trained on four folds (approximately 21 patients), while the remaining 5–6 patients not used to train the model were used to evaluate the model. In this way, all patient data were used to both train and evaluate models but never the same model. This entire process was repeated five times, with different patients assigned to each fold in each repetition. The mean across the five repetitions was reported as the final performance metric.
Hyperparameters: The training of the models utilized a mini-batch size of 16 and a learning rate set at 0.01 over a maximum of 15 epochs. Visual inspection of preliminary learning curves indicated no substantial reduction in validation loss beyond 15 epochs. Stochastic gradient descent was employed as the optimizer, with a momentum of 0.9. The input images were each resized to dimensions of 224 × 224 pixels, which altered their aspect ratio from 4:3 to 1:1. Finally, random rotations (ranging from 0° to 360°) and horizontal/vertical flips were applied to the images in the training sets.
Along with the size of the model, the effect of three other system design parameters on model performance was also explored. These parameters are described in the upcoming 
Section 2.2.1, 
Section 2.2.2 and 
Section 2.2.3, and a summary can be seen in 
Table 1. A visual representation of the data flow and utilization of the proposed system design parameters can be seen in 
Figure 8.
  2.2.1. Simulated Data
Since the data collected from the NICU contain more instances without interventions than those with, the resulting labelled data had a high class imbalance of 10.8:1 in favour of the negative (no-intervention) class. To help correct for this imbalance, simulated intervention data were collected as previously described (
Section 2.1). These data comprised 600 images of simulated interventions that were added to the positive class, bringing the class imbalance down to approximately 7.3:1. Both model sizes were trained without the addition of the simulated data, and then the process was repeated with the inclusion of the simulated data in each training fold (i.e., simulated data were used for training but not for testing).
  2.2.2. Perspective Transformation
The effect of a perspective transformation (PT) algorithm on the performance of the models was explored. As discussed in ref. [
5], we previously demonstrated that perspective transformation can account for nonoptimal depth camera placement relative to the patient. In that study, perspective transform was shown to improve an ROI selection algorithm for subsequent respiration rate estimation. Based on those results, it was thought that applying the transformation to the data used to build the ViT-based intervention detection model might also improve its performance. The patient data collected from the NICU and the simulated data were transformed by manually selecting four registration points in the plane of the bedding for each new recording. The rotation matrix was found and applied to all frames extracted from the same recording. The experiments were then re-run using these transformed data as the input. Models were trained and tested with and without perspective transform to investigate its effect on intervention detection accuracy.
  2.2.3. HHA Encoding
ViTs are not typically trained from scratch for specific image classification tasks. Rather, ViT models are typically pre-trained on large datasets using self-supervised learning techniques, such as masked auto-encoding (MAE) [
28]. Pre-trained ViTs are then fine-tuned for specific tasks through the addition of a task-specific prediction head. Such pre-training of ViT requires a large amount of data and extensive compute resources. Some ViT models pre-trained on large image datasets, such as ImageNet, have been released publicly by researchers at Google Research [
29] and other groups. As these models have been pre-trained on 3-channel RGB images, there is latitude as to how the single channel of depth data should be mapped to a 3-channel input. The effect of HHA encoding on the performance of the proposed intervention detection model was investigated.
Each of the datasets described previously was transformed to be HHA-encoded, and the experiments were re-run. Models were trained with and without HHA encoding to investigate its effect on intervention detection accuracy. Models trained without HHA encoding were modified to accept 1-channel images as inputs. The pre-trained input layer weights from each of the 3 channels normally used for R, G, and B were summed into a single channel.
  2.3. Baseline Methods
The models explored in this study were compared against the best-performing CNN-based intervention detection model proposed by Souley Dosso et al. in ref. [
2]. Specifically, the model chosen for comparison was the multi-modal RGB-D fusion model, which used a VGG-16 CNN architecture [
7] and was pre-trained on the ImageNet dataset [
14] and fine-tuned on the intervention detection dataset described in 
Section 2.1.
Additionally, the exclusively depth-based model from Souley Dosso’s study was included for comparison given its shared reliance on depth modality, though it resulted in lower performance metrics overall. For this model, the VGG-16 input layer was modified by removing two of its three input channels, allowing the pre-trained weights to be fine-tuned on the single depth channel. Further, a conventional (rules-based) method was also evaluated as an alternative baseline for comparison. The method consists of designating a known nonintervention frame for each patient recording and calculating the mean squared error of each of the rest of the frames.
As a final baseline model for comparison with our depth-based models, the RGB-D and depth-based models presented in ref. [
2] were also re-trained and evaluated using the design parameters outlined in 
Section 2.2.1, 
Section 2.2.2 and 
Section 2.2.3. This enabled direct comparisons between the depth-based vision transformer models proposed here and Souley Dosso’s depth-based CNN models for each of the design variables explored in this study.
  4. Discussion
Detecting periods of intervention from a recording presents a number of issues depending on the modality used. RGB video suffers from a decrease in performance during periods of lower light or lighting changes. Models utilizing depth data may be tricked by a nurse’s hands being the same depth away from the camera as the patient or near the patient’s bed. The difference in difficulty of identifying the period of intervention from depth can be seen between 
Figure 4 and 
Figure 5. The ‘base’-size vision transformer trained here for the task of intervention detection outperformed the baseline (state-of-the-art) models over all metrics, while the sensitivity of the ‘tiny’ vision transformer was only slightly outperformed by the RGB-D fusion baseline model. When exploring variables that might affect the performance of the models, one of the models trained was a ‘base’ vision transformer that took advantage of HHA encoding of the depth data after applying the perspective transformation process. This model was overall the highest performing model, and its associated confusion matrix can be seen in 
Table 9. Model size and the encoding type of the depth data were found to have statistically significant effects on the performance of the models, where the ‘base’ model size and HHA encoding were advantageous. Based on the results of our study, we recommend using a vision transformer model with a larger number of trainable parameters applied on depth data that takes advantage of the perspective transformation process outlined in ref. [
5] and HHA encoding of depth data, since this approach was found to have the greatest performance in detecting periods of intervention compared to other models tested.
The methods developed here examined individual representative frames from each intervention event, sampled every 30 s, which is in line with the state of the art in RGB-based intervention detection [
24]. Our ability to classify the representative frame is expected to reflect the performance of the model when applied to all frames within a continuous period of intervention. We did examine a single period of intervention at greater temporal resolution. For this experiment, we extracted each frame of a 90 s period (2695 frames in total). The period began with no intervention. An intervention (vital sign check and re-swaddling) started after 48 s and continued until the end of the 90 s period. The model (trained on patients different from the test patient) was applied to all 2695 frames, and this performance was compared with the performance estimated from the 14,892 representative frames, originally extracted at 1 frame per 30 s. The resulting performance metrics (Sn = 96.25%, Sp = 100%, Acc = 98.26%, F1 = 98.09%) were equivalent to the performance metrics observed when using the representative frames, validating our approach of evaluating models using representative frames sampled at 1 frame per 30 s. Future work will examine the accuracy with which the precise start and end of each intervention can be determined by the proposed methods. This will be a somewhat nebulous task, since even a human annotator will have difficulty determining the precise start and end points of an intervention (e.g., is the start of the intervention when the clinician’s hands are first visible in frame or when the clinician first makes contact with the patient, etc.).
  Future Work
This paper studied the effect of multiple design parameters (separately and in conjunction) on the performance of ViT clinical intervention detection models. While the use of perspective transform and HHA encoding was found to be beneficial, supplementing the training data with simulated patient care scenes alone did not improve model performance. This outcome was unexpected, as it was thought that correcting for the class imbalance would improve classification accuracy. Although the best-performing model did not include the use of simulated data, the second best model overall did utilize it (ViT Base Simulated HHA PT). This suggests that simulated data may have promise when used in conjunction with other design parameters, and future research could explore the benefits of incorporating more diverse simulated data from a variety of care settings. Researchers could also consider repeating the experiment with simulated data that are captured in an environment that is more comparable to real-world NICU environments. Since the use of HHA encoding had a detrimental effect on some of the baseline depth-based CNN model’s metrics, a possible explanation for the lack of benefit from HHA-encoded data is that the hyperparameter search space may have been insufficient to fine-tune the model and fully leverage these new data. Future work should expand on this search space to re-examine the potential benefit from HHA encoding and consider identifying or training a foundation model pre-trained on HHA-encoded data. Another avenue of research may be to train and test ViT with patches of varying numbers and sizes to study their effect on the performance change that occurs when using perspective transformation. Perspective transformation was shown to increase the performance of the CNN architecture, though the improvement to the ViT model’s performance was not consistent. Future research may also look at background subtraction techniques, where a reference frame containing only the patient is used to highlight differences in depth during an intervention. Lastly, this study examined single-frame depth data; future research will extend this work to consider depth video, since the movements of a clinician in the scene will likely differ from those of the patient. ViT models have recently been extended to RGB video [
30,
31], and 3D CNN models [
32] have also shown great promise for this type of analysis.