Microsoft Kinect Visual and Depth Sensors for Breathing and Heart Rate Analysis

This paper is devoted to a new method of using Microsoft (MS) Kinect sensors for non-contact monitoring of breathing and heart rate estimation to detect possible medical and neurological disorders. Video sequences of facial features and thorax movements are recorded by MS Kinect image, depth and infrared sensors to enable their time analysis in selected regions of interest. The proposed methodology includes the use of computational methods and functional transforms for data selection, as well as their denoising, spectral analysis and visualization, in order to determine specific biomedical features. The results that were obtained verify the correspondence between the evaluation of the breathing frequency that was obtained from the image and infrared data of the mouth area and from the thorax movement that was recorded by the depth sensor. Spectral analysis of the time evolution of the mouth area video frames was also used for heart rate estimation. Results estimated from the image and infrared data of the mouth area were compared with those obtained by contact measurements by Garmin sensors (www.garmin.com). The study proves that simple image and depth sensors can be used to efficiently record biomedical multidimensional data with sufficient accuracy to detect selected biomedical features using specific methods of computational intelligence. The achieved accuracy for non-contact detection of breathing rate was 0.26% and the accuracy of heart rate estimation was 1.47% for the infrared sensor. The following results show how video frames with depth data can be used to differentiate different kinds of breathing. The proposed method enables us to obtain and analyse data for diagnostic purposes in the home environment or during physical activities, enabling efficient human–machine interaction.


Introduction
Recently developed computational technologies enable the use of non-contact and non-invasive systems for monitoring breathing features, detecting heart rate changes and analysing facial features as important diagnostic tools for studying neurological and sleep disorders, assessing stress and evaluating fitness level [1][2][3][4][5][6][7][8]. Figure 1 presents an example of the mouth area with the proposed use of capillaries for heart rate detection. The methodology of subtle color changes caused by blood circulation is complementary to the extraction of pulse rate from video records of fine head oscillation that accompany the cardiac cycle [9,10]. Complex monitoring of selected biomedical features can be based on processing of data ( Figure 2) recorded by Microsoft (MS) Kinect sensors [11][12][13][14][15], forming an inexpensive replacement for conventional devices. These sensors can be used to monitor breathing and heart rate changes during physical activities or sleep in order to analyze disorders by mapping chest movements during different kinds of breathing with analysis of volume or detection of neurological movement disorders and their diagnostics. In general, the use of image and depth sensors [16][17][18][19] allow 3D modelling of the chest and abdomen volume changes during breaths as an alternative to conventional spirometry with the use of video (RGB) cameras [20,21], infrared cameras [22][23][24] or Doppler multi-radar systems [25,26] in addition to ultrasonic and biomotion sensors [27][28][29]. The present paper is devoted to (i) the description of methodology for breathing and heart rate monitoring based on image, depth and infrared video sequences acquired by MS Kinect in the face and chest area; (ii) the visualization of the obtained data; and (iii) the comparison of biomedical features evaluated from data recorded by non-contact image and depth sensors.
The proposed methodology assumes the use of digital filtering methods for data noise component rejection, resampling, data fusion and spectral analysis for detecting the required biomedical features. Specific statistical methods [30,31] are applied for analysis of the obtained features. The paper presents how new technologies, data fusion of signals acquired from different sensors and specific computational intelligence methods can be used in human-machine interaction systems for data processing and extracting and analysing biomedical features.
The paper provides motivation for further studies of methods related to big data processing problems, dimensionality reduction, principal component analysis and parallel signal processing to increase the speed, accuracy and reliability of the results. Methods related to motion recognition can be further used in multichannel biomedical data processing for the detection of neurological disorders, for facial and movement analysis during rehabilitation and for motion monitoring in assisted living homes [32][33][34]. Further studies are devoted to non-intrusive sleep monitoring systems for analysis of depth video data to track a person's chest and abdomen movements over time [35,36] for diagnosis of sleep disorders.

Data Acquisition
The sequence of separate images was recorded with associated time stamps indicating the video frame rate changing from 7 to 15 fps. Image resolution was 1920 by 1080 pixels while the depth and infrared camera resolution was 512 by 424 pixels. Figure 2 presents data subregions that were analysed by an MS Kinect system [13,14].
The video camera allowed following colour image components with selected results recorded at the mouth area of the individual as presented in Figure 2a in the contour form. Specific mouth detection algorithms and face recognition techniques can be used to observe this region [37,38] using different computer vision methods. Figure 2b presents the mapping area used by the depth sensor [39] to record the distance of individual pixels in mm stored in a matrix with its size corresponding to the specified resolution. Figure 2c presents the infrared camera image. Rectangular boxes in each figure show the area used for specifying the time evolution of the object features.
Simple tests were limited by the data size that was obtained. By scaling the image data by a factor of 0.4, the 120 s record of data from the image, depth and infrared sensors occupy 1.2 GB of hard disk space. The overnight record of sleep activities requires about 250 GB of disk space and is dependent on the frame rate and resolution; specific methods of big data processing should be applied to reduce the evaluation time.

Data Processing
Time stamps recorded with each video frame allowed resampling of the video sequence [40,41] to a constant sampling frequency in the preprocessing stage. The final frame rate of f s = 10 fps and spline interpolation was selected to compensate sampling time changes given by technical conditions of data acquisition [42].
The sequences of image and depth maps that were acquired by the MS Kinect were analysed over the selected rectangular area of R rows and S columns that covered either the mouth or the chest area of the individual. For each matrix D n (i, j) including values inside the region of interest at discrete time n, the mean values were evaluated by the following equation: In this way, the mean value of pixels in the selected area for each video and depth frame formed a separate time series {d(n)} N−1 n=0 of length N. To optimize the final algorithm, the resampling mentioned above was applied to this sequence only and not to the complete image frame in this case.
Finite impulse response (FIR) filtering filtering of the selected order M (=40) was then applied to each obtained signal to evaluate the new sequence {y(n)} N−1 n=0 using the following equation: with coefficients {b(k)} M−1 k=0 defined to form a band-pass filter with cut-off frequencies f 1 = 0.2 Hz and f 2 = 2 Hz to cover the estimated frequency of breathing and heart rate and to reject all other frequency components including the mean signal value and its additional noise. Filtering was applied in both forward and reverse directions to minimize start-up and ending transients. To detect the heart rate from the depth sensor data, the infinite impulse response (IIR) band-pass filtering by the Butterworth filter of the 4th order was applied to extract frequency components in the range of 0.6, 1.8 Hz.
Spectral components in each recorded frame were then evaluated by a discrete Fourier transform forming the sequence for k = 0, 1, · · · , N−1 related to frequency f k = k N f s . To obtain the time dependence of these values, a short time Fourier transform of the selected window length was applied.
The local polynomial approximation of evaluated spectral components was then applied in two specified frequency ranges corresponding to possible frequencies of breathing ( F 1 , F 2 Hz) and heart rate ( F 3 , F 4 Hz). Extremal values are then detected in these ranges by the smoothing polynomial g(x) = ∑ P−1 p=0 c p x p of the selected order P using the least squares method to minimize summed squared differences between values of the smoothing function g( f k ) and K values of spectral components s( f k ) for frequencies f k .

Results
Analysis of a selected record of 120 s of image, depth and infrared video frames in stable conditions is presented in Figure 3. The evolution of the red, green and blue mean image values in the mouth area that were selected according to The estimation of the breathing rate was confirmed by the mean thorax movement that was observed in the selected area according to Figure 2b with the resulting evolution of mean depth matrix values presented in Figure 3b. The dominant frequency peak indicates a frequency of 0.343 Hz in this case (representing 20.58 breaths per minute). Table 1 Table 1. The accuracy of image and depth sensor data analysis is better than 0.26% of the breathing rate determined by the thorax movement in this case. The depth sensor is able to detect not only the respiratory rate but also the heart rate represented by a small peak for frequency of about 0.95 Hz (57 bpm) in this record as presented in Figure 2b. To extract this information more precisely, the band-pass filter was applied.
A general methodology for the estimation of respiratory rate and heart rate is presented in Figure 4 for the spectral analysis of a 20 s infrared video sequence using mean values in the mouth region of interest (ROI). An approximation of the spectral components in the selected range using a 7th order polynomial was used to detect the frequency of the highest peak, using the mean squares method to minimize summed squared differences defined by Equation (4). The breathing rate was estimated in the selected frequency band of 12, 38 breaths per minute. In a similar way, the heart rate was estimated using the local polynomial approximation in the selected frequency band of 55, 90 bpm.  Figure 3. Time evolution of mean data values in selected frames with their spectra of (a) the red, green and blue components acquired by the image sensor; (b) the depth sensor data evolution with its spectrum; and (c) the infrared data detecting both the breathing frequency and the heart rate during a 120 s data segment.  Analysis of the heart rate estimate based upon a 180 s infrared video sequence is presented in Table 2. Video data were recorded after physical activity following a decrease in heart rate. The region of interest (ROI) covered the mouth area according to Figures 1 and 2c. Each window was 20 s in duration, and spectral analysis was performed for the whole ROI and three subregions (ROI1, ROI2, ROI3) that were selected according to Figure 1a. Results obtained by the MS Kinect non-contact infrared sensor are compared with the heart rate recorded by the Garmin heart rate sensor in the second column of Table 2. The accuracy of the heart rate estimation in the whole ROI of the infrared sensor related to the contact heart rate sensor is better than 1.5% in this case. Table 2. Estimates of heart rate (HR) evaluated for a 20 s window length by an MS Kinect infra sensor in the full region of interest (ROI) and its subregions in the mouth area compared with Garmin records. Different spectral components estimated in separate subregions over the whole period of 180 s are presented in Figure 5a. While the breathing frequency can be clearly detected (because of the oral corner movement) in ROI1 and ROI3, the ROI2 covering the central part of the mouth area includes more information about the heart rate. Estimates obtained in the whole ROI and the corresponding values recorded by the Garmin sensor are presented in Figure 5b. Average heart rate values that were evaluated in each recorded minute presented in Figure 5c show very good correspondence of these observations.   Figure 5. Heart rate estimation using MS Kinect infrared data and Garmin heart sensor data recorded with a sampling period of 1 s presenting (a) the infrared image spectral components in separate regions over a period of 180 s; (b) the evolution of heart rate using 20 s video sequences together with Garmin records; and (c) the comparison of mean heart rate values averaged over one minute segments and infrared values averaged over the whole mouth ROI and its three subregions. Figure 6 presents mapping of the chest movement in the selected area covered by a grid of specified density. The three-dimensional surface of the thorax that was recorded and analysed in a given time instant is presented in Figure 6a. The contour depth image of the chest with a grid of 10 by 10 values at the specified time is given in Figure 6b. Figure 6c presents the time evolution for 10 grid positions on the vertical axis of the chest showing different breathing ranges at different locations on the chest. Corresponding video animation can be used to visualize the time evolution of the thorax movement; video animation also provides the possibility to evaluate volume changes. The average range of chest movement in selected grid points and over the selected time of 120 s is presented in Figure 7. It is possible to observe the higher range in the upper chest area and smaller values in the abdomen area. This method can also be used for non-contact study of the symmetry of breathing movements over the chest.   Final parts of each record in Figure 8a,b present records acquired by the depth sensor during the interrupted breathing to simulate the sleep apnea While frequency components associated with the breathing disappeared, it is still possible to observe and to detect the heart rate as presented in Figure 8c,d after evaluation of spectral components of data in the middle pectoral area. Figure 8c presents spectral components in window A with the dominant breathing frequency of 0.27 Hz (16.2 breaths per minute) in this case. Dominant frequency components in Figure 8d evaluated in subwindows B1 of the area B with the interrupted breathing points to the heart rate frequency of 0.69 Hz (41.4 bpm). During the interrupted breathing, the heart rate increases as presented for subwindow B2 pointing to the heart rate frequency of 1.34 Hz (80.3 bpm). This result shows that, for detection of sleep apnea both breathing and heart rate frequency should be observed and analyzed.

Starting Time (s) Garmin Record (bpm) Kinect HR (bpm) Estimate
MS Kinect provides a cheap alternative to video monitoring of sleep abnomalities performed in sleep laboratories by polysomnography as the gold standard diagnostic tool. Polysomnography (PSG) records different biosignals, but it is an invasive method that may disturb natural sleep. In comparison with other efficient methods, including infrared video monitoring [24], it is possible to use MS Kinect as a cheap device to record video, depth and infrared videosequences for sleep monitoring but also as a part of assistive home technologies.

Conclusions
This paper presents the use of MS Kinect image, depth and infrared sensors for non-contact monitoring of selected biomedical signals, evaluation of breathing and heart rate using recorded video frames, and verification of the obtained results. It is assumed that estimated features obtained from data retrieved from a natural environment will increase the reliability of such observations related to the diagnosis of possible neurological disorders and the fitness level of the individual. The purpose of this approach is to replace wired sensors by distance monitoring, allowing more convenient data acquisition and analysis.
Obtained data were used for the analysis of biomedical signals recorded after specified body activity. The results show no significant difference between biomedical features obtained by different biosensors and non-contact MS Kinect technology. While breathing rate can be recorded with high reliability, non-contact heart rate detection depends upon the visibility of an individual's blood vessels in the case of video sensor use.
Results show how the MS Kinect sensors and selected digital signal processing methods can be used to detect the heart rate and to analyse breathing patterns during different kinds of breathing. Processing of video frames acquired during interrupted breathing points to the possible use of these sensors for sleep apnea analysis as well.
The methods described here form an alternative approach to biomedical data acquisition and analysis. Developing the abilities of different biosensors with possibilities of wireless data transmission increase the importance of remote data acquisition and signal analysis using computational intelligence and information engineering in the future. This approach has a wide range of applications not only in biomedicine but also in engineering and robotics. Specific applications based on analysis of depth matrices allow gait analysis and early diagnosis of locomotor system problems.
Further research will be devoted to algorithms for more precise data acquisition and processing to detect biomedical feature changes for correct diagnosis and for proposing further appropriate treatment. It is assumed that infrared sensors will be used for non-contact analysis during sleep to detect sleep disorders.
Acknowledgments: Real data were kindly provided by the Department of Neurology of the Faculty hospital of Charles University in Hradec Kralove, Czech Republic. The project was approved by the Local Ethics Committee as stipulated by the Helsinki Declaration.
Author Contributions: Aleš Procházka and Oldřich Vyšata designed the study and reviewed relevant literature, Martin Schatz was responsible for algorithmic tools allowing data acquisition, and Oldřich Vyšata and Martin Vališ interpreted the acquired data and results obtained from the neurological point view. All authors have read and approved the final manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: