Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Regressive Machine Learning for Real-Time Monitoring of Bed-Based Patients

Appl. Sci. 2024, 14(21), 9978; https://doi.org/10.3390/app14219978

by Paul Joseph¹, Husnain Ali², Daniel Matthew¹, Anvin Thomas¹, Rejath Jose¹

, Jonathan Mayer¹

, Molly Bekbolatova¹

, Timothy Devine³ and Milan Toma^1,*

Reviewer 1:

Trung Nguyen

Reviewer 2:

Luigi La Spada

Reviewer 3: Anonymous

Reviewer 4:

Shizhou Wu

Reviewer 5: Anonymous

Appl. Sci. 2024, 14(21), 9978; https://doi.org/10.3390/app14219978

Submission received: 16 September 2024 / Revised: 16 October 2024 / Accepted: 28 October 2024 / Published: 31 October 2024

(This article belongs to the Special Issue Bioinformatics & Computational Biology)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presented an ensemble approach in machine learning to detect patient movements from sensor readings to enhance the safety of the patients. Although there is no theory contribution, the proposed method demonstrates the application of existing machine learning algorithms in a new application domain. Here are the strengths of this paper:

- The abstract provides a good and adequate problem description, a literature review summary, and proposed methodologies, and it highlights the results and contributions of this paper.

- Section 2 (Materials and Methods) provided enough information about the proposed methodology, including the dataset characteristics, machine learning models to detect movement from sensors, and performance measurement.

- Section 3 (Results and Discussion) provided good coverage of dataset information, experiment results (tables), and discussions.

However, some major and minor problems in this paper need to be revised:

A. There were no descriptions of the collected dataset. The authors should provide the following information:

- How many mannequins are involved in the data collection (with whom the Movella DOT sensors were attached to gather the data)?

- What is the size of the dataset (each movement class)? What confused me was that the author mentioned 200 instances in line 164. But, in the confusion matrix in Figure 3, the #instances was 248.

B. There were some unclear explanations of the conducted experiments and result collection methods in section 3.

- Lines 163-163, SMOTE has been applied to give each class 200 instances. With the ratio data splitting of 80% training and 20% testing, then I assumed each class would have 40 test instances. The confusion matrix in Figure 3 does not match the above assumptions.

- How many runs were performed for each method (Decision Tree Regressor (DTR), Gradient Boosting Regressor (GBR), and Bagging Regressor (BR))? I prefer the median and standard deviation to be provided in the table results to confirm the statistical confidence metrics of those experiments.

- Were any hyperparameter tunings performed for each method (DTR, GBR, and BR)? The hyperparameters for these experiments should be listed.

C. This research has the limitations of a small collected dataset, so there is a huge question about its applicability in real-world settings. This research also used the SMOTE technique to deal with the imbalanced data problem. However, the authors did not provide a justification or discuss how to evaluate the quality of the synthetic data produced by the SMOTE.

D. There was no explicit Related Works or Literature Review section. It has been combined into section 1 (Introduction). Section 1 (Introduction) should introduce the problem settings, motivation of the research, summary of related works, briefly introduce the proposed methodology and emphasize the contribution of this research. The authors should develop a separate section of Related Works (and move some paragraphs in section 1 to this new section). The Related Works can be focused in two directions: methods to enhance safety in healthcare settings and machine learning/deep learning models.

E. There is no explicit Conclusion section (the paper ended with the Discussion section). Some of the last paragraphs of the Results and Discussion section should be moved into the Conclusion section. This Conclusion section should again summarize the problem statement, proposed methodology, experiments, results, and discussions and explain limitations and future works.

Comments for author File: Comments.pdf

Comments on the Quality of English Language

Many places in this paper should be improved in written English or formats. Here are some common problems:

- Missing articles, such as a/the, or punctuations between sentence phrases

- Word choices and verb tenses.

- Singular or plural words.

- Missing the full spelling of the acronyms when they were first used in the text.

Please check the attached PDF file with the detailed highlighted comments to improve the writing of this article.

Author Response

We are profoundly thankful for the time and effort you've invested in reviewing our manuscript. Your detailed corrections to our English were notably beneficial. Although it's somewhat humbling to acknowledge, they have heightened our gratitude for your comprehensive review. Your contribution is genuinely cherished.

Comment 1: How many mannequins are involved in the data collection (with whom the Movella DOT sensors were attached to gather the data)?

Response 1: In our study, we used a single high-fidelity mannequin for data collection. The reason for this choice was to maintain consistency in the data collected and to control for any potential variability that might arise from using multiple mannequins. The Movella DOT sensors were attached to this mannequin to simulate and capture the various movements typically experienced by inpatient individuals in a bed setting. This approach allowed us to focus on the effectiveness of the machine learning algorithms in detecting these movements, rather than the potential differences in data that could arise from using different mannequins. We added the following in the methods section to include that information: "In the data collection process, a single high-fidelity mannequin was utilized. This mannequin was equipped with Movella DOT sensors to simulate and capture the various movements typically experienced by inpatient individuals in a bed setting. The use of a single mannequin ensured consistency in the data collected and controlled for any potential variability that might arise from using multiple mannequins. This approach allowed the focus to be on the effectiveness of the machine learning algorithms in detecting these movements.".

Comment 2: What is the size of the dataset (each movement class)? What confused me was that the author mentioned 200 instances in line 164. But, in the confusion matrix in Figure 3, the #instances was 248.

Response 2: The dataset used in our study was indeed split into training, validation, and testing sets. This is a common practice in machine learning to ensure that the model is evaluated on unseen data, thereby providing a more realistic measure of its performance. The size of the training set, therefore, does not necessarily have to correlate with the number of instances in the raw dataset. Furthermore, we employed the Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic data for the minority class in the dataset. This was done to mitigate the issue of class imbalance, which could potentially lead to model bias. The synthetic data generated by SMOTE is included in the training set, which could explain why the number of instances in the confusion matrix (which includes both real and synthetic instances) is larger than the number of instances mentioned in line 164 (which likely refers to the real instances only). It's also worth noting that the learning curves show different training sizes because they are generated by incrementally increasing the size of the training set and evaluating the model's performance. This is done to assess how the model's performance improves with more training data. The horizontal axis in the learning curves, therefore, represents the size of the training set used to train the model, not the number of instances in the raw dataset. To address this point, the following was added in the manuscript at the of the subsection 2.6.: "In ML studies, the dataset is divided into training and testing sets to ensure a robust evaluation of the model's performance. Furthermore, in this study, the SMOTE was employed to generate synthetic data for the minority class, thereby balancing the class distribution in the training set. Note that this explains the discrepancy between the number of instances mentioned above and the number of instances shown in the confusion matrix below. Similarly, the learning curves were generated by incrementally increasing the size of the training set, which is why the horizontal axis shows different training sizes. Hence, the size of the training set used to train the model does not correlate with the number of instances in the raw dataset.".

Comment 3: SMOTE has been applied to give each class 200 instances. With the ratio data splitting of 80% training and 20% testing, then I assumed each class would have 40 test instances. The confusion matrix in Figure 3 does not match the above assumptions.

Response 3: The Synthetic Minority Over-sampling Technique (SMOTE) was indeed applied to balance the class distribution in the dataset, resulting in each class having 200 instances. However, it's important to note that SMOTE was applied before the data splitting process. Therefore, the synthetic instances generated by SMOTE are included in both the training and testing sets. The data was then split into training and testing sets, with 80% of the data used for training and 20% used for testing. However, this does not necessarily mean that each class would have exactly 40 test instances. The data splitting process is random, and the distribution of classes in the testing set depends on this randomness. Therefore, it's possible that some classes have more than 40 instances in the testing set, while others have less. The confusion matrix in Figure 3 represents the performance of the model on the testing set. The discrepancy between the number of instances in the confusion matrix and your assumption could be due to the randomness of the data splitting process and the inclusion of synthetic instances in the testing set. To address this point, we added the following paragraph in the subsection 3.3.: "In this study, the SMOTE was employed to balance the class distribution in the dataset. Note that SMOTE was applied before the data splitting process, and therefore, the synthetic instances generated by SMOTE are included in both the training and testing sets. The data was then split into training and testing sets. However, the distribution of classes in the testing set is subject to the randomness of the data splitting process, and therefore, it does not necessarily reflect the exact ratio of the original class distribution. The confusion matrix in Figure 3 represents the performance of the model on the testing set, which includes both real and synthetic instances.".

Comment 4: How many runs were performed for each method (Decision Tree Regressor (DTR), Gradient Boosting Regressor (GBR), and Bagging Regressor (BR))? I prefer the median and standard deviation to be provided in the table results to confirm the statistical confidence metrics of those experiments.

Response 4: In our study, we utilized a KFold Cross-Validation technique to evaluate the performance of our chosen model, the ensemble model, specifically the Bagging Regressor (BR). This method involves partitioning the dataset into 'k' subsets or folds. Each unique subset is taken as a test dataset, while the remaining subsets form the training dataset. A model is then fitted on the training set and evaluated on the test set. This process is repeated 'k' times, resulting in an array of 'k' scores, one for each run. Consequently, the number of runs performed for the BR method is equivalent to the number of folds used in the cross-validation. The resulting mean and standard deviations are already presented in Table 4. We did not calculate these statistics for the other models (Decision Tree Regressor (DTR) and Gradient Boosting Regressor (GBR)) before deciding to proceed with BR. Our decision to choose BR over DTR and GBR was based on specific reasons, which we have articulated and defended in the manuscript. The mean and standard deviations of the other models, which we decided not to use further, would not have influenced this decision. We added the following paragraph in the subsection 3.1. to address your point: "The BR, our chosen model, was evaluated using a KFold Cross-Validation technique. This involved dividing the dataset into `k' subsets. Each unique subset served as a test dataset, with the remaining subsets forming the training dataset. The model was fitted on the training set and evaluated on the test set, a process repeated `k' times. Therefore, the number of runs for the BR method equates to the number of folds in the cross-validation. The resulting mean and standard deviations are detailed in Table 4. The DTR and GBR were not subjected to these calculations before the decision to proceed with BR was made. The choice of BR over DTR and GBR was based on specific considerations, which are outlined and defended above. The mean and standard deviations of DTR and GBR, which were not used further, did not influence this decision.".

Comment 5: Were any hyperparameter tunings performed for each method (DTR, GBR, and BR)? The hyperparameters for these experiments should be listed.

Response 5: We conducted a series of experiments with various parameters in an attempt to enhance the performance of the DTR, GBR, and BR methods. However, these experiments did not yield any significant improvement over the default settings. The default hyperparameters provided by the Python libraries for these methods are well-optimized and have been extensively tested by the community. By adhering to these default settings, we were able to maintain a simpler methodology, which we believe is crucial for easier reproducibility and replicability of our study. This approach, we argue, strengthens the robustness and generalizability of our findings. We added the following paragraph at the end of the 2.3. subsection: "The default hyperparameters, as provided by the Python libraries for the DTR, GBR, and BR methods, were employed in this study. This selection was based on a series of experiments involving various parameters, none of which demonstrated a significant improvement over the default settings. These default hyperparameters, having been well-optimized and extensively tested by the ML community, were deemed appropriate. By adhering to these default settings, a simpler methodology was maintained without sacrificing the results, which is considered beneficial at this stage for facilitating easier reproducibility and replicability of the study.".

Comment 6: This research has the limitations of a small collected dataset, so there is a huge question about its applicability in real-world settings. This research also used the SMOTE technique to deal with the imbalanced data problem. However, the authors did not provide a justification or discuss how to evaluate the quality of the synthetic data produced by the SMOTE.

Response 6: The limitations of our study due to the small dataset size are acknowledged. However, it's important to note that the data collection process in this field is often challenging due to privacy concerns and the sensitive nature of the data. Despite the small size, we believe our dataset is representative and provides valuable insights. As for the use of the Synthetic Minority Over-sampling Technique (SMOTE), it was employed to address the issue of imbalanced data, a common problem in machine learning. We understand the concern about the quality of synthetic data. However, SMOTE is a widely accepted method for dealing with imbalanced datasets and has been proven to improve the performance of machine learning models in numerous studies. We have addressed this point in the discussion section as follows: "The constraints imposed by the limited size of the dataset are recognized in this study. Data collection in this domain can be challenging due to privacy considerations and the sensitive nature of the information. Although the dataset is of limited size, it is regarded as representative and provides useful insights. It has been utilized to train an ML model capable of detecting and classifying different types of movements in bed-based patients. To address the issue of imbalanced data, the SMOTE was utilized. While the quality of synthetic data may raise concerns, it is important to highlight that SMOTE is a widely accepted approach for handling imbalanced datasets in ML. Its effectiveness in enhancing the performance of ML models has been demonstrated in numerous studies [21,22].".

Comment 7: The authors should develop a separate section of Related Works (and move some paragraphs in section 1 to this new section). The Related Works can be focused in two directions: methods to enhance safety in healthcare settings and machine learning/deep learning models.

Response 7: Thank you for bringing that to our attention. We have now created the new subsection as requested.

Comment 8: Some of the last paragraphs of the Results and Discussion section should be moved into the Conclusion section. This Conclusion section should again summarize the problem statement, proposed methodology, experiments, results, and discussions and explain limitations and future works.

Response 8: Done! We appreciate your feedback. By addressing your comments, we have been able to greatly improve our manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

While the manuscript exhibits technical depth, several critical areas require refinement to enhance its overall scientific rigor and practical applicability. The discussion of the limitations is notably superficial, with minimal attention to key challenges such as overfitting, sensor limitations, or the generalizability of results across diverse patient populations and environments. There is also a lack of statistical validation in the results, such as the absence of confidence intervals or p-values to verify the significance of differences between models, which weakens the robustness of the conclusions. Furthermore, there is insufficient consideration of the computational complexity and real-time feasibility of deploying the Bagging Regressor in a healthcare setting, a crucial factor for clinical adoption. The omission of specific clinical pathways for implementing this monitoring system, such as integration with current healthcare workflows, required staff training, and system maintenance, limits the practical guidance provided to healthcare practitioners. The limited exploration of alternative feature selection strategies, hyperparameter tuning processes, and detailed analysis of model interpretability leaves a gap in the understanding of how to further optimize or trust these models in high-stakes healthcare settings.

Below authors can find a detailed section-by section report. I strongly suggest the authors to answer to all the questions raised by the reviewer and insert all the answers properly in the final manuscript.

Abstract

It lacks sufficient technical depth to make it fully informative and impactful for a technical audience. By incorporating more quantitative data, specifying dataset characteristics, and better explaining the choice and benefits of the models used, the abstract could significantly enhance its clarity and scientific rigor. The impact of the proposed solution on practical healthcare applications should also be better articulated.

If there is no space in the abstract to properly answer the following questions, insert your answers within the body of the manuscript in the rest of the sections.

A1) What are the exact performance metrics (e.g., accuracy, R2 score) for each model, and how do they compare to one another in terms of both training and test datasets?

A2) What were the specific characteristics of the dataset used, including the distribution of each movement type, and how was the class imbalance managed beyond the use of SMOTE?

A3) How were the hyperparameters of each model tuned, and what were the key hyperparameters that significantly influenced model performance?

A4) Why was the Bagging Regressor selected as the final model, and what trade-offs did it present compared to the other models (e.g., computational cost, interpretability, scalability)?

A5) How does the proposed system address practical implementation challenges in real-time monitoring, such as data latency, false positives/negatives in predictions, and integration with hospital IT infrastructure?

1. Introduction

The section would benefit from a deeper exploration of previous work and a more explicit delineation of how the current study advances the field. The section could be further enhanced by incorporating a clear research hypothesis, explicitly stating the study’s objectives, and providing a detailed explanation of the technical and clinical challenges addressed by the proposed solution.

1.1) What specific limitations of previous fall detection methods are overcome by the proposed ensemble regression model, and how do these improvements impact real-world healthcare settings?

1.2) What unique data preprocessing steps are employed in this study to address challenges like sensor noise, data imbalance, or patient variability, and how are they different from those used in similar studies?

1.3) How does the proposed model intend to address the trade-off between high model complexity (which could lead to overfitting) and the generalizability required for real-time patient monitoring in diverse healthcare environments?

1.4) Can you elaborate on the specific feature selection process used to identify relevant features from the sensor data? How were these features validated to ensure they adequately represent patient movements?

1.5) How is the proposed model's computational efficiency characterized, and what are the hardware and software requirements for real-time application in a hospital setting?

1.6) It would be beneficial for the reader if authors include some recent technologies in this section, and compare such methods with yours: Machine learning in health [Deep Learning Techniques and COVID-19 Drug Discovery: Fundamentals, State-of-the-Art and Future Directions

Emerging Technologies During the Era of COVID-19 Pandemic, 2021, Volume 348] and Generative AI applications for diagnostics [A survey of Generative AI Applications. arXiv preprint arXiv:2306.02781.].

2. Materials and Methods

This section would benefit greatly from a deeper analysis of the dataset and the features derived from it. The lack of hyperparameter tuning and computational assessment diminishes the rigor of the model evaluation, and these areas should be expanded to provide a more complete picture. Additionally, presenting further class-specific evaluation metrics (precision, recall, F1 score) would provide richer insights into model performance.

2.1) What specific feature extraction or dimensionality reduction techniques were considered (e.g., Principal Component Analysis, t-SNE), and how did they impact the overall performance of the model?

2.2) How was the hyperparameter optimization conducted for each model, and what were the specific ranges tested for key hyperparameters like learning rate, max depth, or number of estimators?

2.3) How did the distribution of movement classes change after applying SMOTE, and how did it affect the predictive power for minority classes compared to the original distribution?

2.4) Were there any additional preprocessing steps (e.g., data augmentation, noise filtering) performed on the sensor data to account for potential artifacts or outliers, and how were these determined to be necessary?

2.5) What is the computational complexity of training and deploying the Bagging Regressor in a real-time hospital setting, and were any methods employed to reduce computational overhead?

3. Results

The section lacks sufficient statistical analysis and detailed error insights, which are necessary for assessing the robustness and real-world utility of the findings. Additionally, a more in-depth exploration of the confusion matrix and practical clinical implications would greatly enhance the understanding of the model's applicability.

3.1) How do the performance metrics (R2, MSE, accuracy) vary across different movement classes, and what do these variations indicate about the model's ability to generalize?

3.2) What specific statistical significance tests could be applied to the obtained metrics, and what do these tests reveal about the reliability of differences between the models' performances?

3.3) How do sensor placement, noise levels, and sampling rates impact the accuracy and precision of the models, particularly in misclassifying movements like "Breathing" and "Seizure"?

3.4) Can you quantify the degree of overfitting observed in the models without SMOTE, and how do specific hyperparameter choices exacerbate or mitigate this effect?

3.5) How does the confusion between classes (e.g., "Breathing" vs. "Seizure") impact the clinical applicability of the model, and what strategies could mitigate such errors in a real-time monitoring environment?

4. Discussion

The section lacks sufficient depth in areas such as limitations, computational feasibility, and generalizability. There is also a need for more concrete recommendations for future work and a more comprehensive discussion of how the proposed system could be implemented in clinical settings.

4.1) How does the generalizability of the Bagging Regressor model hold up when applied to different patient populations, and what adaptations might be needed for diverse clinical settings?

4.2) What are the computational requirements for running the proposed model in real-time, and how do these requirements compare to those of more traditional monitoring systems in terms of cost-effectiveness?

4.3) Can you elaborate on the specific limitations of the sensors used in data collection, and how do these limitations affect the reliability and accuracy of movement detection?

4.4) How does the sensor-based monitoring approach compare to video monitoring in terms of detection latency, false positives, and real-time responsiveness, particularly in critical situations like falls?

4.5) What measures could be taken to enhance the interpretability of the Bagging Regressor model, and how important is this interpretability for gaining clinician trust in a real-world healthcare environment?

Author Response

We would like to express our sincere gratitude for your insightful and constructive comments on our manuscript. Your detailed feedback has been instrumental in identifying areas of improvement and has significantly contributed to enhancing the quality of our work.

Comments 1: Abstract

Response 1: The abstract has been revised to incorporate your points A1, A2, A4, and A5. Point A3 is addressed separately in subsection 2.3.

Comment 2: What specific limitations of previous fall detection methods are overcome by the proposed ensemble regression model, and how do these improvements impact real-world healthcare settings?

Response 2: The current manuscript presents an ensemble regression model that addresses several limitations of previous fall detection methods. Traditional methods, such as video monitoring, while effective, raise significant privacy concerns as they require constant surveillance of patients. Our approach, on the other hand, leverages sensor technology, which not only respects patient privacy but also allows for automated detection, eliminating the need for continuous manual monitoring. Furthermore, the majority of existing studies primarily focus on fall prevention, whereas our model goes a step further. It detects six different types of movements commonly performed by bed-based patients, providing a more comprehensive and practical solution for real-time patient monitoring. We've added this paragraph in the introduction to address this point: "Hence, the proposed model overcomes specific limitations of previous fall detection methods, thereby enhancing its applicability in real-world healthcare settings. Unlike video monitoring methods, which pose privacy issues due to constant surveillance, our model employs sensor technology, ensuring patient privacy. The sensor-based approach not only facilitates automated detection, a significant advancement over video methods, but also reduces the need for healthcare professionals to continuously monitor a screen. Most importantly, while other studies have concentrated solely on fall prevention, our model broadens the scope by detecting six different types of movements commonly performed by bed-based patients. This detection capability not only improves patient safety but also contributes to more efficient and effective patient care in healthcare settings.".

Comment 3: What unique data preprocessing steps are employed in this study to address challenges like sensor noise, data imbalance, or patient variability, and how are they different from those used in similar studies?

Response 3: Sensor noise, a common issue in similar studies, was not detected in our case, thus eliminating the need for specific preprocessing steps to handle it. Data imbalance, another prevalent problem, was addressed using Synthetic Minority Over-sampling Technique (SMOTE), a method not commonly used in similar studies. As for patient variability, it was not specifically addressed in this study, and we acknowledge that future research needs to focus on this aspect. Unlike similar studies that primarily focus on fall detection, usually from videos, our study classifies six distinct movements based on sensor data, thereby providing privacy to the patients, a feature that is highly valued in healthcare settings. Based on your feedback, all these points were addressed across several sections, including the introduction, methods, and discussion.

Comment 4: How does the proposed model intend to address the trade-off between high model complexity (which could lead to overfitting) and the generalizability required for real-time patient monitoring in diverse healthcare environments?

Response 4: The learning curves of our model, as presented in the manuscript, show the convergence of the training and validation curves, with the validation curve consistently remaining slightly below the training curve. This pattern is a strong indicator that our model, despite its complexity, is not prone to overfitting. We acknowledge that the generalizability of our model is an aspect that requires further exploration and refinement. Our future research plans include testing the model on human subjects, as opposed to mannequins, and in real-life settings to ensure its applicability and effectiveness in diverse healthcare environments. We've added the following in the discussion section: "The learning curves of the BR model illustrate the convergence of the training and validation curves, with the validation curve consistently staying slightly below the training curve. This pattern indicates that the model, in spite of its complexity, does not appear to be overfitting. Nonetheless, it is acknowledged that the generalizability of the model is a subject that requires further exploration. Future research directions are set to include testing the model on human subjects and in real-life environments to verify its applicability in a variety of healthcare settings.".

Comment 5: Can you elaborate on the specific feature selection process used to identify relevant features from the sensor data? How were these features validated to ensure they adequately represent patient movements?

Response 5: Our study is fundamentally different from a typical classification problem, where specific features are selected and validated. Instead, our research is based on a regression model, which uses the entire dataset for training. The sensor data we collected captures the angles around the x, y, and z axes as the mannequin performed one of six distinct movements. This data forms the basis of our model, and the concept of feature selection is not applicable in the traditional sense. The model's performance is evaluated based on its ability to accurately predict the movements, not on the selection of specific features. To explain this, we added the following paragraph in the subsection 2.2.: "The methodology diverges from traditional feature selection processes due to the regression-based nature of the model. In contrast to the conventional feature selection methods often employed in classification problems, this study assesses the model's effectiveness based on its precise prediction of movements, using the entire dataset. This dataset is compiled from measurements captured by sensors as a mannequin executed six unique movements along three specific axes: east-west, north-south, and up-down. Consequently, the typical notion of feature selection, frequently observed in classification problems, is not directly applicable in this regression-based scenario. The model's performance is instead evaluated on its capacity to accurately forecast the movements, utilizing the entire dataset.".

Comment 6: How is the proposed model's computational efficiency characterized, and what are the hardware and software requirements for real-time application in a hospital setting?

Response 6: The current stage of the study is primarily focused on research and the development of the model. The computational efficiency of the proposed model and the specific hardware and software requirements for real-time application in a hospital setting were not the primary focus of this stage. However, it is noteworthy that the model was trained, validated, and tested using a single PC unit within a matter of seconds/minutes. This suggests that the computational efficiency of the model is quite high and that the hardware requirements are not particularly demanding. While the specific software requirements have not been evaluated, the successful implementation of the model on a single PC unit indicates that it could feasibly be applied in a real-time hospital setting. We added the following paragraph in the discussion section: "The computational efficiency of the proposed model is characterized by its ability to be trained, validated, and tested within a short time frame on a single PC unit. This suggests that the model is computationally efficient and could potentially be implemented in a real-time hospital setting. However, the specific hardware and software requirements for such an application have not been evaluated in this study. Future research could focus on determining these requirements to facilitate the implementation of the model in a real-time hospital setting. It is important to note that the primary focus of this study was the development and testing of the model, and as such, the specific computational requirements for real-time application were not evaluated.".

Comment 7: Why was the Bagging Regressor selected as the final model, and what trade-offs did it present compared to the other models (e.g., computational cost, interpretability, scalability)?

Response 7: The Bagging Regressor was chosen as the final model due to its superior performance in terms of accuracy and robustness. The model demonstrated a high degree of predictive accuracy during the testing phase, outperforming other models considered in the study. Furthermore, Bagging Regressor models are known for their robustness to overfitting, which is a significant advantage in predictive modeling. Regarding trade-offs, while the Bagging Regressor may have a higher computational cost compared to simpler models, its superior predictive performance justified this cost. And as explained in the previous response, computational cost is not high anyway. In terms of interpretability, the Bagging Regressor may not be as straightforward as simpler models, but the focus of this study was on predictive accuracy rather than model interpretability. Lastly, Bagging Regressors are highly scalable, making them suitable for future expansions of the study. We added the following at the end of the subsection 3.0.: "BR models are known for their robustness to overfitting, which is a significant advantage in predictive modeling. While the BR may present a higher computational cost compared to simpler models, its superior performance justifies this trade-off. Although the interpretability of the BR may not be as straightforward as that of simpler models, the primary focus of this study was on achieving high predictive accuracy. Furthermore, the BR is highly scalable, making it suitable for future expansions of the study.".

Comment 8: It would be beneficial for the reader if authors include some recent technologies in this section, and compare such methods with yours: Machine learning in health [Deep Learning Techniques and COVID-19 Drug Discovery: Fundamentals, State-of-the-Art and Future Directions Emerging Technologies During the Era of COVID-19 Pandemic, 2021, Volume 348] and Generative AI applications for diagnostics [A survey of Generative AI Applications. arXiv preprint arXiv:2306.02781.].

Response 8: Thank you for bringing those studies/reviews to our attention. Both are now integrated in our manuscript.

Comment 9: What specific feature extraction or dimensionality reduction techniques were considered (e.g., Principal Component Analysis, t-SNE), and how did they impact the overall performance of the model?

Response 9: Studies centered around regression machine learning do not use a specific feature extraction or dimensionality reduction techniques such as Principal Component Analysis or t-SNE. The primary focus of regression machine learning is to predict a continuous outcome variable based on one or more predictor variables. Hence, our main concern is the relationship between these variables, rather than the extraction or reduction of features. The subsection 2.2. addresses this in the following paragraph, "The methodology diverges from traditional feature selection processes due to the regression-based nature of the model. In contrast to the conventional feature selection methods often employed in classification problems, this study assesses the model's effectiveness based on its precise prediction of movements, using the entire dataset. This dataset is compiled from measurements captured by sensors as a mannequin executed six unique movements along three specific axes: east-west, north-south, and up-down. Consequently, the typical notion of feature selection, frequently observed in classification problems, is not directly applicable in this regression-based scenario. The model's performance is instead evaluated on its capacity to accurately forecast the movements, utilizing the entire dataset.".

Comment 10: How was the hyperparameter optimization conducted for each model, and what were the specific ranges tested for key hyperparameters like learning rate, max depth, or number of estimators?

Response 10: We conducted a series of experiments with various parameters in an attempt to enhance the performance of the DTR, GBR, and BR methods. However, these experiments did not yield any significant improvement over the default settings. The default hyperparameters provided by the Python libraries for these methods are well-optimized and have been extensively tested by the community. By adhering to these default settings, we were able to maintain a simpler methodology, which we believe is crucial for easier reproducibility and replicability of our study. This approach, we argue, strengthens the robustness and generalizability of our findings. We added the following paragraph at the end of the 2.3. subsection: "The default hyperparameters, as provided by the Python libraries for the DTR, GBR, and BR methods, were employed in this study. This selection was based on a series of experiments involving various parameters, none of which demonstrated a significant improvement over the default settings. These default hyperparameters, having been well-optimized and extensively tested by the ML community, were deemed appropriate. By adhering to these default settings, a simpler methodology was maintained without sacrificing the results, which is considered beneficial at this stage for facilitating easier reproducibility and replicability of the study.".

Comment 11: How did the distribution of movement classes change after applying SMOTE, and how did it affect the predictive power for minority classes compared to the original distribution?

Response 11: SMOT was implemented to address the evident overfitting in our model, as indicated by the non-convergence of the training and validation curves. Without the application of SMOT, the model was prone to overfitting, which would have compromised its predictive power, particularly for minority classes. The application of SMOT helped to alleviate this issue, leading to the convergence of the learning curves. While we did not specifically evaluate the changes in the distribution of movement classes post-SMOT application, the overall enhancement in the model's performance suggests that SMOT effectively improved the predictive power for minority classes. Additionally, we calculated the mean and standard deviations for the BR model both without and with SMOT, further substantiating the positive impact of SMOT on the model's performance. At the end of the 3.2. section we added the following paragraph: "The SMOT was employed to mitigate the issue of overfitting, as evidenced by the non-convergence of the training and validation curves in the initial model (Figure 4(a)). The application of SMOT led to the convergence of these curves (Figure 4(b)), indicating an improvement in the model's performance. Although the specific changes in the distribution of movement classes after applying SMOT were not directly evaluated, the overall enhancement in the model's performance suggests that SMOT effectively bolstered the predictive power for minority classes."

Comment 12: Were there any additional preprocessing steps (e.g., data augmentation, noise filtering) performed on the sensor data to account for potential artifacts or outliers, and how were these determined to be necessary?

Response 12: Our data collection process was designed in a way that inherently minimized the potential for noise and outliers. We utilized a single mannequin to perform six distinct types of movements, during which we measured and collected the Euler angles. This controlled environment significantly reduced the likelihood of extraneous variables introducing noise or causing outliers. Consequently, additional preprocessing steps such as data augmentation or noise filtering were not deemed necessary for this particular study. We acknowledge that in real-life scenarios, the situation might be different and additional preprocessing methods may be required to account for potential artifacts or outliers. This is a consideration we plan to incorporate in our future research. We added the following paragraph in the discussion section as a limitation of the study and need for future research: "The data collection process was designed to minimize the potential for noise and outliers. A single mannequin was employed to perform six distinct types of movements, and the Euler angles were measured and collected during these movements. This controlled setup significantly reduced the likelihood of extraneous variables introducing noise or causing outliers. Therefore, additional preprocessing steps such as data augmentation or noise filtering were not deemed necessary. However, it is acknowledged that in real-life scenarios, additional preprocessing methods may be required to account for potential artifacts or outliers. This is an area of focus for future research.".

Comment 13: What is the computational complexity of training and deploying the Bagging Regressor in a real-time hospital setting, and were any methods employed to reduce computational overhead?

Response 13: The current stage of the study is primarily focused on research and the development of the model. The computational efficiency of the proposed model and the specific hardware and software requirements for real-time application in a hospital setting were not the primary focus of this stage. However, it is noteworthy that the model was trained, validated, and tested using a single PC unit within a matter of seconds/minutes. This suggests that the computational efficiency of the model is quite high and that the hardware requirements are not particularly demanding. While the specific software requirements have not been evaluated, the successful implementation of the model on a single PC unit indicates that it could feasibly be applied in a real-time hospital setting. We added the following paragraph in the discussion section: "The computational efficiency of the proposed model is characterized by its ability to be trained, validated, and tested within a short time frame on a single PC unit. This suggests that the model is computationally efficient and could potentially be implemented in a real-time hospital setting. However, the specific hardware and software requirements for such an application have not been evaluated in this study. Future research could focus on determining these requirements to facilitate the implementation of the model in a real-time hospital setting. It is important to note that the primary focus of this study was the development and testing of the model, and as such, the specific computational requirements for real-time application were not evaluated.".

Comment 14: How do the performance metrics (R2, MSE, accuracy) vary across different movement classes, and what do these variations indicate about the model's ability to generalize?

Response 14: The performance metrics utilized in this study, namely R2, MSE, and accuracy, are typically employed to assess the overall performance of the model, not its performance on individual movement classes. As such, these metrics were not computed across different movement classes. These metrics are designed to measure the model's overall fit to the data, not its ability to differentiate between different classes. Consequently, variations in these metrics across different movement classes would not yield useful insights into the model's generalization capabilities.

Comment 15: What specific statistical significance tests could be applied to the obtained metrics, and what do these tests reveal about the reliability of differences between the models' performances?

Response 15: For regression machine learning models, performance is typically assessed using metrics such as R-squared, Mean Squared Error, and accuracy. These metrics offer a measure of the model's predictive power for the target variable. Statistical significance tests are primarily designed to determine if the observed differences in a sample could have occurred by chance, which isn't a relevant question when evaluating machine learning models. Consequently, such tests were not employed in this study.

Comment 16: How do sensor placement, noise levels, and sampling rates impact the accuracy and precision of the models, particularly in misclassifying movements like "Breathing" and "Seizure"?

Response 16: This is a limitation of our study. The factors of sensor placement, noise levels, and sampling rates indeed have significant implications on the accuracy and precision of the models. The placement of sensors can influence the quality of data collected, as different positions may capture varying aspects of the movements. Noise levels can introduce an element of variability in the data, potentially leading to misclassification of movements. The sampling rates can impact the granularity of the data, with higher rates providing more detailed data that could enhance the model's performance. However, it's crucial to note that in the context of our study, the movements "Breathing" and "Seizure" were simulated by a mannequin in a controlled environment, which mitigated the influence of these factors on the model's performance. We added the following paragraph in the discussion section as a limitation of our study, "Sensor placement, noise levels, and sampling rates are pivotal factors that can influence the accuracy and precision of machine learning models. The positioning of the sensor can affect the quality of the data collected, as different placements may capture varying aspects of the movements. Noise levels can introduce an element of variability in the data, potentially leading to misclassification of movements. The sampling rates can impact the granularity of the data, with higher rates providing more detailed data that could enhance the model's performance. However, in the context of this study, the movements were simulated by a mannequin in a controlled environment, which mitigated the influence of these factors on the model's performance.".

Comment 17: Can you quantify the degree of overfitting observed in the models without SMOTE, and how do specific hyperparameter choices exacerbate or mitigate this effect?

Response 17: We conducted experiments with various hyperparameter choices to ensure optimal convergence. Although we have a series of graphs demonstrating the influence of hyperparameters on convergence, we deemed their inclusion might overcomplicate the manuscript. Our decision to not include the detailed analysis of hyperparameter choices in SMOT was driven by our intent to maintain the simplicity and readability of the manuscript. Regarding the quantification of overfitting in models without SMOT, it is indeed visually evident from the learning curves that they do not converge. However, quantifying this divergence is non-trivial and we currently lack a robust method to do so. We appreciate the reviewer's suggestion and will consider developing a quantification method for overfitting in our future research.

Comment 18: How does the confusion between classes (e.g., "Breathing" vs. "Seizure") impact the clinical applicability of the model, and what strategies could mitigate such errors in a real-time monitoring environment?

Response 18: As with any machine learning model, a certain degree of misclassification is anticipated. This is particularly true in the healthcare domain, where the complexity and variability of human physiological patterns can lead to overlaps between different conditions, such as "Breathing" and "Seizure". However, these misclassifications do not detract from the clinical applicability of machine learning models. Instead, they underscore the necessity of using the model as a supportive tool, not a replacement, for human judgement. In real-time monitoring environments, the model's results should not be relied upon absolutely, but rather used to assist healthcare personnel in making more informed decisions. Strategies such as setting thresholds for alarm activation, incorporating feedback mechanisms, and providing training to healthcare personnel on the interpretation of the model's outputs could help mitigate such errors. We added the following in the conclusion section of the manuscript: "As with any ML model, a certain degree of misclassification is anticipated, particularly in the healthcare domain where the complexity and variability of human physiological patterns can lead to overlaps between different conditions. However, these misclassifications do not detract from the clinical applicability of ML models. Instead, they underscore the necessity of using ML as a supportive tool, not a replacement, for human judgement. In real-time monitoring environments, the model's results should not be relied upon absolutely, but rather used to assist healthcare personnel in making more informed decisions. Strategies such as setting thresholds for alarm activation, incorporating feedback mechanisms, and providing training to healthcare personnel on the interpretation of the model's outputs could help mitigate such errors.".

Comment 19: How does the generalizability of the Bagging Regressor model hold up when applied to different patient populations, and what adaptations might be needed for diverse clinical settings?

Response 19: We acknowledge that the generalizability of our model is an aspect that requires further exploration and refinement. Our future research plans include testing the model on human subjects, as opposed to mannequins, and in real-life settings to ensure its applicability and effectiveness in diverse healthcare environments. We've added the following in the discussion section: "It is acknowledged that the generalizability of the model is a subject that requires further exploration. Future research directions are set to include testing the model on human subjects and in real-life environments to verify its applicability in a variety of healthcare settings.".

Comment 20: What are the computational requirements for running the proposed model in real-time, and how do these requirements compare to those of more traditional monitoring systems in terms of cost-effectiveness?

Response 20: The focus of our study was primarily on the development and testing of the proposed model, rather than on its specific computational requirements for real-time application or its cost-effectiveness. The model's computational efficiency is demonstrated by its ability to be trained, validated, and tested within a short time frame on a single PC unit. This suggests that the model is computationally efficient and could potentially be implemented in a real-time hospital setting. However, the specific hardware and software requirements for such an application have not been evaluated in this study. As for the cost-effectiveness, while it was not directly evaluated, the sensors used in the model are inexpensive and reusable, which suggests that the model could potentially be more cost-effective than traditional video monitoring systems. We added the following paragraph in the discussion section: "The computational efficiency of the proposed model is characterized by its ability to be trained, validated, and tested within a short time frame on a single PC unit. This suggests that the model is computationally efficient and could potentially be implemented in a real-time hospital setting. However, the specific hardware and software requirements for such an application have not been evaluated in this study. Future research could focus on determining these requirements to facilitate the implementation of the model in a real-time hospital setting. It is important to note that the primary focus of this study was the development and testing of the model, and as such, the specific computational requirements for real-time application were not evaluated. Regarding cost-effectiveness, while it was not directly evaluated in this study, the sensors used are inexpensive and reusable, suggesting that the model could potentially be more cost-effective than traditional video monitoring systems.".

Comment 21: Can you elaborate on the specific limitations of the sensors used in data collection, and how do these limitations affect the reliability and accuracy of movement detection?

Response 21: We added the following in the discussion section, "While the sensors utilized are effective in capturing patient movements, they may not be able to capture all types of patient behavior or conditions. For instance, they might not be able to detect subtle changes in a patient's condition that could be picked up by a human observer. Additionally, the sensors' effectiveness could potentially be influenced by factors such as their placement on the patient or the patient's position in bed. These limitations could affect the reliability and accuracy of movement detection, potentially leading to misclassifications or missed detections. However, it is important to note that these limitations are inherent in the use of sensor-based systems and do not necessarily undermine the overall utility of the model.".

Comment 22: How does the sensor-based monitoring approach compare to video monitoring in terms of detection latency, false positives, and real-time responsiveness, particularly in critical situations like falls?

Response 22: Our choice of a sensor-based monitoring approach was primarily driven by the potential for enhanced patient privacy. Many individuals may feel uncomfortable being under video surveillance while sleeping in a hospital room. Our study, however, did not explicitly compare sensor-based monitoring to video monitoring in terms of detection latency, false positives, and real-time responsiveness. This is primarily because our study was designed as a proof of concept rather than a comprehensive assessment with a fully operational clinical application in mind. While we can hypothesize about the potential advantages and limitations of sensor-based monitoring compared to video monitoring, these would not be grounded in the empirical data collected in this study. We added the following paragraph in the discussion section, "The sensor-based monitoring approach was selected primarily for its potential to offer enhanced patient privacy, as many individuals may feel uncomfortable being under video surveillance while sleeping in a hospital room. This study, however, did not explicitly compare sensor-based monitoring to video monitoring in terms of detection latency, false positives, and real-time responsiveness. This is primarily because our study was designed as a proof of concept rather than a comprehensive assessment with a fully operational clinical application in mind. While we can hypothesize about the potential advantages and limitations of sensor-based monitoring compared to video monitoring, these are not grounded in the empirical data collected in this study.".

Comment 23: What measures could be taken to enhance the interpretability of the Bagging Regressor model, and how important is this interpretability for gaining clinician trust in a real-world healthcare environment?

Response 23: Our choice of the Bagging Regressor model was primarily driven by its robustness and ability to handle complex data structures, which are prevalent in healthcare settings. While we acknowledge the importance of interpretability in gaining clinician trust, it is not the sole factor to consider. The primary objective of our study was to develop a model that can accurately predict patient outcomes, and the Bagging Regressor model has proven effective in achieving this. Although enhancing the interpretability of the model could involve techniques such as partial dependence plots, these were not the focus of our current study. Our study was designed as a proof of concept, and we believe that future work could certainly explore these aspects to further enhance clinician trust.

Reviewer 3 Report

Comments and Suggestions for Authors

This study presents an ensemble machine learning model for monitoring bedridden patients in healthcare settings. The researchers used a high-fidelity mannequin equipped with Movella DOT sensors to simulate common patient movements. Data from these sensors was processed using various machine learning techniques, including Decision Tree Regressor, Gradient Boosting Regressor, and Bagging Regressor. The Bagging Regressor, an ensemble model using Decision Tree Regressor as its base, demonstrated the best performance with an accuracy of 0.950 and R2 scores of 0.996 and 0.959 for training and test data respectively. The study employed SMOTE to address class imbalance, which improved model performance. Learning curves and confusion matrices were used to evaluate the model's effectiveness in classifying different patient movements.

1. The methodology lacks details on sensor placement. Include a diagram or detailed description of exact sensor locations on the mannequin to ensure reproducibility.

2. Explain the rationale behind choosing six specific movement categories. Consider including additional clinically relevant movements or positions to enhance the model's practical applicability.

3. Provide more information on the data collection process, including the duration of simulations and the number of repetitions for each movement type. This information is crucial for assessing the robustness of the dataset.

4. The discussion section needs to address potential limitations of using a mannequin instead of real patients. Elaborate on how this might affect the model's performance in real-world scenarios.

5. Include a comparison of the proposed ensemble model with existing fall detection systems or patient monitoring technologies. This will better contextualize the significance of the study's findings within the current state of healthcare technology.

Author Response

Sincere gratitude for your insightful and comprehensive review of our study. Your understanding of the nuances of our research, including the use of Movella DOT sensors, the application of various machine learning techniques, and the implementation of SMOT to address class imbalance, is deeply appreciated.

Comment 1: The methodology lacks details on sensor placement. Include a diagram or detailed description of exact sensor locations on the mannequin to ensure reproducibility.

Response 1: We added the following in the 2.1. subsection to include the requested information: "The sensor was positioned on the mannequin's torso, specifically at the midpoint of the sternum, i.e., a critical anatomical landmark. This placement was done to ensure optimal data capture and reproducibility of the measurements. To secure the sensor in place and prevent any displacement during the experiments, a cross-shaped fixation method was employed using a high-strength, adhesive-backed material.".

Comment 2: Explain the rationale behind choosing six specific movement categories. Consider including additional clinically relevant movements or positions to enhance the model's practical applicability.

Response 2: We appreciate the suggestion to include additional clinically relevant movements or positions. Our study was designed with a specific focus on six movement categories, which were chosen based on their relevance to the safety and well-being of bed-ridden patients. These movements, as described in the manuscript, cover a wide range of potential scenarios that these patients might encounter. Compared to other studies that primarily focus on fall prevention only, our study provides a more holistic view of patient movements. While we acknowledge that there are other movements that could be considered, we believe that the inclusion of additional movements at this stage would require a significant expansion of the study scope and additional data collection. We propose that these additional movements, such as shifting positions in bed, reaching for objects, or attempting to sit up, could be explored in future research. We added the following paragraph in the discussion section to address this point, "The selection of the six specific movement categories in this study was based on their relevance to the safety and well-being of bed-ridden patients. These movements include breathing, seizures, rolling to the right side, rolling to the left side, rolling off the bed from the left, and rolling off the bed from the right. These categories were chosen to provide a comprehensive overview of potential scenarios that these patients might encounter. While there are other movements that patients might perform in bed, such as shifting positions or reaching for objects, these were not included in the current study. Future research could consider these additional movements to further enhance the model's practical applicability.".

Comment 3: Provide more information on the data collection process, including the duration of simulations and the number of repetitions for each movement type. This information is crucial for assessing the robustness of the dataset.

Response 3: Thank you for bringing that to our attention that we missed to include this crucial information. We've added the following in the 2.1. subsection on data collection, "The sensor was placed on both adult-sized and infant-sized mannequins to simulate a range of patient demographics. Each movement was repeated approximately 100 times to gather a substantial amount of data for each movement type. For breathing, the mannequin performed one full cycle of tidal volume inhalations and exhalations for 3 minutes straight. For seizures, the mannequin underwent a seizure for 10 minutes straight. For the rolling and falling off the bed movements, the infant mannequin was used. The mannequin was started in the supine position and rolled approximately 90 degrees to its left or right side and back to the original supine position. This was repeated approximately 100 times for each side. The data for dropping off the bed from the left was collected by having the infant mannequin start in the supine position and rolling it beyond its left side to the point where it falls off the bed. This was also repeated approximately 100 times.".

Comment 4: The discussion section needs to address potential limitations of using a mannequin instead of real patients. Elaborate on how this might affect the model's performance in real-world scenarios.

Response 4: As requested, the discussion section was now substantially expanded to address multitude of limitations, including this one.

Comment 5: Include a comparison of the proposed ensemble model with existing fall detection systems or patient monitoring technologies. This will better contextualize the significance of the study's findings within the current state of healthcare technology.

Response 5: Thank you for this valuable suggestion. We have now discussed such comparison in the discussion and conclusion sections as requested.

Reviewer 4 Report

Comments and Suggestions for Authors

This paper explores the development and application of regressive machine learning models aimed at enhancing real-time monitoring for bedridden patients in healthcare settings.

“About 10% of fatal falls in the elderly occur in the hospital”, for this sentence, please add an example to specific what is fatal falls

“Using machine learning to monitor bed bound patients can mitigate the current issues 94 associated with monitoring bed-bound patients. ” how this was acquired, please give an explanation.

The aim of your text should be noted in your last part of introduction to make your text more logistic.

six distinct labels: "Roll right" (0), "Roll left" (1), "Drop right" (2), "Drop left" (3), "Breathing" (4), and "Seizure" (5). Please give those movement a specific definition.

Please give flow charts or something else to show the mechanisms of your Machine Learning and Models, not only formula.

Do you have real-time data more than a high-fidelity mannequin to evaluate your results？

“we opted for the BR ” what is BR please give full name.

“Table 2. Performance Comparison of Decision Tree Regressor, Gradient Boosting Regressor, and their Ensemble, without SMOT.” What is SMOT please give a full name.

“This smaller gap suggests that the model is not 345 overfitting as much as in the case without SMOTE.” What is SMOTE?

As shown in Confusion matrix heatmap, some incorrect predictions were noted in labels such as seizure. Could authors make some explanation on those incorrect issues and associated modifications in discussion.

Comments on the Quality of English Language

Minor editing of English language required.

Author Response

We would like to express our sincere gratitude for your time and effort in reviewing our manuscript. Your insightful comments and suggestions have helped us in improving the quality of our work.

Comment 1: “About 10% of fatal falls in the elderly occur in the hospital”, for this sentence, please add an example to specific what is fatal falls.

Response 1: We added the following in the introduction: "Fatal falls in the elderly, particularly in hospital settings, often occur due to a combination of factors such as gait and balance disorders, cognitive impairment, frailty, deconditioning, and the use of certain medications. For instance, an elderly patient with cognitive impairment might not recognize the risk of getting out of bed unassisted, leading to a fall that could result in serious injury or even death."

Comment 2: Using machine learning to monitor bed bound patients can mitigate the current issues 94 associated with monitoring bed-bound patients. ” how this was acquired, please give an explanation.

Response 2: That is the purpose for which we trained that ensemble model that uses a unique dataset capturing six typical movements of bed-bound patients. The model was trained using the Synthetic Minority Over-sampling Technique (SMOT) to balance the dataset and prevent model bias. The Bagging Regressor model, in particular, demonstrated high accuracy and low error rates. This study illustrates the potential of machine learning in mitigating the current issues associated with monitoring bed-bound patients, such as data latency and false positives/negatives, and its seamless integration with hospital IT infrastructure. This approach not only facilitates automated detection, a significant advancement over video methods, but also reduces the need for healthcare professionals to continuously monitor a screen, thereby enhancing patient safety in healthcare settings.

Comment 3: The aim of your text should be noted in your last part of introduction to make your text more logistic.

Response 3: We have now added the aims to the location where requested.

Comment 4: six distinct labels: "Roll right" (0), "Roll left" (1), "Drop right" (2), "Drop left" (3), "Breathing" (4), and "Seizure" (5). Please give those movement a specific definition.

Response 4: As requested, we have now defined them in more detail in the '2.1. Data Collection' section.

Comment 5: Please give flow charts or something else to show the mechanisms of your Machine Learning and Models, not only formula.

Response 5: The machine learning models utilized in our study, specifically the Decision Tree Regressor (DTR), Gradient Boosting Regressor (GBR), and Bagging Regressor (BR), encompass complex computations and decision-making processes that resist accurate simplification into a flowchart. These models, by their very nature, are abstract, and their performance is determined not solely by their structure but also by the specific data on which they are trained. This introduces an additional layer of complexity that a flowchart cannot encapsulate. While we acknowledge the appeal of a more visual representation, it is unfeasible in this instance without risking oversimplification or the omission of crucial information. We are currently unable to conceive of suitable charts, but if the reviewer could provide some suggestions that might inspire us, we would be open to implementing them in the next revision.

Comment 6: Do you have real-time data more than a high-fidelity mannequin to evaluate your results？

Response 6: The current study is a proof-of-concept. The primary objective was to demonstrate the feasibility and potential of our approach, rather than provide a comprehensive evaluation. The data used in this study was collected from a high-fidelity mannequin simulating patient movements. This allowed us to control the conditions and ensure the consistency of the data, which was crucial for the development and initial testing of our machine learning models. While we acknowledge the importance of real-time data from actual patients for a more thorough evaluation, obtaining such data involves practical and ethical considerations that are beyond the scope of this initial study. We have outlined our plans to acquire real-time data in our future work in the discussion section of the manuscript.

Comment 7: “we opted for the BR ” what is BR please give full name. “Table 2. Performance Comparison of Decision Tree Regressor, Gradient Boosting Regressor, and their Ensemble, without SMOT.” What is SMOT please give a full name.

Response 7: We have now made sure that all abbreviations are defined the first time they are used. Thank you for catching that oversight for us.

Comment 8: “This smaller gap suggests that the model is not 345 overfitting as much as in the case without SMOTE.” What is SMOTE?

Response 8: We apologize for the oversight. In some places we wrote SMOTE instead of SMOT. We've now correct it in all location. Thank you for pointing that out and saving us the embarrassment.

Comment 9: As shown in Confusion matrix heatmap, some incorrect predictions were noted in labels such as seizure. Could authors make some explanation on those incorrect issues and associated modifications in discussion.

Response 9: Regarding the misclassifications observed in the confusion matrix heatmap, particularly between the labels "Breathing" (4) and "Seizure" (5), it is important to note that these two movements share significant similarities. Both movements are characterized by rhythmic patterns that differ primarily in frequency rather than magnitude, which can pose a challenge for machine learning models. This is a known limitation of such models, as they can struggle to differentiate between classes with similar features. Despite these misclassifications, the overall performance of the model is acceptable, as evidenced by the majority of predictions falling on the diagonal, indicating correct predictions. We have acknowledged these misclassifications in the manuscript and discussed their implications. We added the following paragraph in the conclusion section, "The confusion matrix heatmap reveals some misclassifications, particularly between the labels ``Breathing'' and ``Seizure''. These misclassifications can be attributed to the inherent similarities between these two movements, which involve rhythmic patterns that differ primarily in frequency rather than magnitude. This is a known limitation of machine learning models, which can struggle to differentiate between classes with similar features. Despite these misclassifications, the overall performance of the model is reasonable, as evidenced by the majority of predictions falling on the diagonal, indicating correct predictions.".

Reviewer 5 Report

Comments and Suggestions for Authors

Major comments:

1. The study utilized a "high-fidelity mannequin" to simulate patient movements instead of involving real patients. While it allows for controlled data collection, it might not fully represent the complexities and variations in movements exhibited by actual bedridden patients. This limitation suggests that the model's performance might differ when applied in a real-world setting, and the conclusion of this study may be affected.

2. The research focuses on classifying six specific movements: "Roll right", "Roll left", "Drop right", "Drop left", "Breathing", and "Seizure". While these are some common movements, they do not represent the full spectrum of actions a bedridden patient might make. This limited scope potentially restricts the model's ability to detect other movements that could indicate a fall risk or require medical attention.

3. The study primarily relies on internal validation techniques like K-fold cross-validation and learning curves to assess the model's performance. While these methods are essential, external validation, which involves testing the model on a completely independent dataset, especially real-world data, is missing in this study. A validation using an independent dataset would provide stronger evidence of the model's generalizability and robustness.

Comments on the Quality of English Language

Minor editing of English language required.

Author Response

Comment 1: The study utilized a "high-fidelity mannequin" to simulate patient movements instead of involving real patients. While it allows for controlled data collection, it might not fully represent the complexities and variations in movements exhibited by actual bedridden patients. This limitation suggests that the model's performance might differ when applied in a real-world setting, and the conclusion of this study may be affected.

Response 1: The decision to use a mannequin was made to ensure a controlled and consistent platform for data collection, which is useful in the initial stages of developing and testing our model. We acknowledge that a mannequin may not fully represent the complexities and variations in movements exhibited by actual bedridden patients. However, it provides a valuable starting point for our research, allowing us to refine our model before testing it in more complex and variable real-world settings. We have now clearly stated this limitation in the discussion section of our manuscript.

Comment 2: The research focuses on classifying six specific movements: "Roll right", "Roll left", "Drop right", "Drop left", "Breathing", and "Seizure". While these are some common movements, they do not represent the full spectrum of actions a bedridden patient might make. This limited scope potentially restricts the model's ability to detect other movements that could indicate a fall risk or require medical attention.

Response 2: The selection of the six specific movements was based on their prevalence and significance in the context of bedridden patients. We acknowledge that there are other movements that a bedridden patient might make (e.g., shifting positions or reaching for objects), and our model's current scope may not cover all of them. However, the chosen movements represent a substantial portion of the actions that could indicate a fall risk or require medical attention. We have clearly stated this limitation in the discussion section of our manuscript. We've added this paragraph now in the discussion section, "The selection of the six specific movement categories in this study was based on their relevance to the safety and well-being of bed-ridden patients. These movements include breathing, seizures, rolling to the right side, rolling to the left side, rolling off the bed from the left, and rolling off the bed from the right. These categories were chosen to provide a comprehensive overview of potential scenarios that these patients might encounter. While there are other movements that patients might perform in bed, such as shifting positions or reaching for objects, these were not included in the current study. Future research could consider these additional movements to further enhance the model's practical applicability.".

Comment 3: The study primarily relies on internal validation techniques like K-fold cross-validation and learning curves to assess the model's performance. While these methods are essential, external validation, which involves testing the model on a completely independent dataset, especially real-world data, is missing in this study. A validation using an independent dataset would provide stronger evidence of the model's generalizability and robustness.

Response 3: Our study primarily relied on internal validation techniques such as K-fold cross-validation and learning curves to assess the model's performance. These methods were chosen due to their robustness in evaluating the model's performance on the available data. We acknowledge that external validation on an independent dataset, particularly real-world data, would provide stronger evidence of the model's generalizability and robustness. However, due to constraints in data availability and the scope of this study, we were unable to perform external validation. We have now explicitly stated this limitation in the discussion section of our manuscript to ensure clarity and transparency, as in the following paragraph, "This study primarily relies on internal validation techniques, such as K-fold cross-validation and learning curves, to assess the model's performance. The learning curves of the BR model illustrate the convergence of the training and validation curves, with the validation curve consistently staying slightly below the training curve. This pattern indicates that the model, in spite of its complexity, does not appear to be overfitting. Nonetheless, it is acknowledged that the generalizability of the model is a subject that requires further exploration. Future research directions are set to include testing the model on human subjects and in real-life environments to verify its applicability in a variety of healthcare settings."

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Compared to the previous version, this revised version has demonstrated the following improvements:

A. Abstract: The proposed methodology and experimental results were emphasized.

B. Section 1 – Introduction: The authors have added more explanations for the motivations and created a separate subsection of Related Works.

C. Section 2 – Materials and Methods:

- More detailed explanations of data collection methods were provided. However, I still have not been convinced about the reasons for using just one mannequin “to maintain consistency in the data collected and to control for any potential variability that might arise from using multiple mannequins”. Using multiple mannequins will benefit in providing diversity in the dataset, reflecting real-world settings if the data is collected from the sensors attached to multiple patients.

- The authors provided explanations for the hyperparameters of the models.

D. Section 3 – Results:

The authors provided better explanations for applying KFold Cross-Validation and the SMOT technique to deal with the overfitting issues.

E. The Discussion section is also significantly improved to discuss the strengths and weaknesses of the proposed methodology. The limitations were also discussed.

F. The authors also created the Conclusion section, which summarizes the problem statement, proposed methodology, experiments, results, and discussions and explains limitations and future works.

Based on the above improvements, I agree that the paper can be published.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors answered clearly the reviewer’s concerns.

New interesting applications and future works can be envisioned.

Reviewer 5 Report

Comments and Suggestions for Authors

The authors have addressed my concerns and comments, and improved the manuscript.

Comments on the Quality of English Language

Minor editing of English language required.

Article Menu

Regressive Machine Learning for Real-Time Monitoring of Bed-Based Patients

Further Information

Guidelines

MDPI Initiatives

Follow MDPI