Review Reports - Deep Learning-Based Forecasting of Boarding Patient Counts to Address Emergency Department Overcrowding

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

A model that can accurately predict boarding counts 6 hours into the future is pretty impressive and will be very helpful to many acute hospitals with busy EDs. The devil lies in capturing certain variables such as waiting times, and treatment times. Did they use time stamps based on the electronic medical records ? How about treatments that can be concurrent, such as doing a minor surgical procedure along with procedural sedation? Waiting times can also be concurrent, such as waiting for lab tests and radiology investigtions. All in all, such data needs to be properly captured electronically to be useful in such predictive models to be of practical use when implemented in real time in real world. It will be good to hear how they clean such data or define the parameters to capture these measurements .

Did the authors also include number of elective surgery cases scheduled that could affect bed availability for analysis ?

I note that their approach does not utilize patient-level clinical data but instead relies solely on aggregate operational data - this might cause their model to be inaccurate when applied to a children's or maternity hospitals, or hospitals in communities with different populations such as very elderly or socially impoverished. The authors might want to comment on these possible limitations ? I know they claimed that "our method simplifies data collection and improves generalizability, making it readily adaptable for use across diverse hospital systems", but these do need some clarification

Good that they also include major sporting events and weather conditions. These do affect ED attendances, and I applaud them for adding it in. But is does mean the model needs good weather forecasting services. Good that they also excluded pandemic data, which will skew the study.

All in all, a good model to try out, will they revisit the accuracy and validity after trying out with international partners outside of Alabama ? ED systems in Asia, Europe and Africa could differ from US ED practices

Author Response

Response to Reviewer 1 Comments
1. Summary
We sincerely thank the reviewer for taking the time to provide thoughtful and constructive feedback on our manuscript. Your comments and suggestions were extremely valuable and have significantly contributed to improving the clarity, quality, and overall strength of the paper. We carefully addressed each point raised, and we believe the revisions have resulted in a much stronger and more comprehensive manuscript.
2. Questions for General Evaluation	Reviewer’s Evaluation	Response and Revisions
Does the introduction provide sufficient background and include all relevant references?	Yes/Can be improved/Must be improved/Not applicable
Are all the cited references relevant to the research?	Yes/Can be improved/Must be improved/Not applicable
Is the research design appropriate?	Yes/Can be improved/Must be improved/Not applicable
Are the methods adequately described?	Yes/Can be improved/Must be improved/Not applicable	We agree with the reviewer that the methods section could be described in more detail. We have revised this section to provide clearer explanations and additional information to improve its completeness and readability.
Are the results clearly presented?	Yes/Can be improved/Must be improved/Not applicable
Are the conclusions supported by the results?	Yes/Can be improved/Must be improved/Not applicable
3. Point-by-point response to Comments and Suggestions for Authors
Comments 1: A model that can accurately predict boarding counts 6 hours into the future is pretty impressive and will be very helpful to many acute hospitals with busy EDs. The devil lies in capturing certain variables such as waiting times and treatment times. Did they use time stamps based on electronic medical records? How about treatments that can be concurrent, such as doing a minor surgical procedure along with procedural sedation? Waiting times can also be concurrent, such as waiting for lab tests and radiology investigtions. All in all, such data needs to be properly captured electronically to be useful in such predictive models to be of practical use when implemented in real time in the real world. It will be good to hear how they clean such data or define the parameters to capture these measurements.
Response 1: (i) Did they use time stamps based on electronic medical records? Timestamps for patient movements are available in the ED tracking data, and for each visit ID we have detailed records that include location check-in and check-out times, room numbers, ED check-out time, Emergency Severity Index (ESI) levels, room types (waiting or treatment), event status (e.g., complete, request, cancel), event types (e.g., triage, admit bed request), and other ED-related information. However, each feature (e.g., waiting time, treatment time) must be calculated separately from the raw data. In the revised manuscript, we have now provided a clearer explanation of how these timestamps were used to create features such as boarding count, boarding time, waiting count, waiting time, treatment count, etc. (ii) How about treatments that can be concurrent, such as doing a minor surgical procedure along with procedural sedation? Because patients undergoing these procedures occupy the treatment room, their treatment count is included for each hour corresponding to the time they occupy the room. Similarly, the average treatment time is calculated based on the total time the treatment room is in use, regardless of whether multiple procedures occur concurrently. We cannot exclude these patients, as they represent a natural part of ED operations and will continue to present in varying numbers at different times. Additionally, our prediction model was developed without using patients' diagnoses, treatment history, or individual demographic information. This approach is one of the model’s key strengths, as it highlights its ability to generalize effectively across different hospitals. (iii) Waiting times can also be concurrent, such as waiting for lab tests and radiology investigations. All in all, such data needs to be properly captured electronically to be useful in such predictive models to be of practical use when implemented in real time in the real world. It will be good to hear how they clean such data or define the parameters to capture these measurements. First of all, each room transfer, each location entry and exit, the timestamps of all requests (e.g., admit bed requests), and key movements such as ED checkout are meticulously recorded in the data source. Regarding Average Waiting Time in the waiting room, there is no issue because we directly use the timestamps capturing when patients enter and leave the waiting room. All of these timestamps are reliably available in the dataset. Average Treatment Time may include periods where patients undergo laboratory tests, screening tests, or other concurrent procedures, as it represents the entire treatment period from entry into a treatment room until departure. These activities are part of the natural course of ED operations, and similar scenarios will continue to occur when the model is applied in real-world settings. In the revised manuscript, we have provided a more detailed explanation of how these features are calculated by specifying the timestamps used and by including examples. Regarding comments (i) and (iii), we have updated the Feature Engineering and Preprocessing section on page 6 of the revised manuscript. *Action 1:* Feature Engineering and Preprocessing Feature engineering and preprocessing were applied to combine and refine data from multiple sources, creating structured, time-aligned features for model training. Aggregated flow metrics were then calculated to represent hourly inpatient and ED activity, ensuring the models used well-defined and temporally consistent inputs. From the ED tracking data source, nine aggregated PFMs were engineered to represent hourly patient activity within the department, while one additional feature was derived from the inpatient data source to capture hospital-wide census. These engineered variables form the core operational metrics used in this study, providing a comprehensive view of patient flow from ED arrival through inpatient admission. These include: (1) boarding count; the number of patients in the boarding phase, which is the period after an inpatient bed request is submitted and before the patient leaves the ED for the inpatient unit. This interval begins when the inpatient bed request is entered into the system and ends at the ED discharge timestamp marking the patient’s transfer. For instance, if a patient receives an admit bed request at 4:20 PM and is checked out of the ED at 6:35 PM, that patient would be included in the boarding count for the hours 4–5 PM, 5–6 PM, and 6–7 PM. (2) Boarding count by ESI level; boarding counts grouped into three Emergency Severity Index (ESI) categories (1&2, 3, and 4&5) to assess boarding distribution by acuity. (3) Average boarding time; the mean duration, in minutes, that patients spend in the boarding phase during each hour. To calculate this metric, we determine how many minutes of each patient’s boarding interval fall within a given hourly window, sum these minutes across all patients, and then divide by the number of patients boarding during that hour. For example, during the hour 01:00–02:00, three patients (IDs 1, 2, and 3) were boarding, and their combined boarding time totaled 135 minutes (30 + 60 + 45), resulting in an average boarding time of 45 minutes (135 ÷ 3). (4) Waiting count; the number of patients in the waiting room during each hour, calculated using the same logic as boarding count. The waiting period begins at the timestamp when the patient arrives at the waiting room location and ends at the timestamp when the patient leaves the waiting room location. This calculation is cumulative across hours. For example, if a patient arrives at the ED waiting room at 10:20 AM and departs at 12:50 PM, they would be counted in the waiting count for the hours 10–11 AM, 11 AM–12 PM, and 12–1 PM. (5) Waiting count by ESI level; the total waiting count broken into the same three ESI categories to reflect differences in waiting room congestion by acuity. (6) Average waiting time; the mean duration, in minutes, that patients spend in the waiting room during each hour. It is calculated using the same logic as average boarding time, but the timestamps used are the waiting room location arrival and departure times. For each hour, the total waiting minutes of all patients are summed and then divided by the number of patients waiting during that hour, ensuring that patients spanning multiple hours are only counted for the portion of time they were waiting in each hour. (7) Treatment count; the number of patients in treatment rooms during each hour, calculated using the same logic as boarding count. The treatment period is defined as the time between the timestamp when a patient arrives at a treatment room location and the timestamp when they leave that location. Patients whose treatment period spans multiple hours are counted in each applicable hourly interval. (8) Average treatment time; the mean duration, in minutes, that patients spend in treatment rooms during each hour, calculated using the same logic as average boarding time but with treatment room timestamps. (9) Extreme case indicator: a binary variable that takes the value 1 if the boarding count at a given hour exceeds its historical mean (μ) plus one standard deviation (σ), and 0 otherwise. Finally, from the inpatient data source, a single feature—hospital census—was engineered to capture the number of admitted patients present in the hospital at each hour. The feature set used in this study contains more than nine features, as shown in Table 1. However, these nine features required specific feature engineering because they were not directly available in the raw dataset and had to be derived through additional calculations.
Comments 2: Did the authors also include number of elective surgery cases scheduled that could affect bed availability for analysis?
Response 2: Elective surgery case volumes were not included as a separate feature in the current analysis because our model was developed using aggregated operational metrics rather than patient-level data. While we recognize that the number of scheduled elective surgeries can influence hospital bed availability and consequently boarding counts, this level of detail was not available within the aggregated data used.
Comments 3: I note that their approach does not utilize patient-level clinical data but instead relies solely on aggregate operational data - this might cause their model to be inaccurate when applied to a children's or maternity hospitals, or hospitals in communities with different populations such as very elderly or socially impoverished. The authors might want to comment on these possible limitations? I know they claimed that "our method simplifies data collection and improves generalizability, making it readily adaptable for use across diverse hospital systems", but these do need some clarification.
Response 3: We appreciate this valuable comment. In the revised manuscript, we have clarified that while our process is designed to be generalizable, the trained prediction module must be retrained using data from each individual hospital. We have also expanded the limitations section at the end of the discussion to reflect this point. Action 3: Regarding comments 3, we have updated the Introduction section on page 3 of the revised manuscript. we have updated the Limitations and Future Work section on page 17 of the revised manuscript. Introduction In the fourth paragraph on page 3 Most existing models rely on a narrow set of input features, often limited to internal ED data. In contrast, our study integrates a wider range of features, combining operational indicators with contextual variables such as weather conditions and significant events (e.g., holidays and football games), which originate outside of the hospital system. Many prior models also depend on patient-level clinical data, including vital signs, demographics, diagnoses, ED laboratory results, and past medical history, which require extensive data sharing and raise privacy and regulatory concerns. Our approach is fundamentally different: it does not use patient-level clinical data and instead relies solely on aggregate operational data—structured, time-stamped numerical indicators that reflect system-level dynamics. Our design simplifies data integration and enhances model generalizability across settings. In the fifth (last) paragraph on page 4 ... Moreover, because our method does not use sensitive patient-level information, the approach is easier to implement and more adaptable to different hospital systems, though models must still be trained on each institution’s data. ... Limitations and Future Work In the fourth paragraph on page 17 From a deployment perspective, our approach offers broad generalizability and potential adoption in diverse ED settings because it relies on aggregate operational data rather than patient-level records. It assumes clearly defined waiting and treatment areas with accurate timestamp tracking; without these, the necessary feature engineering for PFMs would not be possible. The model must also be retrained for each site to ensure optimal performance, as healthcare policies, workflows, and patient flow patterns vary considerably between institutions, especially outside the United States. For example, some hospitals use fast-track systems for low-acuity patients, while others manage all patients through a unified treatment stream. Another example is that pediatric EDs follow different operational models than adult EDs.
Comments 4: Good that they also include major sporting events and weather conditions. These do affect ED attendances, and I applaud them for adding it in. But is does mean the model needs good weather forecasting services. Good that they also excluded pandemic data, which will skew the study.
Response 4: We thank the reviewer for recognizing the inclusion of weather conditions and major sporting events as factors that can influence ED attendances. We agree that relying on weather forecasting services introduces potential dependency risks. To address this, we have added a note in the Limitations and Future Work sections of the revised manuscript. Action 4: Regarding comments 4, we have updated the Limitations and Future Work section on page 18 of the revised manuscript. Limitations and Future Work ... To further enhance reliability, we will establish infrastructure for real-time data ingestion and implement fallback strategies for external data sources (e.g., weather APIs). For instance, if the primary source becomes unavailable or returns anomalous data, the system will automatically switch to a secondary provider and validate consistency. ...
Comments 5: All in all, a good model to try out, will they revisit the accuracy and validity after trying out with international partners outside of Alabama? ED systems in Asia, Europe and Africa could differ from US ED practices
Response 5: We fully agree that ED systems can vary significantly across regions. In the revised manuscript, we have updated the Limitations section to clarify the basic requirements for applying our framework to other countries. Specifically, the ED must have clearly defined waiting and treatment areas, and patient movements through these areas must be accurately tracked with timestamps. We also note that some EDs may have a nested structure in which waiting and treatment areas are not distinctly separated or where movement tracking is not consistently available. In such settings, the lack of granular movement and location data would limit the applicability of our method, and adaptations would be necessary to account for these structural differences. Action 5: Regarding comments 5, we have updated the Limitations and Future Work section on page 17 of the revised manuscript. Limitations and Future Work ... From a deployment perspective, our approach offers broad generalizability and potential adoption in diverse ED settings because it relies on aggregate operational data rather than patient-level records. It assumes clearly defined waiting and treatment areas with accurate timestamp tracking; without these, the necessary feature engineering for PFMs would not be possible. The model must also be retrained for each site to ensure optimal performance, as healthcare policies, workflows, and patient flow patterns vary considerably between institutions, especially outside the United States. For example, some hospitals use fast-track systems for low-acuity patients, while others manage all patients through a unified treatment stream. Another example is that pediatric EDs follow different operational models than adult EDs.

Reviewer 2 Report

Comments and Suggestions for Authors

The abstract is slightly overloaded with technical details (e.g., Optuna, specific model names) that may not aid general readers.
The novelty claim ("without patient-level clinical data") could be emphasized more succinctly with justification.
Some references are dated or weak (e.g., simulations without real hospital data [17]).
Lack of a clearly stated research gap or hypothesis.
Novelty is embedded in a lengthy paragraph; needs distillation.
Lacks structured comparison (e.g., a table comparing features, methods, and results).
Some works (e.g., [15]-[19]) are discussed without clear connection to the authors’ contributions.
Does not explore limitations of deep learning in healthcare (e.g., interpretability, data drift).
Feature scaling/normalization methods are not detailed.
Potential data leakage concerns with engineered lag features are not addressed.
The justification for excluding COVID-19 data (April–July 2020) is weak—why not adjust for seasonality instead?
Results lack confidence intervals or statistical tests to validate significance.
Explainability analysis (Figure 2 & 3) could be supported by SHAP or other independent techniques.
No performance comparison with non-deep learning baselines (e.g., Random Forest).
Lacks patient-level clinical or operational validation from domain experts (e.g., ED staff interviews).
Does not evaluate whether attention correlates with human intuition.
Fails to discuss limitations of model deployment (e.g., latency, integration with EHRs).
No assessment of external validity or transferability across hospitals.

Too brief given the breadth of the study.
Lacks concrete future work suggestions (e.g., simulation-based impact studies, clinical integration).

Comments on the Quality of English Language

Reduce the similarity index, which is currently about 52% (turnitin)
Minor typos: "Higly Extreme" (should be "Highly Extreme").
Repetitive phrasing in some sections (e.g., Dataset 4 description repeated).

Author Response

Response to Reviewer 2 Comments
1. Summary
We sincerely thank the reviewer for taking the time to provide thoughtful and constructive feedback on our manuscript. Your comments and suggestions were extremely valuable and have significantly contributed to improving the clarity, quality, and overall strength of the paper. We carefully addressed each point raised, and we believe the revisions have resulted in a much stronger and more comprehensive manuscript.
2. Questions for General Evaluation	Reviewer’s Evaluation	Response and Revisions
Does the introduction provide sufficient background and include all relevant references?	Yes/Can be improved/Must be improved/Not applicable	We appreciate the reviewer’s comment. We have revised the Introduction to provide additional background context and included several relevant references to better situate our study within the existing literature. These updates improve the clarity and completeness of the background information and strengthen the rationale for our research focus.
Are all the cited references relevant to the research?	Yes/Can be improved/Must be improved/Not applicable
Is the research design appropriate?	Yes/Can be improved/Must be improved/Not applicable	We have clarified the study design and model development process to address the concerns raised. Thank you for the helpful feedback.
Are the methods adequately described?	Yes/Can be improved/Must be improved/Not applicable	We agree with the reviewer that the methods section could be described in more detail. We have revised this section to provide clearer explanations and additional information to improve its completeness and readability.
Are the results clearly presented?	Yes/Can be improved/Must be improved/Not applicable	We agree with the reviewer that the result section could be described in more detail. We have revised this section to provide clearer explanations and additional information to improve its completeness and readability.
Are the conclusions supported by the results?	Yes/Can be improved/Must be improved/Not applicable	Thank you for the feedback. We have refined the conclusion section to better align with the results presented and to more clearly highlight the key findings and their implications.
Are all figures and tables clear and well- presented?	Yes/Can be improved/Must be improved/Not applicable	Thank you for the comment. We have revised the figures and tables to improve clarity, remove redundancies, and ensure consistent formatting throughout the manuscript.
3. Point-by-point response to Comments and Suggestions for Authors
Comments 1: The abstract is slightly overloaded with technical details (e.g., Optuna, specific model names) that may not aid general readers.
Response 1: In the revised abstract, we have reduced the level of technical detail to improve readability for a general reader. We removed references to multiple model names (e.g., ResNetPlus, TSiTPlus) and the hyperparameter optimization tool (Optuna). We now only mention the best-performing model, TSTPlus, while focusing on the key results and contributions of the study. Page 1, the Abstract section has been revised as shown below, with changes indicated in red. Action 1: Abstract Emergency department (ED) overcrowding remains a major challenge for hospitals, resulting in worse outcomes, longer waits, higher costs, and greater strain on staff. Boarding count, which is the number of patients who have been admitted to an inpatient unit but are still in the ED waiting for transfer, is a key patient flow metric that affects overall ED operations. This study presents a deep learning-based approach to forecast ED boarding counts using only operational and contextual features—derived from hourly ED tracking, inpatient census, weather, holiday, and local event data—without patient-level clinical information. Different deep learning algorithms were tested, including convolutional and transformer-based time series models, and the best-performing model, TSTPlus, achieved strong accuracy at the 6-hour prediction horizon, with a mean absolute error of 4.30 and an R² score of 0.79. After identifying TSTPlus as the best-performing model, we tested its use to forecast boarding counts at additional horizons of 8, 10, and 12 hours. The model was also evaluated under extreme operational conditions, demonstrating robust and accurate forecasts. These findings highlight the potential of our forecasting approach to support proactive operational planning and reduce ED overcrowding. The source code and model implementation are available at https://github.com/drorhunvural/ED_OverCrowding_Predictions.
Comments 2: The novelty claim ("without patient-level clinical data") could be emphasized more succinctly with justification.
Response 2: We thank the reviewer for this valuable comment. In the revised manuscript, we have strengthened and clarified the novelty of our approach regarding the use of patient-level clinical data. Specifically, we now explicitly state that many prior models depend heavily on patient-level data such as vital signs, demographics, diagnoses, ED laboratory results, and past medical history, which require extensive data sharing and raise privacy and regulatory concerns. We emphasize that our approach is fundamentally different because it does not use patient-level clinical data and instead relies solely on aggregate operational data. Page 3, the Introduction section has been revised as shown below, with changes indicated in red. Page 18, the Conclusion section has been revised as shown below, with changes indicated in red. Action 2: Introduction In the fourth paragraph on page 3 Most existing models rely on a narrow set of input features, often limited to internal ED data. In contrast, our study integrates a wider range of features, combining operational indicators with contextual variables such as weather conditions and significant events (e.g., holidays and football games), which originate outside of the hospital system. Many prior models also depend on patient-level clinical data, including vital signs, demographics, diagnoses, ED laboratory results, and past medical history, which require extensive data sharing and raise privacy and regulatory concerns. Our approach is fundamentally different: it does not use patient-level clinical data and instead relies solely on aggregate operational data—structured, time-stamped numerical indicators that reflect system-level dynamics. Our design simplifies data integration and enhances model generalizability across settings. Additionally, to our knowledge, there is no existing model that performs hourly boarding count prediction using real-world data from US EDs with deep learning models, and there remains a clear research gap in this area. We hypothesize that a deep learning–based framework leveraging aggregate operational features can accurately forecast ED boarding counts across multiple prediction horizons, providing actionable insights that enable proactive management of ED resources and mitigation of overcrowding. Our results are intended to proactively inform FCP activation and serve as inputs to a clinical decision support system, enabling more timely and informed operational responses. In the third paragraph on page 3 ... Similarly, Lee et al. [19] modeled early disposition prediction as a hierarchical multiclass classification task using patient-level ED data, including lab results and clinical features available approximately 2.5 hours prior to disposition. This model aims to reduce boarding delays but relies on patient-level clinical data, which is labor-intensive to collect, subject to privacy constraints, and often inconsistent across hospitals—limiting its scalability and real-time applicability. The short and variable prediction window, triggered only after lab results, may limit its effectiveness for enabling proactive resource allocation and operational decision-making. Conclusion In the first paragraph on page 1 The proposed framework does not rely on patient-level clinical data, thereby facilitating generalizability, protecting privacy, and making it applicable to diverse hospital environments.
Comments 3: Some references are dated or weak (e.g., simulations without real hospital data [17]).
Response 3: We thank the reviewer for this observation. Reference [17] is a comprehensive PhD dissertation from the University of Texas at Arlington that provides valuable background on ED patient flow modeling, and we believe it remains relevant to our study. However, we have updated the manuscript to ensure we are not relying solely on this reference for horizon selection (e.g., the six-hour window). While we acknowledge that six hours is an important benchmark identified in prior research, we have strengthened the study by expanding our analysis to include prediction results at multiple horizons (6, 8, 10, and 12 hours). These updates have been incorporated throughout the revised manuscript. Page 3-4 the Introduction section has been revised as shown below, with changes indicated in red. Action 3: Introduction In the fifth (last) paragraph on page 3-4 We developed a predictive model to estimate boarding count at multiple future horizons (6, 8, 10, and 12 hours) using real-world data from a partner hospital located in Alabama, United States of America, relying solely on ED operational flow and external features without incorporating patient-level clinical data. During this study, we worked closely with an advisory board consisting of representatives from various emergency department teams at our collaborating hospital. Based on recommendations from our advisory board and prior research identifying a six-hour window as a critical threshold for boarding interventions [17], we evaluated four prediction horizons (6, 8, 10, and 12 hours) to balance early-warning capability with operational feasibility. ...
Comments 4: Lack of a clearly stated research gap or hypothesis.
Response 4: In response, we have revised the manuscript to more clearly articulate both the research gap and our study hypothesis. Specifically, we now highlight that there is no existing model, to our knowledge, that performs hourly boarding count prediction using real-world data from US EDs with deep learning methods—an area that remains underexplored in the literature. We have also explicitly stated our hypothesis: that a deep learning–based framework leveraging aggregate operational features can accurately forecast ED boarding counts across multiple prediction horizons. These updates are now reflected in the revised Introduction to clarify the novel contribution and direction of our work. Page 3, the Introduction section has been revised as shown below, with changes indicated in red. Action 4: Introduction In the fourth paragraph on page 3 Our design simplifies data integration and enhances model generalizability across settings. Additionally, to our knowledge, there is no existing model that performs hourly boarding count prediction using real-world data from US EDs with deep learning models, and there remains a clear research gap in this area. We hypothesize that a deep learning–based framework leveraging aggregate operational features can accurately forecast ED boarding counts across multiple prediction horizons, providing actionable insights that enable proactive management of ED resources and mitigation of overcrowding. Our results are intended to proactively inform FCP activation and serve as inputs to a clinical decision support system, enabling more timely and informed operational responses.
Comments 5: Novelty is embedded in a lengthy paragraph; needs distillation.
Response 5: We have revised the final portion of the relevant paragraph in the introduction to clearly articulate the key innovations of our work. The updated paragraph now explicitly outlines the three primary contributions of our study: (1) the use of aggregate operational and contextual data without patient-level clinical inputs, (2) multi-horizon forecasting using real-world hospital data to support proactive ED resource management, and (3) enhanced model interpretability and generalizability through transparent feature engineering, attention-based analysis, and evaluation under extreme conditions. These changes are intended to better highlight the novelty and significance of our approach. Page 4, the Introduction section has been revised as shown below, with changes indicated in red. Action 5: Introduction In the fifth (last) paragraph of the Introduction section (page 4) ... Overall, our framework offers three key innovations: it leverages only aggregate operational and contextual data, enhanced through extensive feature engineering to derive critical PFMs that were not directly available from the raw data; it performs multi-horizon forecasting using real-world hospital data to enable proactive, real-time ED resource management before critical thresholds are reached; and it promotes interpretability and generalizability through transparent feature design, attention-based analysis, and testing under extreme operational scenarios.
Comments 6: Lacks structured comparison (e.g., a table comparing features, methods, and results).
Response 6: Thank you for the comment. We have addressed this by adding structured comparisons to enhance clarity and comprehensiveness. Specifically, Figure 2 now includes 8-, 10-, and 12-hour prediction results for the best-performing model, TSTPlus, allowing direct comparison across multiple prediction horizons. In addition, Table 4 incorporates a bootstrap analysis to provide a statistically robust evaluation of TSTPlus performance. Page 1, the Results section has been revised as shown below, with changes indicated in red. Action 6: Results In the fourth paragraph of the Results section (page 13) Figure 2 presents the performance of the TSTPlus model across five datasets for 6-, 8-, 10-, and 12-hour prediction horizons. As expected, shorter prediction intervals produced more accurate and reliable forecasts, consistent with the general principle that predictive accuracy decreases as the prediction horizon lengthens and uncertainty increases. The best results for different prediction horizons were obtained from two datasets: Dataset 3 achieved the highest accuracy for the 6- and 12-hour horizons, while Dataset 5 performed best for the 8- and 10-hour horizons. Figure 2 shows that different feature sets can be more effective for different forecasting horizons. As a result, the best MAE values achieved by the TSTPlus algorithm were 4.30 for the 6-hour horizon, 4.72 for 8 hours, 5.14 for 10 hours, and 5.39 for 12 hours. Figure 2. Performance of TSTPlus across five datasets, evaluated using MAE, MSE, RMSE, and R² metrics for 6-, 8-, 10-, and 12-hour prediction horizons. Extreme Case and Bootstrap Analysis In the fourth paragraph of the Extreme Case and Bootstrap Analysis section (page 14) Bootstrap analysis is a resampling-based technique used to assess the stability and variability of model performance metrics. We employed a block bootstrap approach, sampling consecutive 12-hour segments for Datasets 1–4 and 24-hour segments for Dataset 5 to maintain the temporal structure of the test data. Across 500 bootstrap iterations, the model was evaluated on each resampled set, and performance metrics—MAE, MSE, and R²—were calculated. The interquartile range (IQR), reported as the 25th to 75th percentile values for each metric, reflects the spread of the middle 50% of bootstrap results, providing a robust evaluation of performance variability. These median and IQR values are summarized in Table 4. As shown in Figure 1, TSTPlus on Dataset 3 achieved a median MAE of 4.30. The narrow IQR range of 4.24–4.37 indicates high stability and low variability across bootstrap samples, underscoring the robustness of the results.
Comments 7: Some works (e.g., [15]-[19]) are discussed without clear connection to the authors’ contributions.
Response 7: To address this concern, we have revised the relevant section in the manuscript to more clearly explain the relationship between the cited studies and our own contributions. Specifically, we have added explanatory context that highlights the methodological differences, data requirements, and prediction horizons in each study compared to ours. Where appropriate, we now explicitly link these points to the novelty and relevance of our own approach. These additions help clarify how our work builds upon or diverges from previous efforts in ED boarding prediction. Page 2, the Introduction section has been revised as shown below, with changes indicated in red. Action 7: Introduction In the third paragraph of the Introduction section (page 2) Several studies have addressed ED overcrowding by modeling or predicting boarding using both statistical and machine learning approaches. For example, Cheng et al. [15] developed a linear regression model to estimate the staffed bed deficit—defined as the difference between the number of ED boarders and available staffed inpatient beds—using real-time data on boarding count, bed availability, pending consults, and discharge orders. Their model predicts the number of beds needed 4 hours in advance of each shift change. However, the model uses limited features, cannot capture nonlinear patterns due to its linear regression design, and is constrained by a short 4-hour prediction window. Hoot et al. [16] introduced a discrete event simulation model to forecast several ED crowding metrics, including boarding count, boarding time, waiting count, and occupancy level, at 2-, 4-, 6-, and 8-hour intervals. The model, which used six patient-level features, achieved a Pearson correlation of 0.84 when forecasting boarding count six hours into the future. However, this approach does not employ machine learning; it is based on simulation techniques rather than data-driven predictive modeling. More recent studies have leveraged machine learning to enable earlier predictions. Suley [17] developed predictive models to forecast boarding counts 1-6 hours ahead using multiple machine learning approaches, with Random Forest regression achieving the best performance. They demonstrated that when boarding levels exceed 60% of ED capacity, average length of stay increases from 195 to 224 minutes. However, these findings were based on agent-based simulation modeling rather than real hospital data, which may not reflect actual ED operations. Additionally, the study lacks key descriptives (e.g., mean, standard deviation of hourly boarding counts), limiting full performance evaluation. Kim et al. [18] used data collected within the first 20 minutes of ED patient arrival—including vital signs, demographics, triage level, and chief complaints—to predict ED hospitalization, employing five models: logistic regression, XGBoost, NGBoost, support vector machines (SVM), and decision trees. At 95% specificity, their approach reduced ED length of stay by an average of 12.3 minutes per patient, totaling over 340,000 minutes annually. Unlike our study, which leverages aggregate operational data along with contextual features, their approach relies on early patient-level clinical information and employs traditional machine learning methods. Similarly, Lee et al. [19] modeled early disposition prediction as a hierarchical multiclass classification task using patient-level ED data, including lab results and clinical features available approximately 2.5 hours prior to disposition. This model aims to reduce boarding delays but relies on patient-level clinical data, which is labor-intensive to collect, subject to privacy constraints, and often inconsistent across hospitals—limiting its scalability and real-time applicability. The short and variable prediction window, triggered only after lab results, may limit its effectiveness for enabling proactive resource allocation and operational decision-making.
Comments 8: Does not explore limitations of deep learning in healthcare (e.g., interpretability, data drift)
Response 8: We appreciate the reviewer’s comment regarding the limitations of deep learning in healthcare. In our newly added Limitations and Future Work section, we address several critical concerns, including the need to evaluate the model in real-world or simulation-based settings, the practical constraints of our 6–12 hour prediction windows, and the potential limitations of transformer architectures in capturing temporal causality. We also emphasize the importance of generalizability across hospitals with varying operational structures. Furthermore, in the Future Work section, we explicitly mention the need to maintain model robustness in the presence of data drift by incorporating automated retraining, continuous performance monitoring, and anomaly detection into the proposed platform. Page 17, the Limitations and Future Work section has been revised as shown below, with changes indicated in red. Action 8: Limitations and Future Work This study presents a predictive approach for ED operations, but like any data-driven method, it has limitations and opportunities for future improvement. While the model demonstrates strong performance using aggregate operational data, further developments are necessary to optimize its applicability across diverse healthcare settings. One primary limitation is that this study did not evaluate how the predictions would perform in a real-world operational or simulation-based setting. Although the model provides accurate forecasts at 6-, 8-, 10-, and 12-hour intervals, its practical value—particularly whether these windows allow sufficient lead time for effective intervention—remains untested. Beyond operational considerations, the study also faces some technical limitations related to model design and training. Despite an extensive hyperparameter search, alternative or more comprehensive strategies could yield better-performing combinations. Likewise, exploring different feature set configurations within the datasets could further enhance predictive performance. While explainability approaches (e.g., Gradient SHAP, attention weight visualization) provide insights into feature importance, these methods cannot guarantee full transparency of the model’s decision-making, especially in cases involving complex nonlinear relationships between inputs and outputs. From a deployment perspective, our approach offers broad generalizability and potential adoption in diverse ED settings because it relies on aggregate operational data rather than patient-level records. It assumes clearly defined waiting and treatment areas with accurate timestamp tracking; without these, the necessary feature engineering for PFMs would not be possible. The model must also be retrained for each site to ensure optimal performance, as healthcare policies, workflows, and patient flow patterns vary considerably between institutions, especially outside the United States. For example, some hospitals use fast-track systems for low-acuity patients, while others manage all patients through a unified treatment stream. Another example is that pediatric EDs follow different operational models than adult EDs. In future work, we plan to evaluate the model in simulation-based or real-world ED settings to assess its operational effectiveness. We will also extend the framework to predict additional PFMs used in this study and integrate these models into a unified prediction platform. This system will generate real-time forecasts for multiple critical PFMs, offering a comprehensive view of anticipated ED operational status and supporting proactive resource management. The deployment will include standardized data preparation, feature engineering, and preprocessing routines, along with automated model retraining, performance monitoring, and anomaly detection to maintain accuracy and robustness in the presence of data drift. Multiple predictive modules—each targeting a specific PFM—will run concurrently, coordinated by a central orchestration layer to ensure data consistency and manage dependencies. To further enhance reliability, we will establish infrastructure for real-time data ingestion and implement fallback strategies for external data sources (e.g., weather APIs). For instance, if the primary source becomes unavailable or returns anomalous data, the system will automatically switch to a secondary provider and validate consistency. Additionally, user-facing dashboards will be developed to integrate seamlessly with existing ED workflows while supporting actionable, data-driven decision-making as part of the full deployment process.
Comments 9: Feature scaling/normalization methods are not detailed.
Response 9: We thank the reviewer for highlighting the need for greater detail in our feature scaling and normalization methods. In the revised manuscript, we have clarified that we applied standardization using scikit-learn’s StandardScaler. Specifically, the scaler was fit on the training set to compute the mean and standard deviation, and then the same transformation was applied to the validation and test sets to ensure consistency in data preprocessing. Page 7, the Feature Engineering and Preprocessing section has been revised as shown below, with changes indicated in red. Action 9: Feature Engineering and Preprocessing In the seventh paragraph of the Feature Engineering and Preprocessing section (page 7) All non-binary features and the target variable were standardized using the StandardScaler from scikit-learn [30], with the scaling parameters computed on the training set and applied to the validation and test sets to avoid data leakage.
Comments 10: Potential data leakage concerns with engineered lag features are not addressed.
Response 10: We appreciate the reviewer’s observation. To address this, we have clarified in the revised manuscript that all lagged and rolling features were generated using only past values relative to each prediction timestamp. This ensures that no future information was used in constructing input features for training, validation, or testing, thereby avoiding data leakage. A corresponding clarification has been added to the Feature Engineering and Preprocessing section of the manuscript. Page 7, the Feature Engineering and Preprocessing section has been revised as shown below, with changes indicated in red. Action 10: Feature Engineering and Preprocessing In the sixth paragraph of the Feature Engineering and Preprocessing section (page 7) Lagged and rolling features were computed using a custom function that systematically transformed each selected variable by generating lagged versions and rolling averages. For each variable, lag features were created by shifting the original values backward by 1 to N hours, producing a series of lagged inputs corresponding to different historical time steps. This enables the model to learn from recent historical values. To capture local trends and smooth out noise, rolling mean features were calculated using a centered moving average over a specified window size. To prevent data leakage, lagged and rolling features were generated strictly based on historical values prior to each prediction timestamp, ensuring that no future information was included in the input features during training, validation, or testing.
Comments 11: The justification for excluding COVID-19 data (April–July 2020) is weak—why not adjust for seasonality instead?
Response 11: We appreciate the reviewer’s feedback. We have clarified in the manuscript that the decision to exclude data from April to July 2020 was not solely due to seasonality, but because this period exhibited atypical operational behaviors caused by the early impact of the COVID-19 pandemic. As shown in Appendix 1, both boarding counts and times were substantially lower than historical norms, reflecting systemic disruptions and operational anomalies not representative of routine ED functioning.
Comments 12: Results lack confidence intervals or statistical tests to validate significance.
Response 12: Thank you for this valuable comment. To address this concern, we have added a bootstrap analysis that reports the interquartile range (IQR) of performance metrics for the best-performing model, TSTPlus, across all datasets. This analysis, now presented in Table X, provides a statistical measure of variability and strengthens the reliability of our results. Page 14, the Extreme Case and Bootstrap Analysis section has been revised as shown below, with changes indicated in red. Action 12: Extreme Case and Bootstrap Analysis In the fourth paragraph of the Extreme Case and Bootstrap Analysis section (page 14) Bootstrap analysis is a resampling-based technique used to assess the stability and variability of model performance metrics. We employed a block bootstrap approach, sampling consecutive 12-hour segments for Datasets 1–4 and 24-hour segments for Dataset 5 to maintain the temporal structure of the test data. Across 500 bootstrap iterations, the model was evaluated on each resampled set, and performance metrics—MAE, MSE, and R²—were calculated. The interquartile range (IQR), reported as the 25th to 75th percentile values for each metric, reflects the spread of the middle 50% of bootstrap results, providing a robust evaluation of performance variability. These median and IQR values are summarized in Table 4. As shown in Figure 1, TSTPlus on Dataset 3 achieved a median MAE of 4.30. The narrow IQR range of 4.24–4.37 indicates high stability and low variability across bootstrap samples, underscoring the robustness of the results.
Comments 13: Explainability analysis (Figure 2 & 3) could be supported by SHAP or other independent techniques.
Response 13: The attention-based explainability analysis has been expanded to include two additional, complementary techniques: Gradient SHAP and Input Weight Norm analysis. Gradient SHAP offers a model-agnostic, sensitivity-based measure of each feature’s contribution, while the Weight Norm approach estimates importance from the magnitude of learned embedding weights. Together, these methods provide independent validation of the attention-based findings, strengthening the interpretability of the model’s predictions. To ensure the results of these analyses were directly comparable, the input data shape was standardized across methods, which required updating the corresponding figures from the original manuscript. Page 14 (Model Explainability) and Page 16 (Discussion) sections have been revised as shown below, with changes highlighted in red. Action 13: Model Explainability To enhance interpretability, we applied three complementary explainability techniques to our best-performing TSTPlus model: Input Weight Norm analysis, inspired by the principle that larger post-training weight magnitudes reflect greater variable influence [44], here adapted to compute the L₂ norm of each feature’s learned embedding vector to assess relative importance (Figure 3); attention analysis to identify which past time steps received the most focus during prediction (Figure 4); and Gradient SHAP to quantify each feature’s contribution to individual predictions through sensitivity analysis (Appendix C.1). For the attention analysis, we reshaped the dataset from a single time step (sequence length = 1) to 12-time steps, allowing lagged values to be represented as a sequence rather than as separate features. For example, Dataset 3 changed from (26051, 34, 1) to (26051, 22, 12), where each variable spans 12 consecutive hourly lags. The Weight Norm method measures feature importance by calculating the L2 norm of each input feature’s weight vector in the model’s first learnable layer, with higher norms indicating greater initial influence. These norms are normalized to sum to 1, providing comparable importance scores without requiring additional data. Figure 3 shows the ranked importance scores from this analysis, with Boarding Count having the highest importance (0.1772), followed by Day of Week (0.0880), Hospital Census (0.0697), Hour (0.0673), and Treatment Count (0.0626), indicating these features are most heavily weighted by TSTPlus at the model’s input stage. The three lowest-importance features were Year (0.0104), Weather: Rain (0.0174), and Football Game 2 (0.0195). In transformer models, the attention mechanism learns how much each time step should focus on others when making predictions. During training, we saved the model’s attention weights from a selected layer for later analysis. To calculate temporal importance, these weights were averaged across all heads and batches, summed for each time step to capture its total received attention, and normalized with a softmax function. Figure 4 shows that the current time step t (0.211) received the highest attention, followed by t-11 (0.175) and t-10 (0.099), while t-2 and t-7 (both 0.047) had the lowest influence on the model’s predictions. Gradient SHAP estimates each feature’s contribution to the model’s predictions by combining integrated gradients with Shapley values. Using a background dataset and evaluation samples, we computed SHAP values for the trained TSTPlus model and averaged their absolute magnitudes across all samples to obtain comparable importance scores. Appendix C.1 shows that Boarding Count had the highest mean absolute SHAP value (0.4318), followed by Day of the Week (0.1650), Treatment Count (0.1004), Federal Holiday (0.0964), and Average Waiting Time (0.0944), indicating these features had the greatest influence on the model’s outputs. The lowest-ranked features were Football Game 2 (0.0003), Weather Thunderstorm (0.0001), and Month (0.0030). Unlike weight-based metrics or attention patterns, Gradient SHAP attributes importance by directly measuring each feature’s marginal effect on the model’s output, making it more robust to the effects of feature scaling, correlated variables, and hidden-layer transformations. Feature importance patterns from the Weight Norm method (Figure 3) and Gradient SHAP (Appendix C.1) showed strong agreement in identifying the most influential predictors. Both analyses ranked Boarding Count as the top feature (Weight Norm: 0.1772; SHAP: 0.4318), followed by high importance for Day of Week and Treatment Count. Weight Norm additionally emphasized Hospital Census and Hour, while SHAP highlighted Federal Holiday and Average Waiting Time. Lower-ranked features in both methods included weather variables and external event indicators, such as Football Game 2, indicating minimal impact on predictions. This convergence across two complementary methods strengthens confidence in the robustness of the identified top predictors. Appendix C.1 Discussion Explainability analysis provided complementary perspectives on the TSTPlus model’s decision-making, improving interpretability and supporting operational trust. Weight Norm analysis highlighted Boarding Count as the most influential feature, followed by Day of Week, Hospital Census, Hour, and Treatment Count, with minimal contributions from weather variables and external events such as football games. Attention analysis revealed that the model assigned the greatest weight to the current time step (t), as well as distant lags t–11 and t–10, indicating that both immediate conditions and patterns from the same time on the previous day play key roles in forecasting. Intermediate lags (e.g., t–2, t–7) had lower attention, suggesting less relevance for short-term fluctuations. Gradient SHAP results reinforced these findings, ranking Boarding Count, Day of Week, and Treatment Count as top drivers, with Federal Holiday and Average Waiting Time also contributing meaningfully. Across methods, weather variables and football game indicators consistently showed the least importance, indicating limited short-term predictive value.
Comments 14: No performance comparison with non-deep learning baselines (e.g., Random Forest).
Response 14: We appreciate the reviewer’s suggestion. In this study, we intentionally limited the scope to deep learning-based models, as also reflected in the manuscript title, to explore and optimize state-of-the-art neural architectures tailored for multivariate time series forecasting in emergency department operations. Prior studies [X] have shown that traditional machine learning algorithms such as Random Forest and Gradient Boosting Machines are generally less effective in modeling long-range temporal dependencies and complex feature interactions compared to deep learning approaches, especially in healthcare time series settings. [1] Vural, O., Ozaydin, B., Aram, K. Y., Booth, J., Lindsey, B. F., & Ahmed, A. (2025). An Artificial Intelligence-Based Framework for Predicting Emergency Department Overcrowding: Development and Evaluation Study. arXiv preprint arXiv:2504.18578.
Comments 15: Lacks patient-level clinical or operational validation from domain experts (e.g., ED staff interviews).
Response 15: We appreciate the reviewer’s point regarding the importance of validation from domain experts. While we agree that structured interviews or patient-level validation could provide additional insights, our study involved close collaboration with an advisory board comprising representatives from multiple emergency department teams at our partner hospital, as noted in the Introduction. Furthermore, two of the co-authors bring direct clinical and operational expertise to the project: one is a practicing emergency physician serving as Associate Vice Chair and Chief Medical Information Officer, and another is the Associate Vice President of Patient Throughput. As acknowledged in the revised Limitations section—"One primary limitation is that this study did not evaluate how the predictions would perform in a real-world operational or simulation-based setting"—we recognize that further on-the-ground validation, including systematic feedback from clinical stakeholders, remains a critical direction for future work.
Comments 16: Does not evaluate whether attention correlates with human intuition.
Response 16: While some studies in clinical decision support validate attention mechanisms through structured expert feedback, such validation is less common in time series forecasting research. In this study, our focus was on predictive performance and using attention analysis to provide qualitative insights. We agree that future work could incorporate structured evaluation with ED specialists.
Comments 17: Fails to discuss limitations of model deployment (e.g., latency, integration with EHRs).
Response 17: We appreciate the reviewer’s observation regarding deployment-related limitations. This study is part of a larger, long-term, funded initiative focused on enhancing ED operations through predictive modeling. As outlined in the revised Future Work section, we detail the planned deployment framework, which includes standardized data pipelines, real-time ingestion infrastructure, backup strategies for external data sources (e.g., weather APIs), model retraining, performance monitoring, and integration with user-facing dashboards. This work will not be deployed alone; it is one component of a broader predictive system that will integrate forecasts for multiple patient flow metrics (PFMs) to support coordinated, real-time decision-making in the ED. Deployment is out of the scope of this research because, for deployment, we will conduct a dedicated implementation study to address system integration, latency, and operational impact in detail. While this study focuses on model development and evaluation, full deployment and testing in live ED settings will occur in subsequent phases of the project. Page 18, the Limitations and Future Work section has been revised as shown below, with changes indicated in red. Action Limitations and Future Work In future work, we plan to evaluate the model in simulation-based or real-world ED settings to assess its operational effectiveness. We will also extend the framework to predict additional PFMs used in this study and integrate these models into a unified prediction platform. This system will generate real-time forecasts for multiple critical PFMs, offering a comprehensive view of anticipated ED operational status and supporting proactive resource management. The deployment will include standardized data preparation, feature engineering, and preprocessing routines, along with automated model retraining, performance monitoring, and anomaly detection to maintain accuracy and robustness in the presence of data drift. Multiple predictive modules—each targeting a specific PFM—will run concurrently, coordinated by a central orchestration layer to ensure data consistency and manage dependencies. To further enhance reliability, we will establish infrastructure for real-time data ingestion and implement fallback strategies for external data sources (e.g., weather APIs). For instance, if the primary source becomes unavailable or returns anomalous data, the system will automatically switch to a secondary provider and validate consistency. Additionally, user-facing dashboards will be developed to integrate seamlessly with existing ED workflows and support actionable, data-driven decision-making as part of the full deployment process.
Comments 18: No assessment of external validity or transferability across hospitals.
Response 18: We thank the reviewer for raising this important point. The revised Limitations and Future Work section addresses the issue of external validity and model transferability. Specifically, we note that while our approach—based on aggregate operational data—supports broader generalizability, the predictive model must still be retrained on data from each individual ED to ensure local applicability. We also explain that variations in patient flow structures, healthcare policies, and data infrastructure across institutions, particularly outside the United States, may limit direct transferability. These considerations underscore the need for site-specific adaptations and retraining prior to deployment in different ED settings. Action 18: Limitations and Future Work One primary limitation is that this study did not evaluate how the predictions would perform in a real-world operational or simulation-based setting. Although the model provides accurate forecasts at 6-, 8-, 10-, and 12-hour intervals, its practical value—particularly whether these windows allow sufficient lead time for effective intervention—remains untested. Beyond operational considerations, the study also faces some technical limitations related to model design and training. Despite an extensive hyperparameter search, alternative or more comprehensive strategies could yield better-performing combinations. Likewise, exploring different feature set configurations within the datasets could further enhance predictive performance. While explainability approaches (e.g., Gradient SHAP, attention weight visualization) provide insights into feature importance, these methods cannot guarantee full transparency of the model’s decision-making, especially in cases involving complex nonlinear relationships between inputs and outputs. From a deployment perspective, our approach offers broad generalizability and potential adoption in diverse ED settings because it relies on aggregate operational data rather than patient-level records. It assumes clearly defined waiting and treatment areas with accurate timestamp tracking; without these, the necessary feature engineering for PFMs would not be possible. The model must also be retrained for each site to ensure optimal performance, as healthcare policies, workflows, and patient flow patterns vary considerably between institutions, especially outside the United States. For example, some hospitals use fast-track systems for low-acuity patients, while others manage all patients through a unified treatment stream. Another example is that pediatric EDs follow different operational models than adult EDs.
Comments 19: Too brief given the breadth of the study.
Response 19: We thank the reviewer for this comment. In response, we have significantly expanded the manuscript to provide a more comprehensive and detailed presentation of our methodology, experimental design, model evaluation, and implementation context. Thanks to the reviewers’ constructive feedback, the revised version offers a thorough account of our approach and its implications, addressing the breadth and depth expected of a study in this domain.
Comments 20: Lacks concrete future work suggestions (e.g., simulation-based impact studies, clinical integration).
Response 20: We appreciate the reviewer’s suggestion. In response, we have created a dedicated Limitations and Future Work section that outlines concrete next steps, including simulation-based impact evaluations, development of a unified prediction platform, real-time deployment architecture, and integration plans aligned with clinical operations. These updates provide a clearer roadmap for translating our predictive models into practical decision-support tools. Action 20: Limitations and Future Work This study presents a predictive approach for ED operations, but like any data-driven method, it has limitations and opportunities for future improvement. While the model demonstrates strong performance using aggregate operational data, further developments are necessary to optimize its applicability across diverse healthcare settings. One primary limitation is that this study did not evaluate how the predictions would perform in a real-world operational or simulation-based setting. Although the model provides accurate forecasts at 6-, 8-, 10-, and 12-hour intervals, its practical value—particularly whether these windows allow sufficient lead time for effective intervention—remains untested. Beyond operational considerations, the study also faces some technical limitations related to model design and training. Despite an extensive hyperparameter search, alternative or more comprehensive strategies could yield better-performing combinations. Likewise, exploring different feature set configurations within the datasets could further enhance predictive performance. While explainability approaches (e.g., Gradient SHAP, attention weight visualization) provide insights into feature importance, these methods cannot guarantee full transparency of the model’s decision-making, especially in cases involving complex nonlinear relationships between inputs and outputs. From a deployment perspective, our approach offers broad generalizability and potential adoption in diverse ED settings because it relies on aggregate operational data rather than patient-level records. It assumes clearly defined waiting and treatment areas with accurate timestamp tracking; without these, the necessary feature engineering for PFMs would not be possible. The model must also be retrained for each site to ensure optimal performance, as healthcare policies, workflows, and patient flow patterns vary considerably between institutions, especially outside the United States. For example, some hospitals use fast-track systems for low-acuity patients, while others manage all patients through a unified treatment stream. Another example is that pediatric EDs follow different operational models than adult EDs. In future work, we plan to evaluate the model in simulation-based or real-world ED settings to assess its operational effectiveness. We will also extend the framework to predict additional PFMs used in this study and integrate these models into a unified prediction platform. This system will generate real-time forecasts for multiple critical PFMs, offering a comprehensive view of anticipated ED operational status and supporting proactive resource management. The deployment will include standardized data preparation, feature engineering, and preprocessing routines, along with automated model retraining, performance monitoring, and anomaly detection to maintain accuracy and robustness in the presence of data drift. Multiple predictive modules—each targeting a specific PFM—will run concurrently, coordinated by a central orchestration layer to ensure data consistency and manage dependencies. To further enhance reliability, we will establish infrastructure for real-time data ingestion and implement fallback strategies for external data sources (e.g., weather APIs). For instance, if the primary source becomes unavailable or returns anomalous data, the system will automatically switch to a secondary provider and validate consistency. Additionally, user-facing dashboards will be developed to integrate seamlessly with existing ED workflows while supporting actionable, data-driven decision-making as part of the full deployment process.
Comments 21: Reduce the similarity index, which is currently about 52% (turnitin) Minor typos: "Higly Extreme" (should be "Highly Extreme").
Response 21: We thank the reviewer for pointing this out. The similarity index is primarily due to overlap with our own preprint version of this study, which is publicly available on arXiv. Additionally, the noted typo (“Higly Extreme”) has been corrected to “Highly Extreme.”
Comments 22: Repetitive phrasing in some sections (e.g., Dataset 4 description repeated).
Response 22: Thank you for bringing this to our attention. We have addressed the repetitive phrasing in the manuscript.