Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessSystematic Review

Peer-Review Record

Drowsiness Detection in Drivers: A Systematic Review of Deep Learning-Based Models

Appl. Sci. 2025, 15(16), 9018; https://doi.org/10.3390/app15169018

by Tiago Fonseca¹

and Sara Ferreira^2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Yanqun Yang

Appl. Sci. 2025, 15(16), 9018; https://doi.org/10.3390/app15169018

Submission received: 12 July 2025 / Revised: 11 August 2025 / Accepted: 12 August 2025 / Published: 15 August 2025

(This article belongs to the Special Issue Emerging Technologies of Accident Analysis and Prevention in Safety Engineering)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Overall assessment.

The peer-reviewed article is a systematic review of current research on driver drowsiness detection using deep learning models. The research is relevant, clearly structured. Although the review is only 6 paragraphs long, the authors managed to show an overview across different types of countries (even the reference to 1979 can be seen as demonstrating the longevity of the problem). The problem statement is formulated correctly.

Positive aspects

Relevance of the topic: Driver drowsiness is a global problem that significantly affects road safety. The analysis of deep learning-based tools is extremely timely.
Methodological transparency: The authors adhere to PRISMA, provide a registration number in PROSPERO, which confirms the careful preparation of the review protocol.
Scope and depth of analysis: 81 included studies from different countries, indicating model architectures, data sources, data types, and accuracy indicators - this is an impressive volume for a review work.
Statistical description: Summarized values (medians, IQR, SD) for accuracy, F1-score, AUC are provided, allowing the reader to assess the typical performance of DL-models.
Challenge analysis: A critical review of technical, ethical, and practical limitations is made, which is extremely important for future applied research.

The article provides a complete answer to all four research questions (RQ 1-4)

RQ1: The paper describes in detail the use of various DL architectures - Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) (in particular LSTM), hybrid models, and Transformer approaches predominate. Statistics on the frequency of use of each architecture and the context of their use are provided (section 3.3.1)

RQ2: The authors provided aggregated performance indicators (medians):

Accuracy ≈ 0.95
F1-score ≈ 0.95
Precision, Recall, AUC-ROC - also high. In addition, the differences between the results in simulated and real conditions, error handling (false positives/negatives), and adaptation to different demographic groups and driving scenarios are discussed (sections 3.4.1–3.4.4)

RQ3: Public (NTHU-DDD, YawDD, etc.) and private datasets are described. The following modalities are considered:

Behavioral (face, eye images)
Physiological (EEG, ECG)
Vehicle data (steering wheel, pedals)
Multimodal combinations. Also, emphasis is placed on insufficient representativeness (age, gender, ethnicity), weak standardization, and openness of data (sections 3.5.1–3.5.3)

RQ4: Section 3.6 clearly outlines four types of problems:

Technical: overfitting, weak generalizability, small or unrepresentative datasets.
Implementation: difficulty of embedding models in real transportation systems.
Ethical and privacy: use of biometrics, video surveillance.
Methodological: lack of uniform protocols, lack of comparable metrics, and open replicated solutions

Although the article is informative and thorough, its considerable volume and text density may make it difficult for readers to grasp the key results.

In addition, it would be advisable to indicate which model the authors consider to be the most effective based on the results of the analysis, or present a rating of the models (from best to worst).

It is also worth summarizing in the Discussion or Conclusions which optimality criteria were key for the authors when evaluating the models (for example: accuracy, inference time, need for computational resources, adaptability to different conditions, etc.).

Publishing after minor revision.

Author Response

Reviewer 1 - Comment 1:

"Although the article is informative and thorough, its considerable volume and text density may make it difficult for readers to grasp the key results.

I recommend that authors supplement the Discussion or Conclusions section with one or more summary tables or graphs that would visualize the main characteristics of the models, data types, and other comparative aspects. This approach will enhance the perception of the review and enable readers to navigate the generalized conclusions more efficiently."

Response to Comment 1:

We appreciate the reviewer’s suggestion and recognize the importance of presenting the key findings in a clear and accessible manner. The manuscript already includes three supplementary tables in the Appendix (Tables A1–A3) that were specifically designed for this purpose. Table A1 summarizes each study’s authorship, country of origin, driving context, model type, and inference mode. Table A2 compiles the reported performance metrics, and Table A3 presents details on datasets, data types, technical challenges, and recommendations for improving robustness and applicability. Together, these tables provide readers with a concise overview of the models, data types, and comparative aspects discussed in the review, thereby complementing the narrative in the Discussion and Conclusions.

Reviewer 1 - Comment 2:

"In addition, it would be advisable to indicate which model the authors consider to be the most effective based on the results of the analysis, or present a rating of the models (from best to worst)."

Response to Comment 2:

We thank the reviewer for this suggestion and agree that identifying the most effective model could be of interest to readers. However, as noted in Sections 3.4.1 and 3.4.2, the considerable methodological heterogeneity among the included studies—particularly in datasets, evaluation protocols, and reported metrics—precludes a fair and meaningful direct ranking. Performance outcomes vary depending on the testing context, with median accuracy and F1-scores differing between simulated and real-world evaluations (Table 3), and many studies reporting only a subset of key metrics (Table 2). For these reasons, and in line with systematic review best practices, we did not assign an overall best model or a ranked list. Instead, our analysis highlights models that achieved consistently high performance within comparable contexts, discussing the architectural features, data types, and evaluation conditions associated with these results. We believe this approach provides balanced, actionable insights without risking misleading conclusions due to variations in study design and reporting.

Reviewer 1 - Comment 3:

"It is also worth summarizing in the Discussion or Conclusions which optimality criteria were key for the authors when evaluating the models (for example: accuracy, inference time, need for computational resources, adaptability to different conditions, etc.)."

Response to Comment 3:

We thank the reviewer for this valuable suggestion. As outlined in Section 3.4.1, the evaluation of models in this review was based on five key performance metrics: accuracy, precision, recall, F1-score, and AUC-ROC. These metrics were chosen because they are the most frequently reported across the included studies and collectively provide a comprehensive view of predictive performance. Accuracy offers an overall measure of correct classifications, precision indicates the proportion of correctly identified drowsiness cases among all positive predictions, recall measures the ability to detect true positive cases, F1-score balances precision and recall—particularly useful under class imbalance—and AUC-ROC evaluates the model’s ability to discriminate between classes across varying thresholds. The aggregated values for these metrics are presented in Table 2, with detailed results for each study provided in Appendix Table A2.

Reviewer 2 Report

Comments and Suggestions for Authors

This systematic review provides a comprehensive analysis of deep learning (DL) models for detecting driver drowsiness, a significant contributor to road traffic crashes. The review evaluates the performance, contexts of application, and implementation challenges of DL-based drowsiness detection systems, adhering to the PRISMA 2020 guidelines. It includes 81 peer-reviewed empirical studies published between 2015 and 2025, focusing on models that use behavioral, physiological, or multimodal inputs. However, Authors suggested to address the following comments and suggestions when preparing the revised version:

As a review paper, I believe the traces of LLM are too pronounced, making it seem far from being written by a human. Moreover, during the analysis of research directions, there is a glaring omission: no references are cited, nor is there any commentary on the specific contributions of each included paper to these directions. This is simply a critical flaw.
Many studies lack detailed descriptions of their model architectures, hyperparameters, and preprocessing steps, making it difficult to reproduce the results. This hampers the ability to compare different approaches effectively.
The absence of standardized benchmarks and datasets makes it challenging to compare the performance of different models. The review highlights the need for public, large-scale, and diverse datasets to facilitate fair and consistent evaluations.

Comments on the Quality of English Language

/NA

Author Response

Reviewer 2 – Comment 1:

"As a review paper, I believe the traces of LLM are too pronounced, making it seem far from being written by a human. Moreover, during the analysis of research directions, there is a glaring omission: no references are cited, nor is there any commentary on the specific contributions of each included paper to these directions. This is simply a critical flaw."

Response to Comment 1:

We thank the reviewer for these valuable observations. In response, we have carefully revised the “Future research directions” section (Section 4.4) to improve the flow and make the writing feel more natural, avoiding overly uniform phrasing. We have also strengthened the discussion by directly linking each proposed research direction to specific studies included in the review, briefly outlining their individual contributions. These changes ensure that our recommendations are firmly grounded in the evidence presented. In addition, we removed generic statements and refined transitions to enhance clarity and coherence. We believe these revisions address the reviewer’s concerns and improve the readability of the paper.

Reviewer 2 – Comment 2:

"Many studies lack detailed descriptions of their model architectures, hyperparameters, and preprocessing steps, making it difficult to reproduce the results. This hampers the ability to compare different approaches effectively."

Response to Comment 2:

We thank the reviewer for raising this important point regarding the lack of methodological details in several primary studies. We have now addressed this in Section 4.2 by explicitly noting that, in addition to missing information on sample size, demographics, annotation procedures, and computational specifications, many studies also did not report critical details such as model architectures, hyperparameter configurations, and data preprocessing steps. We highlight that these omissions limit reproducibility and hinder the fair comparison of different approaches. This addition strengthens the discussion of the review’s limitations and aligns with the reviewer’s observation on the need for greater methodological transparency in the field.

Reviewer 2 – Comment 3:

"The absence of standardized benchmarks and datasets makes it challenging to compare the performance of different models. The review highlights the need for public, large-scale, and diverse datasets to facilitate fair and consistent evaluations."

Response to Comment 3:

We thank the reviewer for this observation and fully agree that the absence of standardized benchmarks and datasets poses a significant challenge for fair and consistent model evaluation. This point is already addressed in Section 4.4, where we emphasize the need for public, large-scale, and diverse datasets as a prerequisite for improving both generalizability and comparability of results. We believe the current text captures the reviewer’s concern and underlines its importance for future research in this field.

Reviewer 3 Report

Comments and Suggestions for Authors

This is a review paper, but the format and contents are somewhat weird. For example, the paper states, “Section 2 details the methodology, including eligibility criteria, search strategy, screening procedures, and data extraction methods. Section 3 presents the main findings, organized into subsections addressing the characteristics of included studies, evaluation contexts, datasets used, model architectures, performance metrics, and real-world feasibility. Section 4 offers an in-depth discussion of these findings, considers their implications, and identifies gaps and limitations in the existing literature. Section 5 concludes the review by summarizing its key contributions and outlining future research directions to improve the design, evaluation, and implementation of deep learning-based drowsiness detection systems.” This structure resembles that of a research paper.

In reviewing the current paper, it is important to summarize the key findings and highlight or inspire the innovations of deep learning-based models used in drowsiness detection for drivers.

The review questions presented below (introduced in the paper) are relevant, but they are not clearly reflected in the following sections of paper:

RQ1:Which deep learning models are used to detect drowsiness in drivers?
RQ2:How precise and reliable are these deep learning models in detecting drowsiness?
RQ3:What types of datasets are used for training and validating deep learning models?
RQ4:What are the main challenges and limitations in developing deep learning-based drowsiness detection systems?

Author Response

Reviewer 3 – Comment:

"This is a review paper, but the format and contents are somewhat weird. For example, the paper states, 'Section 2 details the methodology…' This structure resembles that of a research paper.

In reviewing the current paper, it is important to summarize the key findings and highlight or inspire the innovations of deep learning-based models used in drowsiness detection for drivers.

The review questions presented below are relevant, but they are not clearly reflected in the following sections of the paper:

RQ1: Which deep learning models are used to detect drowsiness in drivers?
RQ2: How precise and reliable are these deep learning models in detecting drowsiness?
RQ3: What types of datasets are used for training and validating deep learning models?
RQ4: What are the main challenges and limitations in developing deep learning-based drowsiness detection systems?"

Response:

We are grateful to the reviewer for these constructive observations. We agree that ensuring a clear alignment between the research questions, the presentation of findings, and the identification of innovations is essential to strengthen the contribution of this review.

Regarding the structure, the current format follows the PRISMA 2020 guidelines for systematic reviews, which require a dedicated methodology section, a results section presenting the main findings, and a discussion that synthesizes these findings in relation to the research objectives. While this structure may resemble that of an experimental paper, it is in line with the reporting standards for systematic reviews and was adopted to ensure methodological transparency and reproducibility.

We also note the importance of explicitly linking the research questions to the subsequent sections. To improve clarity, we have revised the Results section headings to include direct references to the research questions, as follows:

3.3. Deep learning models for drowsiness detection (RQ1)
3.4. Model accuracy and reliability (RQ2)
3.5. Datasets and data characteristics (RQ3)
3.6. Challenges and limitations (RQ4)

With regard to summarizing key findings and highlighting innovations, we respectfully submit that the original Section 4.1 already meets this objective. This section consolidates the main results for each RQ and integrates them into a coherent synthesis, while also noting important developments in the field. For example, it describes advances in model architectures, the integration of multimodal data sources, approaches for deployment optimization, and initial steps toward explainable AI and personalized modeling. These elements are presented within the narrative to both reflect the state of the art and illustrate how the field is evolving toward more robust, generalizable, and deployable solutions.

We believe that, with the inclusion of RQ references in the results headings, the link between the research questions, the summarized findings, and the innovations identified will now be more apparent to readers, while preserving the systematic review standards that guided the paper’s structure.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

Article Menu

Drowsiness Detection in Drivers: A Systematic Review of Deep Learning-Based Models

Further Information

Guidelines

MDPI Initiatives

Follow MDPI