Next Article in Journal
Innovation in Indoor Disinfection Technologies During COVID-19: A Comprehensive Patent and Market Analysis (2020–2025)
Previous Article in Journal
Correction: Leontjevaite et al. Air Pollution Effects on Mental Health Relationships: Scoping Review on Historically Used Methodologies to Analyze Adult Populations. Air 2024, 2, 258–291
 
 
Article
Peer-Review Record

Modelling the Presence of Smokers in Households for Future Policy and Advisory Applications

by David Moretón Pavón 1,2, Sandra Rodríguez-Sufuentes 2, Alicia Aguado 2, Rubèn González-Colom 3, Alba Gómez-López 3, Alexandra Kristian 4, Artur Badyda 5, Piotr Kepa 5, Leticia Pérez 6 and Jose Fermoso 2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Submission received: 30 June 2025 / Revised: 25 September 2025 / Accepted: 29 September 2025 / Published: 7 October 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

After reading, the manuscript intended for publication in "Air" journal, it had been identified a significant missing ethical factor: the voluntary consent of smoking households to participate in the study.

Lines 40‑45: These are private households, which introduces a major conflict when attempting to implement regulations or policies, potentially violating human rights regarding conduct within private homes.

For decades, television and other media have communicated the harms of smoking, gradually increasing the proportion of the population that avoids the problem. Nevertheless, the article does not analyze this type of intervention to reduce the number of smokers, despite its stated aim of informing regulations and new policies.

Table 4 presents results from various statistical methods used to build predictive models. Later (lines 335‑342) it states that the XGBoost method yields the best results. However, the manuscript does not analyze why this model performs better. In other words, both the Discussion and Conclusions sections merely report the numerical outcomes without explaining them or examining trends across the predictive models. The lack of discussion about the results diminishes the scientific value of the work.

Overall, the Introduction frames smokers in private homes as affecting public health, particularly people with COPD, asthma, and similar conditions. Yet this is the only mention of POSSIBLE public‑health impact: the manuscript does not examine potential relationships between smokers in private homes and the health of “passive smokers,” namely other family members who might experience health effects from inhaling smoke indoors.

Author Response

Comment 1.1:
A significant missing ethical factor: the voluntary consent of smoking households to participate in the study.

Response 1.1:
Thank you for this observation. We have clarified in the revised manuscript that all participants signed informed consent forms prior to the beginning of any monitoring activity, and were informed that they could withdraw at any time. This is now explicitly stated in Section 2.8 (Ethical Statement), along with the ethical approval granted by the Hospital Clínic de Barcelona (Reference HCB/2023/0126). This information reinforces that participation was entirely voluntary and conducted in accordance with ethical standards and the Declaration of Helsinki.

Comment 1.2:
Lines 40–45: These are private households, which introduces a major conflict when attempting to implement regulations or policies, potentially violating human rights regarding conduct within private homes.

Response 1.2:
We appreciate the reviewer’s concern and have addressed this in the Ethical Statement (Section 2.8) and Discussion (Section 4). We clarified that the study does not seek to judge individual behaviour or promote enforcement in private settings. Rather, it aims to explore methods for identifying indoor pollution sources in a non-invasive and anonymised way, to inform public health strategies, particularly for protecting vulnerable populations. This clarification helps ensure the study is framed in terms of preventive health, not regulation or intrusion.

Comment 1.3:
The article does not analyze media or public awareness interventions to reduce smoking, despite its aim of informing regulations and new policies.

Response 1.3:
Thank you. While the focus of our study is methodological and limited to detecting environmental exposure using sensor data, we recognise the value of broader behavioural and media-based interventions. We have now included a brief mention in the Discussion noting that machine learning approaches may complement—not replace—existing strategies such as education and media campaigns. See Discussion, paragraph 4.

Comment 1.4:
Table 4 presents results from various statistical methods used to build predictive models. Later (lines 335–342) it states that the XGBoost method yields the best results. However, the manuscript does not analyze why this model performs better. The lack of discussion about the results diminishes the scientific value of the work.

Response 1.4:
We fully agree with this important observation. We have substantially revised the Discussion section (Section 4) to include a more detailed reflection on model performance, including possible reasons why SVM and XGBoost outperformed other classifiers. We also note the limitations of the current explanation and state that additional analysis (e.g. variable interactions, learning curves) will be explored in future work. These changes are located on page 14, paragraph 3 and 4.

Comment 1.5:
The manuscript does not examine potential relationships between smokers in private homes and the health of “passive smokers,” such as other family members.

Response 1.5:
This is an excellent point. In response, we have added a paragraph to the Discussion highlighting the well-established health risks associated with passive smoking in domestic settings, including its links to asthma, COPD exacerbations, and cardiovascular effects. We also added a supporting reference (Jordan et al., 2011, BMJ Open) to strengthen this point. See Section 4, page 13, paragraph 2.

The supplementary materials provided include the R scripts used for data integration and preprocessing, a pseudocode file that outlines the full analytical pipeline from raw sensor data to final model evaluation, and the final trained SVM model object. A README file is also included to guide reproduction of the results. These materials are provided to ensure transparency and reproducibility of the modelling approach. Additionally, we include a version of the revised manuscript with tracked changes enabled, in case the editor or reviewers wish to trace the specific modifications made during the revision process.

Reviewer 2 Report

Comments and Suggestions for Authors

The development of techniques for analyzing sensor results using machine learning is important. However, I believe that there are still few published case studies in academic literature where such techniques have been applied to real-world data. This manuscript reports on an actual case in which sensor measurements were analyzed using machine learning, and I believe that the publication of such data in this journal holds value.

However, the authors should revise the manuscript based on the following points:

  1. In the abstract, “pre-diction” and “con-tribute” should be corrected to “prediction” and “contribute” (lines 23 and 26).

  2. The principle, manufacturer, and model number of the VOC sensor used should be reported.

  3. The results in Tables 2 and 3 should include units such as ppm. Also, "T" should be described as "Temperature".

  4. Does "FA" in Table 2 refer to formaldehyde? If so, how it was measured should be described in the Materials and Methods section.

  5. In Figure 2, only the label for panel (d) appears on the left side of the figure. Please align it with the labels for (a), (b), and (c), which are on the right side.

  6. A space should be inserted between Table 4 and the sentence starting on line 303.

  7. In line 324, “Section 3.2” seems to be a mistake. Should it be “Section 3.4”?

  8. Tables 6 and 9 are split across two pages. Please ensure that each table appears entirely on a single page.

Author Response

Comment 2.1:
In the abstract, “pre-diction” and “con-tribute” should be corrected to “prediction” and “contribute” (lines 23 and 26).

Response 2.1:
Thank you for noticing this. We have corrected the hyphenation issues in the abstract and throughout the document.

Comment 2.2:
The principle, manufacturer, and model number of the VOC sensor used should be reported.

Response 2.2:
This information has been added to Section 2.3 (Monitoring and Data Collection). Specifically, we now mention that VOCs and formaldehyde were measured using the Sensirion SFA30 sensor, which is based on an electrochemical principle. We also provide similar details for the COâ‚‚ and PM sensors. See lines [~125–140] of the revised manuscript.

Comment 2.3:
The results in Tables 2 and 3 should include units such as ppm. Also, "T" should be described as "Temperature".

Response 2.3:
We have revised Tables 2 and 3 to include units for each variable (e.g., COâ‚‚ in ppm, PM in µg/m³, etc.). We have also changed the label "T" to “Temperature (°C)” in the tables and throughout the text to improve clarity.

Comment 2.4:
Does "FA" in Table 2 refer to formaldehyde? If so, how it was measured should be described in the Materials and Methods section.

Response 2.4:
Yes, “FA” refers to Formaldehyde. This has now been clarified in the text at first mention and in Table 2. We have also added the measurement method (electrochemical sensor, SFA30) in Section 2.3.

Comment 2.5:
In Figure 2, only the label for panel (d) appears on the left side of the figure. Please align it with the labels for (a), (b), and (c), which are on the right side.

Response 2.5:
We have adjusted Figure 2 so that all panel labels (a, b, c, d) are now aligned consistently and correctly positioned.

Comment 2.6:
A space should be inserted between Table 4 and the sentence starting on line 303.

Response 2.6:
This formatting issue has been corrected. Thank you for pointing it out.

Comment 2.7:
In line 324, “Section 3.2” seems to be a mistake. Should it be “Section 3.4”?

Response 2.7:
Correct. This has been updated to refer to the correct section number (“Section 3.4”) in the revised text.

Comment 2.8:
Tables 6 and 9 are split across two pages. Please ensure that each table appears entirely on a single page.

Response 2.8:
We have reformatted Tables 6 and 9 so that they appear completely on a single page each, ensuring better readability and compliance with journal standards.

The supplementary materials provided include the R scripts used for data integration and preprocessing, a pseudocode file that outlines the full analytical pipeline from raw sensor data to final model evaluation, and the final trained SVM model object. A README file is also included to guide reproduction of the results. These materials are provided to ensure transparency and reproducibility of the modelling approach. Additionally, we include a version of the revised manuscript with tracked changes enabled, in case the editor or reviewers wish to trace the specific modifications made during the revision process.

Reviewer 3 Report

Comments and Suggestions for Authors

Comments and Suggestions for Authors: Major revision

The following recommendations are put forward for further improving the design and scientific quality of the paper before the editorial board may consider it for publication.

This article describes the development of a machine learning model to identify households with smokers based on indoor air quality (IAQ) data. Using low-cost sensors, the researchers collected data from 129 homes in Spain and Austria, including measurements of PM2.5, COâ‚‚, temperature, humidity, and total VOCs. The final model, based on the XGBoost algorithm, demonstrated near-perfect accuracy in classifying households, reaching 100% in internal validation. The results indicate that PM2.5 is the most influential indicator of the presence of smokers, and the study highlights the potential of such tools for public health and policy-making. However, it is not common to achieve 100% internal validation (e.g., cross-validation) with a mathematical model like XGBoost, and this is usually a sign of overfitting.

  1. Can the authors confirm that the sensors were working properly and well calibrated?
  2. In some paragraphs, there is spacing between them, while in others there is continuity, giving the impression of a lack of attention to design.
  3. In general, the paragraphs are not well indented; the first one in each section is different, but the rest are aligned with the paragraph width.
  4. Line 263. Wrong position of
  5. The tables 2 and 3 are poorly structured and are neither visually appealing nor easy to read.
  6. As much as possible, tables should be kept on a single page and not split across two pages.
  7. It is not normal to achieve 100% internal validation (e.g., cross-validation) with a mathematical model like XGBoost, and it is usually a sign of overfitting.
  8. The ROC curve shown appears unusually perfect, which raises concerns. The curve is nearly square-shaped, indicating near-perfect sensitivity and specificity — suggesting an AUC close to 1.0. While this could mean the model performs exceptionally well, it is more likely a sign of overfitting, data leakage, or evaluation on the training set rather than on unseen data. It’s important to confirm that the model was validated properly using a separate test set or cross-validation to ensure it generalizes well and that there is no information leakage from the features to the target.
  9. Household-level classification accuracy: 100%. It never happens in real events, concerns that something might be wrong.
  10. It would be useful to know the reproducibility of the experiments.
  11. Conclusion section is missing.
  12. Delete lines 475 and 476 if they don’t add anything.
  13. Please, check the formatting of references. They should have all the same format

In order for the work to be published, in addition to the issues already raised, overfitting must be reconsidered, and all calculations should be reviewed and repeated to avoid it.

Author Response

Comment 3.1:
The model reaches 100% in internal validation, which is unusual and suggests potential overfitting. Please confirm that cross-validation was properly implemented and that the model generalizes well.

Response 3.1:
We fully agree with the reviewer. In the revised manuscript, we have re-expressed model performance results to clarify that 100% accuracy was only achieved on the training set, and that cross-validation and external validation were used to properly evaluate generalization. The Discussion section now includes a reflection on potential overfitting risks, and the best-performing model (SVM) achieved 83% accuracy at the household level under real validation conditions. See Section 4, paragraph 3 (lines ~480–495).

Comment 3.2:
The ROC curve appears nearly perfect and may suggest overfitting, data leakage, or improper validation.

Response 3.2:
Thank you. We have carefully checked the modelling process and confirmed that the ROC curve shown corresponds to the cross-validated performance, not the training set. However, we acknowledge the need for transparency and have added clarification in Section 3.4 (Model evaluation) and in the Discussion to highlight that even high AUC values can mask overfitting if not interpreted carefully. We also emphasize that a separate external validation set was used to further test generalizability.

Comment 3.3:
Household-level classification accuracy of 100% is unrealistic in real scenarios. Please review and clarify.

Response 3.3:
We agree. The manuscript has been revised to clarify that 100% household-level accuracy was not sustained in external validation, and that the final SVM model achieved ~83% accuracy when tested on unseen data. We removed any misleading wording that could imply overconfident performance. See updated Results (Section 3.4) and Discussion (Section 4).

Comment 3.4:
Can the authors confirm that the sensors were functioning correctly and properly calibrated?

Response 3.4:
Yes. We have added this information in Section 2.3, specifying that the sensors were factory-calibrated and underwent harmonization across homes before deployment. Devices were tested prior to installation to ensure reliable operation.

Comment 3.5:
Reproducibility of the experiments should be addressed.

Response 3.5:
We agree with the importance of reproducibility. In response, we have included a Supplementary Materials section with the relevant scripts, pseudocode, and trained model. A README file has been added to facilitate replication of the results using the provided code and data structure. This is now mentioned in the manuscript (Section 5) and described in detail in the response letter.

Comment 3.6:
Formatting issues: inconsistent paragraph spacing, indentation, and alignment.

Response 3.6:
All formatting inconsistencies have been corrected. Paragraph spacing, indentation, and alignment have been made consistent throughout the manuscript.

Comment 3.7:
Line 263: Wrong position of a sentence.

Response 3.7:
This issue has been corrected and the structure of the section has been revised accordingly.

Comment 3.8:
Tables 2 and 3 are poorly structured, and some tables (e.g. Tables 6 and 9) are split across two pages.

Response 3.8:
We have reformatted Tables 2 and 3 to improve visual structure and readability. We have also ensured that all tables are contained on single pages, avoiding page splits.

Comment 3.9:
Conclusion section is missing.

Response 3.9:
We have added a Conclusion section (Section 5) that summarises the key findings, discusses implications for public health and indoor exposure monitoring, and outlines future work directions.

Comment 3.10:
Delete lines 475–476 if not relevant.

Response 3.10:
Lines 475–476 have been revised and shortened to ensure clarity and focus.

Comment 3.11:
Check reference formatting.

Response 3.11:
All references have been reviewed and revised to comply with the journal’s required citation style. Formatting has been unified.

The supplementary materials provided include the R scripts used for data integration and preprocessing, a pseudocode file that outlines the full analytical pipeline from raw sensor data to final model evaluation, and the final trained SVM model object. A README file is also included to guide reproduction of the results. These materials are provided to ensure transparency and reproducibility of the modelling approach. Additionally, we include a version of the revised manuscript with tracked changes enabled, in case the editor or reviewers wish to trace the specific modifications made during the revision process.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Gracias por tomar en cuenta las observaciones y comentarios.

Author Response

Thanks a lot for your feedback.

Best wishes

Jose

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors, I still find the results obtained very unusual, as it is not common to achieve 100% validation in the training set.

Striking aspects in Figure 3. 

  • Stepped and abrupt shape
    The curve rises almost vertically from (1,0) to (1, ~0.85) and then remains almost flat.

Normally, in a realistic model ROC curve, one would expect a smoother curve that increases progressively. This suggests an issue in the initial calculations, which should be redone.

Therefore, I propose to reject this article.

Author Response

Reviewer comment:
“I still find the results obtained very unusual, as it is not common to achieve 100% validation in the training set.”

Response:
We understand the reviewer’s concern. After careful review of the analysis and code, we confirm that no data leakage occurred. However, we agree that the earlier presentation, based on a single test partition, could lead to the perception of overfitting or unrealistic performance. Therefore, we have revised the final model evaluation to rely on 100 iterations of stratified Monte Carlo cross-validation, which better reflects the model’s generalisation capacity across varying data splits. This approach provides a distribution of AUC scores instead of a single value, avoiding the overemphasis on one favorable result.

This change is reflected in the revised Figure 3, which now shows the average ROC curves and variability in model performance, and is explained in the updated Section 3.4 (Model robustness evaluation).

Reviewer comment:
“Striking aspects in Figure 3. Stepped and abrupt shape. The curve rises almost vertically from (1,0) to (1, ~0.85) and then remains almost flat. Normally, in a realistic model ROC curve, one would expect a smoother curve that increases progressively. This suggests an issue in the initial calculations, which should be redone.”

Response:
We appreciate this observation. The unusual shape of the original ROC curve was the result of evaluating performance on a single random test set, which in our case consisted of a large number of highly similar time-series records from a limited number of households. This led to the classification of many "easy" cases with high confidence, producing an abrupt curve. We agree that this representation was suboptimal and have replaced it with a more realistic and statistically grounded evaluation based on repeated cross-validation.

In the revised manuscript:

  • We provide aggregated ROC curves and AUC distributions over 100 iterations.

  • We explain the reasons behind the original abrupt shape and highlight the importance of evaluating variability in model performance across samples.

These updates are now shown in Figure 3 and discussed in Section 3.4 and the Discussion section.

Reviewer comment:
“Therefore, I propose to reject this article.”

Response:
We regret that the original submission led to this conclusion and appreciate the opportunity to revise. We believe that the improved methodology, based on a rigorous evaluation protocol, clearer discussion of the model's limitations, and more nuanced interpretation of the results, addresses the reviewer’s core concerns. We hope that the revised version demonstrates the scientific integrity of our work and its contribution to the field of indoor air quality and data-driven health assessment.

Author Response File: Author Response.pdf

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors,

Please be advised that for the article to be published, you are required to provide the final PDF once the requested changes have been accepted, with these changes highlighted in yellow, rather than in the current format. Acceptance will be subject to the correction and approval of the proposed changes, ensuring that the article conforms to the format and standards of this journal.

Author Response

Thanks a lot for your corrections and suggestions.

Best wishes

Jose

Author Response File: Author Response.pdf

Back to TopTop