Modelling the Presence of Smokers in Households for Future Policy and Advisory Applications

Pavón, David Moretón; Rodríguez-Sufuentes, Sandra; Aguado, Alicia; González-Colom, Rubèn; Gómez-López, Alba; Kristian, Alexandra; Badyda, Artur; Kepa, Piotr; Pérez, Leticia; Fermoso, Jose

doi:10.3390/air3040027

Open AccessArticle

Modelling the Presence of Smokers in Households for Future Policy and Advisory Applications

by

David Moretón Pavón

^1,2,

Sandra Rodríguez-Sufuentes

²

,

Alicia Aguado

²,

Rubèn González-Colom

³

,

Alba Gómez-López

³

,

Alexandra Kristian

⁴

,

Artur Badyda

⁵

,

Piotr Kepa

⁵

,

Leticia Pérez

⁶

and

Jose Fermoso

^2,*

¹

Science Faculty, Universidad de Valladolid, 47002 Valladolid, Spain

²

CARTIF Technology Center, 47151 Boecillo, Spain

³

Institut d’Investigacions Biomèdiques August Pi i Sunyer, 08036 Barcelona, Spain

⁴

Department of Environmental Health, ZPH, Medical University of Vienna, 1090 Vienna, Austria

⁵

Faculty of Environmental Engineering, Warsaw University of Technology, 00-653 Warsaw, Poland

⁶

KVELOCE, 46003 Valencia, Spain

^*

Author to whom correspondence should be addressed.

Air 2025, 3(4), 27; https://doi.org/10.3390/air3040027

Submission received: 30 June 2025 / Revised: 25 September 2025 / Accepted: 29 September 2025 / Published: 7 October 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Identifying tobacco smoke exposure in indoor environments is critical for public health, especially in vulnerable populations. In this study, we developed and validated a machine learning model to detect smoking households based on indoor air quality (IAQ) data collected using low-cost sensors. A dataset of 129 homes in Spain and Austria was analyzed, with variables including PM_2.5, PM₁, CO₂, temperature, humidity, and total VOCs. The final model, based on the XGBoost algorithm, achieved near-perfect household-level classification (100% accuracy in the test set and AUC = 0.96 in external validation). Analysis of PM_2.5 temporal profiles in representative households helped interpret model performance and highlighted cases where model predictions revealed inconsistencies in self-reported smoking status. These findings support the use of sensor-based approaches for behavioral inference and exposure assessment in residential settings. The proposed method could be extended to other indoor pollution sources and may contribute to risk communication, health-oriented interventions, and policy development, provided that ethical principles such as transparency and informed consent are upheld.

Keywords:

indoor air quality; smoking detection; machine learning; XGBoost; exposure assessment; air pollution sources; environmental sensors; residential environments; PM_2.5; behavioral inference

1. Introduction

Indoor air quality (IAQ) is increasingly recognized as a crucial determinant of human health due to the prolonged exposure of individuals to indoor environments, where pollutant concentrations often exceed outdoor levels. Among indoor pollutants, cigarette smoke remains particularly hazardous, as it contains thousands of toxic chemicals, including particulate matter (PM_2.5), volatile organic compounds (VOCs), aldehydes, polycyclic aromatic hydrocarbons (PAHs), heavy metals, and tobacco-specific nitrosamines (TSNAs), many of which have carcinogenic or mutagenic properties [1]. Although regulatory efforts have significantly reduced smoking in public spaces, private households persist as largely uncontrolled environments where tobacco smoke exposure continues to pose substantial health risks. Given this scenario, characterizing and predicting tobacco smoke presence indoors through advanced methodologies becomes essential for informing public health policies and intervention strategies.

However, IAQ and its associated parameters exhibit complex relationships with high temporal and spatial variability. Accurately predicting how tobacco emissions affect IAQ requires tools that extend beyond deterministic approaches. In this context, machine learning models have demonstrated high reliability in identifying intricate patterns within environmental data. Liang et al. (2020) [2] showed the effectiveness of various machine learning techniques, including Support Vector Machines (SVM), Random Forest, Adaptive Boosting, and linear regression, highlighting that ensemble methods such as Stacking and AdaBoost consistently outperform simpler models in terms of predictive accuracy for hourly AQI forecasts (measured by R² and RMSE). Similarly, Fan et al. (2024) [3] utilized the XGBoost algorithm to enhance predictions of atmospheric pollutants (NO₂, SO₂, O₃, PM_2.5), achieving correlation coefficients above R = 0.95 when trained on observational data. Their study also emphasized XGBoost’s capability to accurately identify and correct biases inherent in numerical atmospheric simulation models.

Within this framework, recent efforts from our research group have contributed significant advancements in IAQ monitoring, particularly through studies involving vulnerable populations. Gómez-López et al. [4] developed an innovative protocol aimed at enhancing the management of multimorbid patients with chronic obstructive pulmonary disease (COPD) and severe asthma by integrating continuous IAQ monitoring with advanced clinical interventions and digital health solutions. As part of this ongoing clinical research, special emphasis is being placed on investigating the potential of indoor environmental data to predict exacerbation episodes in respiratory patients. Tobacco smoke exposure is one of the scenarios being explored initially, alongside broader efforts to assess how household IAQ influences patient outcomes. This includes the development of predictive algorithms to identify high-risk pollution events and to characterize additional indoor pollutant sources relevant to disease progression. In parallel, Aguado et al. [5] conducted a rigorous verification study of low-cost IAQ monitoring tools, highlighting their usability for health-related applications despite certain limitations in accuracy. Together, these prior studies lay the groundwork for broader data-driven strategies aimed at reducing preventable exposures and improving health outcomes through environmental intelligence embedded in patient-centred care models.

Building upon these insights, this study performs a comparative analysis of several machine learning algorithms—Logistic Regression, SVM, Random Forest, and XGBoost—to identify the model that best balances predictive accuracy with computational efficiency in detecting smoker presence within households. Once identified, the selected model will be employed to examine the specific influence of smoking habits on individual air quality parameters, focusing on differences in mean concentrations and temporal fluctuations associated with cigarette consumption. Additionally, instances in which the model produces significant errors will be critically examined to interpret limitations and further enhance its predictive performance. This comprehensive evaluation aims to validate the model’s practical applicability, facilitating the implementation of real-time alert systems or environmental diagnostics within residential settings.

In summary, the overarching goal of this research is not only to reliably detect the presence of smokers in domestic environments using advanced machine learning approaches, but also to deepen our understanding of how tobacco-related activities impact IAQ parameters. Ultimately, these efforts aim to promote healthier living environments through accessible, technology-driven monitoring solutions.

This research contributes to the objectives of the K-HEALTHinAIR project (https://k-healthinair.eu/ accessed on 30 September 2025), which aims to enhance our understanding of IAQ determinants and develop practical solutions to reduce exposure to harmful pollutants in indoor environments across Europe.

2. Materials and Methods

2.1. Study Design and Setting

This research is an observational, retrospective analysis for predicting the presence of smokers based on IAQ sensor data from households located in Austria and Spain. Data were collected at intervals of every five minutes per household in Austria and every 10 min per household in Spain. The study comprises two datasets covering the periods from 10 March 2023 to 9 March 2025 for Austria, and from 13 October 2023 to 9 March 2025 for Spain (Figure 1).

2.2. Participants

In the study, all households participate in the k-HEALTHinAIR project, which involves continuous IAQ monitoring. The classification of households as smoking or non-smoking was determined based on self-reported information and follow-up verification.

All participants were required to sign an informed consent form prior to any monitoring or data collection procedures. The study protocol was approved by the Ethical Committee for Human Research at the Hospital Clínic de Barcelona on 29 June 2023 (HCB/2023/0126), and registered at ClinicalTrials.gov (Identifier: NCT06421402). The design followed the principles of the Declaration of Helsinki and the Spanish Biomedical Research Act 14/2007, ensuring the collection of only the data strictly necessary for the research objectives. Participants were informed of their right to withdraw from the study at any time without any consequences for their medical care or relationship with healthcare providers.

2.3. Variables and Endpoints

Household ID served as a grouping variable. The primary endpoint was the binary classification of households into smoking or non-smoking based on environmental parameters. Predictor variables included the following:

PM_2.5 (μg/m³);
CO₂ (ppm);
tVOCs (ppb);
Temperature (°C);
Relative humidity (%RH).

2.4. Data Sources and Measurements

Each household was equipped with a multi-sensor device capable of continuously recording several IAQ parameters, including temperature (T), relative humidity (RH), carbon dioxide (CO₂), formaldehyde (FA), total volatile organic compounds (TVOCs), and particulate matter (PM1, PM2.5, PM10). The sensors used were InBiot MICA PLUS (https://www.inbiot.es/products/mica-devices/mica-plus Mutilva, Navarra, Spain, accessed on 27 May 2025) in Spain and Kaiterra Sensedge Mini (https://www.kaiterra.com/sensedge-mini-indoor-air-quality-monitor Crans Montana, Switzerland, accessed on 27 May 2025) in Austria, which combine electrochemical sensors (for VOCs and FA), a non-dispersive infrared (NDIR) sensor for CO₂, and optical laser scattering for particulate matter.

The devices were factory-calibrated, and additional spot checks were performed before deployment to ensure consistency between units. All data were obtained using standardized, calibrated IAQ sensors [5] deployed in each home. Devices recorded the above variables with time stamps and unique household codes (e.g., “HOMXXXXX”). The assessment methods were uniform across sites, with comparable sensor models and calibration protocols used in both countries. Data were logged at ten-minute intervals and aggregated for analysis. The sensors were installed in central locations within the main living areas of each home, away from direct heat sources or ventilation outlets, following a harmonized protocol to ensure comparability.

While absolute values may vary slightly due to sensor characteristics, the modelling approach focuses on relative changes and patterns within each household. This enables replication of the method in other settings where similar low-cost IAQ monitoring devices are available, supporting the scalability of the proposed approach.

2.5. Data Pre-Processing

All incomplete rows—those containing at least one missing value—are removed. This results in the elimination of 21 and 21,476 observations for Austria and Spain, respectively, which represent only 0.24% of the total data.

Binary categorical variables were numerically encoded (e.g., HOM_smoking = 1 for smoker households, 0 for non-smokers). Special attention must be given to variability in PM_2.5 levels across households, as smoking leads to the release of significant quantities of particulate matter [6], causing sharp increases in sensor readings. To quantify this variability meaningfully, a new variable, “PM25_centered”, is created by grouping PM_2.5 values by household and subtracting the respective household mean. This approach removes differences in absolute scale and instead highlights intra-household variance, which is more indicative of smoking activity.

Finally, all features were scaled to improve comparability and model performance.

The Supplementary Materials include the full R code used for data preprocessing, feature engineering, and model development, along with the final trained model object. The provided scripts (‘Fumadores.R’ and ‘Depurado.R’) perform data cleaning, integration of indoor air quality measurements with participant metadata, and creation of derived variables such as PM_2.5-centred values and smoking classification. The pseudocode document outlines the overall analytical pipeline, including data merging, cross-validation procedures, and training of five classification models (GLM, SVM, ANN, RF, XGBoost). The final trained SVM model is provided as an ‘.RData’ object for reproducibility and future testing.

2.6. Sample Size Justification

The study uses a convenience sample based on the number of homes enrolled in the broader K-HEALTHinAIR project. Given the high frequency of measurements (5–10 min) and long monitoring period, the sample size was considered sufficient for robust training and validation of machine learning models.

2.7. Quantitative Variables and Statistical Methods

With the exception of HOM_smoking, all variables were treated as continuous predictors and no categorization was applied [7]. The five supervised machine learning models chosen were: Generalized Linear Models (GLM), Support Vector Machines (SVM), Artificial Neural Networks (ANN), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) [8].

To train and evaluate these models, random subsets of the dataset were used depending on the intended training size. For performance comparison across algorithms, a random sample of one million records was extracted from the testing. All models were trained and evaluated using the same training and test partitions to ensure comparability across algorithms and dataset sizes. In the final model, one full month of data was used for training, while the remaining data were reserved for internal validation. A second, temporally independent dataset was used later for external validation.

To distinguish the performance of the classification algorithms used in this study, a range of metrics was applied to capture both overall predictive ability and robustness to class imbalance, a particularly relevant factor in this context. Key metrics included AUC-ROC, which assesses the model’s ability to distinguish between classes without relying on a fixed threshold, and accuracy, which was complemented by household-level accuracy to reflect real-world applicability. To better capture performance under imbalance, the evaluation also included the F1-score, which balances precision and recall, and the Matthews Correlation Coefficient (MCC), a metric known for its reliability in imbalanced binary classification tasks. Together, these measures provided a comprehensive assessment of model effectiveness and practical relevance.

All analyses and model evaluations were performed using R (version 2024.04.2+764) [9,10].

2.8. Ethical Statement

The core study protocol for K-HEALTHinAIR was approved by the Ethical Committee for Human Research at the Hospital Clínic de Barcelona on 29 June 2023 (Reference: HCB/2023/0126), and is registered at ClinicalTrials.gov (Identifier: NCT06421402). All procedures were carried out in accordance with the Declaration of Helsinki and national regulations (Biomedical Research Act 14/2007 of 3 July). The study design follows the principle of data minimisation, collecting only the information strictly necessary for the research. All participants provided signed informed consent before any procedures were initiated, and were informed that they could withdraw from the study at any time without any consequence for their medical treatment or relationship with healthcare providers.

Although the present work aims to detect smoking behaviour in households based on environmental sensor data, it does not seek to judge personal habits or enforce behavioural changes. Rather, this approach is intended to serve as a methodological framework for identifying sources of indoor air pollution that may be relevant for public health, particularly in vulnerable populations. Smoking is addressed here as an initial and illustrative case of such a source, but the underlying modelling approach can be extended to other indoor emission activities with health implications.

3. Results

3.1. Sample Description

The distribution of homes and observations (Table 1) between Austria and Spain reveals a notable difference not only in sample size but also in the prevalence of smoking households. Although Spain contributed more homes overall (100 vs. 29), the proportion of smoker households was similar across both countries, with approximately 31% in Austria (9 out of 29) and 24% in Spain (24 out of 100). This balanced representation provides a useful cross-country comparison and supports the model’s ability to generalize across different geographical and cultural contexts. Additionally, the near-identical end dates and long data collection periods ensure consistency in temporal coverage, enhancing the robustness of both training and validation phases.

To complement the general description of the sample, a comparative analysis of IAQ parameters was conducted between smoking and non-smoking households in both Spain and Austria. This analysis focused on seasonal mean values and standard deviations for each environmental parameter recorded by the IAQ sensors, offering a more detailed characterization of pollution patterns and household behaviour. The results are summarized in Table 2 and Table 3.

In Spain (Table 2), clear differences were observed between smoking and non-smoking households, particularly in PM_2.5 and PM₁ concentrations, where smoking households exhibited consistently elevated mean values across all seasons. For example, PM_2.5 in winter reached an average of 85 µg/m³ in smoking homes, compared to 11 µg/m³ in non-smoking homes. These differences were accompanied by substantially greater standard deviations, suggesting increased temporal variability likely associated with smoking events.

CO₂ levels were also slightly lower in smoking households across most seasons, which may reflect behavioural or ventilation differences. For instance, in autumn, average CO₂ concentrations were 683 ppm in non-smoking homes and 662 ppm in smoking homes. This trend could indicate that smokers tend to ventilate more during or after smoking activities, either intentionally or as a consequence of IAQ discomfort. However, this observation requires cautious interpretation, as differences are small and could also be influenced by occupancy or building characteristics.

Temperature and humidity showed minimal differences, indicating that these parameters are less influenced by smoking behaviour and may act as secondary contextual variables rather than primary predictors.

In Austria (Table 3), similar trends were observed, though the contrast was slightly less pronounced. Smoking households showed higher PM_2.5 values, especially in spring and winter, with PM_2.5 levels reaching 51 µg/m³ in spring compared to 9 µg/m³ in non-smoking homes. Regarding CO₂, smoking households tended to present slightly elevated values throughout the year—for example, 741 ppm in autumn versus 697 ppm in non-smoking households—which may point to different ventilation patterns, possibly associated with outdoor temperature or cultural factors.

Overall, these descriptive analyses reinforce the findings of the machine learning model by highlighting distinct IAQ signatures associated with smoking behaviour. The large intra-seasonal variability in pollutant levels—especially in PM-related metrics—supports the decision to include both mean-adjusted and centred variables in the modelling approach, as detailed in Section 2.5.

3.2. PM_2.5 Profile Analysis by Classification Outcome

To better understand the behavior of the machine learning model and the features that drive its predictions, we performed a visual inspection of the temporal evolution of PM_2.5 concentrations in selected households. Although the final classification model incorporated multiple environmental variables (PM_2.5, PM₁, CO₂, temperature, humidity, and tVOC), PM_2.5 and PM₁ consistently emerged as the most influential features in both the feature importance analysis (See ahead Section 3.4) and the exploratory comparisons between smoking and non-smoking households (Section 3.1).

This analysis focuses on PM_2.5, given its particularly strong discriminative power and interpretability as a marker of smoking-related pollution. We selected four representative households corresponding to each of the following classification outcomes:

▪: True Positive (TP): a smoking household correctly classified as such.
▪: True Negative (TN): a non-smoking household correctly classified as non-smoker.
▪: False Negative (FN): a smoking household misclassified as non-smoker.
▪: False Positive (FP): a non-smoking household misclassified as smoker.

For each case, we extracted PM_2.5 data over a one-month period and plotted the daily temporal profile. The selected households reflect realistic environmental and behavioral conditions during the monitoring campaign.

The visual profiles shown in Figure 2 reveal distinct differences that help to contextualize the model’s predictions (top-down in order):

▪ Figure 2a (TP): Clear and frequent PM_2.5 peaks throughout the month, consistent with indoor smoking episodes.
▪ Figure 2b (TN): A very low and stable PM_2.5 signal, with only one or two isolated minor peaks, strongly aligned with a non-smoking pattern.
▪ Figure 2c (FN): Consistently low PM_2.5 levels despite the household being labelled as smoking. This may be due to low-frequency smoking, smoking in well-ventilated areas, or distance between the source and the sensor, resulting in minimal detection.
▪ Figure 2d (FP): Although initially considered as a non-smoking home, the household was later confirmed by the field nurse to present signs of smoking, suggesting that the model correctly identified a misclassified home. The PM_2.5 pattern shows high peaks concentrated in the first half of the month, followed by a sharp decrease, possibly indicating a cessation or change in behaviour.

These examples highlight both the strengths and limitations of the model. On one hand, it can detect exposure patterns that contradict self-reported data, adding value in real-world monitoring and public health surveillance. On the other hand, its performance may be constrained by the physical context of smoking behaviours, such as source location, intensity, or household ventilation strategies, which can limit the sensor’s ability to detect pollution events. These individual case analyses provide qualitative support for the modelling approach described next, where algorithmic performance is evaluated systematically across a range of training sizes.

3.3. Modelling Performance

Table 4 summarizes the results for GLM, SVM, ANN, Random Forest, and XGBoost at four training sizes: 100, 1000, 10,000, and 100,000 observations. Each configuration was evaluated with seven performance metrics: computational time, AUC-ROC (with optimal threshold), accuracy, balanced accuracy, household-level accuracy, F1-score, and the Matthews Correlation Coefficient (MCC). For every experiment, 100 of the 129 households were randomly selected. From each selected household, 1, 10, 100, or 1000 observations were drawn, producing the four training sizes above. A 5-fold cross-validation was applied, with the observations re-randomized at each fold using all data of the 29 homes remaining to test the algorithms.

As shown, Support Vector Machines (SVM) consistently demonstrate the strongest predictive capacity across dataset sizes, particularly at medium and large training sets (10,000 records), where the algorithm achieves the highest overall accuracy (91.14%), balanced accuracy (86.5%), and competitive F1-scores and MCC values. Although the computational cost of SVM grows considerably with very large datasets (100,000 samples), its predictive performance remains robust, highlighting its suitability for household-level smoking detection.

The advantage of SVM in this context likely arises from its ability to effectively separate classes in high-dimensional feature spaces. Smoking households can present subtle and non-linear patterns in particulate matter, CO₂ concentration, and indoor temperature fluctuations. These variables often overlap between smoking and non-smoking homes, but SVM can identify optimal decision boundaries that maximize the distinction between groups, even when differences are small. This capacity to capture fine-grained, complex relationships explains why SVM outperforms other algorithms in predicting smoking presence. While methods such as GLM and RF offer good baseline performance with low computational demands, and ANNs can model non-linearities effectively, SVM achieves a better balance between precision and generalizability across conditions.

Given the temporal resolution of the dataset (5–10 min intervals across 129 households) and the environmental variability inherent to indoor air measurements, SVM emerges as the most appropriate algorithm for this study’s objectives. Its superior classification performance across multiple evaluation metrics provides a solid foundation for its selection in the final model deployment and further detailed analysis, presented in Section 3.4.

3.4. Robustness Assessment of the Final Model

Following the large-scale exploration presented in Section 3.1, an additional evaluation was conducted to assess the robustness of the selected Support Vector Machine (SVM) model. The aim was to verify its ability to consistently discriminate between smoking and non-smoking households both at the level of individual measurements and at the aggregated household level (i.e., averaging predictions across all observations within each home). To this end, we applied a stratified Monte Carlo cross-validation procedure, ensuring proportional representation of smoking and non-smoking households in each iteration.

Across 100 repetitions, the model achieved the following average performance at the observation level:

AUC-ROC: 0.83 (95% CI: 0.81–0.86);
Accuracy: 0.88 (95% CI: 0.87–0.90);
Balanced Accuracy: 0.78 (95% CI: 0.75–0.80);
F1-score: 0.69 (95% CI: 0.65–0.74);
Matthews Correlation Coefficient (MCC): 0.67 (95% CI: 0.63–0.71).

At the household level, where predictions were aggregated per home, performance remained comparable:

AUC-ROC: 0.84 (95% CI: 0.81–0.86);
Accuracy: 0.89 (95% CI: 0.83–0.90);
Balanced Accuracy: 0.78 (95% CI: 0.75–0.80);
F1-score: 0.70 (95% CI: 0.66–0.74);
MCC: 0.67 (95% CI: 0.63–0.71).

These findings indicate that the SVM model delivers stable discrimination between smoking and non-smoking households when evaluated across multiple random splits, and that this consistency holds both for raw sensor measurements and for aggregated household-level predictions. Figure 3 presents the ROC curves for the first 10 iterations of the stratified Monte Carlo validation, illustrating variability across runs while supporting the overall robustness of the model.

The analysis of feature importance (Table 5 revealed that PM_2.5 and PM_2.5 centered were by far the most influential variables in the model, consistent with their role as direct indicators of indoor smoking. tVOCs, while much less influential, provided complementary information, likely reflecting additional emission sources. Temperature, humidity, and CO₂ showed even smaller contributions, but their consistent presence suggests they provide relevant contextual information that enhances the model’s ability to make accurate classifications.

These results confirm the reliability and robustness of the SVM model in identifying smoking households based on sensor data collected at high frequency.

4. Discussion and Conclusions

The results presented in this study demonstrate the feasibility of detecting the presence of smokers in households with high accuracy using machine learning techniques applied to indoor air quality (IAQ) data. This work adds to the limited body of research exploring smoking behavior detection in real-world domestic environments by leveraging high-resolution, long-term sensor data across diverse household settings. The final SVM model, trained on 10,000 observations, achieved around 83% classification at the household level, and retained reasonable performance under temporal shift conditions during external validation. The comparative analysis of different machine learning models revealed that both SVM and XGBoost outperformed other approaches such as GLM, ANN, and Random Forest, with SVM achieving the best overall classification accuracy at approximately 83%. XGBoost also showed strong results, particularly in terms of robustness to variable reduction and high SHAP-derived feature interpretability. Notably, the model performance remained relatively stable even under reduced feature sets and threshold shifts, which suggests good generalization capability and resilience to input noise—key aspects for practical deployment in heterogeneous household environments. These patterns underscore the feasibility of implementing such models in dynamic, real-world conditions where sensor reliability or completeness may vary.

While the SVM and XGBoost models outperformed other classifiers in this study, further analysis is required to fully understand the factors contributing to their superior performance. Possible explanations include their robustness to multicollinearity, ability to model non-linear relationships, and resilience to high-dimensional feature spaces. Additional evaluation of feature interactions and model behavior will be explored in future work.

These findings support the growing potential of data-driven approaches to complement—or even replace—traditional exposure identification methods. In contrast to manual inspections, periodic surveys, or passive sampling, machine learning models can process continuous and high-frequency sensor data to identify behavioral patterns such as smoking, which have clear and well-documented health implications. This capability opens the door to real-time alerts, personalized exposure monitoring, and integration into digital health platforms targeting at-risk populations.

Beyond the scope of smoking detection, the structure and logic of the proposed methodology are readily transferable to other indoor pollution sources. Events such as cooking, cleaning product use, incense or candle burning, and inadequate ventilation can all generate spikes in particulate matter or VOCs that resemble smoking signatures in isolation. With suitable data labeling or unsupervised learning extensions, similar models could be developed to classify or cluster such events, enabling a more granular understanding of indoor pollution dynamics. Future implementations could explore multiclass classification schemes or anomaly detection techniques to identify atypical IAQ patterns without relying solely on predefined categories.

Another important implication of this work lies in the field of exposure assessment. Rather than relying on static thresholds or averaged concentrations, probabilistic outputs from machine learning models allow for a more nuanced and dynamic evaluation of exposure risk. These models can estimate not only whether a person is exposed, but also how likely it is that a persistent pollutant source is present, how often it occurs, and under which contextual conditions. This level of granularity could be instrumental for developing personalized exposure profiles, improving clinical follow-up of patients with chronic respiratory diseases, or informing public health interventions. Furthermore, these probabilistic assessments could support policy design by identifying households or behaviors with consistently elevated indoor pollution profiles.

This line of research aligns closely with the objectives of the K-HEALTHinAIR project, which seeks to identify actionable IAQ determinants across European indoor environments. In this context, the ability to automatically detect pollution sources, characterize exposure episodes, and evaluate their relevance to health outcomes forms a foundational component of environmental intelligence systems. The integration of such approaches into building diagnostics, occupant feedback loops, or health-related decision-making tools could represent a significant step forward in preventive indoor environmental health.

Finally, the study highlights key operational and methodological considerations. Sensor placement, room use patterns, and ventilation behavior all influence detection performance—especially in borderline or ambiguous cases. As such, future work should consider combining IAQ data with occupancy tracking, contextual metadata, or even physiological monitoring to better interpret complex exposure profiles. Additionally, expanding the dataset to include labeled data from other pollutant sources would enable broader generalization and improved source attribution.

By integrating machine learning with longitudinal IAQ monitoring, this study demonstrates a scalable, interpretable, and robust approach for detecting hidden behaviors that impact IAQ and public health. The findings pave the way for smarter, more responsive indoor pollution detection systems—capable of supporting both individual care and broader environmental policy objectives.

Beyond its technical performance, the proposed modeling approach offers a versatile framework that could be adapted to detect other relevant sources of indoor pollution, such as cooking, cleaning products, or incense burning. These extensions would allow for broader classification capabilities and more comprehensive exposure profiles. At the same time, the potential to infer personal behaviors from environmental data raises important ethical considerations. It is essential that future applications ensure transparency, informed consent, and data protection, especially when deployed in sensitive contexts such as households, schools, or healthcare settings. As emphasized in the European Group on Ethics in Science and New Technologies (EGE) guidelines [11], data-driven tools must be designed not only for technical accuracy but also with respect for dignity, autonomy, and fairness. Integrating ethical foresight into the development and deployment of IAQ-based behavioral models will be crucial to ensure they serve both scientific advancement and societal good.

The methodological framework proposed in this study can be reproduced in other residential settings using similar low-cost indoor air quality sensors. As the analysis focuses on relative patterns and correlations rather than on absolute pollutant concentrations, the approach remains robust across slight sensor-to-sensor variability. Moreover, the installation protocol and data processing steps are simple and adaptable, making the method suitable for large-scale implementations or citizen science initiatives aimed at characterizing indoor pollution sources.

While the main objective of this work was to develop a methodological approach to detect indoor smoking based on low-cost sensor data, the public health relevance of such detection lies in its potential to protect cohabitants—especially vulnerable individuals such as children, elderly people, or patients with chronic respiratory diseases—from second-hand smoke exposure. Numerous studies have established the health risks of passive smoking in domestic environments, including increased incidence of asthma, COPD exacerbations, and cardiovascular issues. For instance, Jordan et al. (2011) reported a significant dose–response relationship between household passive smoking and COPD risk, with nearly double the odds of clinically significant COPD among never-smokers exposed for over 20 h per week [12]. By identifying indoor smoking through indirect measurements, this approach may contribute to raising awareness and encouraging preventive actions without the need for intrusive surveillance.

Although regulatory interventions in private households are ethically and legally limited, the insights provided by this type of modelling can support the development of targeted educational campaigns, voluntary mitigation strategies, or warning systems for sensitive groups. Furthermore, the methodology can be extended to other sources of indoor air pollution with public health implications, such as the use of wood stoves, cleaning products, or indoor combustion activities.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/air3040027/s1. The Supplementary Materials include the full R code used for data preprocessing, feature engineering, and model development, along with the final trained model object. The provided scripts (‘Fumadores.R’ and ‘Depurado.R’) perform data cleaning, integration of indoor air quality measurements with participant metadata, and creation of derived variables such as PM_2.5-centred values and smoking classification. The pseudocode document outlines the overall analytical pipeline, including data merging, cross-validation procedures, and training of five classification models (GLM, SVM, ANN, RF, XGBoost). The final trained SVM model is provided as an ‘.RData’ object for reproducibility and future testing.

Author Contributions

Conceptualization, D.M.P., A.A., R.G.-C. and J.F.; methodology, D.M.P., R.G.-C. and J.F.; software, D.M.P.; validation, A.B., P.K. and R.G.-C.; formal analysis, A.A., S.R.-S., A.B., A.K. and P.K.; investigation, S.R.-S., A.G.-L., A.B. and J.F.; resources, S.R.-S., A.A., A.B. and A.K.; writing—original draft preparation, D.M.P., S.R.-S. and A.A.; writing—review and editing, D.M.P., A.A., A.B., P.K., A.K., L.P. and J.F.; visualization, A.A. and J.F.; supervision, J.F.; project administration, A.A. and J.F.; funding acquisition, A.A. and J.F. All authors have read and agreed to the published version of the manuscript.

Funding

The K-HEALTHinAIR project funded this study, Grant Agreement n° 101057693, under a European Union’s Call on Environment and Health (HORIZON-HLTH-2021-ENVHLTH-02).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study, in accordance with the approved protocol by the Ethical Committee for Human Research at the Hospital Clínic de Barcelona (Reference HCB/2023/0126).

Data Availability Statement

The data supporting the findings of this study, including the graphs, datasets, sensor specifications, and details about the reference equipment used, will be made available upon reasonable request through the ZENODO platform in the K-HEALTHinAIR project profile folder. These materials will be shared to ensure transparency and reproducibility while adhering to any applicable privacy or ethical considerations.

Use of Artificial Intelligence

Artificial intelligence tools were employed in the preparation of this paper to expedite processes such as language translation, language editing, grammar correction, and text generation.

Conflicts of Interest

Authors Sandra Rodríguez-Sufuentes, Alicia Aguado, and Jose Fermoso were employed by the company CARTIF. Author Leticia Pérez was employed by the company KVELOCE. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Soleimani, F.; Dobaradaran, S.; De-La-Torre, G.E.; Schmidt, T.C.; Saeedi, R. Content of toxic components of cigarette, cigarette smoke vs cigarette butts: A comprehensive systematic review. Sci. Total. Environ. 2022, 813, 152667. [Google Scholar] [CrossRef] [PubMed]
Liang, Y.-C.; Maimury, Y.; Chen, A.H.-L.; Juarez, J.R.C. Machine Learning-Based Prediction of Air Quality. Appl. Sci. 2020, 10, 9151. [Google Scholar] [CrossRef]
Fan, C.; Wang, R.; Song, G.; Teng, M.; Zhang, M.; Liu, H.; Li, Z.; Li, S.; Xing, J. Quantifying the Impact of Multiple Factors on Air Quality Model Simulation Biases Using Machine Learning. Atmosphere 2024, 15, 1337. [Google Scholar] [CrossRef]
Gómez-López, A.; Arismendi, E.; Cano, I.; Farre, R.; Figols, M.; Hernández, C.; Montilla-Ibarra, A.; Sánchez-Ruano, N.; Sánchez, B.; Sisó-Almirall, A.; et al. Protocol for the enhanced management of multimorbid patients with COPD and severe asthma: Role of indoor air quality. BMJ Open Respir. Res. 2025, 12, e002589. [Google Scholar] [CrossRef] [PubMed]
Aguado, A.; Rodríguez-Sufuentes, S.; Verdugo, F.; Rodríguez-López, A.; Figols, M.; Dalheimer, J.; Gómez-López, A.; González-Colom, R.; Badyda, A.; Fermoso, J. Verification and Usability of Indoor Air Quality Monitoring Tools in the Framework of Health-Related Studies. Air 2025, 3, 3. [Google Scholar] [CrossRef]
Pitten, L.; Brüggmann, D.; Dröge, J.; Braun, M.; Groneberg, D.A. TAPaC—Tobacco-associated particulate matter emissions inside a car cabin: Establishment of a new measuring platform. J. Occup. Med. Toxicol. 2022, 17, 17. [Google Scholar] [CrossRef] [PubMed]
Honeine, P. An Eigenanalysis of Data Centering in Machine Learning. arXiv 2014, arXiv:1407.2904. [Google Scholar] [CrossRef]
Zaki, M.J.; Meira, W., Jr. Data Mining and Machine Learning: Fundamental Concepts and Algorithms, 2nd ed.; Cambridge University Press: Cambridge, UK, 2020; Available online: https://books.google.es/books?hl=es&id=oafDDwAAQBAJ (accessed on 27 May 2025).
R Core Team. R: A Language and Environment for Statistical Computing. Available online: https://www.R-project.org/ (accessed on 27 May 2025).
Shah, A.; Shah, D.; Wornell, G.W. A Computationally Efficient Method for Learning Exponential Family Distributions. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Virtual Event, 6–14 December 2021; Available online: https://proceedings.neurips.cc/paper_files/paper/2021/file/84f7e69969dea92a925508f7c1f9579a-Paper.pdf (accessed on 27 May 2025).
European Group on Ethics in Science and New Technologies (EGE). Ethical Principles and Democratic Prerequisites for AI-Based Systems; Publications Office of the European Union: Luxembourg, 2022. [Google Scholar]
Jordan, R.E.; Cheng, K.K.; Miller, M.R.; Adab, P. Passive smoking and chronic obstructive pulmonary disease: Cross--sectional analysis of data from the Health Survey for England. BMJ Open. 2011, 1, e000153. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Timeline of data quantity between Spain and Austria.

Figure 2. PM₁ (blue) and PM_2.5 (yellow) concentration profiles in four representative households corresponding to the four classification outcomes: (a) true positive (TP), (b) true negative (TN), (c) false negative (FN), and (d) false positive (FP). Each profile covers a one-month period and illustrates the temporal dynamics that influenced model predictions.

Figure 3. Receiver Operating Characteristic (ROC) curve for the final SVM model used to classify smoking households. The different colored lines represent the ROC curves obtained for each cross-validation fold, while the bold black line shows the mean ROC curve across all folds. The shaded area corresponds to the variability between folds, illustrating the robustness and generalizability of the model’s performance.

Table 1. Information about each dataset.

Country	Austria	Spain
Starting date	10-03-2023 12:02:44	13-10-2023 08:19:57
End date	09-03-2025 23:58:43	09-03-2025 23:59:54
Number of observations	4,512,604	4,412,555
Number of homes	29	100
Homes with smoker presence	9	24

Table 2. Seasonal average values and standard deviations of indoor air quality parameters in smoking and non-smoking households in Spain.

Mean
	CO₂ (ppm)				PM_2.5 (µg/m³)
	Autumn	Spring	Summer	Winter	Autumn	Spring	Summer	Winter
NO	683	636	550	722	10	8	6	11
YES	662	571	483	700	72	54	21	85
	PM₁ (µg/m³)				Temperature (°C)
	Autumn	Spring	Summer	Winter	Autumn	Spring	Summer	Winter
NO	10	7	6	10	21.9	23.5	28.2	20.6
YES	64	49	19	76	21.7	23.3	28.3	20.0
	tVOC (ppb)				Formaldehyde (µg/m³)
	Autumn	Spring	Summer	Winter	Autumn	Spring	Summer	Winter
NO	1275	2165	732	2576	37	39	51	29
YES	2133	3492	828	3664	29	30	36	23
	Relative Humidity (%)
	Autumn	Spring	Summer	Winter
NO	56	51	54	49
YES	56	51	55	50

Table 3. Seasonal average values and standard deviations of indoor air quality parameters in smoking and non-smoking households in Austria.

Mean
	CO₂ (ppm)				PM_2.5 (µg/m³)
	Autumn	Spring	Summer	Winter	Autumn	Spring	Summer	Winter
NO	697	655	550	658	12	9	10	13
YES	741	704	619	705	34	51	21	34
	PM₁ (µg/m³)				Temperature (°C)
	Autumn	Spring	Summer	Winter	Autumn	Spring	Summer	Winter
NO	7	6	6	8	22.2	22.5	25.0	21.9
YES	21	30	12	21	21.6	21.8	24.1	21.6
	tVOC (ppb)				Relative Humidity (%)
	Autumn	Spring	Summer	Winter	Autumn	Spring	Summer	Winter
NO	2841	1007	964	2575	49	49	56	40
YES	3540	1651	1311	2937	51	53	60	42

Table 4. Comparative performance of machine learning models for smoking household classification using varying training set sizes. Results are averaged across models after 5-fold cross-validation.

ML Algorithm	Training Data	Computational Cost in Seconds	AUC–ROC (Threshold)	Accuracy	Balanced Accuracy	F1-Score	MCC
GLM	100	0.0021	0.7937 (0.5959)	0.892	0.7896	0.7277	0.6889
	1000	0.005	0.7724 (0.3939)	0.8522	0.7731	0.6542	0.659
	10,000	0.0423	0.8572 (0.3282)	0.9066	0.8588	0.7344	0.7661
	100,000	0.1518	0.7849 (0.4024)	0.7965	0.7619	0.5536	0.5701
SVM	100	0.0034	0.8158 (0.5169)	0.8913	0.7905	0.6578	0.6848
	1000	0.0631	0.7863 (0.37)	0.851	0.775	0.6542	0.6567
	10,000	6.489	0.8602 (0.3303)	0.9114	0.8654	0.7344	0.7785
	100,000	994.5252 (16.5754 min)	0.7987 (0.4117)	0.8097	0.7725	0.5791	0.5678
ANNs	100	0.0059	0.7292 (0.6069)	0.8307	0.7397	0.7325	0.5123
	1000	0.0672	0.7436 (0.4434)	0.8363	0.7513	0.6542	0.613
	10,000	0.7427	0.8517 (0.3418)	0.8849	0.8411	0.7176	0.7095
	100,000	8.5871	0.7633 (0.4227)	0.7891	0.7414	0.5352	0.5368
RF	100	0.0077	0.7769 (0.453)	0.8334	0.7498	0.7277	0.5249
	1000	0.0898	0.7478 (0.409)	0.8046	0.7311	0.6542	0.521
	10,000	1.2665	0.8404 (0.464)	0.8722	0.8200	0.7314	0.6732
	100,000	16.1001	0.7375 (0.427)	0.7922	0.7246	0.5281	0.4854
XGBoost	100	0.1518	0.7359 (0.4169)	0.4169	0.7766	0.8966	0.701
	1000	0.2616	0.7371 (0.4663)	0.8578	0.8003	0.7314	0.6298
	10,000	0.5036	0.8173 (0.5206)	0.8225	0.7631	0.6811	0.5678
	100,000	2.3375	0.7518 (0.3355)	0.7815	0.7292	0.5985	0.4573

The best-performing values for each dataset size are highlighted in bold.

Table 5. Feature importance of the final SVM model based on the weights.

Parameter	Weights
PM_2.5	4.4879
PM_2.5 centered	4.0078
tVOCs	0.1832
temperature	0.0639
humidity	0.0423
CO₂	0.0388

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pavón, D.M.; Rodríguez-Sufuentes, S.; Aguado, A.; González-Colom, R.; Gómez-López, A.; Kristian, A.; Badyda, A.; Kepa, P.; Pérez, L.; Fermoso, J. Modelling the Presence of Smokers in Households for Future Policy and Advisory Applications. Air 2025, 3, 27. https://doi.org/10.3390/air3040027

AMA Style

Pavón DM, Rodríguez-Sufuentes S, Aguado A, González-Colom R, Gómez-López A, Kristian A, Badyda A, Kepa P, Pérez L, Fermoso J. Modelling the Presence of Smokers in Households for Future Policy and Advisory Applications. Air. 2025; 3(4):27. https://doi.org/10.3390/air3040027

Chicago/Turabian Style

Pavón, David Moretón, Sandra Rodríguez-Sufuentes, Alicia Aguado, Rubèn González-Colom, Alba Gómez-López, Alexandra Kristian, Artur Badyda, Piotr Kepa, Leticia Pérez, and Jose Fermoso. 2025. "Modelling the Presence of Smokers in Households for Future Policy and Advisory Applications" Air 3, no. 4: 27. https://doi.org/10.3390/air3040027

APA Style

Pavón, D. M., Rodríguez-Sufuentes, S., Aguado, A., González-Colom, R., Gómez-López, A., Kristian, A., Badyda, A., Kepa, P., Pérez, L., & Fermoso, J. (2025). Modelling the Presence of Smokers in Households for Future Policy and Advisory Applications. Air, 3(4), 27. https://doi.org/10.3390/air3040027

Article Menu

Modelling the Presence of Smokers in Households for Future Policy and Advisory Applications

Abstract

1. Introduction