1. Introduction
Indoor air quality (IAQ) is increasingly recognized as a crucial determinant of human health due to the prolonged exposure of individuals to indoor environments, where pollutant concentrations often exceed outdoor levels. Among indoor pollutants, cigarette smoke remains particularly hazardous, as it contains thousands of toxic chemicals, including particulate matter (PM
2.5), volatile organic compounds (VOCs), aldehydes, polycyclic aromatic hydrocarbons (PAHs), heavy metals, and tobacco-specific nitrosamines (TSNAs), many of which have carcinogenic or mutagenic properties [
1]. Although regulatory efforts have significantly reduced smoking in public spaces, private households persist as largely uncontrolled environments where tobacco smoke exposure continues to pose substantial health risks. Given this scenario, characterizing and predicting tobacco smoke presence indoors through advanced methodologies becomes essential for informing public health policies and intervention strategies.
However, IAQ and its associated parameters exhibit complex relationships with high temporal and spatial variability. Accurately predicting how tobacco emissions affect IAQ requires tools that extend beyond deterministic approaches. In this context, machine learning models have demonstrated high reliability in identifying intricate patterns within environmental data. Liang et al. (2020) [
2] showed the effectiveness of various machine learning techniques, including Support Vector Machines (SVM), Random Forest, Adaptive Boosting, and linear regression, highlighting that ensemble methods such as Stacking and AdaBoost consistently outperform simpler models in terms of predictive accuracy for hourly AQI forecasts (measured by R
2 and RMSE). Similarly, Fan et al. (2024) [
3] utilized the XGBoost algorithm to enhance predictions of atmospheric pollutants (NO
2, SO
2, O
3, PM
2.5), achieving correlation coefficients above R = 0.95 when trained on observational data. Their study also emphasized XGBoost’s capability to accurately identify and correct biases inherent in numerical atmospheric simulation models.
Within this framework, recent efforts from our research group have contributed significant advancements in IAQ monitoring, particularly through studies involving vulnerable populations. Gómez-López et al. [
4] developed an innovative protocol aimed at enhancing the management of multimorbid patients with chronic obstructive pulmonary disease (COPD) and severe asthma by integrating continuous IAQ monitoring with advanced clinical interventions and digital health solutions. As part of this ongoing clinical research, special emphasis is being placed on investigating the potential of indoor environmental data to predict exacerbation episodes in respiratory patients. Tobacco smoke exposure is one of the scenarios being explored initially, alongside broader efforts to assess how household IAQ influences patient outcomes. This includes the development of predictive algorithms to identify high-risk pollution events and to characterize additional indoor pollutant sources relevant to disease progression. In parallel, Aguado et al. [
5] conducted a rigorous verification study of low-cost IAQ monitoring tools, highlighting their usability for health-related applications despite certain limitations in accuracy. Together, these prior studies lay the groundwork for broader data-driven strategies aimed at reducing preventable exposures and improving health outcomes through environmental intelligence embedded in patient-centred care models.
Building upon these insights, this study performs a comparative analysis of several machine learning algorithms—Logistic Regression, SVM, Random Forest, and XGBoost—to identify the model that best balances predictive accuracy with computational efficiency in detecting smoker presence within households. Once identified, the selected model will be employed to examine the specific influence of smoking habits on individual air quality parameters, focusing on differences in mean concentrations and temporal fluctuations associated with cigarette consumption. Additionally, instances in which the model produces significant errors will be critically examined to interpret limitations and further enhance its predictive performance. This comprehensive evaluation aims to validate the model’s practical applicability, facilitating the implementation of real-time alert systems or environmental diagnostics within residential settings.
In summary, the overarching goal of this research is not only to reliably detect the presence of smokers in domestic environments using advanced machine learning approaches, but also to deepen our understanding of how tobacco-related activities impact IAQ parameters. Ultimately, these efforts aim to promote healthier living environments through accessible, technology-driven monitoring solutions.
This research contributes to the objectives of the K-HEALTHinAIR project (
https://k-healthinair.eu/ accessed on 30 September 2025), which aims to enhance our understanding of IAQ determinants and develop practical solutions to reduce exposure to harmful pollutants in indoor environments across Europe.
2. Materials and Methods
2.1. Study Design and Setting
This research is an observational, retrospective analysis for predicting the presence of smokers based on IAQ sensor data from households located in Austria and Spain. Data were collected at intervals of every five minutes per household in Austria and every 10 min per household in Spain. The study comprises two datasets covering the periods from 10 March 2023 to 9 March 2025 for Austria, and from 13 October 2023 to 9 March 2025 for Spain (
Figure 1).
2.2. Participants
In the study, all households participate in the k-HEALTHinAIR project, which involves continuous IAQ monitoring. The classification of households as smoking or non-smoking was determined based on self-reported information and follow-up verification.
All participants were required to sign an informed consent form prior to any monitoring or data collection procedures. The study protocol was approved by the Ethical Committee for Human Research at the Hospital Clínic de Barcelona on 29 June 2023 (HCB/2023/0126), and registered at ClinicalTrials.gov (Identifier: NCT06421402). The design followed the principles of the Declaration of Helsinki and the Spanish Biomedical Research Act 14/2007, ensuring the collection of only the data strictly necessary for the research objectives. Participants were informed of their right to withdraw from the study at any time without any consequences for their medical care or relationship with healthcare providers.
2.3. Variables and Endpoints
Household ID served as a grouping variable. The primary endpoint was the binary classification of households into smoking or non-smoking based on environmental parameters. Predictor variables included the following:
PM2.5 (μg/m3);
CO2 (ppm);
tVOCs (ppb);
Temperature (°C);
Relative humidity (%RH).
2.4. Data Sources and Measurements
Each household was equipped with a multi-sensor device capable of continuously recording several IAQ parameters, including temperature (T), relative humidity (RH), carbon dioxide (CO
2), formaldehyde (FA), total volatile organic compounds (TVOCs), and particulate matter (PM1, PM2.5, PM10). The sensors used were InBiot MICA PLUS (
https://www.inbiot.es/products/mica-devices/mica-plus Mutilva, Navarra, Spain, accessed on 27 May 2025) in Spain and Kaiterra Sensedge Mini (
https://www.kaiterra.com/sensedge-mini-indoor-air-quality-monitor Crans Montana, Switzerland, accessed on 27 May 2025) in Austria, which combine electrochemical sensors (for VOCs and FA), a non-dispersive infrared (NDIR) sensor for CO
2, and optical laser scattering for particulate matter.
The devices were factory-calibrated, and additional spot checks were performed before deployment to ensure consistency between units. All data were obtained using standardized, calibrated IAQ sensors [
5] deployed in each home. Devices recorded the above variables with time stamps and unique household codes (e.g., “HOMXXXXX”). The assessment methods were uniform across sites, with comparable sensor models and calibration protocols used in both countries. Data were logged at ten-minute intervals and aggregated for analysis. The sensors were installed in central locations within the main living areas of each home, away from direct heat sources or ventilation outlets, following a harmonized protocol to ensure comparability.
While absolute values may vary slightly due to sensor characteristics, the modelling approach focuses on relative changes and patterns within each household. This enables replication of the method in other settings where similar low-cost IAQ monitoring devices are available, supporting the scalability of the proposed approach.
2.5. Data Pre-Processing
All incomplete rows—those containing at least one missing value—are removed. This results in the elimination of 21 and 21,476 observations for Austria and Spain, respectively, which represent only 0.24% of the total data.
Binary categorical variables were numerically encoded (e.g., HOM_smoking = 1 for smoker households, 0 for non-smokers). Special attention must be given to variability in PM
2.5 levels across households, as smoking leads to the release of significant quantities of particulate matter [
6], causing sharp increases in sensor readings. To quantify this variability meaningfully, a new variable, “PM25_centered”, is created by grouping PM
2.5 values by household and subtracting the respective household mean. This approach removes differences in absolute scale and instead highlights intra-household variance, which is more indicative of smoking activity.
Finally, all features were scaled to improve comparability and model performance.
The
Supplementary Materials include the full R code used for data preprocessing, feature engineering, and model development, along with the final trained model object. The provided scripts (‘Fumadores.R’ and ‘Depurado.R’) perform data cleaning, integration of indoor air quality measurements with participant metadata, and creation of derived variables such as PM
2.5-centred values and smoking classification. The pseudocode document outlines the overall analytical pipeline, including data merging, cross-validation procedures, and training of five classification models (GLM, SVM, ANN, RF, XGBoost). The final trained SVM model is provided as an ‘.RData’ object for reproducibility and future testing.
2.6. Sample Size Justification
The study uses a convenience sample based on the number of homes enrolled in the broader K-HEALTHinAIR project. Given the high frequency of measurements (5–10 min) and long monitoring period, the sample size was considered sufficient for robust training and validation of machine learning models.
2.7. Quantitative Variables and Statistical Methods
With the exception of HOM_smoking, all variables were treated as continuous predictors and no categorization was applied [
7]. The five supervised machine learning models chosen were: Generalized Linear Models (GLM), Support Vector Machines (SVM), Artificial Neural Networks (ANN), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) [
8].
To train and evaluate these models, random subsets of the dataset were used depending on the intended training size. For performance comparison across algorithms, a random sample of one million records was extracted from the testing. All models were trained and evaluated using the same training and test partitions to ensure comparability across algorithms and dataset sizes. In the final model, one full month of data was used for training, while the remaining data were reserved for internal validation. A second, temporally independent dataset was used later for external validation.
To distinguish the performance of the classification algorithms used in this study, a range of metrics was applied to capture both overall predictive ability and robustness to class imbalance, a particularly relevant factor in this context. Key metrics included AUC-ROC, which assesses the model’s ability to distinguish between classes without relying on a fixed threshold, and accuracy, which was complemented by household-level accuracy to reflect real-world applicability. To better capture performance under imbalance, the evaluation also included the F1-score, which balances precision and recall, and the Matthews Correlation Coefficient (MCC), a metric known for its reliability in imbalanced binary classification tasks. Together, these measures provided a comprehensive assessment of model effectiveness and practical relevance.
All analyses and model evaluations were performed using R (version 2024.04.2+764) [
9,
10].
2.8. Ethical Statement
The core study protocol for K-HEALTHinAIR was approved by the Ethical Committee for Human Research at the Hospital Clínic de Barcelona on 29 June 2023 (Reference: HCB/2023/0126), and is registered at ClinicalTrials.gov (Identifier: NCT06421402). All procedures were carried out in accordance with the Declaration of Helsinki and national regulations (Biomedical Research Act 14/2007 of 3 July). The study design follows the principle of data minimisation, collecting only the information strictly necessary for the research. All participants provided signed informed consent before any procedures were initiated, and were informed that they could withdraw from the study at any time without any consequence for their medical treatment or relationship with healthcare providers.
Although the present work aims to detect smoking behaviour in households based on environmental sensor data, it does not seek to judge personal habits or enforce behavioural changes. Rather, this approach is intended to serve as a methodological framework for identifying sources of indoor air pollution that may be relevant for public health, particularly in vulnerable populations. Smoking is addressed here as an initial and illustrative case of such a source, but the underlying modelling approach can be extended to other indoor emission activities with health implications.
3. Results
3.1. Sample Description
The distribution of homes and observations (
Table 1) between Austria and Spain reveals a notable difference not only in sample size but also in the prevalence of smoking households. Although Spain contributed more homes overall (100 vs. 29), the proportion of smoker households was similar across both countries, with approximately 31% in Austria (9 out of 29) and 24% in Spain (24 out of 100). This balanced representation provides a useful cross-country comparison and supports the model’s ability to generalize across different geographical and cultural contexts. Additionally, the near-identical end dates and long data collection periods ensure consistency in temporal coverage, enhancing the robustness of both training and validation phases.
To complement the general description of the sample, a comparative analysis of IAQ parameters was conducted between smoking and non-smoking households in both Spain and Austria. This analysis focused on seasonal mean values and standard deviations for each environmental parameter recorded by the IAQ sensors, offering a more detailed characterization of pollution patterns and household behaviour. The results are summarized in
Table 2 and
Table 3.
In Spain (
Table 2), clear differences were observed between smoking and non-smoking households, particularly in PM
2.5 and PM
1 concentrations, where smoking households exhibited consistently elevated mean values across all seasons. For example, PM
2.5 in winter reached an average of 85 µg/m
3 in smoking homes, compared to 11 µg/m
3 in non-smoking homes. These differences were accompanied by substantially greater standard deviations, suggesting increased temporal variability likely associated with smoking events.
CO2 levels were also slightly lower in smoking households across most seasons, which may reflect behavioural or ventilation differences. For instance, in autumn, average CO2 concentrations were 683 ppm in non-smoking homes and 662 ppm in smoking homes. This trend could indicate that smokers tend to ventilate more during or after smoking activities, either intentionally or as a consequence of IAQ discomfort. However, this observation requires cautious interpretation, as differences are small and could also be influenced by occupancy or building characteristics.
Temperature and humidity showed minimal differences, indicating that these parameters are less influenced by smoking behaviour and may act as secondary contextual variables rather than primary predictors.
In Austria (
Table 3), similar trends were observed, though the contrast was slightly less pronounced. Smoking households showed higher PM
2.5 values, especially in spring and winter, with PM
2.5 levels reaching 51 µg/m
3 in spring compared to 9 µg/m
3 in non-smoking homes. Regarding CO
2, smoking households tended to present slightly elevated values throughout the year—for example, 741 ppm in autumn versus 697 ppm in non-smoking households—which may point to different ventilation patterns, possibly associated with outdoor temperature or cultural factors.
Overall, these descriptive analyses reinforce the findings of the machine learning model by highlighting distinct IAQ signatures associated with smoking behaviour. The large intra-seasonal variability in pollutant levels—especially in PM-related metrics—supports the decision to include both mean-adjusted and centred variables in the modelling approach, as detailed in
Section 2.5.
3.2. PM2.5 Profile Analysis by Classification Outcome
To better understand the behavior of the machine learning model and the features that drive its predictions, we performed a visual inspection of the temporal evolution of PM
2.5 concentrations in selected households. Although the final classification model incorporated multiple environmental variables (PM
2.5, PM
1, CO
2, temperature, humidity, and tVOC), PM
2.5 and PM
1 consistently emerged as the most influential features in both the feature importance analysis (See ahead
Section 3.4) and the exploratory comparisons between smoking and non-smoking households (
Section 3.1).
This analysis focuses on PM2.5, given its particularly strong discriminative power and interpretability as a marker of smoking-related pollution. We selected four representative households corresponding to each of the following classification outcomes:
- ▪
True Positive (TP): a smoking household correctly classified as such.
- ▪
True Negative (TN): a non-smoking household correctly classified as non-smoker.
- ▪
False Negative (FN): a smoking household misclassified as non-smoker.
- ▪
False Positive (FP): a non-smoking household misclassified as smoker.
For each case, we extracted PM2.5 data over a one-month period and plotted the daily temporal profile. The selected households reflect realistic environmental and behavioral conditions during the monitoring campaign.
The visual profiles shown in
Figure 2 reveal distinct differences that help to contextualize the model’s predictions (top-down in order):
▪
Figure 2a (TP): Clear and frequent PM
2.5 peaks throughout the month, consistent with indoor smoking episodes.
▪
Figure 2b (TN): A very low and stable PM
2.5 signal, with only one or two isolated minor peaks, strongly aligned with a non-smoking pattern.
▪
Figure 2c (FN): Consistently low PM
2.5 levels despite the household being labelled as smoking. This may be due to low-frequency smoking, smoking in well-ventilated areas, or distance between the source and the sensor, resulting in minimal detection.
▪
Figure 2d (FP): Although initially considered as a non-smoking home, the household was later confirmed by the field nurse to present signs of smoking, suggesting that the model correctly identified a misclassified home. The PM
2.5 pattern shows high peaks concentrated in the first half of the month, followed by a sharp decrease, possibly indicating a cessation or change in behaviour.
These examples highlight both the strengths and limitations of the model. On one hand, it can detect exposure patterns that contradict self-reported data, adding value in real-world monitoring and public health surveillance. On the other hand, its performance may be constrained by the physical context of smoking behaviours, such as source location, intensity, or household ventilation strategies, which can limit the sensor’s ability to detect pollution events. These individual case analyses provide qualitative support for the modelling approach described next, where algorithmic performance is evaluated systematically across a range of training sizes.
3.3. Modelling Performance
Table 4 summarizes the results for GLM, SVM, ANN, Random Forest, and XGBoost at four training sizes: 100, 1000, 10,000, and 100,000 observations. Each configuration was evaluated with seven performance metrics: computational time, AUC-ROC (with optimal threshold), accuracy, balanced accuracy, household-level accuracy, F1-score, and the Matthews Correlation Coefficient (MCC). For every experiment, 100 of the 129 households were randomly selected. From each selected household, 1, 10, 100, or 1000 observations were drawn, producing the four training sizes above. A 5-fold cross-validation was applied, with the observations re-randomized at each fold using all data of the 29 homes remaining to test the algorithms.
As shown, Support Vector Machines (SVM) consistently demonstrate the strongest predictive capacity across dataset sizes, particularly at medium and large training sets (10,000 records), where the algorithm achieves the highest overall accuracy (91.14%), balanced accuracy (86.5%), and competitive F1-scores and MCC values. Although the computational cost of SVM grows considerably with very large datasets (100,000 samples), its predictive performance remains robust, highlighting its suitability for household-level smoking detection.
The advantage of SVM in this context likely arises from its ability to effectively separate classes in high-dimensional feature spaces. Smoking households can present subtle and non-linear patterns in particulate matter, CO2 concentration, and indoor temperature fluctuations. These variables often overlap between smoking and non-smoking homes, but SVM can identify optimal decision boundaries that maximize the distinction between groups, even when differences are small. This capacity to capture fine-grained, complex relationships explains why SVM outperforms other algorithms in predicting smoking presence. While methods such as GLM and RF offer good baseline performance with low computational demands, and ANNs can model non-linearities effectively, SVM achieves a better balance between precision and generalizability across conditions.
Given the temporal resolution of the dataset (5–10 min intervals across 129 households) and the environmental variability inherent to indoor air measurements, SVM emerges as the most appropriate algorithm for this study’s objectives. Its superior classification performance across multiple evaluation metrics provides a solid foundation for its selection in the final model deployment and further detailed analysis, presented in
Section 3.4.
3.4. Robustness Assessment of the Final Model
Following the large-scale exploration presented in
Section 3.1, an additional evaluation was conducted to assess the robustness of the selected Support Vector Machine (SVM) model. The aim was to verify its ability to consistently discriminate between smoking and non-smoking households both at the level of individual measurements and at the aggregated household level (i.e., averaging predictions across all observations within each home). To this end, we applied a stratified Monte Carlo cross-validation procedure, ensuring proportional representation of smoking and non-smoking households in each iteration.
Across 100 repetitions, the model achieved the following average performance at the observation level:
AUC-ROC: 0.83 (95% CI: 0.81–0.86);
Accuracy: 0.88 (95% CI: 0.87–0.90);
Balanced Accuracy: 0.78 (95% CI: 0.75–0.80);
F1-score: 0.69 (95% CI: 0.65–0.74);
Matthews Correlation Coefficient (MCC): 0.67 (95% CI: 0.63–0.71).
At the household level, where predictions were aggregated per home, performance remained comparable:
AUC-ROC: 0.84 (95% CI: 0.81–0.86);
Accuracy: 0.89 (95% CI: 0.83–0.90);
Balanced Accuracy: 0.78 (95% CI: 0.75–0.80);
F1-score: 0.70 (95% CI: 0.66–0.74);
MCC: 0.67 (95% CI: 0.63–0.71).
These findings indicate that the SVM model delivers stable discrimination between smoking and non-smoking households when evaluated across multiple random splits, and that this consistency holds both for raw sensor measurements and for aggregated household-level predictions.
Figure 3 presents the ROC curves for the first 10 iterations of the stratified Monte Carlo validation, illustrating variability across runs while supporting the overall robustness of the model.
The analysis of feature importance (
Table 5 revealed that PM
2.5 and PM
2.5 centered were by far the most influential variables in the model, consistent with their role as direct indicators of indoor smoking. tVOCs, while much less influential, provided complementary information, likely reflecting additional emission sources. Temperature, humidity, and CO
2 showed even smaller contributions, but their consistent presence suggests they provide relevant contextual information that enhances the model’s ability to make accurate classifications.
These results confirm the reliability and robustness of the SVM model in identifying smoking households based on sensor data collected at high frequency.
4. Discussion and Conclusions
The results presented in this study demonstrate the feasibility of detecting the presence of smokers in households with high accuracy using machine learning techniques applied to indoor air quality (IAQ) data. This work adds to the limited body of research exploring smoking behavior detection in real-world domestic environments by leveraging high-resolution, long-term sensor data across diverse household settings. The final SVM model, trained on 10,000 observations, achieved around 83% classification at the household level, and retained reasonable performance under temporal shift conditions during external validation. The comparative analysis of different machine learning models revealed that both SVM and XGBoost outperformed other approaches such as GLM, ANN, and Random Forest, with SVM achieving the best overall classification accuracy at approximately 83%. XGBoost also showed strong results, particularly in terms of robustness to variable reduction and high SHAP-derived feature interpretability. Notably, the model performance remained relatively stable even under reduced feature sets and threshold shifts, which suggests good generalization capability and resilience to input noise—key aspects for practical deployment in heterogeneous household environments. These patterns underscore the feasibility of implementing such models in dynamic, real-world conditions where sensor reliability or completeness may vary.
While the SVM and XGBoost models outperformed other classifiers in this study, further analysis is required to fully understand the factors contributing to their superior performance. Possible explanations include their robustness to multicollinearity, ability to model non-linear relationships, and resilience to high-dimensional feature spaces. Additional evaluation of feature interactions and model behavior will be explored in future work.
These findings support the growing potential of data-driven approaches to complement—or even replace—traditional exposure identification methods. In contrast to manual inspections, periodic surveys, or passive sampling, machine learning models can process continuous and high-frequency sensor data to identify behavioral patterns such as smoking, which have clear and well-documented health implications. This capability opens the door to real-time alerts, personalized exposure monitoring, and integration into digital health platforms targeting at-risk populations.
Beyond the scope of smoking detection, the structure and logic of the proposed methodology are readily transferable to other indoor pollution sources. Events such as cooking, cleaning product use, incense or candle burning, and inadequate ventilation can all generate spikes in particulate matter or VOCs that resemble smoking signatures in isolation. With suitable data labeling or unsupervised learning extensions, similar models could be developed to classify or cluster such events, enabling a more granular understanding of indoor pollution dynamics. Future implementations could explore multiclass classification schemes or anomaly detection techniques to identify atypical IAQ patterns without relying solely on predefined categories.
Another important implication of this work lies in the field of exposure assessment. Rather than relying on static thresholds or averaged concentrations, probabilistic outputs from machine learning models allow for a more nuanced and dynamic evaluation of exposure risk. These models can estimate not only whether a person is exposed, but also how likely it is that a persistent pollutant source is present, how often it occurs, and under which contextual conditions. This level of granularity could be instrumental for developing personalized exposure profiles, improving clinical follow-up of patients with chronic respiratory diseases, or informing public health interventions. Furthermore, these probabilistic assessments could support policy design by identifying households or behaviors with consistently elevated indoor pollution profiles.
This line of research aligns closely with the objectives of the K-HEALTHinAIR project, which seeks to identify actionable IAQ determinants across European indoor environments. In this context, the ability to automatically detect pollution sources, characterize exposure episodes, and evaluate their relevance to health outcomes forms a foundational component of environmental intelligence systems. The integration of such approaches into building diagnostics, occupant feedback loops, or health-related decision-making tools could represent a significant step forward in preventive indoor environmental health.
Finally, the study highlights key operational and methodological considerations. Sensor placement, room use patterns, and ventilation behavior all influence detection performance—especially in borderline or ambiguous cases. As such, future work should consider combining IAQ data with occupancy tracking, contextual metadata, or even physiological monitoring to better interpret complex exposure profiles. Additionally, expanding the dataset to include labeled data from other pollutant sources would enable broader generalization and improved source attribution.
By integrating machine learning with longitudinal IAQ monitoring, this study demonstrates a scalable, interpretable, and robust approach for detecting hidden behaviors that impact IAQ and public health. The findings pave the way for smarter, more responsive indoor pollution detection systems—capable of supporting both individual care and broader environmental policy objectives.
Beyond its technical performance, the proposed modeling approach offers a versatile framework that could be adapted to detect other relevant sources of indoor pollution, such as cooking, cleaning products, or incense burning. These extensions would allow for broader classification capabilities and more comprehensive exposure profiles. At the same time, the potential to infer personal behaviors from environmental data raises important ethical considerations. It is essential that future applications ensure transparency, informed consent, and data protection, especially when deployed in sensitive contexts such as households, schools, or healthcare settings. As emphasized in the European Group on Ethics in Science and New Technologies (EGE) guidelines [
11], data-driven tools must be designed not only for technical accuracy but also with respect for dignity, autonomy, and fairness. Integrating ethical foresight into the development and deployment of IAQ-based behavioral models will be crucial to ensure they serve both scientific advancement and societal good.
The methodological framework proposed in this study can be reproduced in other residential settings using similar low-cost indoor air quality sensors. As the analysis focuses on relative patterns and correlations rather than on absolute pollutant concentrations, the approach remains robust across slight sensor-to-sensor variability. Moreover, the installation protocol and data processing steps are simple and adaptable, making the method suitable for large-scale implementations or citizen science initiatives aimed at characterizing indoor pollution sources.
While the main objective of this work was to develop a methodological approach to detect indoor smoking based on low-cost sensor data, the public health relevance of such detection lies in its potential to protect cohabitants—especially vulnerable individuals such as children, elderly people, or patients with chronic respiratory diseases—from second-hand smoke exposure. Numerous studies have established the health risks of passive smoking in domestic environments, including increased incidence of asthma, COPD exacerbations, and cardiovascular issues. For instance, Jordan et al. (2011) reported a significant dose–response relationship between household passive smoking and COPD risk, with nearly double the odds of clinically significant COPD among never-smokers exposed for over 20 h per week [
12]. By identifying indoor smoking through indirect measurements, this approach may contribute to raising awareness and encouraging preventive actions without the need for intrusive surveillance.
Although regulatory interventions in private households are ethically and legally limited, the insights provided by this type of modelling can support the development of targeted educational campaigns, voluntary mitigation strategies, or warning systems for sensitive groups. Furthermore, the methodology can be extended to other sources of indoor air pollution with public health implications, such as the use of wood stoves, cleaning products, or indoor combustion activities.