Next Article in Journal
Unleashing the Potential of Large Language Models in Urban Data Analytics: A Review of Emerging Innovations and Future Research
Previous Article in Journal
Cloud-Enabled Hybrid, Accurate and Robust Short-Term Electric Load Forecasting Framework for Smart Residential Buildings: Evaluation of Aggregate vs. Appliance-Level Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Leveraging Low-Cost Sensor Data and Predictive Modelling for IoT-Driven Indoor Air Quality Monitoring

by
Patricia Camacho-Magriñán
1,2,
Diego Sales-Lerida
1,2,
Alejandro Lara-Doña
1,2 and
Daniel Sanchez-Morillo
1,2,*
1
Department of Automation Engineering, Electronics and Computer Architecture and Networks, School of Engineering, University of Cadiz, 11519 Cádiz, Spain
2
Instituto de Investigación e Innovación Biomédica de Cádiz (INiBICA), 11009 Cádiz, Spain
*
Author to whom correspondence should be addressed.
Smart Cities 2025, 8(6), 200; https://doi.org/10.3390/smartcities8060200
Submission received: 5 November 2025 / Revised: 25 November 2025 / Accepted: 27 November 2025 / Published: 28 November 2025

Highlights

What are the main findings?
  • A novel IoT-based indoor air quality monitoring unit has been designed and implemented following an open hardware approach.
  • The field test demonstrated that high-concentration, short-duration pollutant events can be overlooked by traditional 24-h averaging.
  • Predictive modelling approaches using data from low-cost IoT sensors can successfully identify, quantify, and predict short-term pollutant peaks in real-time.
What are the implications of the main findings?
  • Predictability was highly context-dependent. IAQ assessments should shift to event-based exposure metrics to more accurately evaluate health risks in residential settings.
  • This methodology enables the development of smart IoT-driven responses, such as automated ventilation control or real-time alerts, to actively reduce occupant exposure.

Abstract

Indoor air quality (IAQ) in residential settings is often dominated by high-concentration pollutant events from activities such as cooking and occupancy, which are overlooked by traditional 24 h average assessments. In this, we have designed and implemented a low-cost unit for remote IAQ monitoring. We deployed these units for high-resolution remote monitoring of CO2, particulate matter (PM), and volatile organic compounds (VOCs) in three different domestic environments: a kitchen, a living room, and a bedroom. The monitoring campaign confirmed that, while daily averages frequently remained below guideline limits, transient peaks (e.g., CO2 exceeding 2800 ppm in bedrooms and significant increases in PM during cooking) posed acute exposure risks. This dataset was used to train and evaluate machine learning models for 10 min ahead pollutant forecasting. Ensemble tree-based methods (Random Forest) and gradient boosting algorithms (XGBoost, LGBM, and CatBoost) were effective and robust. The predictability of the models correlated with room dynamics: performance improved under clear cyclical patterns (bedroom) and remained stable under stochastic events (kitchen). This work shows that integrating low-cost IoT sensing with machine learning enables proactive IAQ management, supporting health interventions driven by predictive risk rather than static averages.

1. Introduction

Air quality is a crucial factor in public health and human well-being, especially in urban environments where concentrations of atmospheric pollutants have reached alarming levels. The World Health Organization (WHO) states that breathing good-quality air daily is a fundamental right for everyone. However, almost 99% of the global population [1] breathes air that exceeds WHO guideline limits and contains high levels of pollutants, with low and middle-income countries experiencing the highest exposures [2].
Although extensive research has focused on outdoor air pollution, the importance of indoor air quality (IAQ) has been increasing, as the majority of people spend up to 90% of their time in environments such as homes, residences, offices, schools, and shopping centers [3].
Indoor environments function as complex, ever-changing systems in which air quality is shaped by numerous influences, such as the infiltration of outdoor air, the characteristics of building materials, and everyday human activities. Extended exposure to indoor air pollutants can lead to adverse health effects. The effects can range from minor symptoms such as irritation of the eyes, nose, and throat, to chronic respiratory and cardiovascular diseases, including cancer [4,5]. In particular, recent studies underscore the significant impact of IAQ on respiratory health [2,3,4]. Epidemiological research has consistently shown that exposure to some indoor air pollutants, such as fine particulate matter (PM2.5), nitrogen dioxide (NO2), and ozone (O3), is associated with a higher incidence and severity of respiratory diseases such as chronic obstructive pulmonary disease (COPD) and asthma [5].
Factors such as inadequate ventilation, the use of contaminated building materials, and the presence of sources of internal pollution, such as cleaning products or heating systems, contribute to the accumulation of pollutants in indoor spaces [6]. Consequently, continuous monitoring of indoor environments is imperative to mitigate exposure to harmful pollutants.
IAQ assessment relies heavily on real-time monitoring technologies, particularly environmental sensors capable of continuously measuring key parameters. These include common indoor pollutants such as particulate matter of various sizes (PM1, PM2.5, PM10), ozone (O3), volatile organic compounds (VOCs), sulfur dioxide (SO2), carbon dioxide (CO2), and carbon monoxide (CO) [7]. The data generated by these systems are crucial for quantifying pollution levels, evaluating their impact on respiratory health, and enabling timely mitigation strategies. Although IAQ can also be affected by other significant agents such as radon, aldehydes (e.g., formaldehyde emitted from furnishings), and biological contaminants like mold and dust mites, comprehensive monitoring of all these factors remains technologically challenging and economically demanding.
IAQ presents challenges that differ significantly from outdoor ambient assessment. Indoor environments are dynamic microclimates characterized by high spatial variability (e.g., kitchen vs. bedroom) and intermittent emission sources (e.g., cooking, cleaning, human occupancy). Furthermore, historically, IAQ assessment has relied upon costly, research-grade instrumentation or passive samplers, none of which are typically capable of capturing real-time dynamics. In this context, low-cost sensors (LCSs) have acquired paramount importance. Their affordability and compact size provide a great opportunity for indoor environments, enabling the identification of potential emission sources in various household areas, the management and mitigation of IAQ issues, real-time alert systems, personal exposure monitoring, and building control to optimize energy efficiency and assess health risks [8].
The advantages of using LCSs are well-known in terms of cost, portability [9], and spatio-temporal resolution. In addition, the use of LCSs opens the door to the application of artificial intelligence techniques, enabling the development of highly valuable predictive models for public health, building management, and environmental sustainability [2]. However, some studies agree on their limitations regarding precision and accuracy [10], and the lack of regulation in the certification process [11]. Furthermore, they highlight the need for regular field calibrations and the importance of using reference-grade instruments for validation [8].
Nonetheless, integrating Machine Learning (ML) and IAQ monitoring systems based on LCSs and IoT is of utmost importance, as it transforms raw data into proactive, actionable information. The main advantage of ML is its ability to predict and forecast future air quality conditions [12,13,14,15,16]. ML leverages the large volume of quantitative data generated by low-cost IoT sensors to process, analyze, and build models that deliver reliable and cost-effective predictions to maintain optimal IAQ and occupant well-being. This forecast is crucial because it gives users more time to deliberate on how to improve air quality and prevent dangerous situations before they occur [14,15].
Although not compound-specific, VOC measurements can indirectly capture indoor activities involving chemical products, such as the use of disinfectants or cleaning agents. Similarly, increases in PM concentrations and in CO2, often used as a proxy for ventilation, can indicate emissions from combustion sources, including heating and cooking, particularly when ventilation is limited. Moreover, quantifying IAQ can offer potential benefits for individuals with respiratory conditions [12]. Lastly, the use of ML enables building managers to make informed decisions about ventilation and heating/cooling, which also contributes to improving the building’s energy efficiency [14].
In addition, the importance of ML also lies in its ability to handle and analyze the complex and non-linear nature of air quality data [13]. ML algorithms are ideal for forecasting IAQ data time series, as they overcome the limitations of traditional prediction models and achieve more accurate predictions. These complex models can be trained using historical IAQ data to determine future values. Furthermore, ML can be implemented in the back end of the system to detect anomalies or changes in trends through time series analysis, which is useful to modify occupant behavior [12].
The state of the art in the prediction of PM (mainly PM2.5) has evolved from initial mechanical models, which were inconvenient due to the need for many details of the building, such as the structures of the envelope [17], to a data-driven approach dominated by ML. Algorithms such as Artificial Neural Networks (ANNs) [18], Recurrent Neural Networks (RNNs) [19], and especially Random Forests (RFs) [20,21] have been widely and successfully used to predict pollutant concentrations due to their capacity to capture non-linear interactions between environmental variables, to model the complex temporal dependencies inherent in air quality dynamics, and to handle measurement noise while mitigating overfitting. However, only a few approaches have specialized in analyzing differences in air quality in different rooms of a dwelling. Indoor CO2 and VOCs models have also been developed using machine learning algorithms to forecast these pollutant concentrations [22,23,24]
In this context, this study aims to design, implement, and evaluate a low-cost solution, from initial sensor characterization and selection to hardware integration, firmware programming, and IoT configuration. While IAQ is affected by a multitude of agents, this study focuses on PM, VOCs, CO2, temperature, and humidity. These parameters were selected because they serve as effective real-time proxies for ventilation (CO2), combustion and occupancy activities (PM, CO2), and general chemical contamination (VOCs), while remaining compatible with the cost and size constraints of scalable IoT deployments. In addition, they have demonstrated to play a critical role in the management of chronic respiratory diseases, since they represent the primary environmental triggers for exacerbations in asthma and COPD patients that can be reliably monitored in real-time [2]. This work explicitly addresses the primary weaknesses of common LCSs by validating the system’s performance and by a real-world deployment in residential rooms. Furthermore, we demonstrate the utility of high-quality data to develop predictive ML models to anticipate contaminant levels. The result is a scalable, accurate, and reliable monitoring solution that can contribute to data-driven insights needed for effective building management and personal exposure assessment.
The remainder of this paper is structured as follows: Section 2 details the methodology used for sensor selection and evaluation, for building the IAQ unit, and for the field test, as well as the predictive machine learning approach. Section 3 presents the results on the characterization and comparative evaluation of the IAQ sensors, the final IAQ prototype, the detailed analysis of the pollutant data gathered in the field test, and the results of the training and validation of the ML predictive models. These results are discussed in Section 4. Finally, Section 5 summarizes the conclusions.

2. Materials and Methods

2.1. Sensor Technology Background

The technology integrated into LCSs for IAQ monitoring varies by the target parameter. For gas detection (e.g., NOx, VOCs, CO, O3, SOx, NH3), the primary technologies are metal-oxide semiconductor (MOS) and electrochemical (EC) sensors. EC sensors generally offer higher sensitivity and selectivity, but at a higher cost, while MOS sensors are more affordable with long lifespans, although they can suffer from cross-sensitivity [25]. It should be noted that many MOS sensors dedicated to measuring VOC also provide an equivalent value of CO2 (eCO2). This parameter is not a direct measurement; instead, it is algorithmically inferred from the correlation observed in indoor environments between certain VOCs and hydrogen with CO2 exhaled through respiration. Other sensor types for VOCs include photo-ionization detectors (PIDs), which offer sensitivity higher than that of MOS sensors, although with limited selectivity [26]. For CO2 measurement, the advent of non-dispersive infrared (NDIR) technology has been a significant advancement, providing highly precise, selective, and long-term stability measurements [27]. For PM (PM1, PM2.5, PM4, and PM10), the dominant LCS technology is laser scattering, also known as Optical Particle Counters (OPCs). The reliability and performance of these low-cost OPCs have been extensively evaluated and validated in numerous studies, confirming their utility for IAQ monitoring when properly calibrated [28].

2.2. Sensor Identification

An experimental approach was used for the selection, evaluation, and calibration of LCSs intended for IAQ monitoring.
The target parameters, VOCs, PM, CO2, temperature, and humidity, were selected to balance cost and reliability, allowing a comprehensive assessment of IAQ and ventilation without incurring the high expenses of laboratory-grade, gas-specific analyzers. The selection process focused on commercially available off-the-shelf (COTS) sensors, which were evaluated against essential technical criteria: accuracy, measurement range, stability, cost, and calibration needs.
The experimental evaluation, summarized in Table 1, involved a comparative benchmark of eight distinct sensor modules selected to cover the target parameters. Three VOCs MOX sensors were compared (the SGX Sensortech MICS-VZ-89TE, ScioSense ENS160, and Sensirion SGP40) against a high-precision, lab-calibrated EC sensor (the ECSense TB600B-TVOC-10). MOX sensors are typically factory-calibrated but are generally less precise. In contrast, EC sensors, while substantially more expensive, provide certified, traceable calibration derived from laboratory testing. For this reason, the EC sensor was used as the reference standard to evaluate the performance of the lower-cost MOX technologies.
For the CO2 measurement, three sensors based on NDIR technology were compared. This group included two NDIR photoacoustic sensors (the Sensirion SCD41 and Infineon XENSIV PAS CO2) factory calibrated up to 2000 ppm, and an NDIR optical sensor (Telaire T6793-5K) calibrated up to 5000 ppm. All three units have featured self-calibration capabilities, providing a robust basis for comparing the two NDIR sub-technologies (photoacoustic vs. optical). Additionally, the estimated eCO2 values provided by MOX-based VOC sensors were included in this evaluation to assess their viability as a potential proxy or substitute for a dedicated NDIR sensor.
Finally, the Sensirion SEN54 was included as an all-in-one multi-parameter module. It was the only sensor evaluated for PM (PM1, PM2.5, PM4, PM10), which it measures using laser scattering technology. The SEN54 operates from 0 to 1000 μ g/m3 with a specified precision of ± 10 % against its calibration reference (a TSI DRX 8533 aerosol monitor). Additionally, the SEN54 also integrates a MOX sensor to provide a VOCs index, along with sensors for temperature and relative humidity, offering a comprehensive and cost-effective solution [29,30,31].

2.3. Sensors Evaluation

Sensor evaluation was carried out by means of experimental trials performed under controlled conditions. The block diagram of the test bench employed is shown in Figure 1. The evaluation protocol comprised co-location experiments in an open-door room. For the sensors’ response dynamics and cross-sensitivity analysis, we employed controlled exposure to representative indoor emission sources rather than static standard gases, as our goal was to evaluate the sensors’ behavior under acute, transient events typical of residential settings. For VOCs, ethanol-based aerosols (commercial hairspray) were used to generate sudden, high-concentration spikes. This allowed us to assess the sensor’s rise time and recovery curve. For CO2 to test for cross-sensitivity, we generated CO2 pulses via the stoichiometric acid-base reaction of sodium bicarbonate (NaHCO3) and acetic acid (vinegar). This method produces a clean CO2 plume without significant VOCs, ideal for verifying sensor selectivity.
The prototype consisted of an Arduino MKR WiFi 1010 (Arduino, Monza, Italy) microcontroller board and each of the sensors in Table 1, with the corresponding wiring schematic depicted in Figure 1. For the prototype’s firmware, the data-read latency of the most restrictive sensor (i.e., the slowest) was considered, and a unified sampling frequency of 6 s was established for the simultaneous data acquisition from all sensors.
Upon completion of the experiments, the collected data were analyzed to select the sensors with the optimal cost-reliability ratio. The selected sensors were then targeted for integration into a custom-designed project-specific device. This final device was engineered for ultra-low power consumption, several hours of energy autonomy (battery life), and NB-IoT (Narrowband-IoT) connectivity for autonomous data transmission to a remote server. This design renders the device fully independent of any end-user intervention; it operates using its own SIM card and does not rely on the end-user’s local network infrastructure.

2.3.1. Evaluation of VOCs Measurement Performance

For the reliability analysis of VOCs sensors, the MKR WiFi 1010 device was used, and custom firmware was developed in C++ within the Arduino framework, following a modular architecture. It uses dedicated driver classes to poll the sensors via I 2 C . For the reliability analysis, the transmission layer was configured to send averaged data packets over Wi-Fi to the ThingSpeak platform [32], enabling real-time remote validation. A sampling period of 6 s was set. The one-minute mean of the acquired values was calculated and transmitted.
To assess sensor response in a wide dynamic range of concentrations of VOCs, ambient conditions were altered by applying ethanol-based aerosols (commercial hairspray). Concurrently, ambient CO2 levels were also deliberately increased to observe whether the known cross-correlation between VOCs and CO2 would introduce adverse effects on the measurement process of the VOCs sensors. The increase in CO2 concentration was generated via the acid-base reaction of NaHCO3 and acetic acid.
The comparison protocol consisted of placing the candidate sensors and the reference instrument under identical conditions. Both systems were simultaneously exposed to controlled variations in ethanol vapor to evaluate linearity and response time. Sensor performance was assessed through correlation analysis between the reference device and the MOX sensors under test. In addition, we examined whether combining the outputs of multiple MOX sensors could yield an improved calibration fit relative to the reference standard [33].

2.3.2. Evaluation of CO2 Measurement Performance

To perform a reliability analysis of the NDIR CO2 sensors, as well as the eCO2 values provided by the VOC sensors, data were collected through the serial port at intervals of 6 s. This data acquisition process was implemented without the application of minute-averaging techniques. The rationale for this methodological change was to more accurately capture the variation in ambient CO2 levels, as CO2 is more dynamic and dissipates more rapidly than VOCs during the generation process (mixing sodium bicarbonate and vinegar). For this new experiment, ambient levels of CO2 and VOCs were again altered for the purpose analogous to that of the previous experiment.
To validate the sensor response in concentration ranges representative of indoor spaces, specific tests were designed in two distinct scenarios. Scenario (a) involved the forced injection of CO2 to reach concentrations up to 5000 ppm. Scenario (b) involved restricted CO2 values, common in residential environments (<2000 ppm), reflecting expected real-world operating conditions.
Data evaluation was performed by calculating the coefficient of determination (R2), using a linear regression model between the candidate sensors and the reference instrument. In this assessment, it was assumed that the three NDIR sensors had high reliability and therefore could be considered reference standards against the eCO2 values provided by the VOCs sensors. The purpose of acquiring three different NDIR sensors, using different technologies and manufacturers, was twofold: first, to assess their degree of intercorrelation, and second, to provide a wider selection pool to identify the sensor that offered the best cost-performance ratio.

2.4. IAQ Unit

Following sensor selection, a printed circuit board (PCB) was designed to integrate the selected components into the final prototype (Figure 2).
The use of NB-IoT for data transmission makes the device independent of the user. It was designed for “plug-and-play” self-installation. The data transmission frequency was set to 10 min. Internally, the device acquires sensor readings every 6 s, enabling high temporal resolution monitoring; it then computes the 10-min average for each parameter, which is subsequently transmitted to a remote web server using a RESTful API service. This standardized communication facilitated the centralized storage of all data in JSON format within a secure and accessible environment for subsequent analysis.

Open Hardware

The IAQ unit is available under open-source licenses: (a) the hardware design is under a CERN Open Hardware Licence v2—Strongly Reciprocal; the firmware is accessible under a GNU General Public License v3 or later; and the documentation, user manual, infographics, and others can be accessed under a GNU Free Documentation License v1.3 or later. Our project page is https://atari-researchlab.github.io/cicerone-airlink (accessed on 27 October 2025) [34].

2.5. Field Tests

For the evaluation of the unit in an operational environment, the devices were installed in various indoor spaces (kitchen, bedroom, and main living room) in nine independent dwellings located in the province of Cadiz (Spain). The deployment was conducted during the winter season, ensuring similar climatic conditions between the analyzed environments. A 7-day observation period was established for each dwelling, during which the devices performed continuous data collection of air quality and environmental variables, including PM (PM1, PM2.5, PM4, PM10), CO2, VOCs, temperature, and relative humidity. This real-world deployment allowed for the analysis of pollutant variations in relation to occupant activity in each environment, as well as the evaluation of the highest concentrations reached [35]. The resulting dataset was used to perform a statistical analysis of pollution patterns and develop ML algorithms to evaluate the predictive capacity of the system to forecast the variability of the contaminants.

2.6. Data Analysis

2.6.1. Exploratory Data Analysis

To evaluate the correlation and degree of agreement between the sensor measurements in the sensor evaluation step, the coefficient of determination (R2) was calculated using a univariate ordinary least squares (OLS) linear regression model ( y = β 1 x + β 0 ). This metric was used to quantify the proportion of variance in the measurements of one sensor that is predictable from the measurements of the other, thus indicating the strength of the linear association between the two instruments.
CO2 (ppm), PM1 ( μ g/m3), PM2.5 ( μ g/m3), PM4 ( μ g/m3), PM10 ( μ g/m3), VOCs (index), temperature and humidity were acquired during the field tests. The analysis of the data gathered in the field tests was conducted using a methodology that combined preprocessing, descriptive statistical analysis, and the identification of temporal patterns. Data preprocessing included detecting and eliminating outliers using the interquartile range (IQR) method, removing records below Q 1 1.5 × IQR or above Q 3 + 1.5 × IQR . Subsequently, the data were categorized into four time intervals to enable the study of the temporal evolution of pollutants: night (00:00–6:00), morning (06:00–12:00), afternoon (12:00–18:00), and evening (18:00–24:00). For each parameter, descriptive statistics (mean, standard deviation, minimum, maximum) were calculated both globally and segmented by time slot.
The daily evolution of pollutants in each room was represented against time (hours). The mean value and confidence bands (mean ± standard deviations) of the time series were depicted. Additionally, the data distribution by pollutant and time slot, including the density, and key statistical markers (mean and range), were presented as violin plots. To compare the distribution of concentrations between slots, the non-parametric Kruskal–Wallis test [36] was applied, since it is appropriate when normality cannot be assumed.

2.6.2. Predictive Models

A supervised ML pipeline was implemented to predict pollutant concentrations of IAQ pollutants.
While deep learning architectures such as attention-based long short-term memory (LSTM) are powerful for vast sequential datasets, tree-based ensemble methods were selected for this study, given their proven high performance on complex, non-linear, high-dimensional tabular data, their lower computational requirements suitable for IoT edge-deployment, and their robustness against overfitting on moderate-sized datasets.
The Random Forest (RF) model [16] was included as a fundamental and robust bagging (Bootstrap Aggregating) ensemble, and considered as the baseline. It is effective in reducing variance and mitigating overfitting, provides strong baseline performance, and is highly robust to noise, which is common in sensor data.
In addition, a suite of boosting algorithms was selected, as they represent the current state-of-the-art for regression tasks on tabular data. Unlike the parallel tree-building of RF, boosting models build trees sequentially, where each subsequent tree is trained to correct the residual errors of its predecessors. Extreme Gradient Boosting (XGBoost) [37] was chosen for its well-established high performance and regularized learning for controlling model complexity and preventing overfitting, a critical risk in high-dimensional feature spaces like ours. Light Gradient Boosting Machine (LGBM) [38] was included to evaluate computational efficiency alongside accuracy. The LGBM employs a leaf-wise tree growth strategy, as opposed to the level-wise growth of other models. This approach allows it to converge significantly faster on large datasets, making it a methodologically sound choice for assessing the trade-off between training time and performance. Finally, Categorical Boosting (CatBoost) [39] was specifically selected for two unique methodological innovations that enhance robustness and combat overfitting. CatBoost implements an ordered boosting strategy, which reduces overfitting and improves the model’s ability to generalize, which is critical for noisy time-series sensor data.
  • Features Extraction
To develop predictive models for 10 min ahead IAQ, a set of features was engineered from the raw time-series data. Table 2 details these features.
The current-time measurements (time t) for the pollutant variables were explicitly excluded from the feature set. This ensured that the model only used environmental data available at time t (temperature, humidity) and historical data (all lagged and rolling features) to predict pollutant concentrations at time t + 10 min.
This set of features allowed the models to capture both structural variability and short-term dynamic fluctuations associated with occupancy patterns, ventilation, and the accumulation or dissipation of pollutants in indoor environments.
  • Feature Selection Methodology
To identify the most impactful predictors, we implemented a model-dependent embedded feature selection methodology. This method was performed dynamically inside each fold of the 10-fold cross-validation loop. In each fold, the feature importance scores were extracted to quantify the contribution of each feature. The top-20 features of this ranked list were selected, thereby removing redundant predictors. The model for that specific fold was then trained only on this reduced subset of 20 features.
  • Metrics of Performance and Validation
RF, XGBoost, LGBM, and CatBoost models were trained for each target pollutant. To ensure a robust and unbiased performance assessment, each model was evaluated using a 10-fold cross-validation strategy. The predictive performance of the models was quantified using a set of metrics which included the cross-validated mean and standard deviation of the root mean square error (RMSE), R-squared (R2), symmetric mean absolute percentage error (SMAPE), and mean absolute error (MAE) [40]:
RMSE = 1 n i = 1 n ( y i y ^ i ) 2
The MAE measures the average error without penalizing for magnitude:
MAE = 1 n i = 1 n | y i y ^ i |
The Coefficient R2 determines the proportion of explained variance:
R 2 = 1 i = 1 n ( y i y ^ i ) 2 i = 1 n ( y i y ¯ ) 2
The SMAPE was used because it is symmetric, robust to zero values, and enables a fair comparison of models across different parameters.
SMAPE = 100 % n i = 1 n 2 | y ^ i y i | | y i | + | y ^ i |
With y i being the actual value and y ^ i the predicted value (model output).

2.7. Software

The PCB was designed using Autodesk Fusion 360 (Autodesk, CA, USA). The custom 3D printed enclosure was designed using SolidWorks 2020 (Dassault Systèmes, Suresnes, France). The firmware was developed in the Arduino IDE. The process required integrating and adapting official libraries from sensor manufacturers, as well as developing custom libraries from scratch for hardware modules that did not have them. Statistical analysis and ML model development were performed using Python 3.10.

3. Results

3.1. Sensors Evaluation Results

3.1.1. VOCs Measurement Performance

A dataset consisting of 328 records was built. It included minute average measurements of VOCs. Table 3 presents the results of the correlation analysis carried out to compare the evaluated sensors and the reference device TB600B-TVOC-10.
A correlation greater than 83% with respect to the reference device can be observed in two of the sensors (SGP40 and SEN54). This suggests a high reliability of the factory pre-calibrated MOX detectors. Representative data collected during these co-location experiments, showing the linear regression between the SEN54 and SGP40 low-cost sensors and the reference device (TB600B-TVOC-10), is presented in Figure 3. The SEN54 sensor was selected for its superior correlation with the reference measurements and its all-in-one capability, integrating sensors for VOCs, particulate matter, temperature, and relative humidity.
The evaluation of VOCs was conducted using the VOC index metric provided by the SEN54 module. This logarithmic scale serves as a qualitative indicator of the intensity of pollution events relative to the dynamic baseline of the environment. The VOC index was selected over absolute concentration estimation (e.g., ppb) because low-cost MOX sensors are subject to significant baseline drift over long-term deployments. The proprietary algorithm compensates for this drift and humidity variations, providing a robust, event-driven signal suitable for identifying activities such as cleaning or cooking without the need for frequent recalibration against reference gases. Although the VOC index is a relative unit, our co-location tests against a reference PID instrument demonstrated a high correlation ( R2 = 0.89), validating that the index accurately captures the temporal dynamics and magnitude of pollutant peaks.

3.1.2. CO2 Measurement Performance

A total of 2394 records were obtained during the tests conducted for the selection of the CO2 sensor.
The devices evaluated included three NDIR CO2 sensors (SCD41, XENSIV PAS CO2, and T6793-5K) and two VOCs sensors capable of estimating eCO2 concentrations (MICS-VZ-89TE and ENS160). Figure 4 presents the temporal evolution of the CO2 and eCO2 values during the tests. All sensors, excluding the MICS-VZ-89TE, showed proportional responses to the induced changes in ambient CO2 concentration. The eCO2 signals displayed abrupt fluctuations to the VOCs induced change, revealing a significant cross-sensitivity between them, which compromised the reliability for direct CO2 estimation.
Table 4 shows the correlation analysis conducted on the CO2 and eCO2 sensors. As shown, the eCO2 readings from the VOCs sensors (MICS-VZ-89TE and ENS160) exhibited a maximum correlation of 0.26 with the NDIR CO2 sensor SCD41. In contrast, a strong correlation of 0.97 was observed between the two photoacoustic NDIR sensors (XENSIV PAS CO2 and SCD41), and a correlation of 0.60 between the photoacoustic XENSIV PAS CO2 and the optical NDIR sensor T6793-5K.
This lower correlation was attributed to a mismatch in calibration ranges. The photoacoustic sensors saturated above 2000 ppm, their pre-calibrated limit, whereas the optical T6793-5K sensor was calibrated up to 5000 ppm.
From these results, it can be concluded that eCO2 values provided by the MOX sensors are not suitable for direct CO2 measurement, as they exhibit low correlation levels with dedicated NDIR CO2 sensors. Therefore, the NDIR CO2 sensor was selected for the final prototype to ensure the accuracy and reliability of the measurements.
To ensure a fair and relevant comparison between the NDIR sensors, a second experiment was conducted. It focused exclusively on the operating range common to all devices, limiting the generated concentrations of CO2 to values below 2000 ppm. This methodological decision was made for two key reasons. First, concentrations exceeding 2000 ppm are uncommon in typical residential or indoor environments. Second, this threshold aligns with the limit established by international standards such as ASHRAE 62.1 [41], which begins to classify indoor air quality as degraded at this level.
This new experiment yielded 869 data points for each sensor, revealing a strong linear correlation among the three NDIR CO2 sensors evaluated. This association was highly robust, as detailed in Table 5, with a lowest R2 of 0.88 (between the optical sensor and one of the photoacoustic sensors). The T6793-5K sensor was selected as it presented the best cost-effectiveness ratio.

3.2. IAQ Unit Results

Once the sensors were selected, the prototype of the IAQ unit was designed and implemented. The device was conceived to integrate multiple environmental sensors and autonomous connectivity capabilities, featuring the following main characteristics:
  • Sensors for the measurement of PM1, PM2.5, PM4, PM10, VOCs, CO2, temperature, and relative humidity.
  • An RTC for synchronizing sensor data acquisition and the configuration of transmitted data packets.
  • Autonomous data transmission through an NB-IoT communication module, enabling periodic transmission (every 10 min) of average sensor readings without user intervention.
  • Energy autonomy of up to five hours of continuous operation.
Figure 5 provides a 3D-rendered view of the device, which includes:
1.
Arduino Nano 33 BLE Sense 2, as the core microcontroller.
2.
T6793-5K CO2 NDIR carbon dioxide sensor.
3.
SEN54 multi-parameter module for the measurement of PM, VOCs, temperature, and relative humidity.
4.
M5Stack U111 NB-IoT Module, to handle all cellular data transmission through the NB-IoT protocol.
5.
LiPo Rider Plus Power Manager to handle charging and power delivery, a 900 mAh LiPo battery, and a 5V input for an external power supply.
6.
DRF0641 high-precision RTC.
The IAQ unit (Cicerone AirLink) was registered with open-source licenses and published on GitHub. Device specifications, hardware files (gerbers) for PCB manufacturing, the files for the case 3D-printing, the schematic and board designs, as well as the device firmware in full detail, can be openly accessed together with information for the manufacturing, assembly, and configuration of the device. The final prototype is shown in Figure 6.

3.3. Field Tests Results

A field trial was conducted in nine residential dwellings to assess the performance of the IAQ unit in a practical environment. The units were strategically installed in various indoor locations for continuous monitoring for seven days. 2817, 3353, and 3083 samples of pollutants were gathered in the kitchens, living rooms, and bedrooms, respectively. A synthesis of the descriptive statistics derived from this analysis for all monitored dwellings is presented in Table 6.
Figure 7 illustrates the daily evolution of some pollutants in kitchens. Substantial increases in PM2.5, PM10, and VOCs concentrations were observed, coinciding with peak cooking times. Evening PM10 levels averaged 12.5 μ g/m3, although peaks exceeded 25 μ g/m3. At the same time, the average VOC index measured 149.6, with multiple instances exceeding the threshold for possible irritation [42,43]. CO2 concentrations also exhibited nocturnal increases, frequently exceeding 1000 ppm, although the overall mean concentration remained approximately 597.8 ppm.
The results obtained for the monitoring of the living rooms are presented in Figure 8. Peaks of CO2 were recorded at up to 1277.7 ppm, primarily in the evening and at night, with a daily average concentration of approximately 629 ppm. V O C i n d e x reached maximum values of 359 in the early morning (low-level pollution), while particle concentrations remained at low levels, with averages below 5 μ g/m3 for all fractions evaluated.
Figure 9 illustrates the temporal evolution of various pollutants present in bedrooms. The levels of CO2 were found to be significantly higher compared to other areas, reaching an overall average of 1300.5 ppm, with maximum values recorded at up to 2896 ppm during the morning period. V O C i n d e x showed an average concentration of 146 at night, with peaks ranging to 354.5. Similarly, a significant accumulation of particles was detected during the night, with average concentrations of PM10 of 8.7 μ g/m3.
Figure 10, Figure 11 and Figure 12 show violin plots with the distributions of CO2, PM, and VOCs in each room and for each time slot. Kruskal-Wallis tests confirmed the significant differences between time intervals in all rooms.
For kitchens, significant differences were observed in the concentrations of PM, CO2, and VOCs most of the time. For the living rooms, these variations were especially marked for VOCs and CO2. Regarding PM, significant differences were observed between the night and the remaining periods. In the bedrooms, PM, VOCs, CO2, temperature, and relative humidity exhibited significant variations over all time intervals.
Analysis of pollutant concentrations against established health guidelines revealed several key findings. Daily average concentrations for PM2.5 and PM10 remained below the respective WHO 24-h guidelines (15 μ g/m3 and 45 μ g/m3). However, transient PM2.5 peaks reached 29.8 μ g/m3 in kitchens and 20.1 μ g/m3 in bedrooms. These short-duration events, while not affecting the daily mean, can represent acute exposure risks, particularly for people with respiratory sensitivities.
In terms of CO2, the 1000 ppm threshold, a widely accepted indicator of inadequate ventilation, was systematically exceeded in the bedroom and frequently in the other rooms. These elevated concentrations demonstrated a strong correlation with prolonged occupancy periods and low air exchange rates. Regarding the VOC index, the threshold of 200 was exceeded at different times of the day in all rooms, with the kitchen and bedroom exhibiting the most frequent and pronounced exceedances.
Together, the results indicate that IAQ in residential settings exhibits highly variable dynamics, which are determined by the interaction between occupancy, activity, and ventilation. The differences found between rooms and time slots highlight the need for a differentiated and specific evaluation for each space, rather than a global assessment of the indoor environment.

3.4. Predictive Models Results

A comparative analysis of supervised machine learning regression models (RF, XGBoost, LGBM, and CatBoost) was conducted to evaluate their performance in predicting IAQ pollutant concentrations 10 min in advance in the three domestic environments: kitchen, living room, and bedroom.
The hyperparameter tuning was conducted using a hybrid two-stage optimization strategy. Initially, a systematic grid search was performed to explore the hyperparameter space and identify optimal regions. Subsequently, a manual fine-tuning stage was executed to adjust specific parameters. The parameters finally selected for each algorithm are detailed in Table 7.
Table 8 shows the metrics estimated for each room and model. The results reveal two significant high-level trends. First, the predictive accuracy varied substantially by pollutant type.
Gaseous pollutants (CO2 and VOCs) were consistently more predictable in all models and locations than PM. For example, in the bedroom dataset, all models achieved an R2 value of 0.97-0.98 for CO2 predictions, while R2 scores for PM fractions in the same room ranged from 0.81 (RF) to 0.93 (CatBoost).
Second, the performance of the models was highly dependent on the spatial environment. The kitchen proved to be the most challenging environment for prediction, obtaining the lowest R2 scores for all models, particularly for PM. In this location, R2 values for PM10 were 0.66 (RF), 0.64 (XGBoost), 0.51 (LGBM), and 0.64 (CatBoost). Notably, the LGBM model performed weakest in this setting, with R2 scores ranging from 0.49 to 0.52 for PM. In contrast, the living room and bedroom environments yielded significantly more accurate predictions across all models.
When comparing model architectures, gradient-boosted models (XGBoost, LGBM, and CatBoost) generally outperformed the RF baseline, particularly in the living room and bedroom. XGBoost and CatBoost emerged as the top-performing models.
In the living room, CatBoost demonstrated superior performance for all fractions of PM, achieving an R2 of 0.97 for PM1, PM2.5, PM4, and PM10. XGBoost was also highly competitive, achieving R2 scores between 0.94 and 0.96 for PM. For the prediction of CO2 in the living room, all models performed exceptionally well, with R2 values of 0.95 for the RF model and 0.97 for models based on boosting gradients.
In the bedroom, CatBoost maintained its performance advantage for the prediction of PM, achieving the highest R2 scores for PM1 (0.93), PM2.5 (0.92), and PM10 (0.91). This was closely followed by XGBoost, with R2 scores of 0.89 for PM1 and 0.88 for PM10. All boosting models excelled at predicting CO2 in the bedroom, each achieving an R2 of 0.98.

4. Discussion

4.1. IAQ Unit Evaluation

The IAQ unit has been shown to be an effective technical solution for the ongoing monitoring of indoor air quality. Its design incorporates low-cost sensors, a custom PCB, and NB-IoT connectivity, which allow for the collection of data at a high temporal resolution in domestic environments, striking a balance between accuracy and cost. The low power usage, potential for small-scale production, and modular structure make it replicable and adaptable to various settings. This scalability paves the way for its integration into public health initiatives or smart home projects.

4.2. Analysis of the IAQ

The results (Table 6 and Figure 10, Figure 11 and Figure 12) indicate that indoor air quality varies substantially across different rooms within the residences. It varies significantly according to the specific room, time of day, and usage patterns (Figure 7, Figure 8 and Figure 9). This finding is consistent with previous studies that emphasize the strong influence of human activities on the dynamics of pollutants within enclosed environments [44].
In kitchens, the results reveal a clear dual-driver pollution profile. First, cooking activities were identified as the primary emission source, causing spikes in the PM fractions and VOCs during peak usage times. Second, this emission problem is significantly compounded by insufficient nocturnal ventilation [45,46]. This poor air exchange was evidenced by respiration-driven CO2 concentrations frequently exceeding the 1000 ppm threshold. This lack of ventilation traps pollutants, explaining why residual PM10 and VOCs remained at elevated concentrations, with VOC index values repeatedly exceeding 200, long after cooking had ended.
In the living room, the accumulation of pollutants during the night indicated insufficient passive ventilation during this period. This poor air exchange was evidenced by respiration-driven CO2 concentrations frequently exceeding the 1000 ppm threshold. This trend was consistent with reports from studies on passive occupancy and air quality, where a progressive accumulation of CO2 was observed in the rest areas or in family gathering spaces [47]. The air quality issue in this room was apparently due mainly to the presence of people and their belongings in a poorly ventilated space during the night.
The bedroom, as a space typically enclosed during sleep, showed the highest concentrations of CO2, significantly exceeding the comfort and efficient ventilation threshold (1000 ppm). This finding was consistent with studies warning of the risk of accumulation of CO2 in closed rooms at night, which can affect sleep quality, cognitive performance, and general well-being [48]. Similarly, the persistence of VOCs in the early morning suggested a possible contribution from emission sources associated with furniture materials, bedding, or cleaning products, as has been described in similar residential environments [49].
From a regulatory point of view, although the mean concentrations of PM2.5 and PM10 did not exceed the daily WHO guideline values, the peaks recorded during specific events—particularly in the kitchen and bedroom—could pose a risk to people with respiratory diseases or elevated sensitivity [50]. It should be noted that the effects of brief but repeated exposures are not always addressed in current recommendations, raising the need to evaluate the cumulative impacts of these peaks.
Regarding VOCs, the absence of specific WHO guidelines for residential indoor environments complicates risk assessment. However, indicative thresholds (typically corresponding to a VOC index > 150) are often cited as comfort limits for sensitive individuals or those with respiratory pathologies [51]. In this study, this threshold was sporadically exceeded in all three rooms. These findings highlight the need for specific mitigation strategies, even in spaces traditionally considered passive, such as bedrooms.
Finally, the statistical results show a clear hourly and spatial differentiation in exposure levels, reinforcing the idea that ventilation or air purification strategies must be adapted in a segmented manner, considering both the type of usage and accumulation patterns. This approach, also proposed in recent works on smart ventilation and adaptive control [52], could effectively contribute to reducing the pollutant load at the most critical times and locations.

4.3. Insights from Predictive Models

The results demonstrated the robustness of the predictive approach used to model IAQ in residential environments using ML techniques. Two primary conclusions can be drawn from the results.
First, all architectures (RF, XGBoost, LGBM, and CatBoost) consistently and accurately modeled the pollutant dynamics. However, a clear performance hierarchy emerged, with advanced gradient boosting models (XGBoost, LGBM, and CatBoost) demonstrating a consistent and significant advantage over the baseline RF model. This finding aligns with a growing body of literature that identifies gradient boosting architectures (XGBoost, CatBoost) as state-of-the-art for tabular and time-series IAQ prediction, often outperforming traditional ensemble methods [39]. For gaseous pollutants (CO2 and VOCs), the boosting models usually matched or outperformed RF in all rooms (e.g., bedroom CO2 with R2 of 0.98 for all boosting models vs. 0.97 for RF). This performance gap was even more pronounced for PM. In the living room, CatBoost (R2 of 0.97) and LGBM (R2 of 0.96–0.97) significantly outperformed RF (R2 of 0.86–0.87). This superiority was also reflected in the absolute error metrics: for PM2.5, for example, CatBoost achieved a low MAE of 2.11 ± 0.35 , while the average RF error was significantly higher at 3.00 ± 0.89 . A similar trend was observed in the bedroom, where CatBoost (R2 of 0.91–0.93) and XGBoost (R2 of 0.88–0.89) were clearly superior to RF (R2 of 0.81–0.82). This confirms that the non-linear, event-driven (kitchen), and cyclical (bedroom) nature of the IAQ data is managed far more effectively by these advanced architectures, which are designed to sequentially correct errors and model complex interactions.
Second, predictability was highly context-dependent. The prediction of air quality in bedroom environments was straightforward. The accumulation of pollutants, including CO2, VOCs, and PM, occurred slowly, steadily, and highly cyclical throughout the night, rather than originating from stochastic spikes. This accumulation is followed by rapid dispersion in the morning. This regular pattern facilitated the learning process for predictive models, as evidenced by near-perfect R2 scores. Notably, CO2 predictions achieved an R2 of 0.98 (e.g., CatBoost, XGBoost, LGBM), although this still translated into a considerable absolute error, with MAE values around 58–63 ppm, reflecting the large scale of CO2 fluctuations. For PM, CatBoost also showed excellent performance with an R2 of 0.91–0.93 and a very low MAE, such as 1.56 ± 0.2 for PM2.5. This level of accuracy for CO2 is particularly notable, as other residential studies have reported R2 values of up to 0.90 [22].
Our results suggest that the cyclical, occupancy-driven nature of CO2 in bedrooms is highly predictable with the right features. The most challenging environment was the kitchen, especially for PM. The metrics for PM were the lowest overall, with the R2 values for RF, XGBoost, and CatBoost hovering between 0.64–0.66, and the performance of LGBM was significantly reduced to 0.49–0.52. This failure of LGBM in this context was also evident in the error metrics; its SMAPE for PM2.5 was 29%, compared to 18–20% for the other models. Furthermore, its MAE ( 4.0 ± 0.7 ) was substantially higher than that of XGBoost ( 2.8 ± 0.6 ) or RF ( 2.9 ± 0.7 ), indicating a poorer ability to handle stochastic spikes. The difficulty in predicting these spikes is further highlighted by the large gap between RMSE and MAE for all models (e.g., for RF PM1, an MAE of 2.65 but an RMSE of 9.69 ), indicating the presence of large and infrequent prediction errors. This performance drop for events-driven PM is consistent with the findings in the literature. Studies focusing on PM2.5 forecasting have reported R2 values in the 0.65–0.70 range, confirming that modeling transient high-amplitude spikes from sources such as cooking remains a significant challenge [53]. Living rooms showed very high performance, particularly for the boosting models. The prediction of occupancy-driven CO2 was very high (R2 of 0.97 for all boosting models). The prediction of PM was strongest in this room, CatBoost achieving a near-perfect R2 of 0.97 for all fractions of PM, followed closely by LGBM (R2 of 0.94–0.95) and XGBoost (R2 of 0.94–0.96). This result outperforms the results reported in recent PM2.5 prediction studies [21]. This highlights the clear underperformance of RF in this context, which achieved an R2 of 0.86–0.87. Quantitatively, this R2 gap corresponded to a significant difference in practical error: the MAE of RF for PM2.5 was 3.00 ± 0.89 , while the MAE of the top-performing CatBoost model was only 2.11 ± 0.35 . This gap may be associated with the superior ability of boosting models to better consider the more erratic behavior of PM2.5 and PM10.
In summary, the models achieved high R2 scores in environments with regular cyclical patterns (e.g., CO2 in the bedroom), but found environments with stochastic spikes driven by events (e.g., PM in the kitchen) more difficult to predict. The superior predictability of gaseous pollutants compared to particulate matter is consistent with their more uniform behavior over time. The results, considering all metrics, clarify the model hierarchy. CatBoost and XGBoost emerged as the most robust and high-performance models overall. They consistently achieved not only the highest R2, but also the lowest MAE and RMSE, particularly in the predictions of the living room and bedroom PM (e.g., CatBoost PM2.5 MAE of 2.11 in the living room and 1.56 in the bedroom). The robustness of these models against multicollinearity, their superior ability to model non-linear relationships in complex environmental phenomena [54], and their ability to capture the system’s memory [55] may underlie this improved performance.
The adopted approach demonstrated that the proposed LCS enabled achieving good predictive accuracy levels. In addition, the findings reinforce the utility of ML in developing location-specific early warning systems. The proliferation of low-cost monitors offers an opportunity for building automation [56], and the combination of these data with ML algorithms has great potential to predict air quality. The results of this study validate this potential. By demonstrating high accuracy in 10-min forecasting, our work confirms that these models can form the basis for more efficient operation of heating, ventilation, and air conditioning (HVAC) systems. More importantly, this predictive capability is key to identifying sources of pollution and, crucially, enabling timely interventions to prevent health issues associated with poor air quality [57].

4.4. Limitations and Future Work

The study presents some limitations. First, the current analysis focuses on a detailed case study that involved a limited number of households. Although this approach allows for high data granularity, a valuable avenue for future research will be to validate and extend these findings across a larger, more diverse cohort to assess the models’ generalizability. Second, the short duration of the experiment prevented capturing seasonal variability (i.e., differences between winter and summer), which may affect ventilation and pollutant concentrations. Future studies should aim to replicate this method with a larger group and conduct a longitudinal study that covers all seasonal variations. In addition, future work should explore the integration of the IAQ unit with active control strategies. Specifically, researching the occupants’ window-opening behavior [58] could help to refine ventilation models and deploy the unit as a state observer for reinforcement learning agents to optimize HVAC operations autonomously. Finally, hybrid approaches for real-time monitoring combined with periodic passive sampling should be explored to enable risk-based assessment, integrating continuous low-cost sensors for proxies and trends with targeted measurements of specific carcinogenic pollutants, such as radon and formaldehyde, that require specialized and high-cost instrumentation.

5. Conclusions

This study demonstrates that daily mean pollutant concentrations are insufficient to assess IAQ. Although 24 h averages often remained below established limit values, our high-temporal-resolution analysis revealed significant acute concentration peaks. These transient episodes, directly correlated with events such as cooking and nocturnal occupancy in poorly ventilated rooms, are often missed by traditional assessments. Such high-intensity short-term exposures represent a relevant primary risk to respiratory health and occupant comfort.
A central finding is that high-fidelity predictive models can be successfully developed using data streams from custom-developed, low-cost prototypes. These devices, which integrate sensors with IoT transmission capabilities for remote home monitoring, provide the high-resolution data necessary to train ML algorithms. The resulting models proved highly accurate, although the performance was context-dependent. These findings reinforce the need for specific, data-driven preventive measures. The results support the implementation of sensor-assisted or controlled natural ventilation; automated, event-triggered extraction systems in kitchens; active CO2 and VOC alerts in bedrooms and living areas; and smart air purifiers adapted to identified risk schedules and occupancy patterns.
In summary, evaluating indoor pollutant exposure requires a paradigm shift from static daily averages to a dynamic, predictive perspective. The synergy between affordable and scalable data capture and predictive ML analytics represents a promising strategy for proactive air quality management, ultimately improving health, comfort, and energy efficiency in residential environments.

Author Contributions

P.C.-M. Investigation; methodology; writing—original draft preparation; software; formal analysis; visualization; data curation; writing—review and editing. D.S.-L. conceptualization; Investigation; methodology; formal analysis; software; validation; writing—original draft preparation; writing—review and editing. A.L.-D. conceptualization; Investigation; methodology; formal analysis; software; visualization; validation; writing—original draft preparation; writing—review and editing. D.S.-M.: Investigation; methodology; conceptualization; writing—original draft preparation; formal analysis; software; writing—review and editing; visualization; supervision; project administration; funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This contribution has been supported by grant PID2021-126810OB-I00 funded by MICIU/AEI/10.13039/501100011033 and by ERDF/EU.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code for the software and the hardware design files (e.g., schematics, PCB layouts, and CAD models) presented in this study are openly available at: https://atari-researchlab.github.io/cicerone-airlink (accessed on 27 October 2025).

Acknowledgments

During the preparation of this manuscript, the authors used Google Gemini 2.5 for the purposes of improving grammar and phrasing to enhance the readability of the manuscript. After using this tool/service, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ANNArtificial Neural Network
CatBoostCategorical Boosting
COCarbon Monoxide
CO2Carbon Dioxide
COPDChronic Obstructive Pleurisy
COTSCommercial Off-The-Shelf
ECElectrochemical
eCO2Equivalent Carbon Dioxide
XGBoostExtreme Gradient Boosting
IAQIndoor Air Quality
IoTInternet of Things
IQRInterquartile range
MAEMean Absolute Error
MLMachine Learning
MOSMetal-Oxide Semiconductor
NaHCO3Sodium Bicarbonate
NB-IoTNarrowband Internet of Things
NDIRNon-Dispersive Infrared
NH3Ammonia
NO2Nitrogen Dioxide
LCSLow-Cost Sensors
LGBMLight Gradient Boosting Machine
O3Ozone
OPCsOptical Particle Counters
PCBPrinted Circuit Board
PIDsPhoto-Ionization Detectors
PMParticulate Matter
R2Coefficient of Determination
RFRandom Forest
RMSERoot Mean Square Error
RNNRecurrent Neural Network
SMAPESymmetric Mean Absolute Percentage Error
VOCsVolatile Organic Compounds
WHOWorld Health Organization

References

  1. World Health Organization. Ambient (Outdoor) Air Pollution. Available online: https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health (accessed on 23 October 2025).
  2. Camacho-Magriñán, P.; Sales-Lerida, D.; León-Jiménez, A.; Sanchez-Morillo, D. Indoor environmental monitoring and chronic respiratory diseases: A systematic review. Technologies 2025, 13, 122. [Google Scholar] [CrossRef]
  3. Tran, V.V.; Park, D.; Lee, Y.C. Indoor air pollution, related human diseases, and recent trends in the control and improvement of indoor air quality. Int. J. Environ. Res. Public Health 2020, 17, 2927. [Google Scholar] [CrossRef]
  4. Kumar, P.; Singh, A.B.; Arora, T.; Singh, S.; Singh, R. Critical review on emerging health effects associated with the indoor air quality and its sustainable management. Sci. Total Environ. 2023, 872, 162163. [Google Scholar] [CrossRef] [PubMed]
  5. Raju, S.; Siddharthan, T.; McCormack, M.C. Indoor air pollution and respiratory health. Clin. Chest Med. 2020, 41, 825–843. [Google Scholar] [CrossRef]
  6. Mannan, M.; Al-Ghamdi, S.G. Indoor Air Quality in Buildings: A Comprehensive Review on the Factors Influencing Air Pollution in Residential and Commercial Structure. Int. J. Environ. Res. Public Health 2021, 18, 3276. [Google Scholar] [CrossRef] [PubMed]
  7. Sá, J.P.; Alvim-Ferraz, M.C.M.; Martins, F.G.; Sousa, S.I. Application of the low-cost sensing technology for indoor air quality monitoring: A review. Environ. Technol. Innov. 2022, 28, 102551. [Google Scholar] [CrossRef]
  8. Ródenas García, M.; Spinazzé, A.; Branco, P.T.; Borghi, F.; Villena, G.; Cattaneo, A.; Sousa, S.I. Review of low-cost sensors for indoor air quality: Features and applications. Appl. Spectrosc. Rev. 2022, 57, 747–779. [Google Scholar] [CrossRef]
  9. Castell, N.; Viana, M.; Minguillón, M.C.; Guerreiro, C.; Querol, X. Real-World Application of New Sensor Technologies for Air Quality Monitoring; ETC/ACM Technical Paper 16/2013; European Topic Centre on Air Pollution and Climate Change Mitigation: Roskilde, Denmark, 2013; p. 34. [Google Scholar]
  10. Castell, N.; Dauge, F.-R.; Schneider, P.; Vogt, M.; Lerner, U.; Fishbain, B.; Broday, D.; Bartonova, A. Can Commercial Low-Cost Sensor Platforms Contribute to Air Quality Monitoring and Exposure Estimates? Environ. Int. 2017, 99, 293–302. [Google Scholar] [CrossRef]
  11. Lewis, A.; Von Schneidemesser, E.; Peltier, R. Low-Cost Sensors for the Measurement of Atmospheric Composition: Overview of Topic and Future Applications; WMO: Geneva, Switzerland, 2018. [Google Scholar]
  12. Wall, D.; McCullagh, P.; Cleland, I.; Bond, R. Development of an Internet of Things solution to monitor and analyse indoor air quality. Internet Things 2021, 14, 100392. [Google Scholar] [CrossRef]
  13. Wang, B.; Kong, W.; Guan, H.; Xiong, N.N. Air Quality Forecasting Based on Gated Recurrent Long Short Term Memory Model in Internet of Things. IEEE Access 2019, 7, 69524–69534. [Google Scholar] [CrossRef]
  14. Saini, J.; Dutta, M.; Marques, G. Indoor Air Quality Monitoring with IoT: Predicting PM10 for Enhanced Decision Support. In Proceedings of the 2020 International Conference on Decision Aid Sciences and Application (DASA), Sakheer, Bahrain, 8–9 November 2020; pp. 504–508. [Google Scholar] [CrossRef]
  15. Samadi, S.; Kumawat, A.K. IoT Enabled Low-Cost Real-Time Remote Air Quality Monitoring and Forecasting System. In Proceedings of the 2023 International Conference on Device Intelligence, Computing and Communication Technologies (DICCT), Dehradun, India, 21–22 April 2023; pp. 253–258. [Google Scholar] [CrossRef]
  16. Saini, J.; Dutta, M.; Marques, G. Machine learning for indoor air quality assessment: A systematic review and analysis. Environ. Model. Assess. 2024, 30, 417–434. [Google Scholar] [CrossRef]
  17. Wei, W.; Ramalho, O.; Malingre, L.; Sivanantham, S.; Little, J.C.; Mandin, C. Machine learning and statistical models for predicting indoor air quality. Indoor Air 2019, 29, 704–726. [Google Scholar] [CrossRef]
  18. Feng, X.; Li, Q.; Zhu, Y.; Hou, J.; Jin, L.; Wang, J. Artificial neural networks forecasting of PM2.5 pollution using air mass trajectory based geographic model and wavelet transformation. Atmos. Environ. 2015, 107, 118–128. [Google Scholar] [CrossRef]
  19. Kim, M.; Kim, Y.; Sung, S.; Yoo, C. Data-driven prediction model of indoor air quality by the preprocessed recurrent neural networks. In Proceedings of the 2009 ICCAS-SICE, Fukuoka, Japan, 18–21 August 2009; pp. 1688–1692. [Google Scholar]
  20. Xu, C.; Xu, D.; Liu, Z.; Li, Y.; Li, N. Estimating hourly average indoor PM2.5 using the random forest approach in two megacities, China. Build. Environ. 2020, 180, 107025. [Google Scholar] [CrossRef]
  21. Li, Z.; Tong, X.; Ho, J.M.W.; Kwok, T.C.; Dong, G.; Ho, K.F.; Yim, S.H.L. A practical framework for predicting residential indoor PM2.5 concentration using land-use regression and machine learning methods. Chemosphere 2021, 265, 129140. [Google Scholar] [CrossRef] [PubMed]
  22. Taheri, S.; Razban, A. Learning-based CO2 concentration prediction: Application to indoor air quality control using demand-controlled ventilation. Build. Environ. 2021, 205, 108164. [Google Scholar] [CrossRef]
  23. Kallio, J.; Tervonen, J.; Räsänen, P.; Mäkynen, R.; Koivusaari, J.; Peltola, J. Forecasting office indoor CO2 concentration using machine learning with a one-year dataset. Build. Environ. 2021, 187, 107409. [Google Scholar] [CrossRef]
  24. Kim, J.; Hong, Y.; Seong, N.; Kim, D.D. Assessment of ANN Algorithms for the Concentration Prediction of Indoor Air Pollutants in Child Daycare Centers. Energies 2022, 15, 2654. [Google Scholar] [CrossRef]
  25. Kanan, S.M.; El-Kadri, O.M.; Abu-Yousef, I.A.; Kanan, M.C. Semiconducting Metal Oxide Based Sensors for Selective Gas Pollutant Detection. Sensors 2009, 9, 8158–8196. [Google Scholar] [CrossRef]
  26. Epping, R.; Koch, M. On-Site Detection of Volatile Organic Compounds (VOCs). Molecules 2023, 28, 1598. [Google Scholar] [CrossRef]
  27. Pandey, S.K.; Kim, K.H. The Relative Performance of NDIR-based Sensors in the Near Real-time Analysis of CO2 in Air. Sensors 2007, 7, 1683–1696. [Google Scholar] [CrossRef]
  28. Burkart, J.; Steiner, G.; Reischl, G.; Moshammer, H.; Neuberger, M.; Hitzenberger, R. Characterizing the performance of two optical particle counters (Grimm OPC1.108 and OPC1.109) under urban aerosol conditions. J. Aerosol. Sci. 2010, 41, 953–962. [Google Scholar] [CrossRef]
  29. Lopez de Ipiña, J.M.; Lopez, A.; Gazulla, A.; Aznar, G.; Belosi, F.; Koivisto, J.; Seddon, R.; Durałek, P.; Vavouliotis, A.; Koutsoukis, G. Field testing of low-cost particulate matter sensors for Digital Twin applications in nanomanufacturing processes. J. Phys. Conf. Ser. 2024, 2695, 012002. [Google Scholar] [CrossRef]
  30. Rabuan, U.; Mohd Nadzir, M.S.; Sham, S.; Bahri, S.; Borah, J.; Majumdar, S.; Lei, T.; Ali, S.; Wahab, M.; Mohd Yunus, N. Evaluations of Low-cost Air Quality Sensors for Particulate Matter (PM2.5) under Indoor and Outdoor Conditions. Sens. Mater. 2023, 35, 2881–2895. [Google Scholar] [CrossRef]
  31. Pietraru, R.N.; Olteanu, A.; Nicolae, M.; Crăciun, R.-A. Contributions to the Development of Fire Detection and Intervention Capabilities Using an Indoor Air Quality IoT Monitoring System. Sensors 2025, 25, 6375. [Google Scholar] [CrossRef] [PubMed]
  32. The MathWorks Inc. ThingSpeak Internet of Things. Available online: https://thingspeak.mathworks.com/ (accessed on 22 October 2025).
  33. Sales-Lérida, D.; Bello, A.J.; Sánchez-Alzola, A.; Martínez-Jiménez, P.M. An approximation for metal-oxide sensor calibration for air quality monitoring using multivariable statistical analysis. Sensors 2021, 21, 4781. [Google Scholar] [CrossRef]
  34. Lara-Doña, A.; Camacho-Magriñán, P.; Sanchez-Morillo, D.; Sales-Lerida, D. CICERONE AirLink v2.0.0. 2025. Available online: https://doi.org/10.5281/zenodo.17423312 (accessed on 27 October 2025).
  35. World Health Organization Regional Office for Europe. Selected Pollutants; World Health Organization: Copenhagen, Denmark, 2023; Available online: https://www.who.int/publications/i/item/9789289002134 (accessed on 11 November 2025).
  36. Kruskal, W.H.; Wallis, W.A. Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc. 1952, 47, 583–621. [Google Scholar] [CrossRef]
  37. Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
  38. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 3146–3154. [Google Scholar]
  39. Guo, Z.; Wang, X.; Ge, L. Classification prediction model of indoor PM2.5 concentration using CatBoost algorithm. Front. Built Environ. 2023, 9, 1207193. [Google Scholar] [CrossRef]
  40. Sun, X.; Tian, Z. A novel air quality index prediction model based on variational mode decomposition and SARIMA-GA-TCN. Process Saf. Environ. Prot. 2024, 184, 961–992. [Google Scholar] [CrossRef]
  41. American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc. (ASHRAE). ANSI/ASHRAE Standards 62.1 & 62.2. Available online: https://www.ashrae.org/technical-resources/bookstore/standards-62-1-62-2 (accessed on 11 November 2025).
  42. Fernández-Agüera, J.; Dominguez-Amarillo, S.; Fornaciari, M.; Orlandi, F. TVOCs and PM2.5 in naturally ventilated homes: Three case studies in a mild climate. Sustainability 2019, 11, 6225. [Google Scholar] [CrossRef]
  43. Mula, V.; Bogdanov, J.; Petreska Stanoeva, J.; Zeneli, L.; Mehmeti, V.; Gelmini, F.; Daci, A.; Berisha, A.; Zdravkovski, Z.; Beretta, G. Semi-Quantitative Characterization of Volatile Organic Compounds in Indoor and Outdoor Air Using Passive Samplers: A Case Study of Milan, Italy. Atmosphere 2025, 16, 1088. [Google Scholar] [CrossRef]
  44. Tham, K. Indoor air quality and its effects on humans—A review of challenges and developments in the last 30 years. Energy Build. 2016, 130, 637–650. [Google Scholar] [CrossRef]
  45. Tang, R.; Sahu, R.; Su, Y.; Milsom, A.; Mishra, A.; Berkemeier, T.; Pfrang, C. Impact of cooking methods on indoor air quality: A comparative study of particulate matter (PM) and volatile organic compound (VOC) emissions. Indoor Air 2024, 2024, 6355613. [Google Scholar] [CrossRef]
  46. Klein, F.; Baltensperger, U.; Prévôt, A.S.H.; El Haddad, I. Quantification of the impact of cooking processes on indoor concentrations of volatile organic species and primary and secondary organic aerosols. Indoor Air 2019, 29, 926–942. [Google Scholar] [CrossRef]
  47. Molinier, B.; Arata, C.; Katz, E.F.; Lunderberg, D.M.; Ofodile, J.; Singer, B.C.; Nazaroff, W.W.; Goldstein, A.H. Bedroom concentrations and emissions of volatile organic compounds during sleep. Environ. Sci. Technol. 2024, 58, 7958–7967. [Google Scholar] [CrossRef]
  48. Zendels, P. Indoor Air Quality, Sleep Quality and Next-Day Cognitive Performance. Master’s Thesis, University of North Carolina at Charlotte, Charlotte, NC, USA, 2022. Available online: http://hdl.handle.net/20.500.13093/etd:3081 (accessed on 27 October 2025).
  49. Mečiarová, Ľ.; Vilčeková, S.; Burdová, E.K.; Kiselák, J. Factors effecting the total volatile organic compound (TVOC) concentrations in Slovak households. Int. J. Environ. Res. Public Health 2017, 14, 1443. [Google Scholar] [CrossRef]
  50. Nie, T.; Zhang, G.; Sun, Y.; Wang, W.; Wang, T.; Duan, H. Effects of indoor air quality on human physiological impact: A review. Buildings 2025, 15, 1296. [Google Scholar] [CrossRef]
  51. Mølhave, L.; Nielsen, G. Interpretation and limitations of the concept ‘total volatile organic compounds’ (TVOC) as indicator of human responses to exposures of volatile organic compounds (VOC) in indoor air. Indoor Air 1992, 2, 65–77. [Google Scholar] [CrossRef]
  52. Zheng, C.; Wang, Y. Review on demand control ventilation. In Proceedings of the World Sustainable Built Environment Conference 2017, Hong Kong, China, 5–7 June 2017; pp. 455–460. [Google Scholar]
  53. Hill, L.D.; Pillarisetti, A.; Delapena, S.; Garl, C.; Pennise, D.; Pelletreau, A.; Smith, K.R. Machine-learned modeling of PM2.5 exposures in rural Lao PDR. Sci. Total Environ. 2019, 676, 811–822. [Google Scholar] [CrossRef] [PubMed]
  54. Zhao, X.; Wang, S.; Li, P.; Shi, X. Long-term indoor air quality monitoring in office buildings: Data-driven and goal-oriented recommendations for sensor placement and sampling frequency. Build. Environ. 2025, 283, 113392. [Google Scholar] [CrossRef]
  55. Luoma, M.; Batterman, S.A. Autocorrelation and variability of indoor air quality measurements. AIHAJ 2000, 61, 658–668. [Google Scholar] [CrossRef]
  56. Haase, J.; Alahmad, M.; Nishi, H.; Ploennigs, J.; Tsang, K.F. The IoT mediated built environment: A brief survey. In Proceedings of the 2016 IEEE 14th International Conference on Industrial Informatics (INDIN), Poitiers, France, 19–21 July 2016; pp. 1065–1068. [Google Scholar] [CrossRef]
  57. Yu, W.; Nakisa, B.; Loke, S.W.; Stevanovic, S.; Guo, Y.; Rastgoo, M.N. Indoor PM2.5 forecasting and the association with outdoor air pollution: A modelling study based on sensor data in Australia. arXiv 2024, arXiv:2405.07404. [Google Scholar] [CrossRef]
  58. Dai, X.; Liu, J.; Zhang, X. A Review of Studies Applying Machine Learning Models to Predict Occupancy and Window-Opening Behaviours in Smart Buildings. Energy Build. 2020, 223, 110159. [Google Scholar] [CrossRef]
Figure 1. Block diagram of the test bench for evaluating and selecting sensors.
Figure 1. Block diagram of the test bench for evaluating and selecting sensors.
Smartcities 08 00200 g001
Figure 2. Overview of the custom-designed printed circuit board (PCB) used to integrate the selected components in the final prototype.
Figure 2. Overview of the custom-designed printed circuit board (PCB) used to integrate the selected components in the final prototype.
Smartcities 08 00200 g002
Figure 3. Calibration curves derived from co-location experiments. The linear regression between the low-cost sensor readings (x-axis) and the reference instrument measurements (y-axis) for VOCs is shown. The dotted line represents the Ordinary Least Squares (OLS) fit. R2 denotes the coefficient of determination.
Figure 3. Calibration curves derived from co-location experiments. The linear regression between the low-cost sensor readings (x-axis) and the reference instrument measurements (y-axis) for VOCs is shown. The dotted line represents the Ordinary Least Squares (OLS) fit. R2 denotes the coefficient of determination.
Smartcities 08 00200 g003
Figure 4. Response of the CO2 sensors over time.
Figure 4. Response of the CO2 sensors over time.
Smartcities 08 00200 g004
Figure 5. 3D Assembly of the air quality measurement unit.
Figure 5. 3D Assembly of the air quality measurement unit.
Smartcities 08 00200 g005
Figure 6. Final prototype and housing.
Figure 6. Final prototype and housing.
Smartcities 08 00200 g006
Figure 7. Daily evolution of PM2.5, PM10, VOCs, and CO2 in the kitchens. The mean value and standard deviation of the time series are shown. Red boxes highlight some transient peaks.
Figure 7. Daily evolution of PM2.5, PM10, VOCs, and CO2 in the kitchens. The mean value and standard deviation of the time series are shown. Red boxes highlight some transient peaks.
Smartcities 08 00200 g007
Figure 8. Daily evolution of PM2.5, PM10, VOCs, and CO2 in the living rooms. The mean value and standard deviation of the time series are shown.
Figure 8. Daily evolution of PM2.5, PM10, VOCs, and CO2 in the living rooms. The mean value and standard deviation of the time series are shown.
Smartcities 08 00200 g008
Figure 9. Daily evolution of PM2.5, PM10, VOCs, and CO2 in the bedrooms. The mean value and standard deviation of the time series are shown.
Figure 9. Daily evolution of PM2.5, PM10, VOCs, and CO2 in the bedrooms. The mean value and standard deviation of the time series are shown.
Smartcities 08 00200 g009
Figure 10. Air quality parameter values in kitchens segregated by time slots. Dashed red lines indicate the mean values. * p 0.05 ; ** p 0.01 ; *** p 0.001 .
Figure 10. Air quality parameter values in kitchens segregated by time slots. Dashed red lines indicate the mean values. * p 0.05 ; ** p 0.01 ; *** p 0.001 .
Smartcities 08 00200 g010
Figure 11. Air quality parameter values in living rooms segregated by time slots. Dashed red lines indicate the mean values. * p 0.05 ; ** p 0.01 ; *** p 0.001 .
Figure 11. Air quality parameter values in living rooms segregated by time slots. Dashed red lines indicate the mean values. * p 0.05 ; ** p 0.01 ; *** p 0.001 .
Smartcities 08 00200 g011
Figure 12. Air quality parameter values in bedrooms segregated by time slots. Dashed red lines indicate the mean values. * p 0.05 ; ** p 0.01 ; *** p 0.001 .
Figure 12. Air quality parameter values in bedrooms segregated by time slots. Dashed red lines indicate the mean values. * p 0.05 ; ** p 0.01 ; *** p 0.001 .
Smartcities 08 00200 g012
Table 1. Sensors used for comparative benchmarking.
Table 1. Sensors used for comparative benchmarking.
DeviceParametersManufacturerTechnologyCalibration
MICS-VZ-89TEtVOC, eCO2SGX Sensortech (Neuchâtel, Switzerland)MOXNo
ENS160tVOC, eCO2ScioSense (Eindhoven, The Netherlands)MOXNo
SGP40tVOCSensirion (Stäfa, Switzerland)MOXPrecalibrated
TB600B-TVOC-10tVOCECSense GmbH (Hohenschäftlarn, Germany)Electrochem.Calibrated in lab
SEN54VOC index, NOx index, T, RH, PM1, PM2.5, PM4, PM10Sensirion (Stäfa, Switzerland)MOX, Algorithm, NDIRPrecalibrated and tested with DRX8533
SCD41CO2Sensirion (Stäfa, Switzerland)NDIR photoacousticPrecalibrated and self-calibrated
XENSIV PAS CO2 + boosterCO2Infineon Technologies (Neubiberg, Germany)NDIR photoacousticPrecalibrated and self-calibrated
Telaire T6793-5KCO2Amphenol Sensors (St. Marys, PA, USA)NDIR opticalFactory calibrated up to 5000 ppm
Table 2. Features used to train and validate the machine learning models.
Table 2. Features used to train and validate the machine learning models.
FeatureNDescription
Hour of the day (0–23)1Provides temporal context related to routine human activities and time-dependent pollution patterns.
T, H at t—10 min2Temperature and humidity 10 min earlier, used as modulators of human activity, ventilation dynamics, and pollutant reactivity.
PM1, PM2.5, PM4, PM10, CO2, VOC at t—10 min6Pollutant levels measured 10 min prior, providing the model with very short-term lag information.
PM1, PM2.5, PM4, PM10, CO2, VOC, T, H at t—60, t—120, t—240, t—300, t—720, and t—1440 min48Historical context through lagged features representing pollutant and environmental conditions 1, 2, 4, 5, 12, and 24 h earlier.
6-h moving average of PM1, PM2.5, PM4, PM10, CO2, VOC6Smoothing feature capturing accumulated trends over the previous 6 h for each pollutant.
Trend component of PM1, PM2.5, PM4, PM10, CO2, VOC6Short-term change indicator computed as the deviation from the moving average, highlighting abrupt variations.
VOC–CO2 interaction term1Product of the 10 min lagged values of VOC and CO2, allowing the model to capture nonlinear co-occurrence effects.
Table 3. Coefficient of determination (R2) between the values of VOCs measured by the evaluated sensors and by the reference device TB600B-TVOC-10.
Table 3. Coefficient of determination (R2) between the values of VOCs measured by the evaluated sensors and by the reference device TB600B-TVOC-10.
MICS-VZ-89TEENS160SGP40SEN54
MICS-VZ-89TE1.00
ENS1600.771.00
SGP400.390.751.00
SEN540.270.670.861.00
TB600B-TVOC-100.320.650.830.89
Table 4. Coefficient of determination (R2) between eCO2 and CO2 sensors. * denotes eCO2 sensors.
Table 4. Coefficient of determination (R2) between eCO2 and CO2 sensors. * denotes eCO2 sensors.
T6793-5KSCD41XENSIV PAS CO2MICS-VZ-89TE *
T6793-5K1.00
SCD410.661.00
XENSIV PAS CO20.600.971.00
MICS-VZ-89TE *0.030.260.261.00
ENS160 *0.010.100.100.41
Table 5. Coefficient of determination (R2) between NDIR sensors measurements with a <2000 ppm range.
Table 5. Coefficient of determination (R2) between NDIR sensors measurements with a <2000 ppm range.
T6793-5KSCD41XENSIV PAS CO2
T6793-5K1.00
SCD410.901.00
XENSIV PAS CO20.880.971.00
Table 6. Synthesis of descriptive statistics for the monitored environmental parameters: Particulate Matter (PM1, PM2.5, PM4, PM10), Volatile Organic Compounds (VOC index), Carbon Dioxide (CO2), Temperature (T), and Relative Humidity (H). Values are presented as Mean ± Standard Deviation, except for the Max, Min rows, which indicate the range. Units: PM concentrations in μ g / m 3 ; VOC in index points (0–500); CO2 in ppm; Temperature (T) in °C; Humidity (H) in %. Time slots: Night (00:00–06:00); Morning (06:00–12:00); Afternoon (12:00–18:00); Evening (18:00–24:00).
Table 6. Synthesis of descriptive statistics for the monitored environmental parameters: Particulate Matter (PM1, PM2.5, PM4, PM10), Volatile Organic Compounds (VOC index), Carbon Dioxide (CO2), Temperature (T), and Relative Humidity (H). Values are presented as Mean ± Standard Deviation, except for the Max, Min rows, which indicate the range. Units: PM concentrations in μ g / m 3 ; VOC in index points (0–500); CO2 in ppm; Temperature (T) in °C; Humidity (H) in %. Time slots: Night (00:00–06:00); Morning (06:00–12:00); Afternoon (12:00–18:00); Evening (18:00–24:00).
RoomTimePM1PM2.5PM4PM10VOCCO2TH
KitchenTotal 8.94 ± 5.46 9.58 ± 5.86 9.76 ± 5.99 9.84 ± 6.05 123.42 ± 84.40 550.63 ± 154.24 20.98 ± 0.95 60.86 ± 4.63
Max, Min 28.16 , 1.70 29.76 , 1.78 30.15 , 1.79 30.58 , 1.80 390.19 , 0.0 1034.42 , 242.87 23.52 , 18.48 73.43 , 46.52
Evening 11.27 ± 5.67 12.13 ± 6.16 12.39 ± 6.37 12.52 ± 5.66 149.61 ± 91.71 597.79 ± 175.04 21.08 ± 1.10 62.15 ± 4.77
Night 8.84 ± 5.53 9.47 ± 5.97 9.64 ± 6.12 9.72 ± 6.20 116.18 ± 66.28 570.59 ± 171.57 21.01 ± 0.99 60.50 ± 4.61
Morning 7.09 ± 4.28 7.56 ± 4.48 7.66 ± 4.48 7.71 ± 4.49 98.77 ± 69.45 506.89 ± 124.51 20.91 ± 0.75 59.75 ± 4.71
Afternoon 8.25 ± 5.29 8.82 ± 5.58 8.97 ± 5.63 9.04 ± 5.66 125.05 ± 95.89 520.98 ± 115.16 20.91 ± 0.91 60.83 ± 4.03
Living RoomTotal 4.18 ± 1.89 4.46 ± 1.99 4.51 ± 2.01 4.54 ± 2.03 110.12 ± 77.11 629.47 ± 218.33 19.66 ± 0.91 64.76 ± 2.81
Max, Min 11.20 , 1.35 11.76 , 1.48 11.89 , 1.52 12.00 , 1.54 359.22 , 1.78 1277.71 , 198.66 21.71 , 17.57 70.62 , 56.23
Evening 4.10 ± 1.81 4.35 ± 1.94 4.39 ± 1.97 4.41 ± 1.99 107.08 ± 73.83 582.95 ± 194.61 19.84 ± 0.83 64.56 ± 2.83
Night 4.64 ± 1.91 4.91 ± 2.01 4.94 ± 2.02 4.95 ± 2.02 139.22 ± 88.54 692.53 ± 218.08 19.64 ± 0.88 65.60 ± 2.15
Morning 3.98 ± 1.86 4.26 ± 1.95 4.51 ± 2.01 4.37 ± 2.00 108.07 ± 72.84 689.40 ± 242.21 19.39 ± 0.91 64.95 ± 2.95
Afternoon 4.04 ± 1.91 4.33 ± 2.01 4.41 ± 2.03 4.45 ± 2.05 87.70 ± 63.33 551.06 ± 173.06 19.79 ± 0.93 63.95 ± 2.94
BedroomTotal 6.67 ± 3.63 7.08 ± 3.86 7.15 ± 3.90 7.18 ± 3.93 113.14 ± 77.93 1300.46 ± 726.51 20.84 ± 0.61 64.76 ± 7.08
Max, Min 19.16 , 1.46 20.14 , 1.54 20.31 , 1.56 20.60 , 1.57 354.49 , 1.00 2896.04 , 247.71 22.48 , 19.17 78.25 , 42.43
Evening 6.56 ± 3.55 6.94 ± 3.75 6.99 ± 3.77 7.01 ± 3.78 120.94 ± 69.24 874.69 ± 298.66 20.94 ± 0.62 63.18 ± 6.99
Night 8.21 ± 3.85 8.66 ± 4.06 8.71 ± 4.09 8.73 ± 4.10 146.21 ± 82.89 1899.47 ± 433.15 21.05 ± 0.45 68.30 ± 5.10
Morning 5.91 ± 2.62 6.30 ± 2.85 6.38 ± 2.97 6.42 ± 3.03 113.43 ± 76.42 1817.31 ± 669.12 20.64 ± 0.53 66.59 ± 6.28
Afternoon 5.97 ± 3.94 6.38 ± 4.20 6.49 ± 4.27 6.53 ± 4.30 68.77 ± 60.99 567.26 ± 228.41 20.71 ± 0.73 60.72 ± 7.29
Table 7. Machine learning models and hyperparameters used in this study.
Table 7. Machine learning models and hyperparameters used in this study.
ModelHyperparameters
Random Forestn_estimators = 800, max_depth = 25, min_samples_split = 3, min_samples_leaf = 2, bootstrap = True
XGBoostn_estimators = 3000, learning_rate = 0.05, max_depth = 8, subsample = 0.8, colsample_bytree = 0.8, reg_alpha = 0.1 (L1), reg_lambda = 1.0 (L2), objective = reg:absoluteerror, tree_method = hist
LGBMRegressorn_estimators = 500, learning_rate = 0.05, max_depth = 8
CatBoostn_estimators = 800, learning_rate = 0.05, max_depth = 6
Table 8. Performance metrics of the machine learning regression models. Bold values indicate the best-performing model (highest R2 and lowest error) for each specific room and pollutant. Units: PM [ μ g / m 3 ], CO2 [ppm], VOCs [index].
Table 8. Performance metrics of the machine learning regression models. Bold values indicate the best-performing model (highest R2 and lowest error) for each specific room and pollutant. Units: PM [ μ g / m 3 ], CO2 [ppm], VOCs [index].
Predictive ModelRoomPollutantRMSER2SMAPEMAE
Random ForestKitchenPM1 9.69 ± 7.52 0.65 18.18 ± 1.60 2.65 ± 0.56
PM2.5 11.00 ± 8.27 0.63 18.13 ± 1.33 2.93 ± 0.69
PM4 11.69 ± 9.73 0.63 18.21 ± 1.26 3.03 ± 0.82
PM1011.25 ± 9.230.6618.18 ± 1.583.02 ± 0.70
VOCs 51.73 ± 5.65 0.82 25.38 ± 1.83 29.53 ± 2.57
CO2 76.31 ± 18.29 0.88 6.32 ± 0.51 39.46 ± 4.30
Living RoomPM1 12.88 ± 3.42 0.87 14.89 ± 1.59 2.79 ± 0.75
PM2.5 13.89 ± 4.22 0.86 15.10 ± 1.63 3.00 ± 0.89
PM4 13.99 ± 4.28 0.86 15.3 ± 1.63 3.05 ± 0.89
PM10 13.92 ± 4.19 0.86 15.36 ± 1.65 3.05 ± 0.86
VOCs 39.35 ± 3.8 0.87 17.05 ± 1.49 18.76 ± 1.68
CO2 49.41 ± 5.30 0.95 4.60 ± 0.30 27.71 ± 1.96
BedroomPM1 5.90 ± 1.91 0.83 14.11 ± 0.87 1.72 ± 0.26
PM2.5 6.43 ± 2.08 0.82 14.67 ± 1.02 1.92 ± 0.29
PM4 6.53 ± 2.14 0.82 14.87 ± 1.10 1.99 ± 0.32
PM10 6.77 ± 2.18 0.81 14.99 ± 1.07 2.04 ± 0.32
VOCs 45.78 ± 4.78 0.84 22.49 ± 1.63 22.91 ± 2.01
CO2 112.09 ± 11.11 0.97 6.12 ± 0.49 61.82 ± 5.38
XGBoostKitchenPM19.23 ± 6.790.6917.63 ± 1.552.51 ± 0.56
PM2.5 10.31 ± 8.29 0.65 17.78 ± 1.46 2.77 ± 0.65
PM4 11.29 ± 8.78 0.65 17.66 ± 1.31 2.93 ± 0.74
PM10 11.38 ± 8.91 0.64 17.82 ± 0.94 2.99 ± 0.70
VOCs 51.62 ± 5.60 0.82 25.23 ± 1.81 29.55 ± 2.66
CO2 75.17 ± 17.81 0.88 6.16 ± 0.40 38.59 ± 3.65
Living RoomPM1 8.51 ± 3.02 0.94 17.33 ± 1.30 2.16 ± 0.35
PM2.5 7.63 ± 2.55 0.96 17.03 ± 1.32 2.10 ± 0.43
PM4 8.51 ± 3.20 0.95 17.05 ± 1.68 2.20 ± 0.42
PM10 8.62 ± 2.13 0.95 17.95 ± 1.40 2.36 ± 0.45
VOCs 35.40 ± 4.22 0.89 15.54 ± 1.07 16.66 ± 1.52
CO2 40.59 ± 4.23 0.97 4.04 ± 0.27 24.40 ± 1.76
BedroomPM1 4.54 ± 1.42 0.89 13.09 ± 1.04 1.48 ± 0.16
PM2.5 4.78 ± 1.43 0.89 13.34 ± 1.43 1.61 ± 0.20
PM4 5.11 ± 1.76 0.88 13.66 ± 0.99 1.67 ± 0.27
PM10 5.14 ± 1.94 0.88 13.93 ± 1.08 1.70 ± 0.28
VOCs 38.78 ± 3.70 0.88 19.83 ± 1.86 18.71 ± 1.69
CO2 107.29 ± 13.17 0.98 5.70 ± 0.47 58.21 ± 3.51
LGBMRegressorKitchenPM1 11.48 ± 7.14 0.52 29.25 ± 2.42 3.71 ± 0.47
PM2.5 12.73 ± 8.6 0.49 28.88 ± 2.08 4.05 ± 0.68
PM4 13.12 ± 9.29 0.50 28.80 ± 2.71 4.13 ± 0.71
PM10 13.34 ± 9.47 0.51 28.97 ± 2.59 4.24 ± 0.68
VOCs 52.92 ± 4.63 0.81 27.77 ± 1.88 32.12 ± 2.00
CO2 79.11 ± 13.45 0.88 7.06 ± 0.45 44.40 ± 3.38
Living RoomPM1 7.58 ± 2.11 0.95 27.85 ± 3.09 2.49 ± 0.34
PM2.5 8.07 ± 2.37 0.95 27.53 ± 2.62 2.62 ± 0.38
PM4 8.10 ± 2.43 0.95 28.73 ± 3.21 2.71 ± 0.40
PM10 8.45 ± 2.47 0.94 28.61 ± 2.67 2.78 ± 0.41
VOCs 31.00 ± 3.44 0.91 17.97 ± 1.49 17.1 ± 1.06
CO235.79 ± 5.620.973.97 ± 0.3823.6 ± 2.25
BedroomPM1 3.52 ± 0.99 0.88 14.03 ± 0.86 1.38 ± 0.20
PM2.5 3.93 ± 1.17 0.87 14.99 ± 1.21 1.56 ± 0.25
PM4 4.22 ± 1.30 0.86 15.67 ± 1.06 1.67 ± 0.29
PM10 4.32 ± 1.38 0.86 16.40 ± 1.22 1.74 ± 0.29
VOCs 38.79 ± 5.31 0.88 24.37 ± 1.20 20.91 ± 1.77
CO299.59 ± 17.550.985.78 ± 0.4758.27 ± 4.35
CatBoostKitchenPM1 9.79 ± 6.76 0.65 20.45 ± 1.67 2.80 ± 0.51
PM2.510.79 ± 7.800.6620.09 ± 0.783.02 ± 0.64
PM411.17 ± 9.610.6520.90 ± 1.443.16 ± 0.71
PM10 11.63 ± 9.00 0.64 20.44 ± 1.07 3.20 ± 0.63
VOCs48.04 ± 4.890.8427.59 ± 1.8930.03 ± 2.26
CO273.76 ± 15.040.896.80 ± 0.3441.76 ± 3.70
Living RoomPM16.40 ± 2.000.9719.69 ± 1.141.95 ± 0.30
PM2.56.96 ± 2.300.9719.77 ± 1.042.11 ± 0.35
PM47.24 ± 2.110.9719.29 ± 0.902.09 ± 0.25
PM106.73 ± 2.080.9719.63 ± 1.052.06 ± 0.27
VOCs29.44 ± 4.200.9317.72 ± 1.2017.02 ± 1.77
CO2 36.87 ± 2.54 0.97 4.04 ± 0.21 24.11 ± 1.24
BedroomPM13.71 ± 1.460.9314.89 ± 0.921.40 ± 0.17
PM2.54.12 ± 1.590.9215.47 ± 1.381.56 ± 0.2
PM44.30 ± 1.660.9215.77 ± 0.921.64 ± 0.20
PM104.51 ± 1.840.9116.03 ± 1.121.69 ± 0.20
VOCs35.56 ± 4.840.9022.75 ± 1.9320.46 ± 1.96
CO2 105.7 ± 11.37 0.98 6.38 ± 0.53 63.49 ± 4.64
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Camacho-Magriñán, P.; Sales-Lerida, D.; Lara-Doña, A.; Sanchez-Morillo, D. Leveraging Low-Cost Sensor Data and Predictive Modelling for IoT-Driven Indoor Air Quality Monitoring. Smart Cities 2025, 8, 200. https://doi.org/10.3390/smartcities8060200

AMA Style

Camacho-Magriñán P, Sales-Lerida D, Lara-Doña A, Sanchez-Morillo D. Leveraging Low-Cost Sensor Data and Predictive Modelling for IoT-Driven Indoor Air Quality Monitoring. Smart Cities. 2025; 8(6):200. https://doi.org/10.3390/smartcities8060200

Chicago/Turabian Style

Camacho-Magriñán, Patricia, Diego Sales-Lerida, Alejandro Lara-Doña, and Daniel Sanchez-Morillo. 2025. "Leveraging Low-Cost Sensor Data and Predictive Modelling for IoT-Driven Indoor Air Quality Monitoring" Smart Cities 8, no. 6: 200. https://doi.org/10.3390/smartcities8060200

APA Style

Camacho-Magriñán, P., Sales-Lerida, D., Lara-Doña, A., & Sanchez-Morillo, D. (2025). Leveraging Low-Cost Sensor Data and Predictive Modelling for IoT-Driven Indoor Air Quality Monitoring. Smart Cities, 8(6), 200. https://doi.org/10.3390/smartcities8060200

Article Metrics

Back to TopTop