1. Introduction
Air pollution poses a significant threat to public health, driving a need for extensive air quality monitoring in both space and time. Traditionally, monitoring has relied on high-grade instruments at fixed stations (e.g., FRM/FEM reference monitors), which provide accurate data but are sparse due to high costs [
1,
2]. In recent years, low-cost sensor (LCS) networks have emerged as a complementary approach, enabling real-time, high-resolution observation of pollutants through dense deployment at a fraction of the cost [
3,
4]. These IoT-based sensor networks are being adopted worldwide to fill gaps in coverage and inform communities about local air quality conditions [
5]. However, a major drawback of LCS data is their inconsistent quality when compared with reference instruments. Discrepancies arise from differences in sensing principles and sensitivity to environmental factors, often leading to biased or noisy readings [
6,
7]. As a result, quality control (QC) measures are essential to detecting and correcting errors in raw sensor observations before the latter can be reliably used [
8,
9].
Machine learning (ML) has rapidly become a key tool for air quality data QC, enabling automated calibration, anomaly detection, and data correction that outperform traditional linear or rule-based methods [
10,
11]. Researchers have shown that ML models can substantially improve the agreement of low-cost sensor data with reference measurements, for example, boosting
from 0.4 to 0.99 and reducing errors by orders of magnitude through proper calibration [
12]. Beyond calibration, intelligent algorithms can learn normal patterns in sensor data and flag outliers or drifts in real time [
13,
14]. These advances are critical to smart air quality monitoring, where vast streams of sensor data must be automatically validated and corrected to ensure accuracy for end-users ranging from scientists and policymakers to citizens tracking personal exposure [
15].
This survey provides a comprehensive review of the past decade (approximately 2015–2025) of research on ML-based QC for air quality sensor networks, with an emphasis on spatiotemporal techniques and real-time systems. We cover peer-reviewed literature and real-world case studies that illustrate how various ML approaches, spanning classical algorithms to deep learning, are employed to improve data quality in networked air pollution sensors. In particular, we highlight methods that exploit spatial and temporal correlations across sensors, as well as frameworks for on-line or in situ data correction suitable for real-time deployment. Key application domains such as personal exposure monitoring, integration with atmospheric models, and policy decision support are discussed to underscore the impact of these technologies.
To ensure transparency in this review process, we briefly describe the methodology used for literature selection. Publications from 2015 to 2025 were targeted to capture the most recent decade of research. The primary databases searched included Web of Science, Scopus, IEEE Xplore, and Google Scholar, using keyword combinations such as “air quality monitoring,” “low-cost sensors,” “machine learning,” “quality control,” and “calibration.” Studies were included if they were peer-reviewed or presented as well-documented case studies applying ML methods to sensor calibration, anomaly detection, or spatiotemporal QC. Papers focusing solely on hardware development or generic ML methods unrelated to air quality monitoring were excluded. This procedure yielded a representative body of work that forms the basis of the synthesis presented in the following sections.
The remainder of this paper is organized as follows:
Section 2 introduces air quality sensor networks and their characteristics and contrasts reference-grade monitors with low-cost sensors.
Section 3 catalogs common sources of error in atmospheric sensor observations.
Section 4 surveys ML approaches to QC (traditional ML, deep learning, and hybrid/unsupervised methods) and explains how these algorithms are applied in practice.
Section 5 focuses on spatiotemporal QC techniques, including inter-sensor correlation methods and representative frameworks.
Section 6 reviews real-time and online QC systems, covering streaming data handling, edge deployment, federated/cluster learning, drift detection and re-calibration triggers, and hybrid edge–cloud designs.
Section 7 outlines applications and impacts of improved estimation of data-driven quality personal exposure, model/simulation integration, and policy/management relevance.
Section 8 discusses remaining challenges and future directions, including standardization, uncertainty quantification, scalability, and benchmarked validations. Finally,
Section 9 concludes the paper.
2. Overview of Air Quality Sensor Networks
Air quality sensor networks typically consist of numerous distributed sensing nodes that measure pollutants (e.g., PM
2.5, NO
2, and O
3) and environmental parameters (temperature, humidity, etc.) across urban or regional areas. These nodes often utilize low-cost technologies such as electrochemical cells for gases or optical particle counters for particulates, transmitting data via IoT communication protocols to cloud servers for aggregation [
1,
3]. The appeal of such networks is their ability to provide spatially dense and temporally continuous data in contrast to the sparse coverage of traditional stations [
1]. For instance, community sensor networks and crowdsourced platforms (e.g., PurpleAir, as well as the U.S. EPA’s (Environmental Protection Agency) AirNow platform, which aggregates corrected sensor data) have deployed thousands of low-cost devices worldwide [
4]. Recent evaluations of large-scale deployment, such as in Imperial County (California) and other community-driven monitoring projects, confirm that citizen-led sensor initiatives can generate valuable hyper-local insights, although with varying levels of data quality [
16,
17]. This high-density monitoring enables detection of neighborhood-level pollution hotspots and short-term pollution episodes that would be missed by coarse networks [
5,
18]. Ultimately, networked sensors promise to improve public awareness and urban air quality management by offering hyper-local data and trending information [
3,
19]. To clarify the complementary roles and trade-offs in such networks, we briefly contrast reference-grade monitors and low-cost sensors in
Table 1.
Beyond these contrasts, low-cost sensor networks face important limitations. The data from these sensors are generally less reliable than those from reference stations, necessitating robust QC [
6,
23]. Many low-cost sensors are prone to measurement errors due to hardware limitations and environmental interferences. For example, low-cost optical PM sensors can overestimate concentrations at high humidity because water droplets scatter light, unlike federal monitors, which control humidity in the sample inlet [
5,
22]. Electrochemical gas sensors may drift or suffer from cross-sensitivity, responding to gases other than their target, which can cause spurious readings [
6,
7]. Recent systematic assessments further reveal that calibration accuracy strongly depends on algorithm choice, input duration, and environmental predictors, indicating that careful preprocessing is as important as the model itself [
26,
27]. Moreover, manufacturing variability means that each sensor unit may have a unique bias or offset, so one-size calibration does not fit all [
9,
12]. Network communications can also introduce issues (data dropouts, timestamp misalignment, etc.), leading to missing or inconsistent records that require imputation [
14,
28]. Some recent studies have demonstrated that incorporating weather covariates or multivariate statistical models can significantly enhance reliability in such cases [
28,
29].
The reliance on these networks for decision making makes QC critically important. Increasingly, city authorities, researchers, and even individual citizens incorporate sensor data into health alerts, policy formulation, or personal exposure tracking [
15]. For instance, smartphone apps and wearable air quality monitors use IoT sensor readings to advise users of high pollution exposure in real time [
3,
30]. Ensuring that such data are accurate and trustworthy is a major challenge. Basic quality assurance steps—like filtering out implausible values or applying sensor factory calibrations—are usually insufficient on their own [
8]. Therefore, advanced AI and data analytics techniques are being deployed on top of sensor networks to perform dynamic calibration, drift correction, anomaly detection, and data reconciliation across the network [
13,
31,
32]. Recent frameworks, such as HypeAIR and AIrSense, demonstrate that combining multiple anomaly detectors with calibration models can significantly improve the usability of live data streams in smart cities [
14,
30]. These approaches highlight a shift from passive sensing toward adaptive and self-correcting network architectures. These architectures continuously refine their outputs as conditions evolve, and
Section 4 details the ML-based QC methods that enable these capabilities.
3. Sources of Error in Atmospheric Observations
Air quality observations from sensors can be corrupted by various sources of error, which QC algorithms must identify and correct. Instrument bias and calibration error are fundamental issues: low-cost sensors often have systematic biases relative to true concentration due to manufacturing differences or simplistic factory calibration [
6,
9,
12]. Each device may read consistently high or low, requiring an offset or scaling correction. Studies have shown that even when calibration models are applied, unit-to-unit variability remains a significant barrier, underscoring the need for generalized or transferable calibration frameworks [
7,
11]. Calibration obtained under one set of conditions (e.g., laboratory or co-location testing) may not hold as conditions change, and sensors are also subject to gradual drift over time due to aging, fouling, or material degradation [
10,
23]. These issues highlight the need for adaptive re-calibration strategies and transferable models, which are further discussed in
Section 8.
Another major source of error is environmental interference. Unlike reference instruments that control sampling conditions, low-cost sensors are directly exposed to ambient environmental variability. Humidity is a well-known interferent, especially for optical PM sensors: high relative humidity can cause hygroscopic growth of particles and fogging, leading to overestimation of particle mass by the sensor [
3,
22]. In contrast, reference PM
2.5 monitors often include heaters or dryers to maintain constant humidity in the sample stream, thus avoiding this issue [
5]. Temperature variations can similarly affect sensor baseline signals and amplifiers [
6]. Many gas sensors have internal temperature compensation, but extreme temperatures or rapid changes can still introduce noise. Additionally, cross-sensitivity to non-target species (and other pollutants) plagues low-cost gas sensors; for example, a NO
2 electrochemical sensor might respond to ozone or to strong changes in humidity, confounding its readings [
8,
33]. Metal-oxide sensors (MOS) for gases are especially sensitive to temperature and humidity and also require a burn-in period; their resistance measurements can drift or be disrupted by the presence of other volatile compounds [
12,
27]. In some cases, global data scaling and environmental differentials have been employed to partially mitigate these issues, but they remain an active challenge [
32].
Physical malfunctions and outliers also occur. Sensors can suffer from malfunctions like saturation (e.g., a sudden very high reading when the sensor’s range is exceeded or a voltage spike occurs) or clipping at zero. Power or circuit issues may introduce spikes or dropouts in the data [
14]. Network and data handling errors can produce gaps or duplicate timestamps, which complicate downstream analysis. Some field studies have shown that wireless community networks are particularly vulnerable to such artifacts due to heterogeneous hardware and intermittent connectivity [
16,
17]. All of these manifest as anomalies in the time series that need to be detected. Standard meteorological QC practices often include rules for physical range checks (discarding values outside plausible bounds), time consistency checks (limiting the rate of change between readings), and persistency checks (ensuring a minimum variability) [
5]. For example, Table 1 in the work by Kim et al. [
5] defines realistic min/max limits and maximum rates of change for temperature, humidity, PM
2.5, wind, etc., based on sensor specs and physical expectations. Values violating these thresholds are flagged as errors by basic QC. These rule-based filters catch gross errors, but more subtle issues (e.g., a sensor slowly drifting or slightly biased readings under certain weather conditions) require more sophisticated approaches [
13,
28]. Advanced statistical models, such as multivariate Tobit regression or Bayesian neural networks, have recently been applied to improve robustness under such conditions [
28,
34].
In summary, the primary error sources to address in atmospheric sensor QC include sensor bias, long-term drift, environmental cross-effects (humidity, temperature, etc.), interference from other pollutants, and random anomalies or data dropouts [
8,
23]. Addressing these errors is the foundation upon which ML techniques are built, and the following sections describe how modern QC frameworks target each of these challenges to maintain data quality in smart monitoring networks.
4. Machine Learning Approaches to Quality Control
Machine learning (ML) provides a powerful arsenal of techniques to perform QC on air quality data. Broadly, ML-based QC methods can be categorized into traditional ML models (often supervised regression or classification algorithms), deep learning approaches (using neural network architectures to capture complex patterns), and hybrid or unsupervised methods (combining multiple algorithms or using data-driven discovery without labeled training data) [
8,
10]. These approaches are often complementary—for example, a pipeline may use unsupervised outlier detection followed by supervised calibration. This section surveys each category with representative examples from the literature.
Calibration (supervised regression): In LCS networks, supervised calibration models are typically trained on short co-location campaigns with reference monitors, using environmental covariates (temperature, relative humidity, co-pollutants, etc.) to correct cross-sensitivities and nonlinear biases. Tree ensembles (e.g., random forest and gradient boosting) and kernel/linear baselines (e.g., SVR and ridge) remain strong general-purpose choices for PM
2.5 and NO
2 [
9,
11,
24]. Recent studies report near-reference performance when appropriate predictors and window lengths are used, with
frequently exceeding 0.9 for well-instrumented deployment [
11,
27,
33]. Neural approaches further improve accuracy when interactions are complex or inputs are high-dimensional, including mixed scaling and extended inputs for particulate sensors [
12,
32].
Anomaly detection and data repair: Because LCS streams can contain spikes, dropouts, and device faults, unsupervised or semi-supervised detectors are layered on top of calibration. Deep sequence models (e.g., LSTM autoencoders and variational autoencoders) detect distributional shifts and recurrent artifacts, enabling automatic flagging and imputation [
13,
31]. Operational frameworks combine multiple detectors with repair modules so that downstream calibration is stabilized in real time (e.g., HypeAIR and AIrSense) [
14,
30].
Spatiotemporal consistency and network-level QC: Dense networks permit cross-sensor consistency checks, neighborhood-based filtering, and spatially informed regression. Studies have leveraged spatial correlations to interpolate and validate block-level exposure maps while correcting local biases via ML [
5,
10,
18,
35]. Best-practice summaries highlight the importance of choosing predictors, durations, and validation splits appropriate to the climatology and source mix [
7].
Online (real-time) vs. offline (post hoc) QC: Here, we use online QC to denote streaming corrections and anomaly handling performed on edge devices or in low-latency cloud services and offline QC to denote retrospective batch processing. Real-world systems increasingly adopt hybrid edge–cloud pipelines with drift-aware scheduling and automated re-calibration triggers tied to model diagnostics [
14,
23,
30]. Guidance from the U.S. EPA Air Sensor Toolbox and technical standards (e.g., CEN/TS 17660-1) provide procedures to document these steps and communicate uncertainty [
20,
21].
4.1. Traditional Machine Learning Methods
Traditional ML approaches to sensor QC typically involve supervised learning algorithms that learn a mapping from sensor inputs to a corrected output (or an error flag) based on reference data or historical patterns. A common application is sensor calibration via regression. Here, an ML regression model is trained on co-location datasets where low-cost sensors and reference instruments perform measurements side by side, so the model can learn to predict the reference-quality concentration from the raw sensor outputs and possibly additional features (temperature, humidity, etc.) [
9,
12]. Researchers have tried a wide range of algorithms for this task, from simple linear and multilinear regression to more flexible nonlinear models. Comparative studies show that ensemble methods like random forest and gradient boosting tend to outperform linear calibration, especially under variable environmental conditions [
6,
7]. For instance, Ravindra et al. [
11] calibrated low-cost PM
2.5 sensors (PurpleAir and Atmos) using multiple ML models; the best model (a decision tree) raised
values from ∼0.40 to ∼0.99 and reduced RMSE from tens of µg/m
3 to <1 µg/m
3—a substantial improvement in data quality. Similarly, Koziel and colleagues demonstrated that statistical preprocessing combined with regression can substantially reduce calibration error for NO
2 sensors [
27], while their later works highlight how additive/multiplicative scaling and extended calibration inputs further improve robustness across sensor units [
32]. Such results underscore that traditional ML is not limited to “basic regression” but can also incorporate data transformation and feature engineering steps to handle sensor-specific variability. However, these approaches still depend heavily on the quality of reference co-location data and may fail to generalize across sensor types or environmental conditions, limiting their transferability beyond the calibration site.
Beyond calibration, traditional ML has also been applied to anomaly detection and data cleaning. One approach is to train a regression or time-series model on a rolling basis to predict the expected sensor reading and then compare the prediction to the actual observation. If the actual value deviates beyond a certain threshold, it is flagged as an outlier. Kim et al. [
5] implemented this by training models on the past 10 min of data and defining an acceptable range as the ML-predicted value
. Any new measurement falling outside this range is classified as an error and can be replaced or corrected. Lee et al. [
36] further demonstrated the utility of support vector regression (SVR) for anomaly detection in meteorological data, optimizing input variables with a multi-objective genetic algorithm. Their framework reduced RMSE by an average of 45% compared with baseline estimators while maintaining computational efficiency, illustrating that even relatively lightweight ML models can deliver substantial improvements when paired with optimization techniques. Similar approaches have been used in operational deployment; for example, Sousàn et al. [
37] reported that combining decision trees with adaptive thresholds improved detection of abnormal particulate readings in field networks. Classification-based methods have also been tested, where models such as decision trees or SVM classifiers are trained on labeled data to distinguish “normal” versus “faulty” observations. Although fault-labeled datasets are scarce, controlled co-location experiments and synthetic anomaly generation have been employed to bootstrap training sets [
29]. These examples show that even relatively lightweight models can act as sophisticated real-time validators, extending rule-based checks with data-driven expectations. Nonetheless, the scarcity of representative fault data and the reliance on synthetic anomalies raise concerns about how well these models will perform under unanticipated sensor failures or new environmental conditions.
4.2. Deep Learning Approaches
Deep learning (DL) techniques have increasingly been adopted for air quality data QC, as they can model complex nonlinear relationships and spatiotemporal patterns in large datasets. One area where deep learning shines is in handling time series and sequence data from sensor networks. Recurrent neural networks (RNNs), particularly long short-term memory (LSTM) networks, are well suited to capture temporal dependencies and trends in pollution data. They have been used to predict pollutant levels based on past readings, effectively learning the temporal dynamics [
13]. When used for QC, an LSTM model can play a similar role as the aforementioned regression predictor—forecasting the next value and identifying anomalies when the actual value deviates significantly. Unlike simpler models, an LSTM can leverage long-range dependencies and seasonality (diurnal cycles, weekly patterns, etc.) in the data. Convolutional neural networks (CNNs) have also been applied, sometimes by treating time-series segments or even multi-sensor data as “images” or matrices that the CNN can process. More commonly, CNNs appear in hybrid architectures (e.g., as part of a feature extractor before an LSTM network or in 1D form to capture local trends in a sequence). Recent experiments suggest that 1D CNNs coupled with environmental covariates can outperform standard regression under fluctuating weather conditions [
32,
33].
Deep models can combine spatial and temporal features. For example, a graph neural network or spatiotemporal CNN/LSTM can incorporate data from neighboring sensors and recent time steps to detect anomalies or fill missing values [
13]. These DL models effectively learn the expected multi-dimensional structure of the data. In one recent work, Allka et al. [
13] proposed a Pattern-Based Attention Recurrent Autoencoder for anomaly detection (PARAAD) in air quality sensor networks. Their model uses a bi-directional LSTM autoencoder with an attention mechanism, applied to blocks of time-series data rather than individual points. PARAAD achieved over 80% detection and localization of anomalous sensors, outperforming baseline models like standard autoencoders and even transformer-based approaches. Similar spatiotemporal DL pipelines have been explored in other urban deployment scenarios, where graph-based layers encode sensor neighborhood structures to stabilize anomaly detection and enable spatial interpolation [
18,
19]. However, while these architectures achieve state-of-the-art accuracy, they require large labeled datasets and significant computational resources, raising doubts about their practicality for low-power edge devices or for deployment in data-scarce regions.
Another important application of deep learning in QC is data imputation and denoising. Autoencoders (AEs) and variational autoencoders (VAEs) are unsupervised neural networks that learn a compressed representation (latent space) of the input data. They can be trained on historical sensor data so that the network learns the manifold of “normal” sensor behavior; if a new data point cannot be well reconstructed by the autoencoder, it is likely anomalous. Autoencoders and even Generative Adversarial Networks (GANs) have been used to impute missing values or repair faulty readings by essentially predicting what the sensor should have reported [
5]. For instance, Bachechi et al. [
30] integrated an autoencoder within their HypeAIR framework to perform on-device calibration and real-time anomaly filtering, showing the feasibility of DL at the network edge. Kim et al. [
10] also applied an AE to ensure data integrity by filling gaps in an urban air quality dataset. The VAE-based method by Osman et al. [
31] combined a VAE with a random forest (RF) classifier to decide if a given segment was anomalous. This hybrid deep-and-ensemble approach proved robust in identifying pollution anomalies across different scenarios without relying on extensive labeled data. Recent reviews highlight that VAE–RF and CNN–LSTM hybrids are among the most promising strategies for handling complex, multivariate air quality datasets [
15,
38]. Deep learning models have also demonstrated robustness to sensor noise; for instance, Zimmerman et al. [
6] found that a neural network model could inherently filter out some noise and improve calibration, while Villarreal-Marines et al. [
39] showed that hybrid DL calibration pipelines improved performance of field sensors in industrialized regions. Nevertheless, the complexity and opacity of deep models make them difficult to interpret and validate for regulatory acceptance, suggesting a need for explainable AI tools tailored to QC applications.
4.3. Hybrid or Unsupervised Methods
Given the diverse nature of sensor errors, hybrid approaches that combine multiple techniques often yield the best results. One strategy is to use ensemble anomaly detection, where different algorithms detect outliers from different perspectives, and their results are combined (for example, via voting or aggregation). Rollo et al. [
14] introduced AIrSense, a framework that first applies three complementary anomaly detection algorithms to raw sensor signals before calibration. If at least two of the three algorithms agree a data point is anomalous, it is labeled an outlier (majority vote). After detecting anomalies, AIrSense then repairs them; if sufficient recent non-anomalous data exist, a local prediction model is trained on the past readings to estimate the true value, which is used to replace the anomaly. Finally, the cleaned data stream is passed through a calibration model to convert raw sensor units to pollutant concentrations, significantly improving calibration accuracy on real-world datasets [
14].
Another type of hybrid method involves combining clustering or other unsupervised learning with regression. Kim et al. [
10] used expectation–maximization (EM) clustering on smartphone barometer data to group data by time of day and trained separate regression models (SVR and MLP) on each cluster. This yielded better correction of atmospheric pressure data compared with a one-model-fits-all approach. More recently, Koziel and colleagues demonstrated that combining clustering with statistical preprocessing improved NO
2 calibration robustness under variable meteorological conditions [
27]. Such approaches highlight the importance of context-aware calibration rather than relying on static global models. However, clustering-based models assume that regime boundaries are stable and identifiable, which may not hold under highly dynamic urban conditions.
In general, unsupervised QC methods like PCA or one-class SVMs are also used to detect anomalies without needing labeled examples of faults. Bayesian neural networks have also been investigated for calibration under uncertainty, providing probabilistic confidence intervals that can increase user trust in automated QC decisions [
34]. These methods are particularly attractive when labeled fault data are scarce or when networks operate in highly dynamic environments. Yet, their effectiveness often hinges on strong prior assumptions or careful tuning, which may reduce generality across deployment scenarios.
A recent example of combining deep unsupervised learning with classical ML is the VAE–RF hybrid by Osman et al. [
31]. There, a VAE was trained on historical multivariate data (NO
2, PM
2.5, O
3, CO, and SO
2 plus meteorological features) to capture the normal patterns of these correlated variables [
31]. The latent representations from the VAE were then input into a random forest classifier that distinguished anomalies from normal conditions. Hybrid approaches like this underscore a trend in recent research: rather than relying on any single algorithm, the best QC systems integrate multiple models and knowledge sources (statistical, physical, and machine-learning-based ones) [
14,
31]. This can also include hybrid physical–ML models, where known physical relationships (e.g., gas response vs. temperature or humidity correction curves) are embedded or used to inform the ML model [
7,
37]. For example, HypeAIR [
30] illustrates how AEs and physical consistency checks can be combined for real-time QC at the network edge. While these approaches offer strong performance and flexibility, their complexity can make deployment challenging, and interoperability across heterogeneous sensor platforms remains an open issue.
The overall goal of hybrid and unsupervised approaches is to maximize accuracy and reliability by using all available information within a cohesive ML-driven QC framework. These methods are particularly effective when deployed in large heterogeneous sensor networks, where unit variability, environmental effects, and data gaps coexist [
16,
17]. By unifying statistical models, physical constraints, and ML predictions, hybrid QC frameworks move closer to resilient, trustworthy smart air quality monitoring systems. At the same time, their dependence on computational resources and integration complexity highlights the importance of future work on lightweight, explainable, and standardized solutions.
To synthesize methodological differences,
Table 2 compares traditional ML, deep learning, and hybrid/unsupervised approaches in terms of representative models, strengths, limitations, and example studies.
Table 3 summarizes representative studies on ML-based quality control for low-cost air quality sensors published over the past decade. Calibration-focused works generally report substantial performance gains, with several achieving
values above 0.9 [
11,
27,
33]. In particular, decision tree ensembles and neural networks have been shown to reduce biases in PM
2.5 and NO
2 measurements by an order of magnitude, bringing low-cost sensors close to reference-grade accuracy [
6,
29,
37]. Beyond calibration, recent studies highlight the role of anomaly detection frameworks, where advanced deep learning approaches such as autoencoders or variational methods achieve robust error detection and repair in dense sensor networks [
13,
14,
31].
5. Spatiotemporal Quality Control Techniques
Spatiotemporal QC techniques exploit spatial correlations and temporal dynamics to correct biases, enforce cross-sensor consistency, and fill gaps at the network level. These methods are primarily model-centric and can be executed offline or embedded within online systems; here we emphasize the algorithmic aspects (e.g., spatial regression/kriging, graph-based smoothing, spatiotemporal regularization, and cross-sensor reconciliation), while operational concerns such as latency, triggers, and fail-safes are deferred to
Section 6.
A distinctive advantage of networked sensors is the ability to leverage spatiotemporal redundancy for QC. In a dense network, neighboring sensors measuring the same pollutant should exhibit similar trends (after accounting for local source differences), and each sensor’s time series typically shows temporal continuity. An anomalous reading can, therefore, be detected by cross-checking against nearby sensors or by comparing it with the sensor’s recent temporal pattern [
3,
4]. Following prior work by Kim et al. [
5,
8], we describe three modes of ML-based spatiotemporal QC (MLQC): homogeneous temporal (HT), nonhomogeneous temporal (NT), and spatiotemporal (ST). For brevity, we use HT, NT, and ST hereafter.
In the HT mode, QC uses only the time series of a single sensor/variable. A short-history predictor (e.g., using the last several minutes) forecasts the current value; large deviations from the prediction are flagged as anomalies [
8]. This approach is computationally efficient and sensor-local, but it can miss issues that are only evident relative to neighbors, such as a single faulty node during a network-wide episode.
In the NT mode, the sensor’s recent history is augmented with other variables (meteorology, co-pollutants, or co-located channels). Incorporating multi-variable covariates often improves robustness to environmental confounding and cross-sensitivities [
5,
26,
27]. However, these models require synchronized multi-sensor or multi-channel data, which may be unavailable or noisy in community and citizen science deployment.
In the ST mode, simultaneous readings from spatial neighbors are combined with the temporal context. Incorporating data from a trusted anchor (e.g., an automatic weather station or a nearby reference monitor) can substantially improve detection and correction. Kim et al. [
5] reported that including anchor features reduced RMSE by approximately 17% relative to raw inputs. Beyond linear pooling, spatial machine learning (e.g., Gaussian processes and graph-based models) can explicitly model sensor-to-sensor correlations and stabilize network-level predictions at scale [
15,
18,
19]. The main trade-offs are computational cost and scalability for mega-city networks.
To provide a structured comparison,
Table 4 summarizes the three QC frameworks (HT, NT, and ST) in terms of input features, strengths, and limitations. This overview highlights the trade-offs between computational simplicity and robustness when leveraging temporal versus spatiotemporal information.
Correlation-based QC operationalizes these ideas in practice. Inter-sensor correlation is monitored over sliding windows, and alerts are raised when a node decorrelates from its peers beyond expected variability [
5,
14]. Such methods help distinguish network-wide episodes (all sensors spike) from device-specific faults (one sensor spikes) and have been used in both community networks and industrial regions [
28,
39]. Still, correlation metrics depend on sensor density and placement: sparse or irregular layouts weaken redundancy and reduce generalizability.
Classical “buddy checks” from meteorology fit naturally into this framework [
8]: models learn expected relationships between nearby sites and flag violations. Probabilistic variants, such as Bayesian neural network calibration, attach uncertainty bounds to these relationships, yielding confidence-aware QC adjustments [
34]. Spatiotemporal redundancy also enables value recovery: faulty-node estimates can be reconstructed from neighbors, a principle extended by federated and cluster-based frameworks that calibrate within local groups to improve scalability [
30,
40]. Interoperability across heterogeneous hardware and communication protocols remains an open issue.
Large urban networks often employ hierarchical designs with a few reference-grade “golden nodes” anchoring many low-cost sensors. In Breathe London, reference nodes provided citywide anchors and informed continuous adjustments to low-cost nodes [
1]. Similar “virtual calibration” services leverage nearby regulatory monitors to nudge baselines and mitigate drift [
3]. The effectiveness of such strategies depends on anchor availability, which is limited in many regions.
Overall, spatiotemporal QC harnesses redundancy in time (self-consistency) and space (cross-sensor consistency) to detect implausible readings and repair data streams more robustly than single-sensor approaches [
5,
8]. Remaining challenges (scalability, dependence on anchors, and uneven network density) motivate lightweight, interpretable, and transferable frameworks that can adapt across diverse deployment scenarios.
Section 6 discusses how these methods are operationalized in real-time systems with latency, drift, and reliability constraints.
6. Real-Time and Online QC Systems
Following the definitions in
Section 4, we use
online QC to denote streaming calibration, anomaly handling, and reconciliation performed on edge devices or low-latency cloud services, in contrast to offline post hoc processing. This section focuses on operational design: edge–cloud architectures, low-latency detectors and repair, automated re-calibration triggers tied to model diagnostics, and monitoring/machine learning operations (MLOps) practices that keep the spatiotemporal methods in
Section 5 reliable at scale.
As air quality sensor networks move toward live data delivery, QC must also operate in (near) real time. This imposes constraints on algorithms: they should be computationally efficient, adaptive to new data, and capable of continuous operation on embedded hardware [
41]. Online systems additionally face concept drift (
Section 3): models need to update as sensor characteristics and environmental patterns evolve [
23]. In what follows, we outline how streaming QC is implemented, including on-device processing, streaming model updates, and end-to-end architectures for live quality control.
A core requirement in real time is detecting and correcting anomalies on the fly. Many ML methods from earlier sections can be adapted to streaming [
5]. For example, sliding-window regressors can be re-trained on recent data to track shifting behavior, and lightweight models (e.g., small decision trees or compact neural nets) can update incrementally. When abrupt changes occur (e.g., a baseline jump), drift detectors monitoring input statistics or prediction errors can trigger alerts or re-training. D’Elia et al. [
23] studied drift mitigation for low-cost NO
2 sensors: upon drift detection, strategies such as weighted incremental updates and ensembles of old/new models extended calibration validity by weeks. Research on field deployment of PM sensors likewise reports that periodic re-training (every 2–4 weeks) is often needed to preserve accuracy under changing meteorology [
29,
37]. In practice, brief field co-locations or continuous remote anchoring to references (see U.S. EPA guidelines and public platforms) are used to sustain accuracy in between maintenance cycles [
20,
25].
Edge computing has become central to online QC [
41]. Instead of shipping raw streams to the cloud, first-tier QC runs locally on the sensor or a nearby gateway; outlier filtering, basic calibration, and sanity checks can execute at millisecond-to-second latency, flagging or correcting data before transmission. Advances in microcontrollers and single-board computers make such deployment feasible, and recent frameworks demonstrate real-time autoencoder-based calibration and anomaly screening integrated into smart-city platforms [
30]. By performing cleaning at the source, edge processing also reduces bandwidth and provides resilience under intermittent connectivity.
Because edge devices are resource-constrained, many systems adopt a hybrid design. First-tier QC runs at the edge, while second-tier, computationally intensive analysis executes in the cloud, where a global view enables spatial consistency checks and cross-sensor reconciliation [
14]. Architectures that summarize locally and centralize only the necessary aggregates have been proposed for resource efficiency [
41], and hierarchical QC has been shown to support both indoor and outdoor monitoring at community scale with manageable overhead [
16]. Federated or cluster-based learning further reduces bandwidth and can improve privacy: models are trained locally and periodically synchronized, sometimes within sensor clusters to limit communication [
15,
40]. In a QC context, nodes refine local anomaly detectors and share parameter updates, approaching centralized accuracy while preserving data locality. To clarify these trade-offs,
Table 5 summarizes the characteristics of edge, cloud, and hybrid QC architectures, highlighting their main strengths and limitations. This structured view illustrates why many recent deployment instances favor hybrid systems that combine local responsiveness with global analytics.
Operationally, production systems rely on streaming pipelines to organize QC at scale. Message brokers and stream processors route sensor messages through chained operators for decoding, validation, anomaly tagging, and calibration; health metrics and drift diagnostics are logged for automated triggers and human oversight [
14,
30]. In industrialized regions, coupling such pipelines with spatiotemporal models has enabled continuous calibration and monitoring under rapidly changing conditions [
39].
Demonstrations from the literature illustrate these patterns. D’Elia et al. [
23] describe an autonomic calibration loop that adjusted electrochemical sensor baselines in real time using reference feeds, reducing months-long drift. The U.S. EPA’s Fire and Smoke Map integrates corrected low-cost sensor data for public use, supported by Toolbox guidance that documents calibration and QC steps for streaming integration [
20,
25]. The HypeAIR project showed that edge screening plus cloud reconciliation can be seamlessly embedded into city platforms, stabilizing live data streams for decision support [
30].
In summary, online QC systems combine algorithmic techniques with deployment engineering: sliding-window adaptation and drift detection to keep models current; edge computing for low-latency filtering and first-tier calibration; hybrid edge–cloud reconciliation for network-wide consistency; and MLOps practices for monitoring, automated re-calibration, and safe rollbacks [
15,
41]. The result is an end-to-end pipeline that converts raw sensor readings into quality-assured data products within seconds to minutes, enabling instant alerts that distinguish real events from sensor faults and maintaining accurate streams for public advisories [
15,
42]. Finally, sustained reliability benefits from drift-aware scheduling and MLOps-style automation for diagnostics and re-calibration, as detailed in
Section 8.
8. Challenges and Future Directions
Despite considerable progress in ML-based QC for air quality monitoring, there remain numerous challenges and open research directions. Ensuring data quality in ever-expanding sensor networks is a moving target, and both technical and practical hurdles must be overcome to realize the full potential of smart air quality monitoring [
15].
Data Quality and Reliability Gaps: A fundamental challenge is that many regions still have sparse monitoring coverage and inconsistent data quality standards [
3]. Low-cost sensors are proliferating, but not all implementations follow best practices for calibration or maintenance, leading to highly variable data quality. Comparative studies reveal that sensor drift is one of the most persistent issues: even with initial calibration, long-term deployment suffers from gradual accuracy loss [
23,
37]. Developing early drift detection methods and efficient re-calibration schemes is, therefore, an active area. Another gap is pollutant coverage: much of the research has focused on PM and a few gases like O
3 or NO
2, but newer sensors for SO
2, VOCs, and ultrafine particles present unique interference patterns and require tailored ML-based QC [
7,
15]. Case studies such as that by Sayahi et al. [
22] on Plantower PMS sensors emphasize how environmental factors like humidity can undermine long-term stability, further motivating robust adaptive methods. Comparative evaluations indicate that QC performance is highly sensitive to predictor selection (e.g., meteorology and co-pollutants), the length of the calibration window, and the algorithm family; the lack of standardized protocols complicates cross-study comparisons [
7,
26]. Moreover, multi-year, publicly accessible benchmarks with agreed train/validation/test splits and rich metadata remain scarce, limiting reproducibility and the rigorous propagation of QC uncertainties into downstream analyses [
7,
40]. While these challenges are widely recognized, relatively few studies provide long-term validation data, meaning many proposed solutions remain promising in short trials but untested at scale.
Scalability and System Integration: As networks scale to hundreds or thousands of nodes, scalability of QC algorithms is vital. Techniques that work well for tens of sensors may face bottlenecks at city-wide scale. Graph-based and clustering approaches have been suggested to partition networks for tractable computation [
27,
40]. System integration challenges include reliable communication, power management, and model updates. Many sensor networks operate on limited power; running complex ML on-device could strain energy resources [
41]. As discussed in
Section 6, hybrid edge–cloud processing remains a key strategy for balancing local responsiveness with global analytics [
14,
30]. Operationalizing this at scale will likely require drift-aware scheduling and automated re-calibration triggers tied to model diagnostics, integrated within MLOps-style pipelines for sensor networks [
23,
40]. Modular frameworks that combine AI, edge computing, and multimodal data (traffic, meteorology, and satellite data) are emerging as a promising paradigm [
15,
42]. However, large-scale demonstrations are still rare, and questions remain about interoperability across heterogeneous sensor hardware and data platforms.
Adaptability and Transferability: ML models trained in one city or season often do not generalize well to other contexts due to differences in sources, climate, or sensor batches. Several field studies have reported that models calibrated under one set of conditions performed poorly when directly applied elsewhere, underscoring the limits of transferability [
44]. Developing transferable calibration models or applying transfer learning could reduce the need to restart training for each deployment. Domain adaptation and federated learning approaches (see
Section 6) show promise in this regard [
15,
42]. Semi-supervised learning is also key, since labeled “ground truth” data are scarce. Approaches like unsupervised anomaly detection, simulation-based anomaly synthesis, or physics-informed ML can reduce dependence on expensive labeled data [
5,
34]. Few-shot co-location protocols and transfer learning from canonical sites can further reduce the labeled-data burden while preserving site-specific biases [
44]. Explainability remains another challenge: policymakers and scientists may hesitate to trust black-box corrections. Methods like Bayesian neural networks or rule extraction from ensembles provide a way to attach interpretable confidence intervals to predictions [
32,
34]. Nevertheless, balancing accuracy and interpretability remains unresolved; highly interpretable models may underperform in complex environments, while state-of-the-art black-box models often struggle to gain policy acceptance.
Maintenance and Longevity: In practice, sensors require periodic cleaning, replacement, and re-calibration. QC algorithms could play a predictive role here—for example, identifying gradual baseline shifts as early indicators of sensor aging [
23]. Long-term field studies, such as Connolly et al.’s [
16] and Villarreal-Marines et al.’s [
39], demonstrate the importance of sustained monitoring partnerships to collect multi-year datasets. Such datasets are invaluable for developing next-generation QC models that explicitly account for seasonal cycles, material degradation, and long-term drift. However, these collaborative datasets are still geographically limited, raising concerns about the global representativeness of current QC strategies.
Interdisciplinary Integration: The future of smart monitoring likely involves integration with health data, traffic flows, and citizen engagement platforms. For instance, coupling quality-controlled exposure data with GPS and biometric data (heart rate, respiratory signals, etc.) could support personalized health interventions [
3,
17]. Ultra-reliable QC is required for critical applications such as issuing public health alerts or managing pollution-sensitive infrastructure [
15]. Hybrid deployment (indoor + outdoor) also raises new challenges for consistency across heterogeneous environments [
16]. Despite this promise, privacy concerns and governance of cross-domain data remain underexplored, limiting the near-term feasibility of such integrated systems.
Policy and Standardization: On the regulatory side, agencies such as the U.S. EPA and EU are exploring how calibrated sensor data could complement reference monitoring [
3]. Real-world policy integration, however, still requires standardized QC protocols and certification benchmarks. Clear thresholds for uncertainty must be defined before regulatory agencies can systematically adopt ML-corrected sensor data. Future frameworks are likely to embed uncertainty quantification. In practice, this means reporting not just corrected values but also confidence intervals [
15,
33]. For example, probabilistic calibration using Bayesian neural networks or ensemble-based predictive intervals can provide decision-ready uncertainty bounds [
7,
34]. For instance, the U.S. EPA has piloted the integration of corrected low-cost sensor data into public platforms such as the AirNow Fire and Smoke Map, supported by the Air Sensor Toolbox, which provides calibration and QC guidelines [
20,
25]. In Europe, the CEN technical committees are drafting performance evaluation standards for low-cost air quality sensors (e.g., CEN/TS 17660-1) to ensure comparability across devices and member states [
21]. Similarly, the WMO and UNEP have acknowledged the supplementary role of QC-enhanced LCS networks in expanding monitoring coverage in regions lacking reference stations [
46,
47]. These ongoing initiatives demonstrate that technical advances in ML-based QC are beginning to converge with institutional and regulatory efforts, laying the groundwork for globally recognized practices. A recent review emphasized that standardization will be critical to scaling community networks into regulatory decision-making processes [
38]. Yet, without consensus on benchmarks, there is a risk of fragmented standards across regions, which could undermine global comparability.
Addressing these gaps will hinge on shared benchmarks, standardized QC protocols, and field-scale validations that explicitly account for drift and uncertainty.
To provide a consolidated view,
Table 7 summarizes the major challenges in ML-based QC, representative current approaches, and the key limitations that remain unresolved. This overview highlights both the technical and institutional barriers that must be addressed for widespread adoption.
In summary, future work must focus on making ML-based QC more automated, scalable, and robust: handling data deluge with cloud–edge hybrids, maintaining accuracy despite drift via online learning, and ensuring that improved data quality leads directly to better decisions. Recent reviews highlighted data quality, scalability, and integration as the most pressing research directions [
15,
38]. By fostering interdisciplinary collaboration and establishing standards, the next decade should see QC techniques mature into a standard practice for proactive air quality management. However, the success of this vision will depend not only on technical advances but also on trust building, governance, and long-term sustainability of sensor networks.
9. Conclusions
Machine learning-based quality control (ML-based QC) has become a cornerstone of smart air quality monitoring over the past decade. A wide range of techniques—from regression and decision trees to deep neural networks—have been applied to calibrate low-cost sensors, detect anomalies, and leverage spatiotemporal correlations. These approaches have substantially improved the reliability of low-cost sensor data, in many cases approaching the accuracy of reference instruments. Real-time applications, supported by edge computing and online learning, are already enabling more responsive environmental management and more accurate assessments of personal exposure.
Despite these advances, several challenges remain. Generalization across contexts, interpretability of models, scalability to large networks, and the lack of standardized benchmarks continue to limit widespread adoption. Addressing these issues is essential to regulatory acceptance and the long-term sustainability of ML-based QC.
Looking ahead, QC frameworks must become more robust, scalable, and autonomous. Future networks should be able to self-calibrate continuously, detect sensor faults early, and integrate multimodal data sources such as traffic, meteorology, and health. Equally important is the development of explainable and standardized QC procedures so that scientists, policymakers, and regulators can trust ML-driven corrections. Establishing benchmarks and certification schemes will be critical to institutional uptake.
In summary, ML-based QC has shown strong potential but remains constrained by data quality, transferability, and governance gaps. By uniting technical innovation with institutional trust, it can evolve from a promising research field into a reliable foundation for environmental monitoring. Recent initiatives, such as the U.S. EPA AirNow Fire and Smoke Map and the European CEN/TS 17660-1 standard, already demonstrate how ML-corrected sensor data can inform real-world policy. This convergence of IoT sensing, AI-driven QC, and regulatory adoption signals a gradual but significant transformation in the way air quality is monitored and managed.