1. Introduction
Air pollution, as one of the most pressing environmental challenges worldwide, has long attracted extensive attention from both the scientific community and the public [
1,
2]. Numerous studies have demonstrated that air pollution not only disrupts ecosystems but also poses serious threats to human health, with fine particulate matter (PM
2.5) being closely associated with a range of respiratory and cardiovascular diseases [
3,
4,
5,
6]. To accurately assess air quality conditions and develop effective pollution control strategies, air pollution monitoring has traditionally relied on fixed monitoring stations deployed in urban and surrounding areas over the past decades [
7,
8]. These stations offer high measurement accuracy and reliable data; however, their high construction and maintenance costs significantly limit their large-scale and high-density deployment [
9]. In recent years, with rapid advances in sensor technology and the Internet of Things (IoT), low-cost sensors (LCS) have been increasingly adopted in air pollution monitoring. Compared with conventional fixed-site monitoring, LCS devices feature compact size, low cost, and flexible deployment, enabling large-scale and high spatiotemporal resolution observations across diverse urban settings [
10,
11,
12]. As a result, they serve as a valuable complement to traditional monitoring networks and provide new opportunities for characterizing the spatiotemporal distribution of air pollution and supporting more refined air quality management [
9,
12,
13].
Despite the promising prospects of LCS in air pollution monitoring, their widespread deployment and application are still hindered by a critical challenge: the relatively low quality of their measurement data [
14]. Compared with regular fixed-site monitoring, LCS devices typically utilize inexpensive sensing components, resulting in lower measurement accuracy, stability, and long-term reliability. Their performance is highly susceptible to various external influences, including environmental conditions, sensor aging, and hardware variability, all of which contribute to considerable measurement uncertainty and systematic bias [
15]. Consequently, uncalibrated or uncorrected sensor data are often unsuitable for direct use in scientific research or environmental policy-making, thereby limiting the practical value and application potential of LCS within air quality monitoring networks [
14,
16].
To fully harness the advantages of LCSs and address their inherent limitations in data quality, the development of robust calibration and correction approaches has become a key research focus. In recent years, significant progress has been made in developing calibration methods to improve the accuracy and reliability of LCS measurements [
12,
14,
16]. These approaches can generally be categorized into three major types. The first is physics-mechanism correction models, which compensate for systematic biases by explicitly accounting for the effects of known physical mechanisms, such as relative humidity, aerosol optical properties, and particle hygroscopic growth [
17]. These models remain a fundamental approach to LCS calibration, offering interpretability and robustness [
18]; however, their limited accuracy and adaptability under complex environmental conditions have led to their being increasingly complemented by data-driven methods to achieve higher accuracy and adaptability. The second category consists of parametric models, including linear and nonlinear regression approaches, which establish explicit functional relationships between sensor outputs and reference instrument measurements and often incorporate environmental covariates such as temperature and humidity [
19,
20,
21,
22]. An illustrative example of the application of parametric calibration approaches is provided by Zusmana et al. [
23], who developed metropolitan region-specific calibration models using a multi-linear regression (MLR) framework. The modest to good model performance (R
2 = 0.67–0.84) demonstrated the feasibility of deploying low-cost sensors for community-scale air quality monitoring after calibration. Parametric models are widely applied in LCS calibration for their simplicity, transparency, and ability to explicitly quantify relationships between sensor outputs and reference measurements; however, their reliance on predefined functional forms limits their capacity to capture more complex interactions [
22,
23]. The third category encompasses nonparametric frameworks based on advanced machine learning (ML) algorithms, such as random forest (RF), gradient boosting, and neural networks, which are capable of capturing complex nonlinear interactions among multiple predictors and adapting to site-specific conditions [
24,
25,
26]. Si et al. [
27] compared four calibration methods—simple linear regression (SLR), MLR, XGBoost, and deep neural networks (DNN)—against SHARP reference data, demonstrating that machine learning approaches significantly outperform linear regression in calibrating low-cost PM
2.5 sensors, with DNN achieving the best performance (R = 0.85, RMSE = 3.91). To further enhance the calibration accuracy of low-cost PM
2.5 sensors by integrating the strengths of multiple ML methods, Park et al. [
28] proposed an advanced hybrid model, HybridLSTM, which combines a DNN optimized for calibration tasks with a long short-term memory (LSTM) network specialized in capturing temporal dependencies. The proposed HybridLSTM model achieved an R
2 of 0.93, outperforming the DNN (R
2 = 0.90), MLR (R
2 = 0.80), and raw data (R
2 = 0.59), demonstrating its superior calibration performance compared with conventional approaches.
Although existing machine learning-based calibration approaches for low-cost PM
2.5 sensors have achieved remarkable progress, there remains considerable room for improvement. Most previous studies have incorporated meteorological variables into calibration models, but they typically consider only temperature and humidity [
24,
25,
26,
27,
28]. However, recent findings indicate that low-cost PM
2.5 sensors are influenced by a broader and more complex range of meteorological factors—such as temperature, humidity, wind speed, atmospheric pressure, etc.—which can substantially affect particulate matter measurements [
12,
14,
15]. This highlights a critical research gap: the need to develop calibration approaches that account for the comprehensive impacts of diverse meteorological variables to further enhance the accuracy and robustness of low-cost PM
2.5 sensor data.
In this study, we proposed an ML-based framework for PM2.5 LCS correction that integrates diverse meteorological factors. Multiple ML-based mathematical models were developed to regress the PM2.5 deviations (PM2.5 concentrations of the regular site−PM2.5 concentrations of the LCS) and the most important meteorological variables based on six kinds of ML algorithms: RF, support vector regression (SVR), LSTM, decision tree regression (DTR), Gated Recurrent Unit (GRU), and Bidirectional LSTM (BiLSTM), and the final model was selected from them with the optimal performance. This study offers the following novel contributions: (1) the modeling of PM2.5 measurement corrections integrates the comprehensive effects of multiple meteorological variables, surpassing the conventional approach of accounting solely for temperature and humidity adopted in most previous studies; (2) it represents the a few attempt to evaluate and correct measurements from the latest generation of commercially available LCS, specifically the American Temtop M10+ series, thereby extending current knowledge on the performance and calibration of emerging sensor technologies; and (3) through a comparative analysis of multiple ML algorithms, the study provides new reference results that, in combination with existing research, can guide the selection of appropriate models for correcting different types of LCSs. Our work is emerging as a promising direction to enhance the reliability and utility of PM2.5 LCS measurements. Beyond its methodological innovation, this research also promotes the development of new techniques for environmental research and sustainable urban environmental management.
2. Materials and Methods
2.1. The Co-Location Site
To calibrate the LCS, a co-location observation experiment was conducted. The experiment took place at a regular national air quality monitoring station located in Tongshan District, Xuzhou City, Jiangsu Province, China. The geospatial location of the co-location site is shown in
Figure 1. This fixed-site station, which continuously monitors ambient air pollutants including PM
2.5, provided reliable reference measurements for the calibration process. During the experiment, one LCS (model: Temtop M10+) was installed adjacent to the reference instrument at the monitoring station to ensure that both devices were exposed to the same atmospheric conditions [
29]. The co-location observation was carried out from 00:00 on 17 December 2021 to 00:00 on 22 December 2021. Throughout this period, both the LCS and the reference instrument were configured to output PM
2.5 mass concentrations at hourly intervals, generating time-synchronized datasets suitable for comparative analysis. The co-located measurements enable the evaluation of the performance of the LCS under real-world environmental conditions and provide the basis for subsequent correction of its readings.
2.2. Temtop M10+ LCS
In this study, the LCS used in the calibration experiment was the Temtop M10+ (upper right in
Figure 1) that reported PM
2.5 concentrations with a native 1-min time resolution. This device integrates a built-in Temtop M2000 2nd-generation sensor, which measures PM
2.5 mass concentrations based on the light-scattering principle. Some previous independent evaluations by South Coast AQMD’s AQ-SPEC program show that (i) in field deployments, the sensor’s PM
2.5 shows strong correlations with FEM reference instruments (R
2 ≈ 0.76–0.83 using 5-min means); (ii) under laboratory conditions (20 °C, 40% RH), PM
2.5 correlations with a FEM GRIMM are very strong (R
2 > 0.99), but the device tends to overestimate relative to the FEM [
30]. Given these characteristics, we treat the M10+/M2000 output as a stable, precise optical signal that benefits from co-location calibration to correct for systematic bias before use in our analyses. The monitoring campaign spanned six consecutive days, among which five full days provided valid paired measurements from both the LCS and the reference station. In total, 120 hourly data points were obtained for the subsequent model development.
2.3. Meteorological Variables
To account for the comprehensive effects of meteorological factors on PM
2.5 LCS measurements, this study incorporated multiple meteorological variables—including boundary layer height (BLH), 2 m temperature (T2m), 10 m u-component of wind speed (U10), surface pressure (SP), 10 m v-component of wind speed (V10), and relative humidity (RH)—in the development of the LCS correction models.
Figure 2 shows the hourly time series of these meteorological variables during the study period at the co-location site.
2.4. LCS Calibration Method
In this study, we applied an observation bias correction method to adjust the PM
2.5 measurements from the LCS. We assume that the PM
2.5 concentration measured by the reference fixed monitoring station (RFMS) at time
t is denoted as
; thus, the observational time series can be expressed as
. The PM
2.5 concentration measured by the LCS at time
t is denoted as
; thus, the observational time series can be expressed as
. Thus, a bias
exists between them as:
We simply assume that this part of the observational error is caused by ambient meteorological conditions (although in reality the situation is more complex, as LCS devices themselves have instrumental observation errors, which manufacturers have already addressed through other correction methods in this study). Under this theoretical assumption, the observation bias
can be further expressed as a mathematical model of various meteorological variables:
where
denotes the value of the
m-th meteorological variable at time
t, and
F(
X) is the mathematical model that regresses the PM
2.5 measurement bias and the most important meteorological variables. This mathematical model is constructed using four different types of ML algorithms, and the final model is selected from among them based on its optimal performance. The fundamental principles of these ML algorithms will be described in detail in the next subsection.
Once this model is established based on PM2.5 observations from the co-location site, it can be used to correct LCS observations at other locations under different monitoring environments. Specifically, we only need to obtain the relevant meteorological parameters at the current location of an LCS site and input them into the established model to calculate the corresponding correction bias. By adding this correction bias to the original LCS PM2.5 measurements, the corrected LCS PM2.5 concentration data can then be obtained.
2.5. Machine Learning Algorithms
In this study, a set of machine learning algorithms with varying levels of model complexity (from simple to complex) was selected to comprehensively evaluate the calibration performance for low-cost PM2.5 sensors. The chosen models represent a balanced combination of interpretability, computational efficiency, and predictive capability, making them suitable for both methodological comparison and practical deployment in real-world air quality monitoring applications. DTR was included as a simple and highly interpretable model that provides baseline performance with minimal computational requirements. RF and SVR were selected as medium-complexity traditional ML algorithms that are widely used and well established for nonlinear regression tasks. These models offer strong robustness and generalizability, making them representative of classical ML approaches applicable to sensor calibration. To incorporate more advanced modeling capabilities, LSTM networks were employed as a complex deep learning method capable of capturing temporal dependencies and nonlinear relationships inherent in atmospheric pollutant measurements. Furthermore, following suggestions from peer reviewers, two additional recurrent neural network architectures—GRU and BiLSTM—were included to explore whether deeper and more sophisticated neural structures could further enhance calibration accuracy.
2.5.1. Random Forest
Random forest (RF) is a non-parametric ensemble learning algorithm that performs regression by constructing multiple decision trees and aggregating their outputs [
31]. By introducing randomness in both data sampling and feature selection, RF effectively reduces the high variance and overfitting problems associated with single decision trees, thereby improving the model’s generalization ability and predictive performance [
32]. During model construction, the algorithm first employs bootstrap sampling to generate multiple subsets
from the original training dataset
, with replacement. Each subset is then used to train a regression tree. At each node split, instead of considering all features
F, a random subset
Fm (
m ≪ ∣
F∣) is selected to determine the best splitting feature and threshold that maximizes the split gain [
33,
34]. This “double-randomization” mechanism—randomness in both samples and features—reduces correlations among individual trees and enhances the robustness of the overall model.
Once all trees are trained, the prediction for a new input sample
x is obtained by averaging the outputs of all regression trees [
34]:
where
T is the total number of trees in the forest and
ht(
x) denotes the prediction of the
t-th regression tree for input
x. This ensemble averaging significantly reduces the instability of individual trees and improves predictive accuracy. RF also provides a natural way to evaluate feature importance. By calculating the decrease in mean squared error (MSE) caused by a feature when used for splitting, its contribution to the model can be quantified.
2.5.2. Support Vector Regression
Support Vector Regression (SVR) is an extension of the Support Vector Machine (SVM) for regression tasks. It aims to construct an optimal regression function in a high-dimensional feature space to capture the relationship between input variables and continuous target variables [
35]. The core idea of SVR is to fit the training data within a predefined tolerance margin while keeping the regression function as flat as possible, thereby enhancing the model’s generalization capability and robustness to noise [
36].
Given a training dataset
, where
is the input feature vector and
is the target variable, SVR seeks a regression function of the form [
37]:
Here, ϕ(x) is a nonlinear mapping function that projects the input data into a high-dimensional feature space, while w and b are the weight vector and bias term, respectively. SVR minimizes the model complexity, represented by ||w||2, while allowing prediction errors within an ϵ-insensitive tube. Due to its strong generalization ability, robustness to noise, and capability to model nonlinear relationships even with limited samples and high-dimensional data, SVR has been widely applied in the field of environmental modeling.
2.5.3. Long Short-Term Memory
Long Short-Term Memory (LSTM) is a special type of recurrent neural network (RNN) designed to model sequential data and capture long-range dependencies that standard RNNs often fail to learn due to vanishing or exploding gradient problems [
38]. By introducing a memory cell and a gating mechanism, LSTM effectively retains relevant historical information over extended time steps, making it particularly suitable for time series prediction tasks such as air quality modeling [
39].
An LSTM unit consists of three key gates—the input gate, the forget gate, and the output gate—which jointly regulate the flow of information through the network. At each time step
t, given the input vector
xt and the previous hidden state
ht−1, the gating operations are defined as follows [
40,
41]:
Forget gate: Controls how much of the previous cell state should be retained:
Input gate: Determines how much new information should be added to the cell state:
Cell state update: Combines retained memory and new information:
Output gate: Controls how much of the updated cell state contributes to the hidden state:
Here, σ() denotes the sigmoid activation function, tanh() is the hyperbolic tangent function, ⊙ represents element-wise multiplication, Ct is the cell state, and ht is the hidden state at time t. The gating mechanism allows LSTM to dynamically adjust the retention, removal, and integration of information, thereby overcoming the limitations of standard RNNs in capturing long-term temporal dependencies. Because of these properties, LSTM has been widely applied in sequential modeling tasks such as time series forecasting, natural language processing, and environmental data prediction.
2.5.4. Decision Tree Regression
DTR is a non-parametric supervised learning method that predicts a continuous target variable by recursively partitioning the feature space into a set of disjoint regions and fitting a simple prediction model (typically a constant value) within each region [
42]. Its core idea is to split the data into increasingly homogeneous subsets based on feature values, thereby minimizing prediction error at each node [
43].
Given a training dataset
, where
denotes the feature vector and
the target variable, the algorithm starts from the root node containing the entire dataset and searches for the feature
j and threshold
s that best split the data into two subsets [
43,
44]:
The optimal split is chosen to minimize the sum of squared errors (SSE) across the two child nodes [
44]:
where
and
are the mean values of the target variable in regions
R1 and
R2, respectively. This splitting process is applied recursively to each resulting node until a stopping criterion is met, such as reaching a maximum tree depth or a minimum number of samples per leaf node.
For a new input
x, the prediction
is determined by the average of the target values of the training samples in the terminal region
Rm to which
x belongs [
44]:
DTR is intuitive, interpretable, and capable of capturing nonlinear relationships without requiring feature scaling or prior assumptions about the data distribution.
2.5.5. Gated Recurrent Unit
The Gated Recurrent Unit (GRU) is an improved recurrent neural network architecture designed to more effectively model temporal dependencies in sequential data while reducing computational complexity relative to traditional LSTM networks [
45]. GRU introduces a streamlined gating mechanism composed of an update gate and a reset gate. The update gate controls the extent to which past information is retained, enabling the model to capture long-range temporal dependencies without suffering from vanishing gradients. The reset gate determines how much of the previous state should be ignored when incorporating new input information, allowing the network to adaptively focus on short-term patterns when necessary. Because GRU combines memory and gating operations into a simplified structure, it requires fewer parameters than LSTM, resulting in faster training and lower computational cost while maintaining competitive performance [
46]. This balance between efficiency and modeling capability makes GRU suitable for applications involving time-dependent environmental data, such as air pollution monitoring and sensor calibration.
2.5.6. Bidirectional Long Short-Term Memory
The Bidirectional Long Short-Term Memory (BiLSTM) network is an extension of the traditional LSTM architecture designed to capture temporal dependencies in both forward and backward directions [
47]. Unlike standard LSTM, which processes sequences only from past to future, a BiLSTM consists of two parallel LSTM layers: one that processes the input sequence in the forward direction and another that processes it in the reverse direction. By combining the outputs of these two layers, the model gains access to both preceding and succeeding contextual information at each time step [
48].
2.5.7. Algorithm Complexities
A computational complexity analysis of the above six ML algorithms based on their algorithmic properties was also provided. These models differ in training time complexity, memory cost, parameter size, and capacity to model nonlinear or temporal dependencies. Tree-based models (DTR and RF) typically have training complexities of O(n⋅d⋅log n) and O(T⋅n⋅log n), respectively, where n is the number of samples, d the feature dimension, and T the number of trees. Their memory costs are O(n) and O(T·n), respectively. SVR exhibits a higher training complexity of O(n3) due to kernel matrix inversion, making it more computationally demanding for large n. It has a prediction complexity of O(s·d) where s is the number of support vectors. Deep learning models (LSTM, GRU, BiLSTM) have complexities on the order of O(n⋅h2⋅L), where h is the number of hidden units, and L is the number of layers. The parameter size increases with the number of hidden units h and layers L. These objective complexity measures reflect the computational cost and scalability of different algorithms.
2.6. Hyperparameter Optimization
To ensure fair and reliable comparison among the machine learning and deep learning models, all algorithms were tuned through a unified hyperparameter optimization procedure. Key hyperparameters for each model were optimized using the Particle Swarm Optimization (PSO) algorithm, combined with the corresponding optimization functions in MATLAB (R2023a). Parameters not included in the optimization search space were kept at MATLAB’s default settings. This approach provides improved model stability and prevents biased performance caused by suboptimal or unbalanced parameter choices. For the RF model, the function optimize_fitrtreebag was used to optimize the number of trees and minimum leaf size, yielding optimal values of 180 trees and a leaf size of 1. For SVR, optimize_fitrsvm tuned the BoxConstraint, KernelScale, and Epsilon parameters, with final values of 2.22, 4.25, and 0.001, respectively. The DTR model was optimized using optimize_fitrtree, which identified a minimum leaf size of 4 as optimal. For deep learning models, customized optimization functions were applied. The LSTM model was tuned using optimize_fitrLSTM_att, resulting in 44 layers, 20 hidden units, and a dropout rate of 0.001. The GRU model was optimized with optimize_fitrGRU_att, producing 48 layers, 28 hidden units, and a dropout rate of 0.17. Finally, the BiLSTM model was optimized using optimize_fitrBiLSTM, which selected 86 layers and a dropout rate of 0.39 as the best-performing configuration.
2.7. Model Development and Validation
To ensure robust model training and fair performance evaluation, the complete dataset collected from the co-location experiment was randomly divided into three subsets: a training set (80%, 96 data points), a validation set (10%, 12 data points), and a testing set (10%, 12 data points). The training set was used to fit the ML models, the validation set was employed for model tuning and hyperparameter optimization, and the testing set was reserved for final performance evaluation. Given that the random data splitting process can introduce variability and uncertainty in model performance, the entire modeling and evaluation procedure was repeated ten times in a bootstrap fashion. In each repetition, the dataset was randomly resampled and re-split following the 8:1:1 ratio. The final performance metrics were reported as the mean across the ten bootstrap iterations.
Model performance was quantitatively assessed using four commonly adopted statistical indicators: mean absolute error (MAE), mean absolute percentage error (MAPE), root mean square error (RMSE), and the coefficient of determination (R2). MAE and RMSE measure the absolute and squared deviations between predictions and observations, respectively, indicating model accuracy in units consistent with the target variable. MAPE provides a dimensionless measure of relative error, facilitating comparison across different concentration ranges. R2 evaluates the proportion of variance in the reference observations that is explained by the model predictions, reflecting the overall goodness of fit. These metrics allowed a comprehensive evaluation of the predictive accuracy, bias, and reliability of each calibration model on the training, validation, and testing datasets.
The best-performing model for each algorithm was determined based on its average performance on the validation datasets across the ten bootstrap iterations. Specifically, the model configuration yielding the lowest mean RMSE and MAE, together with the highest mean R2, was selected as the optimal model. This model was subsequently applied to the independent testing dataset to assess its final predictive performance and generalization capability. All data preprocessing, machine learning model development, performance evaluation, and figure generation were conducted using MATLAB.
4. Discussion
In this study, we developed a comprehensive ML-based calibration framework for low-cost PM2.5 sensors that integrates diverse meteorological predictors to improve measurement accuracy. Using continuous co-location experiments conducted in Tongshan District of Xuzhou City, the hourly PM2.5 data obtained from a Temtop M10+ series LCS were compared against those from a co-location reference monitoring station. The biases between the two datasets were modeled as functions of key meteorological variables through multiple ML algorithms, including RF, SVR, LSTM, DTR, GRU, and BiLSTM. The optimal model was selected based on its superior predictive performance across training, validation, and testing datasets. For other models of LCS used in another city, we can also establish ML calibration models following this framework, and then add the corrected biases to the original measurements for data calibration.
Our study demonstrates that the latest LCS can provide a relatively reliable monitoring of ambient PM
2.5 concentrations under most conditions, even without calibration. Compared with previous studies, the results of this work indicate that the baseline performance of LCSs has substantially improved in recent years. For example, Park et al. [
28] reported that the raw PM
2.5 data from their LCS (Sensirion SPS 30) exhibited only a moderate correlation with reference-grade instruments (R
2 = 0.59). In contrast, our study found that even without calibration, the LCS measurements showed a strong linear relationship with the reference data (R
2 = 0.90). This improvement suggests that recent generations of LCSs have benefited from advancements in optical particle detection, onboard signal processing, and factory calibration, resulting in more reliable measurements across a broader range of concentrations. Although differences in sensor models, environmental conditions, and aerosol characteristics may also contribute to the performance gap between studies, the overall trend clearly reflects a technological evolution toward higher intrinsic accuracy of LCS devices. Nevertheless, our findings also demonstrate that applying ML calibration incorporating meteorological predictors can further minimize systematic bias and enhance model generalization, indicating that the combination of improved sensor hardware and data-driven calibration methods holds great promise for large-scale, high-resolution air quality monitoring.
Our study also demonstrates that incorporating a wider range of meteorological predictors significantly enhances the accuracy and robustness of LCS calibration compared with conventional approaches that typically consider only relative humidity (RH) and temperature. Building on prior work, we compiled the reported accuracies of different ML-based calibration schemes and contrasted them with our results. Park et al. [
28] evaluated raw LCS readings (R
2 = 0.59) and three calibration models—MLR (R
2 = 0.80), DNN (R
2 = 0.90), and a HybridLSTM (R
2 = 0.93)—and showed the HybridLSTM reduced RMSE by 41–60% versus raw data and by 30–51% versus MLR on the same test sets, using humidity and temperature as meteorological inputs alongside sensor PM
2.5. Kumar et al. [
49] assessed several algorithms seasonally with a PurpleAir LCS and reported that RF performed best when limited to a small set of meteorological covariates (typically RH, temperature, and wind speed), attaining R
2 improvements from ~0.66–0.73 pre-calibration to ~0.94–0.96 post-calibration and test-set RMSE on the order of ~0.7–1.2 µg/m
3 depending on season. In comparison, our framework—trained with a broader feature set that expands beyond RH, temperature, and wind speed to include six meteorological predictors—achieved the highest overall accuracy with RF (R
2 ≈ 0.99/0.94/0.94 for training/validation/testing, with consistently lower RMSE and MAE across bootstrap splits). By integrating six meteorological variables in our modeling framework, this study provides a more comprehensive description of the meteorological influences on LCS performance, allowing a more precise correction of PM
2.5 sensor readings under diverse atmospheric conditions. Additional meteorological variables help capture complex processes such as hygroscopic growth, dispersion dynamics, and boundary-layer stability, which jointly modulate the optical response of PM
2.5 sensors. Consequently, our results highlight the advantage of a multi-factor ML calibration framework that leverages richer meteorological information to achieve more reliable and transferable LCS corrections.
Our results suggest that tree-based/ensemble models (especially RF) outperform kernel-based and sequential models in this LCS calibration context. The consistently strong performance of the RF algorithm observed in this study aligns with previous research findings, confirming its suitability for calibrating low-cost PM
2.5 sensors. Wang et al. [
32] reported that their RF-based calibration of a Hike HK-B3 monitor achieved an R
2 of 0.98, substantially outperforming the traditional linear regression model (R
2 = 0.87) under various temperature and humidity conditions. Similarly, our RF model yielded an R
2 of 0.99 (training) and 0.94 (validation and testing), accompanied by the lowest RMSE, MAE, and MAPE values among all tested ML approaches, further verifying the robustness of RF in capturing complex nonlinear relationships between sensor deviations and meteorological covariates. The RF algorithm’s superior performance can be attributed to its ensemble learning nature, which integrates multiple decision trees to minimize overfitting and effectively handle nonlinear and multicollinear interactions among predictor variables. Unlike regression-based or kernel methods that assume specific data distributions, RF can flexibly model intricate dependencies between PM
2.5 measurement biases and meteorological factors such as humidity, temperature, wind, and pressure. This adaptability enables it to generalize well across varying pollution levels and meteorological conditions. Additionally, the variable importance metrics generated by RF provide valuable interpretability, revealing how different meteorological factors contribute to sensor measurement biases—a feature especially useful for optimizing sensor deployment and calibration strategies. Taken together, both the present study and previous literature demonstrate that RF offers a balanced combination of accuracy, robustness, and interpretability, making it one of the most reliable algorithms for PM
2.5 LCS calibration. Its capability to model complex meteorological influences supports its broader application in large-scale, high-resolution air quality monitoring networks.
Our results indicate that computational complexity and model capacity significantly influence calibration performance, especially under small-sample conditions. Although deep learning models such as LSTM, GRU, and BiLSTM have the capacity to capture nonlinear temporal dependencies, their performance is more sensitive to data availability, and overfitting may occur under limited sample conditions. Medium-complexity models (RF and SVR) provide a balance between computational efficiency and predictive power, with RF showing superior performance and stability. The simple DTR model also achieved strong accuracy with minimal computational requirements, demonstrating that low-complexity models may be sufficient when the data size is small. This analysis highlights the importance of selecting model complexity appropriate to the data scale and practical deployment conditions for LCS calibration.
Overall, this study demonstrates that low-cost PM2.5 sensors, after appropriate ML-based calibration, can achieve accuracy comparable to that of reference-grade instruments, supporting their use as a cost-effective and scalable monitoring solution. The proposed calibration framework effectively corrects measurement biases by integrating multiple meteorological predictors, thereby enhancing the reliability of PM2.5 LCS measurements under diverse meteorological conditions. This advancement highlights the potential of LCS networks to supplement regulatory monitoring systems, enabling dense spatial coverage and high-resolution exposure assessment at a fraction of the cost of conventional networks.
Although the present framework was only tested using co-location PM2.5 data from Xuzhou, its structure is inherently scalable and transferable. Because the method relies on generalizable statistical relationships between LCS deviations, reference measurements, and meteorological parameters, it can be readily adapted to other geographic regions and sensor models with minimal retraining effort. Moreover, in addition to PM2.5 calibrations, the ML–based correction framework developed in this study demonstrates strong potential for broader applicability to other air pollutants measured by low-cost sensors. Although this work focuses specifically on PM2.5, the structure of the proposed model and the multivariable integration strategy allow for adaptation to pollutants with similar physicochemical characteristics and measurement mechanisms. For example, PM10 shares comparable sources, atmospheric behaviors, and light-scattering measurement principles with PM2.5; therefore, the selected meteorological predictors are likely to remain relevant and effective when extending the model to PM10 calibration. For gaseous pollutants such as O3, VOCs, NO, NO2, CO, and SO2, the applicability of the framework also remains feasible but requires additional consideration. Unlike PM2.5 sensors, which typically rely on optical scattering techniques, most low-cost gas sensors use electrochemical or metal-oxide semiconductor technologies that exhibit different sensitivities to environmental conditions. Consequently, the optimal set of input variables may differ, as these sensors are often more strongly influenced by multiple environmental and meteorological factors, as well as cross-sensitivity to interfering gases. These factors may lead to greater variability in calibration performance when compared with particulate measurements. Nevertheless, by incorporating pollutant-specific environmental predictors and sensor behavior characteristics, the general modeling approach presented here can be adapted to support the calibration of a broad range of low-cost air quality sensors. Overall, the framework proposed in this study provides a transferable foundation for correcting multiple types of sensor measurements. Future work will include extending the model to additional pollutants and evaluating its performance across varying sensor technologies and environmental conditions to further validate its generalizability.
This study also holds broader implications for environmental sciences. Reliable and spatially resolved air pollution data are fundamental for understanding the dynamics of air pollution, aerosol–climate interactions, and human exposure to changing environmental conditions. By integrating various meteorological predictors into an ML–based calibration framework, this study provides a pathway to enhance the accuracy and representativeness of LCS PM2.5 observations. Such improvement in observation quality not only supports the assessment of air quality and environmental health but may also contribute valuable input data for studying regional and global environmental changes. Therefore, beyond methodological advancement, our work also strengthens the data foundation and analytical capacity necessary for developing environmental science and management.
A limitation of this study lies in the relatively limited size of the experimental dataset. It is important to emphasize that the dataset used in this study—although sufficiently representative for demonstrating the applicability of the proposed method—remains inherently limited in size. The constrained data volume may still introduce uncertainty into the evaluation process. Therefore, the generalizability of the model beyond the current experimental setting may not be fully verified at this stage. Future work will prioritize the integration of larger multi-site datasets or publicly accessible sensor–meteorology datasets to further assess cross-site robustness and improve model reliability. We acknowledge this limitation explicitly and encourage subsequent studies to validate and extend our framework under broader real-world conditions. To mitigate this issue, we adopted a rigorous validation strategy in which the entire modeling and assessment procedure was repeated ten times in a bootstrap manner, thereby reducing the variability caused by random data splitting. Nevertheless, the restricted dataset remains an inherent constraint that may limit the generalizability of the findings. Future work will benefit from incorporating additional publicly available datasets to further validate and strengthen the robustness of the proposed approach. It is also important to acknowledge that, in long-term air quality studies, the evolution of meteorological conditions across seasons can significantly influence PM2.5 concentrations. Variables such as precipitation, fog, solar radiation, and ultraviolet radiation are known to affect particle formation, dispersion, and removal processes. However, the dataset used in this study spans only six days, during which seasonal variability does not meaningfully manifest. As a result, the absence of seasonal indicators has limited impact on the calibration outcomes for this short-term co-location experiment. In addition, although a broader set of meteorological parameters could theoretically enhance calibration performance, the incorporation of too many input variables under a small-sample setting is likely to introduce overfitting and reduce model reliability. For this reason, we selected six meteorological variables that are most directly relevant to PM2.5 behavior and LCS performance. This selection strikes a balance between representing key atmospheric influences and maintaining the robustness of model training under limited data availability. Future research will extend this work using longer-term and larger datasets, which will allow the integration of additional meteorological factors and explicit consideration of seasonal cycles. Such efforts are expected to improve model generalizability and provide deeper insight into the dynamic interactions between pollutant levels, sensor behavior, and evolving meteorological conditions.
5. Conclusions
In this study, we developed an ML-based calibration framework for low-cost PM2.5 sensors that integrates multiple meteorological predictors to improve measurement accuracy and robustness. Continuous co-location measurements were conducted in Tongshan District, Xuzhou City, using a Temtop M10+ series LCS and an RFMS. The biases between the LCS and RFMS measurements were modeled using six ML algorithms—RF, SVR, LSTM, DTR, GRU, and BiLSTM. The comparative modeling results demonstrated substantial differences in predictive performance among these algorithms. RF achieved the best overall accuracy with R2 values of 0.99 (training), 0.94 (validation), and 0.94 (testing), and correspondingly low RMSE (0.42–1.07 µg/m3), MAE (0.30–0.79 µg/m3), and MAPE (0.27–0.72). DTR also performed well (R2 = 0.98–0.95, RMSE = 0.52–1.13 µg/m3, MAE = 0.32–0.65 µg/m3, MAPE = 0.26–0.76), while LSTM and SVR exhibited comparatively lower accuracy and larger errors (LSTM: R2 = 0.83–0.76, SVR: R2 = 0.71–0.58). In addition, the two more advanced deep learning models, GRU and BiLSTM, achieved competitive performance, with testing R2 values of 0.96 and 0.98, respectively, indicating that deep learning architectures can further enhance calibration accuracy when temporal dependencies are effectively captured. These results indicate that ensemble and tree-based models are particularly suitable for capturing the nonlinear and multivariate relationships between PM2.5 deviations and meteorological factors.
Furthermore, even without calibration, the Temtop M10+ measurements showed a strong linear relationship with the reference data (R2 = 0.90), though with a slight systematic underestimation. The proposed ML-based framework effectively reduced this bias and enhanced data reliability by integrating multiple meteorological factors.
Overall, this study demonstrates that the ML-based calibration approach—especially the RF model—offers a robust, interpretable, and scalable solution for improving the accuracy of low-cost PM2.5 sensors. The framework provides a feasible pathway for large-scale sensor network deployment, enabling spatially dense and cost-effective air quality monitoring to complement regulatory stations and support refined air pollution exposure assessment. This study could advance the development of innovative techniques that enhance air quality assessment and support environmental research.