You are currently on the new version of our website. Access the old version .
SensorsSensors
  • Article
  • Open Access

29 November 2025

Probabilistic Clustering for Data Aggregation in Air Pollution Monitoring System

and
The Artificial Intelligence Research Center, Novosibirsk State University, 630090 Novosibirsk, Russia
*
Author to whom correspondence should be addressed.
This article belongs to the Section Environmental Sensing

Abstract

Air pollution monitoring systems use distributed sensors that record dynamic environmental conditions, often producing large volumes of heterogeneous and stochastic data. Efficient aggregation of this data is essential for reducing communication overhead while maintaining the quality of information for decision making. In this paper, we propose an unsupervised learning approach for soft clustering of sensors in air pollution monitoring systems. Our method utilizes the Expectation–Maximization algorithm, which is an unsupervised machine learning method and probabilistic technique, to cluster sensors into distinct sets corresponding to normal and polluted zones. This clustering is driven by the need for a dynamic data transmission policy: sensors in polluted zones must intensify their operation for detailed monitoring, while sensors in clean zones can reduce reporting rates and transmit condensed data summaries to alleviate network load and conserve energy. The cluster membership probability enables a tunable trade-off between data redundancy and monitoring accuracy. The high efficiency of the proposed AI-based clustering is validated by the simulation results. Under common pollution scenarios and with adequate sample sizes, the EM algorithm exhibits a relative error below 5%. The presented approach provides a foundation for a wide range of intelligent and adaptive data aggregation protocols.

1. Introduction

Environmental degradation, particularly air pollution, ranks among the foremost threats to global health [1]. This crisis is primarily driven by the rapid expansion of urban industrial activity, a growing global transportation network, and large-scale burning of biomass, which together emit a complex mixture of hazardous particulate matter and gaseous pollutants. These insidious emissions significantly degrade air quality, leading to a marked increase in a wide range of diseases, from acute respiratory infections to chronic cardiovascular conditions and cancer. The most devastating impacts are concentrated in densely populated megacities, where intense emission sources and widespread human exposure converge with dangerous consequences. Consequently, the development and implementation of comprehensive, intelligent air quality monitoring (AQM) systems have become essential for public health protection. The key challenge lies not only in obtaining reliable data but also in ensuring its timeliness, broad geographic coverage, and continuous operation [2,3]. The data from these systems provide the essential foundation for effective environmental policies, proactive public health advisories, and rigorous regulatory measures. Ultimately, this systematic and technologically advanced approach is vital for safeguarding populations, guiding sustainable urban development, and securing a healthier quality of life for the millions of people in the world’s ever-expanding urban centers.
Traditional air pollution monitoring, which relies on fixed stations and temporary laboratories, is often hampered by high costs and an inability to provide dense spatial coverage. Modern systems overcome these limitations by deploying extensive networks of low-cost wireless sensors. A notable advancement is the integration of mobile sensors mounted on vehicles or drones, enabling dynamic, real-time data collection along transport routes and identifying pollution hotspots that fixed stations inevitably miss [4]. These pervasive sensor networks are a significant application of the Internet of Things (IoT), forming a core component of the smart city infrastructure [5]. Mobile air quality sensors transmit a continuous data stream to decision making centers, enabling the creation of highly detailed, real-time pollution maps. This capability facilitates intelligent urban management, including adaptive traffic control to reduce congestion-related emissions and the delivery of personalized air quality alerts to citizens. However, due to constraints from limited resources, these monitoring networks require careful optimization. A systematic review of the methods employed for this purpose is provided in [6]. An attractive optimization strategy is to cluster sensors into normal and polluted zones. This enables a dynamic transmission policy as follows. Sensors in polluted zones intensify operation for high-resolution data, while those in clean zones reduce reporting rates and send summaries to conserve energy and alleviate network load. A significant challenge, however, arises from the fuzzy and unstable boundaries of these clusters.
The distribution and concentration of air pollutants, including particles, volatile organic compounds, and microorganisms, are largely dependent on stochastic and highly dynamic meteorological factors [7,8,9]. Urban structure, specifically building density, green spaces, and the shape and size of buildings, contributes to air pollution patterns by either facilitating or inhibiting the dispersion of pollutants [10,11]. This is exemplified by the urban canyon effect, which is where tall buildings along narrow streets trap emissions, creating localized pockets of dangerously high pollution concentrations [12,13]. The concentration of pollutants like particulate matter (PM2.5), nitrogen dioxide (NO2), and ozone (O3) is characterized by pronounced spatiotemporal heterogeneity. Moreover, this variability is further compounded by vertical dynamics [14]. Therefore, from a geometric perspective, the spatiotemporal distribution of pollutant concentrations forms a heterogeneous and fluid patchwork. Transitions between “polluted” and “clean” zones can be undefined. Sensors located just a few meters apart can detect significantly different pollutant levels. This dynamic and stochastic nature of pollution makes it unrealistic to divide mobile sensors into strictly polluted and unpolluted clusters, rendering the use of hard clustering algorithms impractical.
This paper presents a probabilistic clustering framework for air quality sensor networks based on the Expectation–Maximization (EM) algorithm. The core practical contribution is a dynamic data transmission policy that utilizes soft cluster membership probabilities to intelligently allocate network resources. This approach achieves significant energy savings in clean zones by reducing transmission frequency and volume, while simultaneously enhancing monitoring resolution in polluted areas through intensified data collection. The intensification of operations in polluted zones enables high-frequency tracking of pollution dynamics and provides detailed data for source identification and analysis. Furthermore, an increased volume of data transmission ensures reliable delivery of critical environmental data [15]. Consequently, powering down sensors in clean zones becomes a strategically justified mechanism for energy conservation. Informed by the results from our prior work [16], this paper introduces an efficient and problem-specific realization of the EM clustering algorithm. To the best of our knowledge, this constitutes the first implementation of EM clustering in this context.
The rest of this paper is organized as follows. The paper proceeds with a review of related work in Section 2. Section 3 details the proposed methodology, providing a comprehensive description of the Expectation–Maximization algorithm, the specification of the probabilistic air pollution model, and the specific implementation of EM clustering. Section 4 presents a performance analysis, and Section 5 provides concluding remarks.

3. Methodology

3.1. Expectation–Maximization Algorithm

The power of EM clustering lies in its probabilistic interpretation. Each data point does not belong rigidly to a single cluster but is instead described by a distribution of memberships. This reflects uncertainty in the data and aligns well with real-world situations where boundaries between groups are diffused rather than sharp. Moreover, EM provides a mathematically principled way to handle overlapping clusters, noisy measurements, and dynamic changes in the data distribution. These characteristics make it particularly well-suited for modeling air quality monitoring data, where pollutant concentrations exhibit smooth gradients, temporal fluctuations, and heterogeneous spatial distributions.
The EM algorithm is a maximum likelihood estimation framework for models involving latent (unobserved) variables. In clustering problems, the latent variable is the cluster membership of each data point, which is not directly observed. Unlike hard clustering methods that assign each data point to exactly one cluster, EM adopts a probabilistic model in which every observation may belong to each cluster with some probability. This approach is especially powerful when data are generated from a mixture of probability distributions, and the goal is to estimate both the mixture parameters and the soft cluster assignments.
Let us introduce the formalism of the EM algorithm, following the classical formulations presented in [39,40]. Let the dataset be X = { x 1 , x 2 , , x N } , and assume the data are drawn from a mixture of K distributions, where the probability mass function (pmf) or probability density function (pdf) of the k -th distribution is denoted by f ( x | θ k ) . Thus, each distribution is parameterized by θ k (individual parameter or set of parameters), and each cluster k has a mixing proportion π k , with
k = 1 K π k = 1
The pdf/pmf of the mixture model is
p x i | Θ = k = 1 K π k f ( x i | θ k )
where Θ = π 1 , π 2 , , π K , θ 1 , θ 2 , , θ K .
The notion of cluster membership is formalized through the introduction of a set of latent variables Z = { z 1 , z 2 , , z N } , one for each observation. Each z i is a K-dimensional binary random vector indicating which component generated the corresponding data point x i . This vector uses a one-hot encoding, meaning
z i , k = 1 , i f   x i   b e l o n g   t o   c l a s s   k 0 , o t h e r w i s e
Assuming the data points are independent and identically distributed, the complete-data likelihood for the full dataset is then
L X , Z | Θ = i = 1 N k = 1 K π k f ( x i | θ k ) z i , k
For computational simplicity, the model parameters are estimated by maximizing the expected value of the complete-data log-likelihood function:
L Θ = ln L ( X , Z | Θ ) = i = 1 N k = 1 K z i k ( ln π k + ln f ( x i | θ k ) )
Since the latent variables z i k are unobserved, the function L Θ cannot be optimized directly. Instead, the EM algorithm maximizes its expected value, taken with respect to the posterior distribution of Z given the observed data X and the current parameter estimates Θ ^ . Thus, the EM algorithm proceeds iteratively in two steps as follows.
  • E-step (Expectation):
    Compute the posterior probabilities, often referred to as responsibilities, that each data point belongs to each cluster, given the current parameter estimates Θ ^ . This responsibility, denoted γ i , k , is the conditional expectation of the latent variable z i , k given the observed data and the current parameters:
    γ i , k = E z i , k | x i , Θ ^ = P z i , k = 1 | x i , Θ ^
    An application of Bayes’ theorem provides the closed-form expression for the posterior responsibility, quantifying the probability that component k generated observation x i :
    γ i , k ( Θ ^ ) = π ^ k f ( x i | θ ^ k ) j = 1 K π ^ j f ( x i | θ ^ j )
    These probabilities express the degree of membership of point x i in cluster k . Each point is thus softly assigned to all clusters, with weights summing to 1 across clusters. This expression can be further simplified by substituting into it a probability mass (or density) function of practical interest. Next, since L Θ is a function of the unobserved latent variables Z, it is necessary to consider its expectation conditional on the observed data X and current parameter estimates, Θ ^ , which defines the Q-function:
    Q Θ , Θ ^ = E Z | X , Θ L Θ = i = 1 N k = 1 K γ i , k ( Θ ^ ) ( ln π k + ln f ( x i | θ k ) )
  • M-step (Maximization):
    The M-step involves maximizing the Q-function, computed in the previous E-step, with respect to the model parameters Θ to obtain an updated estimate:
    Θ n e w = arg max Θ Q Θ , Θ ^
    The Q-function can be separated into two independent parts: one relating to mixture weights and the other to the parameters of probability distributions. This separation allows us to address each optimization problem individually. Therefore, taking (1) into account and applying the method of Lagrange multipliers, we derive the update rules for the mixture parameters:
    π k n e w = 1 N i = 1 N γ i , k           k 1,2 K .
    The update rules for the distribution parameters are derived by maximizing the corresponding term of the Q-function:
    θ k n e w = arg max θ k i = 1 N γ i , k ln f ( x i | θ k ) ,       k 1,2 K .
In other words, the parameters of each component distribution are re-estimated by weighted maximum likelihood, where the weights are the posterior probabilities γ i , k . The E-step and M-step are alternated until convergence, typically measured by changes in the log-likelihood function or in the parameter set Θ. Furthermore, an iteration limit can be imposed with the goal of keeping the algorithm’s runtime within acceptable bounds. The algorithm is guaranteed to converge to at least a local maximum of the likelihood function.

3.2. Model Specification

In line with our previous work [16], we consider air pollution monitoring through a sensor network capable of mobility. A mobile air quality sensor traverses a geographic region containing both areas of normal background air quality and zones with elevated pollution levels. The sensor is equipped to detect a specific pollutant and generates a message upon each significant detection event. The message generation process is modeled as a Poisson process, where the transmission rate is a function of the sensor’s location. Specifically, the message generation rate λ 1 is low when the sensor is outside the polluted zone, but it switches to a higher rate λ 2 ( λ 1 < λ 2 ) upon entering the polluted area. The Poisson distribution is a foundational model for data transmission and event count analysis in diverse systems, including communication networks. For example, in a typical scenario, the time to detect a critical event with a reusable mobile air quality sensor follows an exponential distribution [16]. The parameter μ of this distribution is determined by the specific characteristics of the sensor, including its mobility and performance. Consequently, the number of critical events detected within a fixed time interval, T, is described by a Poisson distribution with a rate parameter λ = μ T . The probability of observing exactly t events in this interval is given by the Poisson pmf:
P ξ = t = λ t t ! e λ
here ξ is a random variable representing the number of events.
Analyzing the frequency of specific events, such as air pollution levels above a safety threshold, is a common approach in research. This count-based process is effectively modeled with a Poisson distribution, which is well-suited for rare events like hazardous pollution episodes. The logic behind this mirrors a standard method in environmental epidemiology. Emergency room visits for respiratory reasons are also counting processes, and their nature is very similar to the nature of hazardous pollution events [41,42]. The suitability of the Poisson model for air pollution exceedances is directly confirmed by works using real air quality data [43,44].
To complete the picture, we turned to real-world data. Data were collected by a sensor installed at road level in a highly polluted urban area in Italy [45], measuring the concentration of benzene (C6H6(GT)), an established air pollutant and human carcinogen. After removing missing values, the dataset contained 8991 records, with measured concentrations ranging from 0.1 to 63.7 µg/m3. We performed all calculations in this paper using Python (Version 3.11.13) and NumPy (Version 2.3.3). The descriptive statistics of the dataset are summarized in Table 1.
Table 1. Descriptive statistics for benzene concentration.
A threshold value was defined as 35% below the maximum recorded concentration, and the number of threshold exceedances was calculated for each consecutive 9 h interval. The Kolmogorov–Smirnov (K-S) test was applied to evaluate whether the exceedance counts followed a Poisson distribution. The resulting K–S statistic (the supremum distance between the empirical and theoretical cumulative distributions) was 0.0112, with a corresponding p-value of 0.998, indicating an excellent fit to the Poisson distribution.
This pattern is consistent across other pollutants. This trend is exemplified by non-methane hydrocarbons (NMHC), a key precursor to photochemical smog and thus a critical pollutant for this analysis. The accompanying descriptive statistics for this pollutant are provided in Table 2 for context. The K-S test, yielding a statistic of 0.0068 and a p-value of 1.0, demonstrates an excellent fit to the Poisson distribution. This statistical robustness of the NMHC concentration pattern holds regardless of specific geographical or meteorological conditions, as its primary source is localized vehicular traffic, whose emission profiles remain relatively constant.
Table 2. Descriptive statistics for non-methane hydrocarbons concentration.
The use of the Poisson pmf allows the derivation of closed-form expressions for both the Expectation and Maximization steps of the EM algorithm. This analytical tractability eliminates the need for numerical optimization or other computationally intensive procedures. As a result, the algorithm achieves high computational efficiency and rapid convergence.
The proposed EM algorithm is specifically provided to isolate the underlying emission processes that are fundamental drivers of air pollution. This core focus allows us to bypass secondary influences. Statistical validation, both within this work and as extensively documented in the literature, confirms that the concentration data for a range of primary pollutants adhere to a Poisson distribution. This finding is critical, as it establishes the sole prerequisite for our model’s application: the EM algorithm is universally applicable to any dataset where the target variable conforms to a Poisson process, irrespective of sensor type or local environmental conditions. The model’s strength lies in its ability to decode the latent emission signature directly from concentration readings without requiring ancillary data. Therefore, this work deliberately establishes a foundational model that captures the core statistical nature of emissions. Subsequent enhancements, which may integrate geographic and meteorological variables, will build upon this robust, generalizable core to address more complex, site-specific dispersion forecasting.

3.3. Refinement of EM Clustering

In the context of air quality monitoring, we consider the problem of distinguishing between regular background activity and air pollution events based on the number of alerts, x i , recorded by a sensor over a fixed time interval. The observed dataset X , represents the count of alerts from all sensors in the monitoring area. We model these data as a mixture of two Poisson distributions, where the first component ( k = 1 ), parameterized by λ 1 , models the low-rate Poisson process of normal background activity, and the second component ( k = 2 ), parameterized by λ 2 (where λ 2   >   λ 1 ), models the high-rate process of alert generation characteristic of a pollution event. This approach accounts for the fundamental uncertainty in attributing any individual observation with a high count to either a rare extreme value in the normal state or to a genuine pollution incident. The primary goal of applying the EM algorithm is to compute the responsibility for each sensor reading based on the estimated parameters (the mixing probabilities π , 1 π and the rates λ 1 , λ 2 ) thereby allowing each sensor’s reading to be probabilistically classified as either originating from a “polluted zone” or from the “normal state.” The properties of the Poisson distribution allow the update equations for the EM algorithm to be derived in a closed analytical form.
Therefore, we consider a mixture of two Poisson distributions parameterized by a mixing probability π , such that an observation belongs to class 1 with probability π and to class 2 with probability 1 π . Following the structure of the general mixture model in (2), the specific probability mass function for a single data point x i under a two-component Poisson mixture model is defined as
p x i | π , λ 1 , λ 2 = π λ 1 x i x i ! e λ 1 + 1 π λ 2 x i x i ! e λ 2
where Θ = π , 1 π , λ 1 , λ 2 .
Substituting the Poisson probability mass function (12) into the responsibility formula and simplifying, we obtain the responsibility of cluster 1
γ i , 1 = π ^ e λ ^ 1 λ ^ 1 x i π ^ e λ ^ 1 λ ^ 1 x i + ( 1 π ^ ) e λ ^ 2 λ ^ 2 x i
The responsibility of cluster 2 is consequently
γ i , 2 = 1 γ i , 1
From a computational perspective, it is methodologically advantageous to structure the calculations in the following manner:
I i = 1 γ i , 1 = 1 + 1 π ^ 1 exp x i ln λ ^ 2 λ ^ 1 + λ ^ 1 λ ^ 2
The inverse responsibility calculation is numerically more stable because it avoids underflow errors that arise when directly processing extremely small probability values. It structures the computation to work with larger, more manageable numbers instead of perilously tiny ones. This method is also more computationally efficient as it reduces the number of complex exponential calculations required per data point. The resulting speedup can be crucial for processing large datasets effectively in real time.
The performance of the EM algorithm is highly sensitive to its initial parameter values [46]. In our case, if the initial values for the rate parameters λ 1 and λ 2 are identical, the model enters a symmetric state from which it cannot escape. This leads the algorithm to converge immediately to a degenerate solution where the parameter estimates for both components remain equal. Consequently, the model fails to recover the underlying mixture structure. To ensure a robust start, the initial values of Poisson distribution parameters are instead derived from the empirical data, X, with the smaller parameter set to the dataset’s lower quartile and the larger one to the upper quartile.
Within the framework of the EM algorithm for a two-class mixture model, the computation defined by Formula (10) reduces to
π n e w = 1 N i = 1 N γ i , 1 .
The optimization problem (11) for finding parameter estimates of the distribution in this case reduces to the form:
λ k n e w = arg max λ k i = 1 N γ i , k ( x i ln λ k λ k ln x i ! ) ,   k = 1,2 .
The term ln x i ! can be safely ignored in the optimization problem because it is a constant additive term with respect to the model parameters λ k , and, therefore, its removal does not change the location of the extremum of the objective function. Therefore, to find λ k n e w we maximize the following function:
Q ~ ( λ k ) = i = 1 N γ i , k ( x i ln λ k λ k )
Set the derivative equal to zero to find the critical point:
Q ~ λ k = i = 1 N γ i , k x i λ k 1 = 0
Solving this equation yields the update rule for the parameter:
λ k n e w = i = 1 N γ i , k x i i = 1 N γ i , k ,     k = 1,2 .
Let us check the second derivative to confirm that this critical point is a maximum.
2 Q ~ λ k 2 = 1 λ k 2 i = 1 N γ i , k x i < 0       λ k
A negative second derivative at the critical point confirms a maximum. The second derivative is always negative (unless all γ i , k or x i are zero, which is a degenerate case). This conclusively proves that the obtained critical point is a global maximum for the function Q ~ ( λ k ) with respect to λ k .
For completeness, we provide a compact pseudocode of the EM clustering algorithm used (Algorithm 1).
While the EM clustering algorithm does not have a definitive, universally optimal stopping rule, convergence is typically assessed by monitoring the relative increment in the observed data’s log-likelihood between iterations. The algorithm terminates once this change falls below a specified threshold, which indicates parameter stabilization near a local maximum. Additionally, a hard limit on the number of iterations can be set to prevent unnecessary computations should convergence be slow.
Algorithm 1. EM Clustering.
1:  Input: Dataset X, stop_rule
2:  Initialize: 
3:      λ 1   ← first quartile of X
4:      λ 2   ← third quartile of X
5:            π    ← 0.5                                  # mixing proportion for cluster 1 
6:  Repeat until stop_rule:        
7:    For each i:                            # E-step
8:      Calculate  I i  
9:       γ i , 1 = 1 / I i
10:     γ i , 2 = 1 γ i , 1      
11:       Update  π                                  # M-step
12:       Update  λ 1 , λ 2
13:       Prepare for convergence check
14:  Check stop_rule
20:  Return  π , λ 1 , λ 2 , γ i 1 , γ i 2           # Output

4. Performance Analysis

This section presents the results of a simulation-based performance evaluation of the EM algorithm for the soft clustering of air sensors. The objective is to evaluate the algorithm’s ability to correctly classify sensors into “normal air” or “pollution” clusters, under the assumption that the rate of detection for events of interest differs for each case. Our experimental procedure involves generating separate samples for two air quality scenarios using pseudorandom number generators for the Poisson distribution with different parameters. To generate a sample corresponding to observations in the normal situation, we use parameter λ 1 , and for a sample corresponding to air pollution, we use parameter λ 2 , where λ 1 < λ 2 . These samples are combined and randomly shuffled to create a dataset with an unknown underlying structure, simulating data from a real sensor network. The EM algorithm is applied to this combined sample to estimate the mixture parameters and calculate the cluster membership probabilities (responsibilities). To evaluate the algorithm’s performance, the results are analyzed separately for each of the original samples.
During simulation runs, a seed was initialized using pseudorandom integers uniformly distributed between 0 and 1000. Training was terminated when the convergence threshold of 10−7 was reached.
The EM clustering algorithm demonstrates a strong ability to accurately estimate the underlying mixture components: λ 1 , λ 2 and π . As illustrated in Figure 1, the relative error for each parameter remains low, generally not exceeding a few percent across the tested sample sizes. However, as the sample size grows, the error does not follow a monotonic decreasing trend but demonstrates fluctuations.
Figure 1. Relative error of parameter estimates for the mixture model.
This non-monotonic behavior is expected because the EM algorithm converges to a local maximum of the likelihood function. The specific random sample drawn for a given size can slightly bias the initial conditions or the convergence path, leading to minor variations in the final estimates. Consequently, while larger samples provide more stable estimates on average, the stochastic nature of both the data generation and the EM optimization process results in natural fluctuations in accuracy.
In Figure 1, it is assumed that the sensors have an equal probability of being in either the clean or polluted air zones ( π = 0.5 ). The relative error of the parameter estimates, for the situation in which 90% of sensors are within the air pollution zone ( π = 0.1 ), is illustrated in Figure 2. The proposed approach yielded a highly accurate and stable estimate for the λ2 parameter, corresponding to sensor data from the pollution zone. While the accuracy of the λ1 estimate decreased compared to the equal sample size scenario, this only impacted a minor portion of the sample (10%). Figure 3 depicts the case where the majority of sensors (90%) are in unpolluted zones ( π = 0.9 ). As expected, the parameter estimates show noticeable fluctuations in accuracy, though these consistently stay within a 5% margin.
Figure 2. Relative errors of parameter estimate for the mixture model when 90% of sensors are located in the air pollution zone.
Figure 3. Relative errors of parameter estimate for the mixture model when 10% of sensors are located in the air pollution zone.
In all cases considered, the EM algorithm correctly assigns sensors to their corresponding class with near certainty (responsibilities are very close to 1).
Let us consider an extreme scenario with a modest sample size (N = 200) and relatively close detection intensities ( λ 1 = 6 , λ 2 = 10 ). The dataset generated for this situation is presented in Figure 4 as a violin plot. This violin plot illustrates the smoothed density of data distribution, where its width shows the frequency of values, and the inner red line indicates the median. The result of EM clustering for π = 0.5 is shown in Figure 5. A shift in π improves the estimation accuracy for the predominant subsample.
Figure 4. Visualization of the dataset for a challenging scenario.
Figure 5. Cluster membership probabilities from the EM algorithm.
In this scenario, unlike hard clustering approaches (e.g., k-means), which would disable over 30% of sensors in the pollution zone, probabilistic clustering enables the involvement of all these sensors in intensive and detailed air pollution monitoring via parameterized data transmission policies. Even a simple activation policy, triggering intensive operation or a switch to energy-saving mode when a sensor’s probability exceeds a 0.5 threshold, ensures that a significant majority of sensors are correctly assigned an operational state commensurate with their true status. Dynamic resource management based on responsibility information ensures almost full activation of sensors in hazardous regions at the expense of temporary acceptance of additional short-term costs. Conversely, the same framework allows for a substantial reduction in monitoring intensity and data transmission volume within safe zones. This adaptive strategy ensures that limited resources are allocated to activities of the highest operational relevance. Although this may result in lower-resolution monitoring in non-critical areas, it substantially enhances overall cost-effectiveness. Moreover, it addresses an inherent flaw of hard clustering, where the misclassification of a subset of sensors can create critical blind spots that lead to the loss of essential data from contaminated regions and consequently to high economic and societal costs.
Next, we compare the estimates derived from the proposed EM algorithm with those obtained through k-means clustering using Root Mean Square Error (RMSE). Although the parameters of the Poisson mixture ( λ 1 and λ 2 ) and k-means centroids are fundamentally different in nature, representing probabilistic parameters and geometric centers, respectively, this comparison is methodologically justified in the context of this paper. Since the data are generated from a known Poisson mixture, the k-means centroids can be interpreted as empirical estimates of the distribution means. Thus, evaluating both methods via RMSE provides meaningful insight into their relative effectiveness at recovering the true data-generating parameters. Figure 6 presents the results of performance comparison with a fixed λ 2 = 10 and a varying λ 1 .
Figure 6. RMSE of Parameter Estimates for EM Clustering and k-means.
The results demonstrate a superior performance of the EM algorithm over k-means. As expected, estimation accuracy for both methods degrades as λ 1 increases and the distributional dissimilarity diminishes. However, the EM algorithm’s performance degrades far more gradually. Its RMSE increases at a significantly slower rate than that of the k-means estimator.
To further demonstrate the advantages of the EM algorithm, we introduce a scenario with a penalty function. To convert the soft assignments of the EM algorithm into hard clusters, an observation i is assigned to class k if its responsibility γ i , k exceeds a chosen threshold h . This threshold is selectable to suit different application needs. The penalty function is defined as follows:
f p = N e r r ( 1 ) + c · N e r r ( 2 )
where N e r r ( 1 ) is the number of misclassified sensors in the clean area, and c · N e r r ( 2 ) is the number of misclassified sensors in the polluted area. The cost of a false alarm in the clean zone (e.g., unnecessary transmission of redundant data) is normalized to 1. The coefficient c therefore represents the relative cost of a missed detection in the polluted area, which corresponds to the loss of valuable data.
Under an equal mixing coefficient ( π = 0.9 ) , which represents an ideal scenario for k-means, the performance was evaluated by calculating the ratio of the k-means penalty value to that of the EM algorithm. The optimal threshold h is inversely related to the penalty weight c . We select a boundary value of c = 1.2 to analyze the transition where both error types are equally penalized, and h { 0.5 ; 0.3 } . Figure 7 presents the results of this comparison.
Figure 7. Ratio of k-means to EM Penalty Function.
As expected, under typical high-pollution conditions, the EM algorithm significantly outperforms k-means across both threshold values, even with balanced cluster sizes. A more important finding is the inherent ambiguity in selecting an optimal threshold. This ambiguity reveals a key advantage of the EM-based approach: it provides flexibility to tailor the model to specific operational priorities.
The performance analysis reveals that the proposed EM clustering method is subject to certain limitations despite its overall effectiveness. Its performance can degrade when the underlying Poisson distributions exhibit significant overlap in their parameters, making the components difficult to distinguish. Furthermore, the method is sensitive to small sample sizes, which can lead to unstable parameter estimates. To reduce the inherent volatility of the EM algorithm when applied to small sample sizes, a reasonable strategy is to initialize the procedure from a set of random starting points, thereby protecting against spurious convergence to an unrepresentative local optimum and providing more reliable parameter estimates.

5. Conclusions

This paper addresses the critical challenge of data aggregation in large-scale air pollution monitoring networks, where the volume and stochastic nature of sensor data can lead to significant communication overheads. This paper investigates the application of unsupervised machine learning, a branch of artificial intelligence, for enhancing the performance of air quality monitoring systems. We proposed and validated a modified Expectation–Maximization algorithm for the soft clustering of sensors. By modeling sensor alert signals as a mixture of Poisson distributions, our method probabilistically distinguishes between normal background activity and pollution events, assigning each sensor a cluster membership probability. The power of this approach lies in its probabilistic foundation, which naturally handles the uncertainty and diffuse boundaries inherent in environmental data. Unlike hard clustering techniques, our model provides a nuanced view of the monitoring landscape, enabling the implementation of dynamic data transmission policies.
Simulation results confirm the high efficiency of this method, demonstrating that the cluster membership probability serves as a robust mechanism for controlling the fundamental trade-off between data redundancy and monitoring accuracy. As expected, EM clustering produces an almost perfect assignment of responsibilities when the underlying Poisson distributions are well separated. Surprisingly, even in the more challenging scenario of closely spaced intensity parameters, the algorithm correctly identifies the true cluster with a probability exceeding 0.5 in more than 75% of cases. Furthermore, in over half of these cases, the correct cluster is identified with a probability greater than 0.75. The results provided can be used to open promising avenues for future research into more sophisticated, self-organizing data aggregation protocols for environmental sensing and other distributed monitoring applications.

Author Contributions

Conceptualization, V.S. and O.S.; methodology, V.S.; software, V.S.; validation, V.S. and O.S.; formal analysis, V.S.; investigation, V.S.; resources, V.S. and O.S.; data curation, V.S.; writing—original draft preparation, V.S. and O.S.; writing—review and editing, V.S. and O.S.; visualization, V.S.; supervision, V.S.; project administration, V.S. and O.S.; funding acquisition, V.S. and O.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant for research centers, provided by the Ministry of Economic Development of the Russian Federation in accordance with the subsidy agreement with the Novosibirsk State University dated 17 April 2025 No. 139-15-2025-006: IGK 000000C313925P3S0002.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors have no conflicts of interest to declare.

Abbreviations

The following abbreviations are used in this manuscript.
AQMAir Quality Monitoring
WSNsWireless sensor networks
IoTInternet of things
EMExpectation–maximization
pmfProbability mass function
pdfProbability density function
MLMachine learning
DLDeep learning
RMSERoot mean square error

References

  1. Campbell, L.D.; Pruss, U.A. Climate change, air pollution and noncommunicable diseases. Bull. World Health Organ. 2019, 97, 160–161. [Google Scholar] [CrossRef]
  2. Sun, C.; Li, V.O.; Lam, J.C.; Leslie, I. Optimal Citizen-Centric Sensor Placement for Air Quality Monitoring: A Case Study of City of Cambridge, the United Kingdom. IEEE Access 2019, 7, 47390–47400. [Google Scholar] [CrossRef]
  3. Sokolova, O.; Yurgenson, A.; Shakhov, V. Development of Air Quality Monitoring Systems: Balancing Infrastructure Investment and User Satisfaction Policies. Sensors 2025, 25, 875. [Google Scholar] [CrossRef] [PubMed]
  4. Zarrar, H.; Dyo, V. Drive-by air pollution sensing systems: Challenges and future directions. IEEE Sens. J. 2023, 23, 23692. [Google Scholar] [CrossRef]
  5. Pamula, A.S.P.; Ravilla, A.; Madiraju, S.V.H. Applications of the Internet of Things (IoT) in Real-Time Monitoring of Contaminants in the Air, Water, and Soil. Eng. Proc. 2022, 27, 26. [Google Scholar]
  6. Verghese, S.; Nema, A.K. Optimal design of air quality monitoring networks: A systematic review. Stoch. Environ. Res. Risk Assess 2022, 36, 2963–2978. [Google Scholar] [CrossRef]
  7. Xiong, J.; Li, J.; Gao, F.; Zhang, Y. City Wind Impact on Air Pollution Control for Urban Planning with Different Time-Scale Considerations: A Case Study in Chengdu, China. Atmosphere 2023, 14, 1068. [Google Scholar] [CrossRef]
  8. Nakyai, T.; Santasnachok, M.; Thetkathuek, A.; Phatrabuddha, N. Influence of Meteorological Factors on Air Pollution and Health Risks: A Comparative Analysis of Industrial and Urban Areas in Chonburi Province, Thailand. Environ. Adv. 2025, 19, 100608. [Google Scholar] [CrossRef]
  9. Kumar, P.G.; Lekhana, P.; Musini, T.; Chandrakala, S. Effects of Vehicular Emissions on the Urban Environment—A State of the Art. Mater. Today Proc. 2020, 45, 738–745. [Google Scholar]
  10. Wang, F.; Dong, M.; Ren, J.; Luo, S.; Zhao, H.; Liu, J. The impact of urban spatial structure on air pollution: Empirical evidence from China. Environ. Dev. Sustain. 2022, 24, 5531–5550. [Google Scholar] [CrossRef]
  11. Ruda Sarria, F.; Guerrero Delgado, M.; Monge Palma, R.; Palomo Amores, T.; Sánchez Ramos, J.; Álvarez Domínguez, S. Modelling Pollutant Dispersion in Urban Canyons to Enhance Air Quality and Urban Planning. Appl. Sci. 2025, 15, 1752. [Google Scholar] [CrossRef]
  12. Miao, C.h.; Yu, S.h.; Hu, Y.; Bu, R.; Qi, L.; He, X.; Chen, W. How the morphology of urban street canyons affects suspended particulate matter concentration at the pedestrian level: An in-situ investigation. Sustain. Cities Soc. 2020, 55, 102042. [Google Scholar] [CrossRef]
  13. Montalvo, M.; Horna, D. A Numerical Investigation of the Relationship Between Air Quality, Topography, and Building Height in Populated Hills. Buildings 2025, 15, 2145. [Google Scholar] [CrossRef]
  14. Naizabayeva, L.; Kolesnikova, K.; Khrutba, V. Simulation-Based Assessment of Urban Pollution in Almaty: Influence of Meteorological and Environmental Parameters. Appl. Sci. 2025, 15, 6391. [Google Scholar] [CrossRef]
  15. Shakhov, V.; Migov, D.; Chen, H.; Mishchenko, P.; Koo, I. Toward Reliability of Long Wireless Sensor Networks. IEEE Access 2024, 12, 124506–124516. [Google Scholar] [CrossRef]
  16. Shakhov, V.; Materukhin, A.; Sokolova, O.; Koo, I. Optimizing Urban Air Pollution Detection Systems. Sensors 2022, 22, 4767. [Google Scholar] [CrossRef]
  17. Aburukba, R.; El Fakih, K. Wireless Sensor Networks for Urban Development: A Study of Applications, Challenges, and Performance Metrics. Smart Cities 2025, 8, 89. [Google Scholar] [CrossRef]
  18. Christakis, I.; Tsakiridis, O.; Kandris, D.; Stavrakas, I. Air Pollution Monitoring via Wireless Sensor Networks: The Investigation and Correction of the Aging Behavior of Electrochemical Gaseous Pollutant Sensors. Electronics 2023, 12, 1842. [Google Scholar] [CrossRef]
  19. Wang, L. Design industrial 5.1 air quality monitoring system and develop smart city infra-structure. Meas. Sens. 2024, 35, 101292. [Google Scholar] [CrossRef]
  20. Shahid, S.; Brown, D.J.; Wright, P.; Khasawneh, A.M.; Taylor, B.; Kaiwartya, O. Innovations in Air Quality Monitoring: Sensors, IoT and Future Research. Sensors 2025, 25, 2070. [Google Scholar] [CrossRef]
  21. Chadalavada, S.; Faust, O.; Salvi, M.; Seoni, S.; Raj, N.; Raghavendra, U.; Gudigar, A.; Barua, P.D.; Molinari, F.; Acharya, R. Application of artificial intelligence in air pollution monitoring and forecasting: A systematic review. Environ. Model. Softw. 2025, 185, 106312. [Google Scholar] [CrossRef]
  22. Colléaux, Y.; Willaume, C.; Mohandes, B.; Nebel, J.-C.; Rahman, F. Air Pollution Monitoring Using Cost-Effective Devices Enhanced by Machine Learning. Sensors 2025, 25, 1423. [Google Scholar] [CrossRef]
  23. Wang, G.; Yu, C.; Guo, K.; Guo, H.; Wang, Y. Research of low-cost air quality monitoring models with different machine learning algorithms. Atmos. Meas. Tech. 2024, 17, 181–196. [Google Scholar] [CrossRef]
  24. Liu, Y.; Yu, W.; Zhai, X.; Zhang, B.; McDonald-Maier, K.D.; Fasli, M. Multi-level CEP rules automatic extraction approach for air quality detection and energy conservation decision based on AI technologies. Appl. Energy 2024, 372, 123724. [Google Scholar] [CrossRef]
  25. Bogdanffy, L.; Lorinț, C.R.; Nicola, A. Development of a Low-Cost Traffic and Air Quality Monitoring Internet of Things (IoT) System for Sustainable Urban and Environmental Management. Sustainability 2025, 17, 5003. [Google Scholar] [CrossRef]
  26. Yin, P.-Y. Scheduling and Routing of Device Maintenance for an Outdoor Air Quality Monitoring IoT. Sustainability 2025, 17, 6522. [Google Scholar] [CrossRef]
  27. Przystupa, K.; Bernatska, N.; Dzhumelia, E.; Drzymała, T.; Kochan, O. Ensuring Energy Efficiency of Air Quality Monitoring Systems Based on Internet of Things Technology. Energies 2025, 18, 3768. [Google Scholar] [CrossRef]
  28. Lewandowski, M.; Płaczek, B. Data Transmission Reduction in Wireless Sensor Network for Spatial Event Detection. Sensors 2021, 21, 7256. [Google Scholar] [CrossRef]
  29. Brito, T.; Azevedo, B.F.; Mendes, J.; Zorawski, M.; Fernandes, F.P.; Pereira, A.I.; Rufino, J.; Lima, J.; Costa, P. Data Acquisition Filtering Focused on Optimizing Transmission in a LoRaWAN Network Applied to the WSN Forest Monitoring System. Sensors 2023, 23, 1282. [Google Scholar] [CrossRef]
  30. Chen, H.C.; Putra, K.T.; Tseng, S.S.; Chen, C.L.; Lin, J.C.W. A Spatiotemporal Data Compression Approach with Low Transmission Cost and High Data Fidelity for an Air Quality Monitoring System. Future Gener. Comput. Syst. 2020, 108, 488–500. [Google Scholar] [CrossRef]
  31. Bogalecka, M. Probabilistic approach to modelling, identification and prediction of environmental pollution. Environ. Model. Assess. 2023, 28, 1–14. [Google Scholar] [CrossRef]
  32. Christakis, N.; Drikakis, D. Unsupervised Learning of Particles Dispersion. Mathematics 2023, 11, 3637. [Google Scholar] [CrossRef]
  33. Fu, L.; Li, J.; Chen, Y. An innovative decision making method for air quality monitoring based on big data-assisted artificial intelligence technique. J. Innov. Knowl. 2023, 8, 100294. [Google Scholar] [CrossRef]
  34. Sabando-Bravo, K.E.; Navia, M.; Zambrano-Martinez, J.L. Optimizing CO2 Monitoring: Evaluating a Sensor Network Design. J. Sens. Actuator Netw. 2025, 14, 93. [Google Scholar] [CrossRef]
  35. Pazhanivel, D.B.; Velu, A.N.; Palaniappan, B.S. Design and Enhancement of a Fog-Enabled Air Quality Monitoring and Prediction System: An Optimized Lightweight Deep Learning Model for a Smart Fog Environmental Gateway. Sensors 2024, 24, 5069. [Google Scholar] [CrossRef] [PubMed]
  36. Bertocco, M.; Magalini, G.; Peruzzi, G.; Rigo, F.; Pozzebon, A. A Sensor Fusion Paradigm for Particulate Matter Monitoring Exploiting an Embedded Sound Level Meter for Virtual Sensing Techniques. In Proceedings of the IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Chemnitz, Germany, 19–22 May 2025. [Google Scholar]
  37. Koziel, S.; Pietrenko-Dabrowska, A.; Wojcikowski, M.; Pankiewicz, B. Field calibration of low-cost particulate matter sensors using artificial neural networks and affine response correction. Measurement 2024, 230, 114529. [Google Scholar] [CrossRef]
  38. Furtado, L.S.; Monteiro, N.; Gurjão, N.; Cavalcante, R.M.; Silva Filho, J.E.; da Silveira, J.A.N.; Santos, R.; Soares, J.B.; de Macedo, J.A.F. Low-Cost Smart Sensing Pipeline: Assembly, Calibration, and Interpretation of Air Quality Data. In Proceedings of the IEEE International Smart Cities Conference (ISC2), Pattaya, Thailand, 29 October–1 November 2024. [Google Scholar]
  39. Meng, X.L.; Van Dyk, D. The EM algorithm—An old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B. Stat. Methodol. 1997, 59, 511–567. [Google Scholar] [CrossRef]
  40. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc. 1977, 39, 1–38. [Google Scholar] [CrossRef]
  41. Cromar, K.; Gladson, L.; Jaimes Palomera, M.; Perlmutt, L. Development of a Health-Based Index to Identify the Association between Air Pollution and Health Effects in Mexico City. Atmosphere 2021, 12, 372. [Google Scholar] [CrossRef]
  42. Zhao, Y.; Chen, Y.; Liu, Y.; Tang, S.; Han, Y.; Fu, J.; Chang, Z.; Zhao, X.; Zhuang, Y.; Lei, J.; et al. Short-Term Exposure to Air Pollution Associated with an Increased Risk of ST-Elevation and Non-ST-Elevation Myocardial Infarction Hospital Admissions: A Case-Crossover Study from Beijing (2013–2019), China. Atmosphere 2025, 16, 715. [Google Scholar] [CrossRef]
  43. Khan, M.R.; Sarkar, B. Change Point Detection for Airborne Particulate Matter (PM2.5, PM10) by Using the Bayesian Approach. Mathematics 2019, 7, 474. [Google Scholar] [CrossRef]
  44. Gyarmati-Szabó, J.; Bogachev, L.V.; Chen, H. Modelling threshold exceedances of air pollution concentrations via non-homogeneous Poisson process with multiple change-points. Atmos. Environ. 2011, 45, 5493–5503. [Google Scholar] [CrossRef]
  45. UCI Air Quality Dataset. Available online: https://archive.ics.uci.edu/dataset/360/air+quality (accessed on 11 November 2025).
  46. Panić, B.; Klemenc, J.; Nagode, M. Improved Initialization of the EM Algorithm for Mixture Model Parameter Estimation. Mathematics 2020, 8, 373. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.