1. Introduction
The concept of the Internet of Things (IoT) revolves around linking physical objects, including embedded systems, to enable the collection and exchange of data. IoT serves as the foundation for seamlessly integrating sensors, actuators, and communication devices, facilitating real-time data collection and the remote control of actuators [
1]. This interconnected network of physical objects has created a vast ecosystem in which these objects communicate with one another to enable a wide range of applications. This has unlocked opportunities across various sectors, including smart industries, smart transportation, smart agriculture, smart healthcare, and many more [
2].
All these applications of IoT in various sectors have increasingly contributed to the generation of large amounts of data [
3,
4]. For example, in smart waste management, large volumes of data are generated from various distributed IoT devices, such as cameras, RFIDs, and odor sensors, and these generated data are transmitted to the cloud for further analysis [
5,
6]. Similarly, in smart traffic prediction systems, huge data information—vehicles speed and location, traffic data from surveillance cameras and so on—are transmitted continuously from generating source to the cloud for analysis [
7,
8].
However, due to delicate IoT devices and severe environmental conditions, these raw data might get lost during the transmission and storage processes [
9]. Moreover, this loss can also result from the failure of sensors/actuators, processing units, embedded software, or service and application levels [
10]. Furthermore, the growing use of renewable energy resources to power IoT devices can induce discontinuous data collection and missing data [
11].
This data loss can result in significant losses, sometimes leading to serious failures. For example, in smart healthcare, if technology fails to work as intended, a patient could be injured, or sensitive personal health information may be exposed [
12]. Thus, missing data results in insufficient data for performing meaningful processes and analysis for the corresponding applications. Furthermore, a lack of sufficient data can result in an analysis that lacks statistical significance, potentially leading to erroneous conclusions or flawed decision-making when the missing data include crucial and sensitive features [
11]. Therefore, because the effectiveness of numerous statistical and machine learning algorithms depends on having complete data, it is vital to address missing data appropriately.
There are various types of missing-data mechanisms, and identifying the type of missing data is crucial to finding solutions to address them. Missing data have been categorized into three different types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).
MCAR signifies that the absence of data occurs in a completely random pattern, independent of any observed or unobserved values, i.e., the probability of the missingness depends neither on the observed values in any variable of the dataset nor on the unobserved part of dataset [
11,
13]. MAR indicates that missing values are solely dependent on the observed values and are unrelated to the missing values themselves, i.e., probability depends on the observed values but not on the unobserved values [
14]. MNAR suggests that data is missing in a non-random manner, with the missing values depending on both the observed and unobserved values, i.e., the probability that a data point is missing depends on the value of that data point or other unobserved variables [
14].
The methods that have been developed to handle missing data scenarios fall into two groups: (1) discard-based and (2) imputation-based methods. In a discard-based approach, the missing data are directly removed from datasets [
15]. Although it is simple and easy to implement, it is not applicable when missing values are large [
16]. Furthermore, when the application is sensitive to time and values generated at specific times, discard methods will not solve the problem. In an imputation-based method, the missing value is predicted, and the predicted value is used for analysis [
16]. In other words, synthetic data is generated in case of a missing value. Conventionally, the imputation method is divided into two types: (1) statistics-based imputation and (2) model-based imputation. In statistics-based imputation, the missing data are usually imputed using mean, median, mode, and linear regression methods [
17,
18]. In model-based imputation, appropriate machine learning algorithms are used to impute missing data.
These imputation methods depend on two types of datasets to generate synthetic data: (1) historical datasets and (2) datasets from multi-sensor data fusion of the same target. But using historical data for data fusion is unreliable, as it struggles to capture new or emerging trends and patterns that have developed since the data were collected. Multi-sensor data fusion of the same target shows promising results; however, these multi-sensor data of the same target may not be present in all applications. Moreover, all these multi-sensors may experience a single point of failure scenario, and in this case, sensors could not generate data, and hence failure recovery (imputation) becomes difficult.
In this article, we propose a failure-aware data fusion framework that leverages data from independent and geographically distributed IoT networks to support data imputation under sensor or gateway failures. While cross-domain and distributed sensor fusion have been investigated in prior studies, existing approaches primarily focus on correlation-aware decision-level fusion. In contrast, this work specifically targets feature-level fusion triggered by missing data events, enabling imputation even when no redundant sensors are available at the target location.
In addition to the proposed fusion framework, we integrate a lightweight and effective data imputation strategy suitable for resource-constrained IoT environments. The main contributions of this work are summarized as follows:
- (1)
We propose an efficient data fusion method where multi-sensor data from different targets are fused to facilitate the imputation method in case of a single point of failure. Whenever the application encounters missing data and it does not have any redundant sensor from its application area, it looks for similar data from other networks (outside its application reach) and utilizes it if necessary.
- (2)
We propose a lightweight imputation algorithm: KNN with an Iterative PCA-based imputation method and compare its efficiency with other imputation approaches.
- (3)
We experimented with weather station datasets from eight different US states and performed data fusion of weather values from these states, and the results showed that this approach performs on par with the conventional approach and depends on correlation strength.
2. Related Work
In this section, we review data fusion methods for data imputation—historical-data-based data fusion and sensor/feature-based fusion from the same target location—and data imputation methods from the perspective of statistics-based imputation, model-based imputation, and K-Nearest Neighbors-based imputation.
2.1. Historical Data-Based Data Fusion with Statistics-Based Imputation
Utilizing historical data-based data fusion methods includes the use of past data for data imputation. Various statistic-based imputation methods—mean, median, mode, linear regression, and others—are then applied to the historical dataset [
19]. The author of [
20] applied a mean and hot deck imputation method to impute data from a real breast cancer dataset. The author of [
17] compared mean-, median-, and mode-based imputations of a gas emission dataset, and the mean outperformed the other approaches. Mean-, median-, and mode-based imputation replaces missing values with the mean, median, and mode of the observed values, respectively. Regression-based imputation replaces the missing values for each variable with the values predicted from a regression of that variable on other variables. Hot deck imputation replaces each missing value with a random draw from a “donor pool” consisting of the observed values of those variables [
14,
19].
Statistics-based imputation methods applied to historical datasets typically estimate missing values by considering only the affected feature, without leveraging relationships with other variables. Although such approaches are simple to implement and may appear suitable for handling isolated failures, they often introduce bias and may alter the underlying data distribution, particularly when the proportion of missing values is large [
17]. Moreover, when applied to historical datasets, these methods generally perform worse than model-based imputation techniques that exploit multivariate dependencies. Consequently, relying solely on statistics-based imputation with historical data often leads to reduced accuracy and increased bias, highlighting the need for more robust imputation strategies that incorporate additional contextual information.
2.2. Multi-Sensor-Based Data Fusion from Same Target with Model-Based Imputation
In multi-sensor-based data fusion, different features of the same target location are collected and utilized to make decisions. Thus, various kinds of features of an environment are analyzed together to make a prediction. Since there is a correlation among these features, as they represent different parameters of the same environment, when any one of those features goes missing, then the remaining environmental features can be utilized to impute the missing features. For this kind of multi-sensor-based data fusion technique, model-based imputation is popular since it can analyze the various environmental parameters together to make decisions.
The author of [
21] showed that using MICE boosted recognition accuracy from 87% to 98%. The authors of [
22] have proposed a framework to improve the popular multivariate imputation by using the chained equations (MICE) method to deal with large data gaps. They demonstrated the efficiency of their framework using data from continuous water quality monitoring stations in Vermont. The authors of [
23] introduced model selection to improve multiple imputations for handling high-rate missingness in a water quality dataset. The authors proposed a robust method for selecting the best algorithm to combine MICE to handle multiple relationships between a high number of features of interest and a high rate of missingness. Thus, they expressed their main contribution as improving MICE by taking advantage of ML models such as Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Regression (SVR), and Boosted Regression Trees (BRT) by hybridizing with MICE. They found that MICE-SVR gives a good trade-off between performance and computing time.
The author of [
24] utilized the Random Forest (RF) method to impute the missing weather data, facilitating critical agricultural decision-making. The authors of [
24] used 8.5 years of air temperature, relative humidity, wind, and solar radiation feature data from Washington state to impute the missing data. The authors of [
25] predicted corn variety yield by attributing missing data using a graph neural network. Various ecological zones in China were considered to gather corn trait features—corn variety, corn strain, corn cob type, and so on. Thus, corn features such as corn variety, corn strain, corn cob type, axis color, seeding lead, sheath color, and stay-green trait were chosen and labeled to create the complete dataset. The author of [
26] presented an example of imputing missing water quality data by using the Deep Neural Network (DNN). The water quality data consisted of features such as water temperature, pH, electric conductivity, dissolved oxygen, chlorophyll-a, and nitrate. There are some missing data in these feature sets that were imputed using DNN. The authors of [
27] propose a comprehensive method to forecast AQI. Initially, they predicted hourly ambient concentrations of PM2.5 and PM10 using artificial neural networks. Later, it was extended to the prediction of other criteria of pollutants, i.e., O
3, SO
2, NO
2, and CO, to predict the AQI. Moreover, these features consist of missing gaps, which were handled using missForest, a machine learning-based imputation technique that employed the Random Forest (RF) model.
The above approaches rely on multi-sensor data collected from the same target environment, assuming the availability of redundant or complementary features to support imputation. Consequently, when missing data occur due to single-point failure, sensor malfunctions, or gateway outages, these methods become ineffective or inapplicable, as no redundant sensors are available at the target location. This limitation highlights a critical gap in existing multi-sensor fusion-based imputation techniques, underscoring the need for alternative strategies that can leverage information beyond the local sensing environment.
2.3. KNN-Based Imputation
KNN-based imputation is a method used to fill in missing values in a dataset based on the K-Nearest Neighbors algorithm. In this approach, for each missing value, the algorithm identifies the k-nearest data points (neighbors) with known values for the feature(s) of interest. Then, it imputes the missing value by calculating a weighted average or a majority vote of the known values from these nearest neighbors. The weights are typically determined based on the distance or similarity between the missing data point and its neighbors. KNN-based imputation is commonly used in data preprocessing and is particularly effective for datasets with numerical or categorical features where missing values need to be addressed before further analysis or modeling. The authors of [
28] have presented combined KNN and BiLSTM methods to address noise and variance challenges in data imputation. Initially, wind power dimension influence is assessed using Pearson correlation, and CNN extracts information. BiLSTM learns time-series data representation. Missing data is imputed using the CNN-BiLSTM and KNN models, forming the final dataset. This shows that the combined model improves imputation, increasing R^2 by 0.02. The authors of [
29] compared KNN, Sequential KNN (SKNN), and other statistical-based imputation methods to impute the missing air quality values collected from peninsular Malaysia. KNN and SKNN showed better results compared to statistical-based imputation. Similarly, the author of [
30] performed KNN-based imputation to impute the missing weather data. The missing weather data includes the various weather attributes of various weather stations from Pakistan, and these attributes are utilized together to predict the missing data. The authors of [
31] assessed categorical variable imputation methods using Ugandan maternal health records. KNN imputation stands out for predicting missing values. Results reveal KNN’s superior precision at multiple levels of missingness, with RF also performing well at lower missing data proportions. This study highlighted the importance of method selection based on data characteristics for effective imputation strategies.
Existing KNN-based imputation methods primarily operate within a single dataset or target environment and rely on the availability of locally correlated features to estimate missing values. As a result, their effectiveness is significantly reduced in scenarios where missing data arises from sensor or gateway-level failures that eliminate access to relevant local features. Furthermore, standard KNN based approaches may struggle to capture higher-order covariance structures present in multivariate time-series data. These limitations motivate the development of a proposed KNN with Iterative Principal Component Analysis (PCA) approach. We used a regular KNN-based imputation approach, and after that, Iterative PCA was applied. Thus, we implemented KNN with Iterative PCA to impute the missing data in our proposed data fusion method and compared the result with other algorithms implementing conventional data fusion methods.
3. Proposed Data Fusion Approach
This article presents an imputation architecture that integrates cross-location data fusion, correlation estimation, and KNN with Iterative PCA-based imputation to address missing data in IoT systems. When missing data is detected, the framework identifies candidate features from other gateways and selects only those that are sufficiently correlated with the missing feature for fusion. The resulting fused dataset is then used for imputation via an initial KNN estimate followed by Iterative PCA refinement. The individual components are described in detail in the following subsections.
3.1. Proposed Data Fusion Method
Conventional sensor-based data fusion methods collect multiple features from sensors deployed within the same target environment, typically organized through an IoT architecture consisting of gateways, nodes, and sensors. When data gaps occur in one feature, the remaining locally collected features are analyzed to generate synthetic values for the missing data. However, such approaches assume the availability of redundant or complementary sensors within the same environment.
In contrast, the proposed framework relaxes this assumption by allowing features to be sourced not only from sensors within the same target environment but also from geographically distributed gateways sensing similar feature types in different environments. This design enables data fusion and imputation even in the absence of local sensor redundancy.
Let gns represent gateway n at location s, where n = 1, 2, …, N; and s = 1, 2, …, S. For a given gateway gns, each gateway has p number of sensors features of type t, where p can vary among gateways and the collection of these p number of features for each gateway gns can be defined as set Fn = {fnt1, fnt2, …, fntp}, where t = 1, 2, …, T. The collection of all feature sets across all the gateways can be represented as Xn = {F1, F2, …, FN}. Thus, f122 represents feature number 2 of feature type 2 from gateway 1. In our approach, the system will look for a similar feature type t from the network, making sure it has significant correlation with the missing feature data. Let us assume that data feature f122 (fntp), i.e., feature number 2 of type 2 from dataset X1 and gateway 1 located in geographical location 1 (g11), is missing. In this case, in our proposed framework, the system will look for similar feature type 2 in dataset Xn, where n ≠ 1 and gateway gns andwhere n = s ≠ 1. Then these similar features (fntp, if and only if n ≠ 1 and t == 2) will be transmitted to g11.
Thus, when a feature is missing at a gateway, the proposed framework searches for semantically similar features of the same type from other gateways in the network using a correlation-based feature selection mechanismthat is described in the following subsection. Only those features that exhibit statistically significant correlation with the missing feature are selected and transmitted to the gateway for fusion and subsequent imputation.
Figure 1 illustrates this process. When Data 1 in the first standalone network experiences missing values, the system searches for a corresponding feature in another independent IoT network. If a comparable feature (e.g., Data 2) is identified in the second network, it is transferred to the first network to facilitate data fusion and synthetic data generation.
3.2. Correlation Estimation
Correlation estimation aims to approximate the degree of statistical association between two variables without transmitting the complete dataset. Instead, it relies on summary statistics or partial information to obtain a sufficiently accurate estimate of correlation while minimizing communication overhead. In this work, we employ a summary statistic-based correlation estimation strategy to support efficient feature selection (for data fusion) across distributed IoT gateways.
To elaborate on summary statistics techniques, let us consider a situation where the feature f143, i.e., feature number 3 of type 4 from gateway 1, is not available. In this case, we compute summary statistics, namely the mean and standard deviation, for feature f143. These summary statistics are then transmitted to other gateways (gns). If these other gateways, for example, g22, g33, and g55, possess similar feature types, then the summary statistics of these corresponding features are calculated within their respective gateways. Subsequently, correlation estimation is performed by comparing the summary statistics of feature f143 with those g22, g33, and g55, respectively. If a correlation is identified, then the entire feature is sent to g11 for fusion.
Pearson correlation is used as a lightweight screening mechanism in this process. A correlation is considered significant if the associated p-value is below 0.05, which serves as the threshold for selecting candidate features for fusion. This threshold represents a practical balance between false positives and computational overhead and is widely adopted in statistical analysis. Pearson correlation was selected due to its low computational cost, interpretability, and suitability for deployment on resource-constrained IoT gateways.
Although temporal dependencies are not explicitly modeled during the correlation estimation stage, the underlying datasets are time-indexed, and temporal structure is preserved throughout the fusion and imputation pipeline. Temporal relationships are implicitly captured during the imputation phase through KNN neighborhood selection and Iterative PCA decomposition, which exploit covariance patterns across temporally ordered observations. This design enables effective handling of time-series data while maintaining low computational complexity suitable for decentralized IoT environments.
The Pearson correlation coefficient (commonly denoted as ρ or r) can be estimated using summary statistics, specifically the means (μ) and standard deviations (σ) of two variables, X and Y, as follows:
and are individual data points.
and are the means (averages) of X and Y, respectively.
and are the standard deviations of X and Y, respectively.
is the number of data points.
3.3. KNN with Iterative PCA for Synthetic Data Generation
3.3.1. Dataset Preparation
Following the data fusion process described in
Section 3.1 and illustrated in
Figure 2, features of the same type collected from multiple gateways are combined to form a fused dataset. An initial summary statistics-based correlation screening, described in
Section 3.2, is then applied to determine whether a statistically meaningful correlation exists between the missing feature and candidate features obtained from external gateways.
Subsequently, a second stage of Pearson correlation estimation is performed on the fused dataset to identify the most relevant features for synthetic data generation. Only the top four features exhibiting the highest correlation with the missing feature are selected to balance imputation accuracy and computation efficiency.
For simplicity, let us consider the fused dataset consisting of the
N features fused from the outer gateway and missing feature
M, for which synthetic data is to be generated. Let
X be the set of all
N features. Let
C be the correlation coefficient between features. The Pearson correlation coefficient, like in Equation (1), is used to estimate the pairwise correlation between the features. The correlation coefficient for each feature pair is calculated as shown in Equation (2).
After the correlation coefficient for all features with
M is calculated, the second step includes selecting the top four features that exhibit the highest correlations. These top four features are the ones with the strongest linear associations among them. The notation for this selection can be represented as Equation (3).
The selected features from Equation (3) represent the set of features that we want to use to generate synthetic data. The arg maxj is used to find the indices of the top four highest correlation coefficients, denoted by but only considering features other than M. In other words, feature M is excluded from this selection process.
3.3.2. Synthetic Data Generation Approach
After selecting the highly correlated features, the resulting fused dataset is used to generate synthetic data for the missing feature. The complete imputation workflow is summarized in Algorithm 1, which integrates cross-location data fusion, correlation-based feature selection, and a two-stage imputation process consisting of KNN initialization followed by Iterative Principal Component Analysis (I-PCA).
Let X denote the fused dataset containing both observed and missing values. Prior to imputation, feature selection is performed using Pearson correlation, where only features that satisfy a statistical significance threshold (p < 0.05) are retained. This step ensures that only strongly related features contribute to the imputation process and provides a clear link between the correlation estimation stage and the subsequent imputation pipeline.
In the first imputation stage, missing values are initially estimated using the K-Nearest Neighbors (KNN) method, which imputes each missing entry based on the values of its nearest neighbors in the feature space. This step produces a complete but coarse estimate of the dataset, denoted as XKNN, which serves as the initialization for the iterative refinement stage. KNN is selected for initialization due to its robustness, simplicity, and ability to preserve local similarity structures in the data.
Following the initial KNN imputation, an Iterative PCA-based imputation process is applied to refine only the originally missing values. At each iteration
t, PCA decomposes the current dataset estimate X
t into a low-dimensional representation
where Y
t contains the principal component scores and Z
t represents the corresponding loadings. The reconstructed matrix Y
t Z
t is then used to update only the missing entries, while the observed values are kept fixed using a binary mask matrix. This constraint prevents drift in the observed data and limits uncontrolled error propagation.
The iterative process continues until convergence is achieved, defined as the relative change in reconstructed values between successive iterations falling below a predefined tolerance (ε = 10−4). These criteria ensure numerical stability and reproducibility of the imputation results.
Sensitivity to initialization is acknowledged, as Iterative PCA can be influenced by the quality of the initial imputation. Inaccurate initial estimates may propagate errors during iterative refinement.
However, in the proposed framework, this risk is mitigated in three ways:
(i) KNN initialization leverages local neighborhood similarity rather than global assumptions;
(ii) Only features with statistically significant correlation are included in the fused dataset, reducing noise;
(iii) Observed values remain fixed throughout the iterative process, preventing error accumulation from affecting known data.
| Algorithm 1. Imputing missing values using KNN-based Iterative PCA |
Require: Receive the fused data set X after initial KNN imputation. Ensure: Imputed data set with missing values filled in. 1: Initialize X0 with initial imputed values from KNN. 2: Set iteration counter t ← 0, ε ← 10−4 and max_iter ← 50 3: repeat 4: Use current estimate of Xt and compute Yt and Zt using PCA decomposition. 5: Yt, Zt ← PCA Decomposition(Xt) 6: Compute the residuals between the observed values in X and the predicted values based on Yt and Zt. 7: E ← X − Yt ⋅ Zt 8: Update the missing values in Xt by setting the missing values to their corresponding values in Yt ⋅ Zt, and the observed values to their corresponding values in X. 9: Xt ← M ⋅ (Yt ⋅ Zt) + (¬M) ⋅ X ▷ Apply the mask matrix M. 10: t ← t + 1 11: Until E < ε or t ≥ max_iter 12: Return Xt |
Here,
Let Xi be the ith row of X and let xij be the jth element of xi. Let N be the number of rows in X and let p be the number of features in X. Let Y be the PCA decomposition of X such that X = Y * Z, where Y is an N x k matrix of the k principal components of X, and Z is a k * p matrix of the corresponding loadings. Let yij be the jth element of the ith row of Y and let zij be the jth element of the ith column of Z. Let M be a binary mask matrix of the same dimensions as X, where Mij = 1 if xij is observed and 0 if it is missing. Let Xt be the imputed dataset at iteration t.
3.3.3. Evaluation
The evaluation is based on the implementation of proposed synthetic data generation algorithms on both proposed fused datasets and with unfused datasets. For the simplicity of representing the evaluation method, let us consider a dataset, which can be either fused or unfused, where D has R (rows) and C (columns) with a missing value. drc (1 ≤ r ≤ R, 1 ≤ c ≤ C) is the value of the r-th row and c-th column in D. Similarly, denotes the imputed data for D, and (1 ≤ r ≤ R, 1 ≤ c ≤ C) is the r-th row and c-th column value in . However, if drc is not a missing value, then drc holds. Thus, the goal is to minimize the difference between and drc. We have used the root-mean square error (RMSE) to describe the error, which is shown in Equation (4).
The performance of the data fusion method and synthetic data generation method are both compared with RMSE values. The synthetic data will be generated using the proposed KNN with an Iterative PCA approach, and its errors will be obtained in both types of datasets to compare the data fusion method efficiency. Further, RMSE values will be evaluated in different synthetic data generation algorithms to compare the efficiency of the proposed synthetic data generation algorithms.
4. Dataset and Results
To evaluate the effectiveness of the proposed data fusion method against a conventional unfused fusion approach, two types of datasets are considered, both designed to facilitate synthetic data generation for the same target feature. Specifically, for a given missing feature F, we construct (i), a conventional fusion dataset, where imputation relies solely on features collected from the same target location, and (ii), a proposed fusion dataset, where features of the same type are fused from geographically distributed gateways. These two dataset configurations enable a direct and fair comparison of imputation performance for identical missing features under different fusion strategies. The proposed KNN with Iterative PCA imputation method is applied consistently across both dataset types.
The experimental evaluation is conducted using real-world weather monitoring data collected from eight geographically diverse locations across the United States, namely Colorado, California, Arizona, Las Vegas, Washington, Salt Lake City, Texas, and Oregon. These locations exhibit distinct climatic patterns even within the same time period, providing natural variability that is well-suited for assessing cross-location correlation and fusion effectiveness. The datasets were obtained independently for each location and cover a three-month period (from February to April 2023).
While the evaluation focuses on weather sensor data, which represents a common and well-studied IoT application with continuous temporal characteristics, the proposed data fusion and imputation framework is not limited to this domain. Weather data are used as a representative testbed due to their availability, temporal structure, and established use in imputation studies. Extending the evaluation to other IoT domains such as industrial monitoring or healthcare sensing is an important direction for future work.
4.1. Dataset Before Fusion (DS1)
Table 1 and
Table 2 show a glimpse of the conventional-style fused dataset under consideration. Similar datasets exist for other states as well. As observed in these tables, each dataset comprises a total of nine features—temperature, feels like temperature, dew point, humidity, wind gust, wind speed, wind direction, cloud cover, and visibility—obtained from the same target locations, i.e., Arizona and Las Vegas, respectively. All these weather station datasets from all eight states share an identical number of features and the same feature types. In aggregate, there are 93 data points available for each of these features. The choice to select these states was driven by the noticeable variations in weather conditions exhibited across them.
If any one of these features goes missing, then the remaining features will be used to generate synthetic data, a conventional data fusion technique to impute missing values.
4.2. Dataset After Fusion (DS2)
Table 3 and
Table 4 show a glimpse of the proposed fused dataset for “feels like” and “average temperature”, respectively. For example, from
Table 3, whenever the “feels like” feature is missing from the, let us say, gateway from Arizona, then the proposed fusion approach will gather the “feels like” feature from the other gateways from different locations. Similarly,
Table 4 shows the fused dataset for average temperature.
4.3. Experiments
We performed experiments in two parts: (1) evaluating the effectiveness of the proposed data fusion technique and (2) evaluating the effectiveness of KNN + Iterative PCA with other imputation approaches. First, the effectiveness of KNN + Iterative PCA is shown, and later the data fusion technique is compared based on KNN + Iterative PCA values.
4.4. Correlation Statistics Analysis
For both DS1 and DS2, we obtained the top four and five correlations of each feature, as shown in
Figure 3 and
Figure 4. In both figures, the correlation between “temperature” and “feels like” covers the highest value. Similarly, “wind direction” has the lowest correlation between both datasets. Likewise, “humidity” has a moderate correlation between both datasets. The comparison of the synthetic data generated is evaluated based on how effectively the synthetic data was generated on the proposed data fusion methods when it is high. For conventional data fusion methods, it shows a higher correlation than the proposed data fusion methods, except for “visibility”, “humidity”, and “feels like” features. Likewise, for the “cloud cover”, “wind direction”, and “dew” features, conventional data fusion methods show significantly higher correlation values. For the rest, it is almost on par.
4.5. KNN + Iterative PCA with Other Approaches
To evaluate the effectiveness of the proposed imputation strategy, we compared the KNN with Iterative PCA method against a set of widely used statistical and classical machine learning-based imputation techniques on the proposed fused dataset (DS2). The evaluated methods include machine learning approaches such as KNN, Random Forest (RF), RF + Iterative PCA, Decision Tree (DT), and DT + Iterative PCA, as well as statistical techniques, including Mean Imputation, PMF, MICE, and MICE + Iterative PCA. More complex deep learning-based imputation models (e.g., DNNs, GANs, and transformer-based approaches) were not considered, as they typically require large-scale training data, extensive hyperparameter tuning, and substantial computational resources, which are not well-suited to the moderate dataset size and the data-scarce conditions representative of the real-world IoT deployments considered in this study.
In our study, across all evaluated features, the imputation performance of the compared methods exhibits a clear dependence on the degree of cross-location feature correlation. For the “temperature” feature, which shows the highest correlation across locations, the proposed KNN with Iterative PCA method consistently achieves the lowest RMSE values (
Figure 5). This indicates that when strong multivariate relationships exist, the combined use of neighborhood-based initialization and covariance-aware refinement effectively captures both local and global data structures.
For the “humidity” feature, which exhibits moderate correlation, the proposed method continues to outperform or closely match alternative approaches (
Figure 6). Although the overall RMSE values are higher than those observed for temperature, the performance gap between KNN + Iterative PCA and other methods remains evident, demonstrating robustness under reduced correlation strength.
In contrast, the “wind direction” feature exhibits weak cross-location correlation, leading to increased RMSE values for all evaluated methods (
Figure 7). Under these conditions, the advantage of cross-location fusion is diminished, and the proposed method achieves performance comparable to, rather than substantially better than, conventional approaches. This behavior highlights an inherent limitation of correlation-driven fusion methods and underscores the importance of correlation strength in determining imputation accuracy.
Overall, the results demonstrate that the proposed approach provides consistent improvements when cross-location correlations are strong or moderate, while maintaining competitive performance in low-correlation scenarios. These findings confirm that correlation-aware fusion combined with KNN-Iterative PCA imputation is effective under favorable conditions and degrades gracefully when correlation is weak, rather than introducing instability or excessive error.
4.6. Comparison of Proposed Data Fusion Methods with Other Fusion Methods
The comparative evaluation of the proposed cross-location data fusion method (DS2) and the conventional fusion approach (DS1) reveals a consistent relationship between feature correlation strength and imputation accuracy. To provide a structured analysis, the evaluated features can be broadly categorized into high-, medium-, and low-correlation groups based on their cross-location Pearson correlation values.
To simulate realistic missing-data scenarios under limited data availability, missing values are introduced at 10% and 20% levels. Higher missing rates are not considered due to the moderate size of the available datasets, as excessive removal of data would reduce statistical reliability and distort correlation estimation. For higher missing percentages (20%), a larger set of correlated features (top five instead of top four) is used to compensate for increased information loss.
Features such as “temperature” (
Figure 8 and
Figure 9) and “feels like” (
Figure 10 and
Figure 11) exhibit high cross-location correlation, reflecting strong and consistent relationships across geographically distributed weather stations. For these features, the proposed data fusion method consistently achieves lower RMSE values than the conventional approach under both 10% and 20% missing-data scenarios. This behavior indicates that fusing correlated external features effectively enhances the available information for imputation, allowing the KNN with Iterative PCA method to accurately capture both local similarity and global covariance structure.
Features including “humidity” (
Figure 12 and
Figure 13), “wind gust” (
Figure 14 and
Figure 15) and “wind speed” (
Figure 16 and
Figure 17) demonstrate moderate correlation across locations. For this group, the proposed fusion method yields either improved or comparable RMSE values relative to the conventional approach. Although the performance gains are less pronounced than in the high-correlation case, the results show that cross-location fusion remains beneficial when partial but meaningful correlations exist.
In contrast, features such as “wind direction” (
Figure 18 and
Figure 19), “dew” (
Figure 20 and
Figure 21), “visibility” (
Figure 22 and
Figure 23) and “cloud cover” (
Figure 24 and
Figure 25) exhibit low or highly variable cross-location correlation, largely due to their strong dependence on localized meteorological conditions. For these features, the advantage of cross-location fusion is diminished. In some cases, such as “dew” and “cloud cover”, the proposed method produces higher RMSE values than the conventional approach, indicating that incorporating weakly correlated external data can introduce noise rather than useful information. Nevertheless, the proposed framework does not exhibit instability and maintains performance comparable to baseline methods for low-correlation features.
Overall, this analysis confirms that the effectiveness of the proposed cross-location data fusion approach is strongly governed by the strength of inter-location feature correlation. The method performs best for highly correlated features, provides robust performance for moderately correlated features, and degrades gracefully for weakly correlated features, thereby validating the design choice of correlation-aware feature selection in the proposed framework.
5. Conclusions and Limitations
This study evaluated a correlation-aware cross-location data fusion framework for generating synthetic data in IoT systems experiencing sensor or gateway failures. Using real-world weather data collected from eight geographically diverse locations across the United States, the proposed approach was compared with conventional same-location fusion strategies under controlled missing-data scenarios. The results demonstrate that the proposed method achieves improved or comparable imputation accuracy when sufficient cross-location feature correlation exists and remains stable when correlation is weak.
A detailed feature-level analysis shows that the effectiveness of the proposed framework is strongly governed by inter-location correlation strength. Highly correlated features benefit most from cross-location fusion, moderately correlated features show robust and consistent performance, and weakly correlated features exhibit limited improvement or performance comparable to conventional fusion methods. These observations highlight both the strengths and inherent limitations of correlation-driven fusion and underscore the importance of correlation-aware feature selection.
The proposed KNN with Iterative PCA imputation method consistently outperforms baseline statistical approaches such as Mean Imputation and PMF and demonstrates robust behavior across varying correlation levels. Advanced deep learning-based imputation methods (e.g., DNNs, GANs, and transformer-based models) were not considered in this work due to their reliance on large training datasets, extensive hyperparameter tuning, and high computational demands. Given the moderate dataset size and intermittent data availability typical of real-world IoT deployments, such models may suffer from overfitting and unstable training, making lightweight and interpretable approaches more suitable in this context.
From a systems perspective, the proposed fusion framework reduces reliance on redundant sensor deployment by selectively fusing correlated features from external IoT networks. While the current implementation prioritizes imputation accuracy, theoretical analysis of convergence, computational complexity, and scalability have not been formally derived and represent an important limitation of this study. Similarly, practical considerations such as energy consumption, communication overhead, and latency introduced by cross-network data exchange are not explicitly quantified. Although the framework mitigates overhead by exchanging only summary statistics during correlation screening and transmitting full data only when a significant correlation is detected, a comprehensive system-level evaluation remains for future work.
The experimental evaluation is further limited to weather monitoring data and missing data rates of 10% and 20%, which were selected to preserve statistical reliability given the limited dataset size. Higher missing rates, large-scale deployments, and highly dynamic or non-stationary environments were not explored. Future research will focus on formal convergence and complexity analysis, large-scale scalability studies, energy- and latency-aware fusion strategies, and validation across diverse IoT domains such as industrial monitoring and healthcare sensing.
Overall, this work demonstrates that correlation-aware cross-location data fusion combined with lightweight imputation techniques provides a practical and effective solution for handling missing data in resource-constrained IoT environments while also identifying clear directions for advancing theoretical rigor and real-world applicability.