Data Fusion Method for Multi-Sensor Internet of Things Systems Including Data Imputation

Sharma, Saugat; Chmaj, Grzegorz; Selvaraj, Henry

doi:10.3390/iot7010011

Open AccessArticle

Data Fusion Method for Multi-Sensor Internet of Things Systems Including Data Imputation

by

Saugat Sharma

,

Grzegorz Chmaj

and

Henry Selvaraj

^*

Electrical and Computer Engineering Department, College of Engineering, University of Nevada, Las Vegas, NV 89154, USA

^*

Author to whom correspondence should be addressed.

IoT 2026, 7(1), 11; https://doi.org/10.3390/iot7010011

Submission received: 30 December 2025 / Revised: 16 January 2026 / Accepted: 22 January 2026 / Published: 26 January 2026

(This article belongs to the Topic Internet of Things Architectures, Applications, and Strategies: Emerging Paradigms, Technologies, and Advancing AI Integration)

Download

Browse Figures

Versions Notes

Abstract

In Internet of Things (IoT) systems, data collected by geographically distributed sensors is often incomplete due to device failures, harsh deployment conditions, energy constraints, and unreliable communication. Such data gaps can significantly degrade downstream data processing and decision-making, particularly when failures result in the loss of all locally redundant sensors. Conventional imputation approaches typically rely on historical trends or multi-sensor fusion within the same target environment; however, historical methods struggle to capture emerging patterns, while same-location fusion remains vulnerable to single-point failures when local redundancy is unavailable. This article proposes a correlation-aware, cross-location data fusion framework for data imputation in IoT networks that explicitly addresses single-point failure scenarios. Instead of relying on co-located sensors, the framework selectively fuses semantically similar features from independent and geographically distributed gateways using summary statistics-based and correlation screening to minimize communication overhead. The resulting fused dataset is then processed using a lightweight KNN with an Iterative PCA imputation method, which combines local neighborhood similarity with global covariance structure to generate synthetic data for missing values. The proposed framework is evaluated using real-world weather station data collected from eight geographically diverse locations across the United States. The experimental results show that the proposed approach achieves improved or comparable imputation accuracy relative to conventional same-location fusion methods when sufficient cross-location feature correlation exists and degrades gracefully when correlation is weak. By enabling data recovery without requiring redundant local sensors, the proposed approach provides a resource-efficient and failure-resilient solution for handling missing data in IoT systems.

Keywords:

data fusion; IoT; synthetic data

1. Introduction

The concept of the Internet of Things (IoT) revolves around linking physical objects, including embedded systems, to enable the collection and exchange of data. IoT serves as the foundation for seamlessly integrating sensors, actuators, and communication devices, facilitating real-time data collection and the remote control of actuators [1]. This interconnected network of physical objects has created a vast ecosystem in which these objects communicate with one another to enable a wide range of applications. This has unlocked opportunities across various sectors, including smart industries, smart transportation, smart agriculture, smart healthcare, and many more [2].

All these applications of IoT in various sectors have increasingly contributed to the generation of large amounts of data [3,4]. For example, in smart waste management, large volumes of data are generated from various distributed IoT devices, such as cameras, RFIDs, and odor sensors, and these generated data are transmitted to the cloud for further analysis [5,6]. Similarly, in smart traffic prediction systems, huge data information—vehicles speed and location, traffic data from surveillance cameras and so on—are transmitted continuously from generating source to the cloud for analysis [7,8].

However, due to delicate IoT devices and severe environmental conditions, these raw data might get lost during the transmission and storage processes [9]. Moreover, this loss can also result from the failure of sensors/actuators, processing units, embedded software, or service and application levels [10]. Furthermore, the growing use of renewable energy resources to power IoT devices can induce discontinuous data collection and missing data [11].

This data loss can result in significant losses, sometimes leading to serious failures. For example, in smart healthcare, if technology fails to work as intended, a patient could be injured, or sensitive personal health information may be exposed [12]. Thus, missing data results in insufficient data for performing meaningful processes and analysis for the corresponding applications. Furthermore, a lack of sufficient data can result in an analysis that lacks statistical significance, potentially leading to erroneous conclusions or flawed decision-making when the missing data include crucial and sensitive features [11]. Therefore, because the effectiveness of numerous statistical and machine learning algorithms depends on having complete data, it is vital to address missing data appropriately.

There are various types of missing-data mechanisms, and identifying the type of missing data is crucial to finding solutions to address them. Missing data have been categorized into three different types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

MCAR signifies that the absence of data occurs in a completely random pattern, independent of any observed or unobserved values, i.e., the probability of the missingness depends neither on the observed values in any variable of the dataset nor on the unobserved part of dataset [11,13]. MAR indicates that missing values are solely dependent on the observed values and are unrelated to the missing values themselves, i.e., probability depends on the observed values but not on the unobserved values [14]. MNAR suggests that data is missing in a non-random manner, with the missing values depending on both the observed and unobserved values, i.e., the probability that a data point is missing depends on the value of that data point or other unobserved variables [14].

The methods that have been developed to handle missing data scenarios fall into two groups: (1) discard-based and (2) imputation-based methods. In a discard-based approach, the missing data are directly removed from datasets [15]. Although it is simple and easy to implement, it is not applicable when missing values are large [16]. Furthermore, when the application is sensitive to time and values generated at specific times, discard methods will not solve the problem. In an imputation-based method, the missing value is predicted, and the predicted value is used for analysis [16]. In other words, synthetic data is generated in case of a missing value. Conventionally, the imputation method is divided into two types: (1) statistics-based imputation and (2) model-based imputation. In statistics-based imputation, the missing data are usually imputed using mean, median, mode, and linear regression methods [17,18]. In model-based imputation, appropriate machine learning algorithms are used to impute missing data.

These imputation methods depend on two types of datasets to generate synthetic data: (1) historical datasets and (2) datasets from multi-sensor data fusion of the same target. But using historical data for data fusion is unreliable, as it struggles to capture new or emerging trends and patterns that have developed since the data were collected. Multi-sensor data fusion of the same target shows promising results; however, these multi-sensor data of the same target may not be present in all applications. Moreover, all these multi-sensors may experience a single point of failure scenario, and in this case, sensors could not generate data, and hence failure recovery (imputation) becomes difficult.

In this article, we propose a failure-aware data fusion framework that leverages data from independent and geographically distributed IoT networks to support data imputation under sensor or gateway failures. While cross-domain and distributed sensor fusion have been investigated in prior studies, existing approaches primarily focus on correlation-aware decision-level fusion. In contrast, this work specifically targets feature-level fusion triggered by missing data events, enabling imputation even when no redundant sensors are available at the target location.

In addition to the proposed fusion framework, we integrate a lightweight and effective data imputation strategy suitable for resource-constrained IoT environments. The main contributions of this work are summarized as follows:

(1): We propose an efficient data fusion method where multi-sensor data from different targets are fused to facilitate the imputation method in case of a single point of failure. Whenever the application encounters missing data and it does not have any redundant sensor from its application area, it looks for similar data from other networks (outside its application reach) and utilizes it if necessary.
(2): We propose a lightweight imputation algorithm: KNN with an Iterative PCA-based imputation method and compare its efficiency with other imputation approaches.
(3): We experimented with weather station datasets from eight different US states and performed data fusion of weather values from these states, and the results showed that this approach performs on par with the conventional approach and depends on correlation strength.

2. Related Work

In this section, we review data fusion methods for data imputation—historical-data-based data fusion and sensor/feature-based fusion from the same target location—and data imputation methods from the perspective of statistics-based imputation, model-based imputation, and K-Nearest Neighbors-based imputation.

2.1. Historical Data-Based Data Fusion with Statistics-Based Imputation

Utilizing historical data-based data fusion methods includes the use of past data for data imputation. Various statistic-based imputation methods—mean, median, mode, linear regression, and others—are then applied to the historical dataset [19]. The author of [20] applied a mean and hot deck imputation method to impute data from a real breast cancer dataset. The author of [17] compared mean-, median-, and mode-based imputations of a gas emission dataset, and the mean outperformed the other approaches. Mean-, median-, and mode-based imputation replaces missing values with the mean, median, and mode of the observed values, respectively. Regression-based imputation replaces the missing values for each variable with the values predicted from a regression of that variable on other variables. Hot deck imputation replaces each missing value with a random draw from a “donor pool” consisting of the observed values of those variables [14,19].

Statistics-based imputation methods applied to historical datasets typically estimate missing values by considering only the affected feature, without leveraging relationships with other variables. Although such approaches are simple to implement and may appear suitable for handling isolated failures, they often introduce bias and may alter the underlying data distribution, particularly when the proportion of missing values is large [17]. Moreover, when applied to historical datasets, these methods generally perform worse than model-based imputation techniques that exploit multivariate dependencies. Consequently, relying solely on statistics-based imputation with historical data often leads to reduced accuracy and increased bias, highlighting the need for more robust imputation strategies that incorporate additional contextual information.

2.2. Multi-Sensor-Based Data Fusion from Same Target with Model-Based Imputation

In multi-sensor-based data fusion, different features of the same target location are collected and utilized to make decisions. Thus, various kinds of features of an environment are analyzed together to make a prediction. Since there is a correlation among these features, as they represent different parameters of the same environment, when any one of those features goes missing, then the remaining environmental features can be utilized to impute the missing features. For this kind of multi-sensor-based data fusion technique, model-based imputation is popular since it can analyze the various environmental parameters together to make decisions.

The author of [21] showed that using MICE boosted recognition accuracy from 87% to 98%. The authors of [22] have proposed a framework to improve the popular multivariate imputation by using the chained equations (MICE) method to deal with large data gaps. They demonstrated the efficiency of their framework using data from continuous water quality monitoring stations in Vermont. The authors of [23] introduced model selection to improve multiple imputations for handling high-rate missingness in a water quality dataset. The authors proposed a robust method for selecting the best algorithm to combine MICE to handle multiple relationships between a high number of features of interest and a high rate of missingness. Thus, they expressed their main contribution as improving MICE by taking advantage of ML models such as Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Regression (SVR), and Boosted Regression Trees (BRT) by hybridizing with MICE. They found that MICE-SVR gives a good trade-off between performance and computing time.

The author of [24] utilized the Random Forest (RF) method to impute the missing weather data, facilitating critical agricultural decision-making. The authors of [24] used 8.5 years of air temperature, relative humidity, wind, and solar radiation feature data from Washington state to impute the missing data. The authors of [25] predicted corn variety yield by attributing missing data using a graph neural network. Various ecological zones in China were considered to gather corn trait features—corn variety, corn strain, corn cob type, and so on. Thus, corn features such as corn variety, corn strain, corn cob type, axis color, seeding lead, sheath color, and stay-green trait were chosen and labeled to create the complete dataset. The author of [26] presented an example of imputing missing water quality data by using the Deep Neural Network (DNN). The water quality data consisted of features such as water temperature, pH, electric conductivity, dissolved oxygen, chlorophyll-a, and nitrate. There are some missing data in these feature sets that were imputed using DNN. The authors of [27] propose a comprehensive method to forecast AQI. Initially, they predicted hourly ambient concentrations of PM2.5 and PM10 using artificial neural networks. Later, it was extended to the prediction of other criteria of pollutants, i.e., O₃, SO₂, NO₂, and CO, to predict the AQI. Moreover, these features consist of missing gaps, which were handled using missForest, a machine learning-based imputation technique that employed the Random Forest (RF) model.

The above approaches rely on multi-sensor data collected from the same target environment, assuming the availability of redundant or complementary features to support imputation. Consequently, when missing data occur due to single-point failure, sensor malfunctions, or gateway outages, these methods become ineffective or inapplicable, as no redundant sensors are available at the target location. This limitation highlights a critical gap in existing multi-sensor fusion-based imputation techniques, underscoring the need for alternative strategies that can leverage information beyond the local sensing environment.

2.3. KNN-Based Imputation

KNN-based imputation is a method used to fill in missing values in a dataset based on the K-Nearest Neighbors algorithm. In this approach, for each missing value, the algorithm identifies the k-nearest data points (neighbors) with known values for the feature(s) of interest. Then, it imputes the missing value by calculating a weighted average or a majority vote of the known values from these nearest neighbors. The weights are typically determined based on the distance or similarity between the missing data point and its neighbors. KNN-based imputation is commonly used in data preprocessing and is particularly effective for datasets with numerical or categorical features where missing values need to be addressed before further analysis or modeling. The authors of [28] have presented combined KNN and BiLSTM methods to address noise and variance challenges in data imputation. Initially, wind power dimension influence is assessed using Pearson correlation, and CNN extracts information. BiLSTM learns time-series data representation. Missing data is imputed using the CNN-BiLSTM and KNN models, forming the final dataset. This shows that the combined model improves imputation, increasing R^2 by 0.02. The authors of [29] compared KNN, Sequential KNN (SKNN), and other statistical-based imputation methods to impute the missing air quality values collected from peninsular Malaysia. KNN and SKNN showed better results compared to statistical-based imputation. Similarly, the author of [30] performed KNN-based imputation to impute the missing weather data. The missing weather data includes the various weather attributes of various weather stations from Pakistan, and these attributes are utilized together to predict the missing data. The authors of [31] assessed categorical variable imputation methods using Ugandan maternal health records. KNN imputation stands out for predicting missing values. Results reveal KNN’s superior precision at multiple levels of missingness, with RF also performing well at lower missing data proportions. This study highlighted the importance of method selection based on data characteristics for effective imputation strategies.

Existing KNN-based imputation methods primarily operate within a single dataset or target environment and rely on the availability of locally correlated features to estimate missing values. As a result, their effectiveness is significantly reduced in scenarios where missing data arises from sensor or gateway-level failures that eliminate access to relevant local features. Furthermore, standard KNN based approaches may struggle to capture higher-order covariance structures present in multivariate time-series data. These limitations motivate the development of a proposed KNN with Iterative Principal Component Analysis (PCA) approach. We used a regular KNN-based imputation approach, and after that, Iterative PCA was applied. Thus, we implemented KNN with Iterative PCA to impute the missing data in our proposed data fusion method and compared the result with other algorithms implementing conventional data fusion methods.

3. Proposed Data Fusion Approach

This article presents an imputation architecture that integrates cross-location data fusion, correlation estimation, and KNN with Iterative PCA-based imputation to address missing data in IoT systems. When missing data is detected, the framework identifies candidate features from other gateways and selects only those that are sufficiently correlated with the missing feature for fusion. The resulting fused dataset is then used for imputation via an initial KNN estimate followed by Iterative PCA refinement. The individual components are described in detail in the following subsections.

3.1. Proposed Data Fusion Method

Conventional sensor-based data fusion methods collect multiple features from sensors deployed within the same target environment, typically organized through an IoT architecture consisting of gateways, nodes, and sensors. When data gaps occur in one feature, the remaining locally collected features are analyzed to generate synthetic values for the missing data. However, such approaches assume the availability of redundant or complementary sensors within the same environment.

In contrast, the proposed framework relaxes this assumption by allowing features to be sourced not only from sensors within the same target environment but also from geographically distributed gateways sensing similar feature types in different environments. This design enables data fusion and imputation even in the absence of local sensor redundancy.

Let g_ns represent gateway n at location s, where n = 1, 2, …, N; and s = 1, 2, …, S. For a given gateway g_ns, each gateway has p number of sensors features of type t, where p can vary among gateways and the collection of these p number of features for each gateway g_ns can be defined as set F_n = {f_nt₁, f_nt₂, …, f_ntp}, where t = 1, 2, …, T. The collection of all feature sets across all the gateways can be represented as X_n = {F₁, F₂, …, F_N}. Thus, f₁₂₂ represents feature number 2 of feature type 2 from gateway 1. In our approach, the system will look for a similar feature type t from the network, making sure it has significant correlation with the missing feature data. Let us assume that data feature f₁₂₂ (f_ntp), i.e., feature number 2 of type 2 from dataset X₁ and gateway 1 located in geographical location 1 (g₁₁), is missing. In this case, in our proposed framework, the system will look for similar feature type 2 in dataset X_n, where n ≠ 1 and gateway g_ns andwhere n = s ≠ 1. Then these similar features (f_ntp, if and only if n ≠ 1 and t == 2) will be transmitted to g₁₁.

Thus, when a feature is missing at a gateway, the proposed framework searches for semantically similar features of the same type from other gateways in the network using a correlation-based feature selection mechanismthat is described in the following subsection. Only those features that exhibit statistically significant correlation with the missing feature are selected and transmitted to the gateway for fusion and subsequent imputation.

Figure 1 illustrates this process. When Data 1 in the first standalone network experiences missing values, the system searches for a corresponding feature in another independent IoT network. If a comparable feature (e.g., Data 2) is identified in the second network, it is transferred to the first network to facilitate data fusion and synthetic data generation.

3.2. Correlation Estimation

Correlation estimation aims to approximate the degree of statistical association between two variables without transmitting the complete dataset. Instead, it relies on summary statistics or partial information to obtain a sufficiently accurate estimate of correlation while minimizing communication overhead. In this work, we employ a summary statistic-based correlation estimation strategy to support efficient feature selection (for data fusion) across distributed IoT gateways.

To elaborate on summary statistics techniques, let us consider a situation where the feature f_143, i.e., feature number 3 of type 4 from gateway 1, is not available. In this case, we compute summary statistics, namely the mean and standard deviation, for feature f₁₄₃. These summary statistics are then transmitted to other gateways (g_ns). If these other gateways, for example, g₂₂, g₃₃, and g₅₅, possess similar feature types, then the summary statistics of these corresponding features are calculated within their respective gateways. Subsequently, correlation estimation is performed by comparing the summary statistics of feature f₁₄₃ with those g₂₂, g₃₃, and g₅₅, respectively. If a correlation is identified, then the entire feature is sent to g₁₁ for fusion.

Pearson correlation is used as a lightweight screening mechanism in this process. A correlation is considered significant if the associated p-value is below 0.05, which serves as the threshold for selecting candidate features for fusion. This threshold represents a practical balance between false positives and computational overhead and is widely adopted in statistical analysis. Pearson correlation was selected due to its low computational cost, interpretability, and suitability for deployment on resource-constrained IoT gateways.

Although temporal dependencies are not explicitly modeled during the correlation estimation stage, the underlying datasets are time-indexed, and temporal structure is preserved throughout the fusion and imputation pipeline. Temporal relationships are implicitly captured during the imputation phase through KNN neighborhood selection and Iterative PCA decomposition, which exploit covariance patterns across temporally ordered observations. This design enables effective handling of time-series data while maintaining low computational complexity suitable for decentralized IoT environments.

The Pearson correlation coefficient (commonly denoted as ρ or r) can be estimated using summary statistics, specifically the means (μ) and standard deviations (σ) of two variables, X and Y, as follows:

ρ (X, Y) = \frac{\sum_{i = 1}^{n} [(X_{i} - μ_{X}) * (Y_{i} - μ_{Y})]}{[n * σ_{X} * σ_{Y}]}

(1)

$X_{i}$ and $Y_{i}$ are individual data points.
$μ_{X}$ and $μ_{Y}$ are the means (averages) of X and Y, respectively.
$σ_{X}$ and $σ_{Y}$ are the standard deviations of X and Y, respectively.
$n$ is the number of data points.

3.3. KNN with Iterative PCA for Synthetic Data Generation

3.3.1. Dataset Preparation

Following the data fusion process described in Section 3.1 and illustrated in Figure 2, features of the same type collected from multiple gateways are combined to form a fused dataset. An initial summary statistics-based correlation screening, described in Section 3.2, is then applied to determine whether a statistically meaningful correlation exists between the missing feature and candidate features obtained from external gateways.

Subsequently, a second stage of Pearson correlation estimation is performed on the fused dataset to identify the most relevant features for synthetic data generation. Only the top four features exhibiting the highest correlation with the missing feature are selected to balance imputation accuracy and computation efficiency.

For simplicity, let us consider the fused dataset consisting of the N features fused from the outer gateway and missing feature M, for which synthetic data is to be generated. Let X be the set of all N features. Let C be the correlation coefficient between features. The Pearson correlation coefficient, like in Equation (1), is used to estimate the pairwise correlation between the features. The correlation coefficient for each feature pair is calculated as shown in Equation (2).

C_{j} = corr (M, X_{j}) for j = 1,2, \dots, N and M_{i} \neq X_{j}

(2)

After the correlation coefficient for all features with M is calculated, the second step includes selecting the top four features that exhibit the highest correlations. These top four features are the ones with the strongest linear associations among them. The notation for this selection can be represented as Equation (3).

Selected Features = \arg \max_{j} C_{j} for j = 1,2, \dots, 4

(3)

The selected features from Equation (3) represent the set of features that we want to use to generate synthetic data. The arg max_j is used to find the indices of the top four highest correlation coefficients, denoted by

C_{j}

but only considering features other than M. In other words, feature M is excluded from this selection process.

3.3.2. Synthetic Data Generation Approach

After selecting the highly correlated features, the resulting fused dataset is used to generate synthetic data for the missing feature. The complete imputation workflow is summarized in Algorithm 1, which integrates cross-location data fusion, correlation-based feature selection, and a two-stage imputation process consisting of KNN initialization followed by Iterative Principal Component Analysis (I-PCA).

Let X denote the fused dataset containing both observed and missing values. Prior to imputation, feature selection is performed using Pearson correlation, where only features that satisfy a statistical significance threshold (p < 0.05) are retained. This step ensures that only strongly related features contribute to the imputation process and provides a clear link between the correlation estimation stage and the subsequent imputation pipeline.

In the first imputation stage, missing values are initially estimated using the K-Nearest Neighbors (KNN) method, which imputes each missing entry based on the values of its nearest neighbors in the feature space. This step produces a complete but coarse estimate of the dataset, denoted as X_KNN, which serves as the initialization for the iterative refinement stage. KNN is selected for initialization due to its robustness, simplicity, and ability to preserve local similarity structures in the data.

Following the initial KNN imputation, an Iterative PCA-based imputation process is applied to refine only the originally missing values. At each iteration t, PCA decomposes the current dataset estimate X_t into a low-dimensional representation

X_t ≈ Y_t Z_t

where Y_t contains the principal component scores and Z_t represents the corresponding loadings. The reconstructed matrix Y_t Z_t is then used to update only the missing entries, while the observed values are kept fixed using a binary mask matrix. This constraint prevents drift in the observed data and limits uncontrolled error propagation.

The iterative process continues until convergence is achieved, defined as the relative change in reconstructed values between successive iterations falling below a predefined tolerance (ε = 10⁻⁴). These criteria ensure numerical stability and reproducibility of the imputation results.

Sensitivity to initialization is acknowledged, as Iterative PCA can be influenced by the quality of the initial imputation. Inaccurate initial estimates may propagate errors during iterative refinement.

However, in the proposed framework, this risk is mitigated in three ways:

(i) KNN initialization leverages local neighborhood similarity rather than global assumptions;

(ii) Only features with statistically significant correlation are included in the fused dataset, reducing noise;

(iii) Observed values remain fixed throughout the iterative process, preventing error accumulation from affecting known data.

Algorithm 1. Imputing missing values using KNN-based Iterative PCA

Require: Receive the fused data set X after initial KNN imputation.
Ensure: Imputed data set with missing values filled in.
1: Initialize X₀ with initial imputed values from KNN.
2: Set iteration counter t ← 0, ε ← 10⁻⁴ and max_iter ← 50
3: repeat
4: Use current estimate of X_t and compute Y_t and Z_t using PCA decomposition.
5: Y_t, Z_t ← PCA Decomposition(X_t)
6: Compute the residuals between the observed values in X and the predicted values based on Y_t and Z_t.
7: E ← X − Y_t ⋅ Z_t
8: Update the missing values in X_t by setting the missing values to their corresponding values in Y_t ⋅ Z_t, and the observed values to their corresponding values in X.
9: X_t ← M ⋅ (Y_t ⋅ Z_t) + (¬M) ⋅ X ▷ Apply the mask matrix M.
10: t ← t + 1
11: Until E < ε or t ≥ max_iter
12: Return X_t

Here,

Let Xi be the ith row of X and let xij be the jth element of xi. Let N be the number of rows in X and let p be the number of features in X. Let Y be the PCA decomposition of X such that X = Y * Z, where Y is an N x k matrix of the k principal components of X, and Z is a k * p matrix of the corresponding loadings. Let yij be the jth element of the ith row of Y and let zij be the jth element of the ith column of Z. Let M be a binary mask matrix of the same dimensions as X, where Mij = 1 if xij is observed and 0 if it is missing. Let Xt be the imputed dataset at iteration t.

3.3.3. Evaluation

The evaluation is based on the implementation of proposed synthetic data generation algorithms on both proposed fused datasets and with unfused datasets. For the simplicity of representing the evaluation method, let us consider a dataset, which can be either fused or unfused, where D has R (rows) and C (columns) with a missing value. d_rc (1 ≤ r ≤ R, 1 ≤ c ≤ C) is the value of the r-th row and c-th column in D. Similarly,

\hat{D}

denotes the imputed data for D, and

{\hat{d}}_{r c}

(1 ≤ r ≤ R, 1 ≤ c ≤ C) is the r-th row and c-th column value in

\hat{D}

. However, if d_rc is not a missing value, then

{\hat{d}}_{r c} =

d_rc holds. Thus, the goal is to minimize the difference between

{\hat{d}}_{r c}

and d_rc. We have used the root-mean square error (RMSE) to describe the error, which is shown in Equation (4).

The performance of the data fusion method and synthetic data generation method are both compared with RMSE values. The synthetic data will be generated using the proposed KNN with an Iterative PCA approach, and its errors will be obtained in both types of datasets to compare the data fusion method efficiency. Further, RMSE values will be evaluated in different synthetic data generation algorithms to compare the efficiency of the proposed synthetic data generation algorithms.

RMSE = \sqrt{\frac{1}{R C} \sum_{r = 1}^{R} \sum_{c = 1}^{C} {(d_{r c} - \hat{d_{r c}})}^{2}}

(4)

4. Dataset and Results

To evaluate the effectiveness of the proposed data fusion method against a conventional unfused fusion approach, two types of datasets are considered, both designed to facilitate synthetic data generation for the same target feature. Specifically, for a given missing feature F, we construct (i), a conventional fusion dataset, where imputation relies solely on features collected from the same target location, and (ii), a proposed fusion dataset, where features of the same type are fused from geographically distributed gateways. These two dataset configurations enable a direct and fair comparison of imputation performance for identical missing features under different fusion strategies. The proposed KNN with Iterative PCA imputation method is applied consistently across both dataset types.

The experimental evaluation is conducted using real-world weather monitoring data collected from eight geographically diverse locations across the United States, namely Colorado, California, Arizona, Las Vegas, Washington, Salt Lake City, Texas, and Oregon. These locations exhibit distinct climatic patterns even within the same time period, providing natural variability that is well-suited for assessing cross-location correlation and fusion effectiveness. The datasets were obtained independently for each location and cover a three-month period (from February to April 2023).

While the evaluation focuses on weather sensor data, which represents a common and well-studied IoT application with continuous temporal characteristics, the proposed data fusion and imputation framework is not limited to this domain. Weather data are used as a representative testbed due to their availability, temporal structure, and established use in imputation studies. Extending the evaluation to other IoT domains such as industrial monitoring or healthcare sensing is an important direction for future work.

4.1. Dataset Before Fusion (DS1)

Table 1 and Table 2 show a glimpse of the conventional-style fused dataset under consideration. Similar datasets exist for other states as well. As observed in these tables, each dataset comprises a total of nine features—temperature, feels like temperature, dew point, humidity, wind gust, wind speed, wind direction, cloud cover, and visibility—obtained from the same target locations, i.e., Arizona and Las Vegas, respectively. All these weather station datasets from all eight states share an identical number of features and the same feature types. In aggregate, there are 93 data points available for each of these features. The choice to select these states was driven by the noticeable variations in weather conditions exhibited across them.

If any one of these features goes missing, then the remaining features will be used to generate synthetic data, a conventional data fusion technique to impute missing values.

4.2. Dataset After Fusion (DS2)

Table 3 and Table 4 show a glimpse of the proposed fused dataset for “feels like” and “average temperature”, respectively. For example, from Table 3, whenever the “feels like” feature is missing from the, let us say, gateway from Arizona, then the proposed fusion approach will gather the “feels like” feature from the other gateways from different locations. Similarly, Table 4 shows the fused dataset for average temperature.

4.3. Experiments

We performed experiments in two parts: (1) evaluating the effectiveness of the proposed data fusion technique and (2) evaluating the effectiveness of KNN + Iterative PCA with other imputation approaches. First, the effectiveness of KNN + Iterative PCA is shown, and later the data fusion technique is compared based on KNN + Iterative PCA values.

4.4. Correlation Statistics Analysis

For both DS1 and DS2, we obtained the top four and five correlations of each feature, as shown in Figure 3 and Figure 4. In both figures, the correlation between “temperature” and “feels like” covers the highest value. Similarly, “wind direction” has the lowest correlation between both datasets. Likewise, “humidity” has a moderate correlation between both datasets. The comparison of the synthetic data generated is evaluated based on how effectively the synthetic data was generated on the proposed data fusion methods when it is high. For conventional data fusion methods, it shows a higher correlation than the proposed data fusion methods, except for “visibility”, “humidity”, and “feels like” features. Likewise, for the “cloud cover”, “wind direction”, and “dew” features, conventional data fusion methods show significantly higher correlation values. For the rest, it is almost on par.

4.5. KNN + Iterative PCA with Other Approaches

To evaluate the effectiveness of the proposed imputation strategy, we compared the KNN with Iterative PCA method against a set of widely used statistical and classical machine learning-based imputation techniques on the proposed fused dataset (DS2). The evaluated methods include machine learning approaches such as KNN, Random Forest (RF), RF + Iterative PCA, Decision Tree (DT), and DT + Iterative PCA, as well as statistical techniques, including Mean Imputation, PMF, MICE, and MICE + Iterative PCA. More complex deep learning-based imputation models (e.g., DNNs, GANs, and transformer-based approaches) were not considered, as they typically require large-scale training data, extensive hyperparameter tuning, and substantial computational resources, which are not well-suited to the moderate dataset size and the data-scarce conditions representative of the real-world IoT deployments considered in this study.

In our study, across all evaluated features, the imputation performance of the compared methods exhibits a clear dependence on the degree of cross-location feature correlation. For the “temperature” feature, which shows the highest correlation across locations, the proposed KNN with Iterative PCA method consistently achieves the lowest RMSE values (Figure 5). This indicates that when strong multivariate relationships exist, the combined use of neighborhood-based initialization and covariance-aware refinement effectively captures both local and global data structures.

For the “humidity” feature, which exhibits moderate correlation, the proposed method continues to outperform or closely match alternative approaches (Figure 6). Although the overall RMSE values are higher than those observed for temperature, the performance gap between KNN + Iterative PCA and other methods remains evident, demonstrating robustness under reduced correlation strength.

In contrast, the “wind direction” feature exhibits weak cross-location correlation, leading to increased RMSE values for all evaluated methods (Figure 7). Under these conditions, the advantage of cross-location fusion is diminished, and the proposed method achieves performance comparable to, rather than substantially better than, conventional approaches. This behavior highlights an inherent limitation of correlation-driven fusion methods and underscores the importance of correlation strength in determining imputation accuracy.

Overall, the results demonstrate that the proposed approach provides consistent improvements when cross-location correlations are strong or moderate, while maintaining competitive performance in low-correlation scenarios. These findings confirm that correlation-aware fusion combined with KNN-Iterative PCA imputation is effective under favorable conditions and degrades gracefully when correlation is weak, rather than introducing instability or excessive error.

4.6. Comparison of Proposed Data Fusion Methods with Other Fusion Methods

The comparative evaluation of the proposed cross-location data fusion method (DS2) and the conventional fusion approach (DS1) reveals a consistent relationship between feature correlation strength and imputation accuracy. To provide a structured analysis, the evaluated features can be broadly categorized into high-, medium-, and low-correlation groups based on their cross-location Pearson correlation values.

To simulate realistic missing-data scenarios under limited data availability, missing values are introduced at 10% and 20% levels. Higher missing rates are not considered due to the moderate size of the available datasets, as excessive removal of data would reduce statistical reliability and distort correlation estimation. For higher missing percentages (20%), a larger set of correlated features (top five instead of top four) is used to compensate for increased information loss.

Features such as “temperature” (Figure 8 and Figure 9) and “feels like” (Figure 10 and Figure 11) exhibit high cross-location correlation, reflecting strong and consistent relationships across geographically distributed weather stations. For these features, the proposed data fusion method consistently achieves lower RMSE values than the conventional approach under both 10% and 20% missing-data scenarios. This behavior indicates that fusing correlated external features effectively enhances the available information for imputation, allowing the KNN with Iterative PCA method to accurately capture both local similarity and global covariance structure.

Features including “humidity” (Figure 12 and Figure 13), “wind gust” (Figure 14 and Figure 15) and “wind speed” (Figure 16 and Figure 17) demonstrate moderate correlation across locations. For this group, the proposed fusion method yields either improved or comparable RMSE values relative to the conventional approach. Although the performance gains are less pronounced than in the high-correlation case, the results show that cross-location fusion remains beneficial when partial but meaningful correlations exist.

In contrast, features such as “wind direction” (Figure 18 and Figure 19), “dew” (Figure 20 and Figure 21), “visibility” (Figure 22 and Figure 23) and “cloud cover” (Figure 24 and Figure 25) exhibit low or highly variable cross-location correlation, largely due to their strong dependence on localized meteorological conditions. For these features, the advantage of cross-location fusion is diminished. In some cases, such as “dew” and “cloud cover”, the proposed method produces higher RMSE values than the conventional approach, indicating that incorporating weakly correlated external data can introduce noise rather than useful information. Nevertheless, the proposed framework does not exhibit instability and maintains performance comparable to baseline methods for low-correlation features.

Overall, this analysis confirms that the effectiveness of the proposed cross-location data fusion approach is strongly governed by the strength of inter-location feature correlation. The method performs best for highly correlated features, provides robust performance for moderately correlated features, and degrades gracefully for weakly correlated features, thereby validating the design choice of correlation-aware feature selection in the proposed framework.

5. Conclusions and Limitations

This study evaluated a correlation-aware cross-location data fusion framework for generating synthetic data in IoT systems experiencing sensor or gateway failures. Using real-world weather data collected from eight geographically diverse locations across the United States, the proposed approach was compared with conventional same-location fusion strategies under controlled missing-data scenarios. The results demonstrate that the proposed method achieves improved or comparable imputation accuracy when sufficient cross-location feature correlation exists and remains stable when correlation is weak.

A detailed feature-level analysis shows that the effectiveness of the proposed framework is strongly governed by inter-location correlation strength. Highly correlated features benefit most from cross-location fusion, moderately correlated features show robust and consistent performance, and weakly correlated features exhibit limited improvement or performance comparable to conventional fusion methods. These observations highlight both the strengths and inherent limitations of correlation-driven fusion and underscore the importance of correlation-aware feature selection.

The proposed KNN with Iterative PCA imputation method consistently outperforms baseline statistical approaches such as Mean Imputation and PMF and demonstrates robust behavior across varying correlation levels. Advanced deep learning-based imputation methods (e.g., DNNs, GANs, and transformer-based models) were not considered in this work due to their reliance on large training datasets, extensive hyperparameter tuning, and high computational demands. Given the moderate dataset size and intermittent data availability typical of real-world IoT deployments, such models may suffer from overfitting and unstable training, making lightweight and interpretable approaches more suitable in this context.

From a systems perspective, the proposed fusion framework reduces reliance on redundant sensor deployment by selectively fusing correlated features from external IoT networks. While the current implementation prioritizes imputation accuracy, theoretical analysis of convergence, computational complexity, and scalability have not been formally derived and represent an important limitation of this study. Similarly, practical considerations such as energy consumption, communication overhead, and latency introduced by cross-network data exchange are not explicitly quantified. Although the framework mitigates overhead by exchanging only summary statistics during correlation screening and transmitting full data only when a significant correlation is detected, a comprehensive system-level evaluation remains for future work.

The experimental evaluation is further limited to weather monitoring data and missing data rates of 10% and 20%, which were selected to preserve statistical reliability given the limited dataset size. Higher missing rates, large-scale deployments, and highly dynamic or non-stationary environments were not explored. Future research will focus on formal convergence and complexity analysis, large-scale scalability studies, energy- and latency-aware fusion strategies, and validation across diverse IoT domains such as industrial monitoring and healthcare sensing.

Overall, this work demonstrates that correlation-aware cross-location data fusion combined with lightweight imputation techniques provides a practical and effective solution for handling missing data in resource-constrained IoT environments while also identifying clear directions for advancing theoretical rigor and real-world applicability.

Author Contributions

Conceptualization, G.C. and H.S.; methodology, S.S. and G.C.; software, S.S.; validation, S.S., G.C. and H.S.; formal analysis, S.S. and G.C.; investigation, S.S.; resources, S.S., G.C. and H.S.; data curation, S.S.; writing—original draft preparation, S.S.; writing—review and editing, G.C. and H.S.; visualization, S.S.; supervision, G.C. and H.S.; project administration, H.S.; funding acquisition: NONE. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available from the National Oceanic and Atmospheric Administration (NOAA) National Centers for Environmental Information (NCEI) at https://www.ncei.noaa.gov (accessed on 21 July 2024). These data were derived from public domain resources provided by NOAA.

Acknowledgments

The authors acknowledge the use of ChatGPT (GPT 5.2) (OpenAI) for language editing and grammatical improvement. The AI-assisted tool did not contribute to the study design, data processing, data fusion methodology, synthetic data generation, experimental analysis, or scientific conclusions. Full responsibility for the content of this manuscript rests with the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Arshi, O.; Mondal, S. Advancements in sensors and actuators technologies for smart cities: A comprehensive review. Smart Constr. Sustain. Cities 2023, 1, 18. [Google Scholar] [CrossRef]
Sharma, S.; Chmaj, G.; Selvaraj, H. Machine Learning Applied to Internet of Things Applications: A Survey. In Advances in Systems Engineering; Borzemski, L., Selvaraj, H., Świątek, J., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 301–309. [Google Scholar]
Kodali, R.K.; Swamy, G.; Lakshmi, B. An implementation of IoT for healthcare. In Proceedings of the 2015 IEEE Recent Advances in Intelligent Computational Systems, RAICS 2015, Trivandrum, India, 10–12 December 2015. [Google Scholar] [CrossRef]
Gaur, A.; Scotney, B.; Parr, G.; McClean, S. Smart city architecture and its applications based on IoT. Procedia Comput. Sci. 2015, 52, 1089–1094. [Google Scholar] [CrossRef]
Hussain, A.; Draz, U.; Ali, T.; Tariq, S.; Irfan, M.; Glowacz, A.; Daviu, J.A.A.; Yasin, S.; Rahman, S.A. Waste management and prediction of air pollutants using IoT and machine learning approach. Energies 2020, 13, 3930. [Google Scholar] [CrossRef]
Medvedev, A.; Fedchenkov, P.; Zaslavsky, A.; Anagnostopoulos, T.; Khoruzhnikov, S. Waste management as an IoT-enabled service in smart cities. In Internet of Things, Smart Spaces, and Next Generation Networks and Systems; Springer Verlag: Berlin/Heidelberg, Germany, 2015; pp. 104–115. [Google Scholar] [CrossRef]
Mangla, M.; Satpathy, S.; Nayak, B.; Mohanty, S. Integration of Cloud Computing with Internet of Things-Foundations, Analytics and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2021. [Google Scholar]
Fusco, G.; Colombaroni, C.; Comelli, L.; Isaenko, N. Short-term traffic predictions on large urban traffic networks: Applications of network-based machine learning models and dynamic traffic assignment models. In Proceedings of the MT-ITS: 2015 International Conference on Models and Technologies for Intelligent Transportation Systems: Budapest University of Technology and Economics (BME), Faculty of Transport Engineering and Vehicle Engineering, Department of Transport Technology and Economics, Budapest, Hungary, 3–5 June 2015; Institute of Electrical and Electronics Engineers: New York, NY, USA, 2015. [Google Scholar] [CrossRef]
Turner, D.; Levchenko, K.; Snoeren, A.C.; Savage, S. California fault lines: Understanding the causes and impact of network failures. In Proceedings of the ACM SIGCOMM 2010 Conference; ACM Digital Library: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Melo, M.; Aquino, G. The Pathology of Failures in IoT Systems. In Proceedings of the Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2021; pp. 437–452. [Google Scholar] [CrossRef]
Li, L.; Wang, Y.; Wang, H.; Hu, S.; Wei, T. An Efficient Architecture for Imputing Distributed Data Sets of IoT Networks. IEEE Internet Things J. 2023, 10, 15100–15114. [Google Scholar] [CrossRef]
Ismagilova, E.; Hughes, L.; Rana, N.P.; Dwivedi, Y.K. Security, Privacy and Risks Within Smart Cities: Literature Review and Development of a Smart City Interaction Framework. Inf. Syst. Front. 2022, 24, 393–414. [Google Scholar] [CrossRef]
Jadhav, A.; Pramod, D.; Ramanathan, K. Comparison of Performance of Data Imputation Methods for Numeric Dataset. Appl. Artif. Intell. 2019, 33, 913–933. [Google Scholar] [CrossRef]
Rässler, S.; Rubin, D.B.; Zell, E.R. Imputation. Wiley Interdiscip. Rev. Comput. Stat. 2013, 5, 20–29. [Google Scholar] [CrossRef]
Barzi, F.; Woodward, M. Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. Am. J. Epidemiol. 2004, 160, 34–45. [Google Scholar] [CrossRef] [PubMed]
Aljuaid, T.; Sasi, S. Proper imputation techniques for missing values in data sets. In Proceedings of the 2016 International Conference on Data Science and Engineering, ICDSE 2016, Cochin, India, 23–25 August 2016. [Google Scholar] [CrossRef]
Sharma, S.; Chmaj, G.; Selvaraj, H. Sensor Data Restoration in Internet of Things Systems Using Machine Learning Approach. In Lecture Notes in Networks and Systems; Springer International Publishing: Cham, Switzerland, 2023. [Google Scholar] [CrossRef]
Sharma, S.; Chmaj, G.; Selvaraj, H. Applying Machine Learning to Minimize the Impact of Sensor Failures to RTOS Based Internet of Things Systems. In Proceedings of the Advances in Systems Engineering, Las Vegas, NV, USA, 4 August 2023. [Google Scholar] [CrossRef]
Zhang, Z. Missing data imputation: Focusing on single imputation. Ann. Transl. Med. 2016, 4, 9. [Google Scholar] [CrossRef] [PubMed]
Jerez, J.M.; Molina, I.; García-Laencina, P.J.; Alba, E.; Ribelles, N.; Martín, M.; Franco, L. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 2010, 50, 105–115. [Google Scholar] [CrossRef] [PubMed]
Guguloth, S.; Telu, A.; Sairam, U.; Voruganti, S. Activity Recognition in Missing Data Scenario Using MICE Algorithm. In Lecture Notes in Networks and Systems; Springer Science and Business Media Deutschland GmbH: Berlin/Heidelberg, Germany, 2023; pp. 844–851. [Google Scholar] [CrossRef]
Wu, R.; Hamshaw, S.D.; Yang, L.; Kincaid, D.W.; Etheridge, R.; Ghasemkhani, A. Data Imputation for Multivariate Time Series Sensor Data With Large Gaps of Missing Data. IEEE Sens. J. 2022, 22, 10671–10683. [Google Scholar] [CrossRef]
Ratolojanahary, R.; Ngouna, R.H.; Medjaher, K.; Junca-Bourié, J.; Dauriac, F.; Sebilo, M. Model selection to improve multiple imputation for handling high rate missingness in a water quality dataset. Expert. Syst. Appl. 2019, 131, 299–307. [Google Scholar] [CrossRef]
Boomgard-Zagrodnik, J.P.; Brown, D.J. Machine learning imputation of missing Mesonet temperature observations. Comput. Electron. Agric. 2022, 192, 106580. [Google Scholar] [CrossRef]
Yang, F.; Zhang, D.; Zhang, Y.; Zhang, Y.; Han, Y.; Zhang, Q.; Zhang, Q.; Zhang, C.; Liu, Z.; Wang, K. Prediction of corn variety yield with attribute-missing data via graph neural network. Comput. Electron. Agric. 2023, 211, 108046. [Google Scholar] [CrossRef]
Zhang, Y.; Thorburn, P.J. Handling missing data in near real-time environmental monitoring: A system and a review of selected methods. Future Gener. Comput. Syst. 2022, 128, 63–72. [Google Scholar] [CrossRef]
Alkabbani, H.; Ramadan, A.; Zhu, Q.; Elkamel, A. An Improved Air Quality Index Machine Learning-Based Forecasting with Multivariate Data Imputation Approach. Atmosphere 2022, 13, 1144. [Google Scholar] [CrossRef]
Wang, C.; Ren, B.; Li, X.; Chen, L. A CNN-BiLSTM and KNN based missing data imputation for wind power generation forecasting. In Proceedings of the 2023 IEEE 6th International Electrical and Energy Conference, CIEEC 2023, Hefei, China, 12–14 May 2023. [Google Scholar] [CrossRef]
Zainuri, N.A.; Jemain, A.A.; Muda, N. A comparison of various imputation methods for missing values in air quality data. Sains Malays. 2015, 44, 449–456. [Google Scholar] [CrossRef]
Nida, H.; Kashif, M.; Khan, M.I.; Ghamkhar, M. Comparison of missing data imputation methods using weather data. Pak. J. Agric. Sci. 2023, 60, 327–336. [Google Scholar] [CrossRef]
Memon, S.M.; Wamala, R.; Kabano, I.H. A comparison of imputation methods for categorical data. Inform. Med. Unlocked 2023, 42, 101382. [Google Scholar] [CrossRef]

Figure 1. Data fusion in interconnected network architectures.

Figure 2. (a) Flowchart of data fusion for finding highly correlated features. (b) Imputation steps.

Figure 3. Top 4 and 5 correlation values for each feature for DS1.

Figure 4. Top 4 and 5 correlation values for each feature for DS2.

Figure 5. RMSE comparison of KNN + Iterative PCA with various machine learning and statistical methods for “temperature” feature.

Figure 6. RMSE comparison of KNN + Iterative PCA with various machine learning and statistical methods for “humidity” feature.

Figure 7. RMSE comparison of KNN + Iterative PCA with various machine learning and statistical methods for “wind direction” feature.

Figure 8. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “temperature” feature for 10% missing data for various geographical locations.

Figure 9. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “temperature” feature for 20% missing data for various geographical locations.

Figure 10. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “feels like” feature for 10% missing data for various geographical locations.

Figure 11. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “feels like” feature for 20% missing data for various geographical locations.

Figure 12. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “humidity” feature for 10% missing data for various geographical locations.

Figure 13. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “humidity” feature for 20% missing data for various geographical locations.

Figure 14. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “wind gust” feature for 10% missing data for various geographical locations.

Figure 15. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “wind gust” feature for 20% missing data for various geographical locations.

Figure 16. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “wind speed” feature for 10% missing data for various geographical locations.

Figure 17. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “wind speed” feature for 20% missing data for various geographical locations.

Figure 18. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “wind direction” feature for 10% missing data for various geographical locations.

Figure 19. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “wind direction” feature for 20% missing data for various geographical locations.

Figure 20. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “dew” feature for 10% missing data for various geographical locations.

Figure 21. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “dew” feature for 20% missing data for various geographical locations.

Figure 22. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “visibility” feature for 10% missing data for various geographical locations.

Figure 23. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “visibility” feature for 20% missing data for various geographical locations.

Figure 24. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “cloud cover” feature for 10% missing data for various geographical locations.

Figure 25. RMSE comparison of proposed data fusion method with conventional data fusion method for generating synthetic data for “cloud cover” feature for 20% missing data for various geographical locations.

Table 1. Arizona weather dataset.

Temperature	Feels Like	Dew	Humidity	Wind Gust	Windspeed	Wind-Direction	Cloud Cover	Visibility
12	11.4	1.7	51.4	66.5	40.3	198.1	88.1	14.7
8.1	7.2	0.8	61.4	37.1	21.1	227.6	82.6	16
11.2	10.8	2.4	56.1	14.8	15.3	188.6	32.3	16
13.7	13.1	2.6	51.3	16.6	16	109.1	20.9	16
15.3	14.9	2.7	45.9	29.5	17.5	154.8	65.7	16

Table 2. Las Vegas weather dataset.

Temperature	Feels Like	Dew	Humidity	Wind Gust	Windspeed	Wind-Direction	Cloud Cover	Visibility
7.3	4.1	−0.9	57.5	55.4	31.9	200.5	68.1	15.3
9.7	7.5	−6.5	33	67.1	40.9	339.7	8.8	16
9.1	8.6	−4.6	39.2	25.9	12.5	25.5	7.1	16
11.7	11.4	−3.1	39.1	64.8	41.7	201.9	15.3	16
11.4	10.1	−3.4	35.9	64	38.7	202.3	19.3	16

Table 3. “Feels like” data after fusing with all 8 states and cities.

Arizona	California	Colorado	Las Vegas	Oregon	Salt Lake	Texas	Washington
11.4	4.5	−6.7	4.1	2	−2.4	24	8.3
7.2	6.5	−2.7	7.5	1.5	−2.4	21.9	12.7
10.8	8.5	−2	8.6	0.7	−1.3	15.7	4.8
13.1	6.6	−4.7	11.4	0.8	−1.8	16.8	8.2
14.9	6.2	−1.2	10.1	0.5	−1.2	18.5	9.7

Table 4. “Average temperature” dataset after fusing with all 8 states and cities.

Arizona	California	Colorado	Las Vegas	Oregon	Salt Lake	Texas	Washington
12	7	−2	7.3	3.4	2	23.7	9
8.1	8	−0.5	9.7	5.1	0.5	21.8	12.9
11.2	9.2	1.1	9.1	4.3	1.9	15.7	7.4
13.7	8.6	−0.2	11.7	4.9	2.9	16.8	9.6
15.3	8	1.7	11.4	2.6	2.5	18.5	10.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sharma, S.; Chmaj, G.; Selvaraj, H. Data Fusion Method for Multi-Sensor Internet of Things Systems Including Data Imputation. IoT 2026, 7, 11. https://doi.org/10.3390/iot7010011

AMA Style

Sharma S, Chmaj G, Selvaraj H. Data Fusion Method for Multi-Sensor Internet of Things Systems Including Data Imputation. IoT. 2026; 7(1):11. https://doi.org/10.3390/iot7010011

Chicago/Turabian Style

Sharma, Saugat, Grzegorz Chmaj, and Henry Selvaraj. 2026. "Data Fusion Method for Multi-Sensor Internet of Things Systems Including Data Imputation" IoT 7, no. 1: 11. https://doi.org/10.3390/iot7010011

APA Style

Sharma, S., Chmaj, G., & Selvaraj, H. (2026). Data Fusion Method for Multi-Sensor Internet of Things Systems Including Data Imputation. IoT, 7(1), 11. https://doi.org/10.3390/iot7010011

Article Menu

Data Fusion Method for Multi-Sensor Internet of Things Systems Including Data Imputation

Abstract

1. Introduction

2. Related Work

2.1. Historical Data-Based Data Fusion with Statistics-Based Imputation

2.2. Multi-Sensor-Based Data Fusion from Same Target with Model-Based Imputation

2.3. KNN-Based Imputation

3. Proposed Data Fusion Approach

3.1. Proposed Data Fusion Method

3.2. Correlation Estimation

3.3. KNN with Iterative PCA for Synthetic Data Generation

3.3.1. Dataset Preparation

3.3.2. Synthetic Data Generation Approach

3.3.3. Evaluation

4. Dataset and Results

4.1. Dataset Before Fusion (DS1)

4.2. Dataset After Fusion (DS2)

4.3. Experiments

4.4. Correlation Statistics Analysis

4.5. KNN + Iterative PCA with Other Approaches

4.6. Comparison of Proposed Data Fusion Methods with Other Fusion Methods

5. Conclusions and Limitations

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI