Next Article in Journal
AI-Powered Vulnerability Detection and Patch Management in Cybersecurity: A Systematic Review of Techniques, Challenges, and Emerging Trends
Previous Article in Journal
MBS: A Modality-Balanced Strategy for Multimodal Sample Selection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid LSTM–Attention Model for Multivariate Time Series Imputation: Evaluation on Environmental Datasets

School of Computing, Engineering & Digital Technologies, Teesside University, Middlesbrough TS1 3BX, UK
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Mach. Learn. Knowl. Extr. 2026, 8(1), 18; https://doi.org/10.3390/make8010018
Submission received: 19 November 2025 / Revised: 20 December 2025 / Accepted: 8 January 2026 / Published: 12 January 2026
(This article belongs to the Section Learning)

Abstract

Environmental monitoring systems generate large volumes of multivariate time series data from heterogeneous sensors, including those measuring soil, weather, and air quality parameters. However, sensor malfunctions and transmission failures frequently lead to missing values, compromising the performance of downstream analytical and predictive models. To address this challenge, this study presents a comprehensive and systematic evaluation of previously proposed hybrid architecture that interleaves Long Short-Term Memory (LSTM) layers with a Multi-Head Attention mechanism in a “sandwiched” setting (LSTM–Attention–LSTM) for robust multivariate data imputation in environmental IoT datasets. The first LSTM layer captures short-term temporal dependencies, the attention layer emphasises long-range relationships among correlated features, and the second LSTM layer re-integrates these enriched representations into a coherent temporal sequence. The model is evaluated using multiple environmental datasets of soil temperature, meteorological (precipitation, temperature, wind speed, humidity), and air quality data across missingness levels ranging from 10% to 90%. Performance is compared against baseline methods, including K-Nearest Neighbour (KNN) and Bidirectional Recurrent Imputation for Time Series (BRITS). Across all datasets, the Hybrid model consistently outperforms baseline methods, achieving MAE reductions exceeding 50% and reaching over 80% in several scenarios, along with RMSE reductions of up to approximately 85%, particularly under moderate to high missingness conditions. An ablation study further examines the contribution of each layer to overall model performance. Results demonstrate that the proposed Hybrid model achieves superior accuracy and robustness across datasets, confirming its effectiveness for environmental sensor data imputation under varying missing data conditions.

Graphical Abstract

1. Introduction

The rapid expansion of machine learning (ML) applications has dramatically increased the demand for large-scale, high-quality datasets that enable reliable and generalizable model training. The growing accessibility of cost-effective and scalable sensing technologies, such as Internet of Things (IoT) devices, remote sensing nodes, and wireless sensor networks has transformed how data are collected across domains. These systems continuously generate time-stamped data in areas ranging from healthcare monitoring and environmental systems to industrial automation and urban infrastructure, thereby providing a rich source of information for predictive modelling and decision support. However, this exponential increase in data acquisition also amplifies challenges related to data completeness, reliability, and consistency, which directly affect the robustness of ML-driven insights. In particular, data missingness has emerged as one of the most prevalent and detrimental issues, often arising from sensor drift, calibration errors, power failures, communication disruptions, or environmental interferences.
Missing or incomplete data is not confined to any single domain but appears across virtually all data-driven fields. It is particularly common in healthcare systems [1,2,3], Geographic Information Systems (GIS) [4], transportation and traffic management [5], environmental surveillance networks [6,7], and industrial process monitoring [8]. In these contexts, data gaps can lead to unreliable trend detection, erroneous forecasts, and reduced interpretability of analytical models. Furthermore, incomplete datasets can introduce statistical bias, distort model parameters, and reduce predictive accuracy when used in training or validation stages [9,10]. These challenges are magnified in time-dependent domains, where data continuity and temporal correlations are essential for learning meaningful representations.
To mitigate the effects of missingness, two general strategies have been traditionally adopted, which are downsampling and imputation [11]. Downsampling eliminates incomplete observations entirely, simplifying analysis but at the cost of discarding potentially valuable information. This approach can be especially harmful in time series analysis, where removing data points disrupts temporal dependencies that are crucial for forecasting and anomaly detection [7]. In contrast, imputation methods attempt to estimate or reconstruct missing values using statistical inference, probabilistic reasoning, or machine-learning-based modelling [12]. The effectiveness of an imputation approach depends not only on the chosen algorithm but also on the mechanism of missingness, which can be Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR) [13,14,15]. Understanding these mechanisms is vital for selecting appropriate techniques that minimise bias and preserve underlying data distributions. Under MCAR, the probability that a value is missing does not depend on either observed or unobserved data. MAR allows the probability of missingness to depend on observed variables (e.g., network congestion, sensor battery level, or timestamp), which is common in IoT deployments where contextual factors influence data availability. MNAR applies when the probability of missingness depends on the unobserved value itself, such as sensors that fail more often during extreme high or low readings [7,15]. In many practical IoT environments, MAR or MNAR mechanisms are more plausible than MCAR. Nonetheless, MCAR remains useful for isolating the essential ability of an imputation model to reconstruct missing data and is widely used in the literature to report the performance of imputation models [6,16].
Traditional imputation techniques are often categorised into Single Imputation (SI) and Multiple Imputation (MI) frameworks. SI approaches, including mean substitution, median filling, and zero imputation, offer simplicity and computational efficiency but tend to underestimate data variability and uncertainty [13]. MI, introduced through iterative modelling, generates multiple plausible versions of the complete dataset, capturing the inherent uncertainty in the imputation process [17]. Classical statistical models such as regression, Expectation Maximisation (EM), and stochastic simulation have been widely used; however, their performance is typically limited when facing nonlinear or high-dimensional data [15,18]. Consequently, recent years have witnessed a paradigm shift towards machine learning and deep-learning-based solutions that can capture complex interdependencies among variables and temporal patterns in data.
A substantial body of research has benchmarked both conventional and learning-based imputation strategies across diverse applications. Kang [13] surveyed traditional clinical data imputation methods such as Last Observation Carried Forward (LOCF), regression, and EM, while subsequent studies explored data-driven techniques including K-Nearest Neighbour (KNN), Random Forests (RFs), Decision Trees (DTs), Support Vector Machines (SVMs), and ensemble learners [18]. Comparative analyses have demonstrated that model performance varies significantly with dataset characteristics and the type of missingness, suggesting there is no universal best method. For instance, Hegde et al. [19] observed that Probabilistic Principal Component Analysis (PPCA) outperformed Multiple Imputation by Chained Equations (MICE) in healthcare datasets with MCAR missingness, while Stekhoven et al. [20] showed that the non-parametric missForest algorithm achieved superior results on mixed-type biological datasets. Early neural-network-based methods also demonstrated potential advantages: Gupta et al. [21] reported better results using backpropagation networks than with mean or regression imputation, and Silva-Ramírez et al. [22] validated Multilayer Perceptrons (MLPs) for robust imputation across both simulated and real-world datasets. More advanced models, such as deep autoencoders and variational autoencoders (VAEs), have since emerged as powerful tools capable of reconstructing complex patterns of missingness in nonlinear, high-dimensional spaces [23,24].
With the rapid evolution of IoT infrastructures and real-time sensing systems, time series imputation has become a central research theme. Unlike cross-sectional data, time series are inherently sequential and temporally correlated, making traditional static imputation methods suboptimal. Decorte et al. [6] investigated the performance of recurrent models such as Multi-Directional RNNs (M-RNN) and Bidirectional Recurrent Imputation for Time Series (BRITS) in environmental monitoring data with simulated MCAR conditions. Maksims et al. [25] benchmarked multiple deep learning architectures, including attention-based autoencoders, LSTM variants, and adversarial RNNs, across five healthcare datasets. Okafor and Delaney [7] further demonstrated that variational autoencoder frameworks achieved superior reconstruction accuracy and improved calibration in greenhouse gas sensor networks when compared with traditional algorithms such as MICE, missForest, and KNN.
Recently, researchers have explored hybrid frameworks that integrate the strengths of deep sequential models with complementary algorithms. Examples include coupling Long Short-Term Memory (LSTM) networks with DTs [26], optimising learning with Genetic Algorithms [27], or employing Transfer Learning strategies, for example, TrAdaBoost, to handle domain adaptation challenges [28]. Moreover, advances in attention mechanisms pioneered by transformer architectures in natural language processing [29,30,31,32,33,34] have led to a surge of interest in attention-augmented imputation models. These models effectively learn context-aware temporal representations, capturing both short- and long-range dependencies within multivariate sequences.
Several studies and research experiments that have used hybrid methodologies to solve a variety of complex problems, particularly focusing on the innovative use of transformer mechanisms in conjunction with other neural network architectures, have greatly inspired us to further explore imputing missing data in environmental datasets. A closely associated study conducted by Dagtekin et al. provides evidence that supports the potential efficacy of the combination of these techniques, as they present their findings of better imputation performance and prediction accuracy for water quality data through the integration of LSTM networks alongside self-attention mechanisms [16]. Similarly, Shang et al. have investigated the combination of LSTM, attention, and AdaBoost algorithms to impute missing data in the traffic flow dataset and have reported promising results [35].
LSTM and attention hybrids are used extensively in NLP or healthcare, but they are underexplored in environmental IoT monitoring. Secondly, many studies use an attention mechanism either before or after the LSTM layer(s). In this paper, we present a hybrid LSTM–Attention model for missing data imputation in environmental monitoring datasets. We present a novel “sandwiched” LSTM–Attention–LSTM design. This interleaving enables the first LSTM layer to capture local temporal dependencies, the attention mechanism to highlight long-range relationships, and the second LSTM to re-integrate these enriched features into a coherent sequential representation. This ordering was deliberately chosen to balance sequential memory retention with global context modelling. Many prior works train models on complete data and then test performance on missing data. In our study, we trained the proposed model and other baseline models on 30% missing data to align them with real-world scenarios. We tested imputation performance through MAE and RMSE across 10–90% missingness, which provides a broad robustness analysis on low to moderate missingness. The proposed model demonstrates that it preserves accuracy under extreme missingness scenarios (70–80%), which is beneficial in real-world events of IoT sensor failures.
The subsequent section presents a detailed description of the proposed model, together with comprehensive information on the datasets and the methodology adopted in this study. Section 3 reports the experimental results and provides their corresponding interpretations. Section 4 discusses the key findings, their implications, and the limitations of the proposed approach, and also outlines potential directions for future research. Finally, Section 5 concludes the paper by summarising the main outcomes and discussions.

2. Materials and Methods

2.1. Dataset Description

To evaluate the performance and generalizability of the proposed Hybrid model, multiple environmental monitoring datasets were used. These datasets represent varied sensing scenarios, including hydrological, meteorological, and air quality measurements collected from field-deployed sensor networks. Each dataset contains multivariate time series data with varying temporal resolutions, enabling a comprehensive assessment of the model’s imputation capabilities. For each dataset, the target variable, which is the feature to be predicted, is identified alongside a set of correlated auxiliary parameters that serve as input features. The relationships among these variables are visualised using correlation matrices, which help to illustrate the degree of association between the target and auxiliary features. Detailed descriptions of each dataset are provided in the following subsections.

2.1.1. Soil Temperature Dataset

The soil temperature dataset (SoilTemp) is a subset of the dataset originally published by [6]. This dataset comprises continuous time-series measurements of soil temperature recorded at three different heights using in-situ soil sensors proposed in [36]. The sampling frequency for the sensor is 15 min in degrees Celsius, offering high temporal frequency, and the time series spans 6 months of data observations from 19 April 2021 at 00:00:00 to 26 September 2021 at 23:45:00, providing a clearly defined study period.
The target feature in this dataset is the soil temperature measured at 12 cm above ground level, named t e m p _ t o p in the dataset, as it exhibited the greatest variability among the three temperature sensors. The variability of the measurements makes it a suitable candidate for training and evaluating the proposed model, thereby enhancing the generalizability of the obtained results. Input features include soil temperature measurements at ground surface level ( t e m p _ m i d ), subsurface soil temperature ( t e m p _ b o t ) 10 cm deep in the ground, and relative humidity. The heatmap in Figure 1 shows the correlations between the target and input features. The heatmap indicates a strong correlation between the three temperature measurements, suggesting they can be useful predictors for the target feature. In comparison to this, a weak correlation between the target feature and the relative humidity is present in the dataset. However, the target feature shows the strongest correlation with the t e m p _ m i d among the three temperature sensors.

2.1.2. Meteorological Dataset

The meteorological dataset is collected from an automatic weather station operating in the monitoring region under the SuDS+ project in Stanley, Durham, UK. The dataset contains records of nine different weather parameters at an hourly temporal resolution. The time series spans 16 months of data recorded from 11 June 2024 to 14 October 2025 to capture temporal patterns.
The target feature predicted in this dataset is the average air temperature ( t e m p A v g ) measured in degrees Celsius, which is a key climatic parameter. Other predictive features include average humidity, wind direction, wind speed, dew point, wind gust, heat index, wind chill, and average rainfall in millimeters. Figure 2 shows relations among all these features.
As expected, temperature exhibits a negative correlation with humidity and a strong positive correlation with heat index, dew point, and wind chill parameters. The correlation matrix in Figure 2 highlights these relationships, which the proposed Hybrid model leverages to improve the temporal estimation and imputation of air temperature values.

2.1.3. Air Quality Dataset

The air quality dataset utilised in this study is derived from the publicly available dataset presented by [37], hosted on the U.S. Environmental Protection Agency (EPA) Environmental Dataset Gateway [38]. The dataset comprises measurements collected from low-cost gas sensors co-located with Federal Equivalent Method (FEM) reference monitors at an urban regulatory site in Denver, Colorado, USA. The monitoring period spanned approximately six months, from 8 September 2015 to 22 February 2016, with data aggregated at 15 min intervals.
Two gas sensors’ data are considered for analysis: ozone (O3) and combined nitrogen dioxide/ozone (NO2/O3) concentrations, both expressed in parts per billion (ppb). The target feature selected for ozone (O3) and combined nitrogen dioxide/ozone (NO2/O3) are S e n s o r 1 and S e n s o r 3 , respectively. For both sensors, measurements were obtained for corresponding temperature, relative humidity, and reference measurements from the FEM monitors. The gas sensors demonstrated moderate to strong correlations with other auxiliary features (Figure 3).
Specifically, the correlation coefficients between O3 sensors and their reference exceeded 0.9, while those for NO2/O3 sensors ranged between 0.33 and 0.49. As illustrated in the correlation heatmap, the O3 sensors showed strong inter-sensor consistency and positive associations with temperature, alongside negative correlations with relative humidity. These relationships provide valuable contextual dependencies that the proposed Hybrid model can exploit for accurate imputation and prediction of missing air quality measurements.

2.2. Proposed Hybrid Model

The Hybrid model presented in this paper is a sequential neural network designed for time-series imputation tasks, using a combination of LSTM layers and a Multi-Head Attention mechanism to capture temporal dependencies and enhance feature representation. Figure 4 shows the overall architecture of the proposed model. The proposed architecture sandwiches the attention layer between the two LSTM layers instead of cascading the Attention and LSTM layers on top of each other. The model maps an input sequence X 1 : T to a scalar prediction y ^ via a first LSTM encoder producing hidden states H. A multi-head self-attention module is then applied to H, returning attention outputs A. The concatenation C of outputs from the first LSTM and Attention layers is passed through a second LSTM, which re-encodes C to produce the final hidden state s T . The output is obtained by a dense linear layer, and the network is trained by minimising the mean squared error (MSE) as loss measure.
Before feeding the data into the model, missing values in the time series are reconstructed using linear interpolation to ensure temporal continuity. Given two observed data points ( t 1 , x t 1 ) and ( t 2 , x t 2 ) , the interpolated value at time t, where t 1 < t < t 2 , is computed as
x t = x t 1 + ( x t 2 x t 1 ) · t t 1 t 2 t 1 .
Linear interpolation was selected because environmental sensor measurements (e.g., temperature, humidity, gas concentrations) exhibit smooth short-term changes, making linear reconstruction an unbiased and stable choice. More complex interpolators, e.g., cubic splines, may distort short-term fluctuations or cyclical patterns present in meteorological and soil temperature data, whereas linear interpolation preserves the local trend and magnitude. Importantly, interpolation is used only for minimal gap filling required to prepare windowed input sequences; the primary imputation task is performed by the proposed Hybrid LSTM–Attention–LSTM model.
The interpolated data are subsequently normalised using Min-Max scaling to restrict all feature values within the range [ 0 , 1 ] , which improves numerical stability and accelerates convergence during training. Each feature value x t is scaled as
x t = x t x min x max x min ,
where x min and x max represent the minimum and maximum values of the feature, respectively.
Min-Max scaling is a standard practice in environmental deep learning and IoT sensor modelling, specifically in widely cited LSTM and attention-based architectures. Many state-of-the-art works employ Min-Max scaling for environmental or sensor-based time series because it improves gradient stability, reduces training time, prevents exploding/vanishing gradients in recurrent layers, and ensures consistent magnitude across heterogeneous sensor channels. Min-Max scaling therefore provides a stable, interpretable, and model-compatible preprocessing strategy across all datasets used in this study.
To capture sequential dependencies, feature engineering is applied using a sliding window technique with a window size of 30, denoted as T and represented by the parameter w i n d o w _ s i z e in this paper. Each input sample at time step t is expressed as
X = [ x 1 , x 2 , , x T ] ,
where x i R d , with d = 1 for univariate data and d > 1 for multivariate cases. This preprocessing pipeline ensures that the input sequences provided to the model are both complete and properly scaled, enabling effective learning of temporal relationships across multiple correlated features.

2.2.1. First LSTM Layer (LSTM1)

The preprocessed input is then passed to the first LSTM layer to learn long-term dependencies in sequential data [39]. It utilises 64 hidden units to learn local temporal dependencies and transforms the input sequence into a sequence of hidden states:
h t ( 1 ) = LSTM 1 ( x t , h t 1 ( 1 ) , C t 1 ( 1 ) ) .
In our implementation, the standard hyperbolic tangent activation functions used in the candidate and output computations of the LSTM were replaced by ReLU to enhance gradient propagation and model stability. The overall LSTM formulation and gating mechanism remains consistent with the original definition by [39]. Accordingly, the operations for the forget gate, input gate, candidate state update, and output gate are defined as the following equations.
f t = σ ( W f [ h t 1 , x t ] + b f )
where f t is the forget gate activation, W f and b f are the weight matrix and bias, h t 1 is the previous hidden state, and x t is the input at any current time.
i t = σ ( W i [ h t 1 , x t ] + b i ) , C ˜ t = ReLU ( W C [ h t 1 , x t ] + b C ) ,
where i t is the input activation function result and C ˜ t represents the candidate values to add to the cell state.
C t = f t C t 1 + i t C ˜ t
where ⊙ represents element-wise multiplication, f t is the output of the forget gate, C t 1 is the previous cell state, i t is the new input and C t is the new cell state.
o t = σ ( W o [ h t 1 , x t ] + b o ) , h t = o t ReLU ( C t )
where o t is the activation of the output gate, and h t is the new hidden state.
Hidden states produced at each time step are the output of this layer and are collected as
S = [ h 1 ( 1 ) ; ; h T ( 1 ) ] R T × d h .
To mitigate overfitting, a Dropout layer is applied immediately after, with a dropout rate p selected as 0.3 to produce the following output.
S ˜ = Dropout ( S ; p )

2.2.2. Multihead Attention Layer

An Attention mechanism follows the initial LSTM layer to enhance feature representation by focusing on the most relevant parts of the sequence. The attention mechanism is a key element of transformer-based neural networks, enabling effective modelling of contextual dependencies. The multi-head attention variant extends this concept by projecting the input into multiple subspaces, one per head, to compute parallel attention scores that capture diverse contextual relationships within the sequence [30]. In the proposed model, six attention heads are used, with key vector dimensions equal to the LSTM output size, allowing the model to learn multiple attention patterns in parallel. For each attention head i { 1 , , N } , queries, keys, and values are computed as:
Q i = S ˜ W i Q , K i = S ˜ W i K , V i = S ˜ W i V ,
where W i Q , W i K , W i V R d model × d k are learnable projection matrices and d k = d v = d model / h .
The attention output for h e a d i is computed through:
head i = softmax Q i K i d k V i ,
and outputs of all heads are concatenated and linearly projected to form the output A .
A = Concat ( head 1 , , head N ) W O .
To integrate global contextual information with local temporal patterns, the attention outputs and outputs from the first LSTM’s hidden states are concatenated at each time step:
c t = [ h t ( 1 ) ; a t ] .

2.2.3. Second LSTM Layer (LSTM2)

The concatenated sequence C = [ c 1 ; ; c T ] is then passed through a second LSTM layer with the same configurations as the LSTM 1 :
s t = LSTM 2 ( c t , s t 1 , C ˜ t 1 ) .
This layer returns only the final hidden state s * = s T , capturing high-level temporal representations. A Dropout layer similar to the first LSTM layer is applied at this stage too for additional regularisation.
s ˜ * = Dropout ( s * ; p )

2.2.4. Output Layer

A final dense layer maps the final feature representation s ˜ * to the single scalar output y ^ , which is then inverse-transformed to the original scale.
y ^ = σ ( W d s ˜ * + b d ) ,
where W d and b d are trainable parameters, and σ ( · ) denotes the linear activation function to predict the final value.

2.3. Experimental Design

This section outlines the methodological framework employed to evaluate the proposed hybrid deep learning model for missing data imputation. The experiments extend our previously published univariate results [40] by investigating the model’s performance on multivariate sensor datasets. The complete workflow includes missing data simulation, model training with cross-validation, and comparative evaluations against multiple baseline and ablation variants.
Environmental IoT datasets frequently exhibit moderate missingness rates (10–40%) due to communication failures and sensor malfunctions. Therefore, to reflect typical real-world conditions rather than an idealised complete dataset, the proposed model and the baseline methods were trained using datasets with 30% artificially induced missingness. In a prior evaluation of the proposed architecture on univariate data [40], training with 30% missing values resulted in lower imputation error and better generalisation than training on complete data. This moderate level of missingness provides the model with sufficient exposure to missing patterns during training while still retaining enough observed values to learn temporal dynamics effectively. The proposed Hybrid model does not assume a fixed missingness rate and allows for generalisation to varying levels of missingness during inference.
The missing indices in the training and testing datasets were generated randomly through Bernoulli sampling according to the MCAR mechanism across the time steps for the target feature in each dataset to mimic realistic stochastic data loss in sensor networks. MCAR was chosen as this is the most widely used missing mechanism in the literature to assess the performance of the imputation models [6,16,19] and assists in isolating and evaluating the imputation capability of the model under random deletion patterns. The implications and limitations of this choice, including the need to evaluate MAR and MNAR settings in future work, are discussed in Section 4. The auxiliary features remained complete and were used to support temporal and cross-variable correlations during imputation.
Each dataset was partitioned into training and test sets with a ratio of 80:20 without shuffling to preserve the temporal sequence. While the training data was further partitioned into training and validation sets to train the proposed Hybrid model using five-fold cross-validation, to ensure robustness and generalizability. Mean Squared Error (MSE) was used as the loss measure during the training process. The best-performing model, as determined by validation loss, was then evaluated on unseen test data under varying levels of missingness from 10% to 90%. Performance was quantified using the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R2 Score defined as
RMSE = 1 N i = 1 N ( y ^ i y i ) 2
MAE = 1 N i = 1 N | y ^ i y i |
R 2 = 1 i = 1 N ( y i y ^ i ) 2 i = 1 N ( y i y ¯ ) 2
where y ^ i and y i denote the imputed and true values, respectively. These metrics were computed across all missingness levels to evaluate the model’s robustness under increasing data degradation.
The Hybrid model is implemented using Python 3.12 with open source libraries of Pandas, Numpy and TensorFlow. Table 1 summarises the configuration and training parameters for the proposed model, which are used in all experiments. We employed a two-stage approach to hyperparameter selection. First, we fixed a set of structural parameters commonly used in similar architectures, such as attention dimensionality, LSTM units, learning rate, batch size, and activation functions, to constrain the search space. Second, we tuned a subset of influential hyperparameters using a random search on a validation split derived from the training portion of each dataset. The parameters tuned include the number of attention heads, the dropout rate, and the temporal lookback window ( w i n d o w _ s i z e ). The final selected configuration, along with justification of choices or search ranges, are reported in Table 1. Although alternative optimisation strategies such as grid search may yield further improvements, our tuning procedure produced stable performance across all four real-world datasets.
The proposed model was benchmarked against two widely used imputation baselines: K-Nearest Neighbour (KNN) and Bidirectional Recurrent Imputation for Time Series (BRITS). For KNN imputation, the number of nearest neighbours was set to k = 5, following established practice in environmental sensor data imputation (e.g., Decorte et al., 2024 [6]). Uniform weights were assigned to the selected neighbours, and Euclidean distance was used as the similarity metric. Neighbours were selected only from samples with complete feature observations to ensure valid distance computation. This configuration was adopted as a standard and reproducible baseline rather than performing dataset-specific optimisation.
BRITS, a deep-learning-based approach using bidirectional recurrent dynamics to model temporal dependencies [41], was implemented using the authors’ publicly available reference implementation. To maintain an unbiased comparison, most architectural and training parameters were kept consistent with those of the proposed Hybrid model. Only two hyperparameters were tuned: the learning rate was selected from the set { 10 3 , 10 4 , 10 5 } , and the number of training epochs was chosen from { 50 , 100 } . Hyperparameter selection was based on the validation MSE, with particular care taken to avoid overfitting associated with near-zero training loss. The final configuration, using a learning rate of 10 5 and 50 training epochs, provided stable and competitive performance across all datasets and was therefore adopted for all experiments.
All comparative models were trained on the same datasets and evaluated using identical experimental conditions of five-fold cross-validation and performance metrics for fair comparison. Table 2 summarises the cross-validation performance of the Hybrid model in comparison with KNN and BRITS across all datasets. The Hybrid model consistently achieves lower MAE and RMSE values than BRITS and demonstrates competitive performance against KNN, particularly for the SoilTemp and NO2/O3 datasets, while maintaining high R 2 values close to unity. Although KNN attains lower errors for certain datasets such as Meteorological and O3, the Hybrid model exhibits greater robustness overall, as reflected by its stable error statistics and consistently high coefficients of determination.
To investigate the contribution of each module within the Hybrid model, three ablation variants were designed, as detailed in Table 3. Each variant omits or alters specific components to assess their impact on model performance. Each ablation model follows the same training and evaluation procedure as the full Hybrid model. The resulting comparisons provide insights into the role of temporal sequencing and combining the outputs of the LSTM and attention layer to be used as input for the second LSTM layer, emphasising the use of the sandwich approach in the proposed model architecture.

3. Results

3.1. Multivariate Evaluation

Table 4 presents the MAE and RMSE values for four environmental datasets: SoilTemp, Meteorological, Air Quality NO2/O3, and Air Quality O3 across missingness levels ranging from 10% to 90%. Across all datasets, the model demonstrates strong robustness at lower levels of missingness, with relatively small changes in MAE and RMSE up to around 70%. Beyond this point, both metrics begin to rise more noticeably, and the degradation becomes substantial.
For the Soil Temperature dataset, the model exhibits the most stable behaviour. As shown in Table 4 and Figure 5a, both MAE and RMSE remain nearly constant up to 70% missingness. A substantial increase is observed only beyond 80%, indicating that soil temperature values are easier to reconstruct even when a large portion of data is missing. The Meteorological dataset displays similarly consistent performance at moderate missingness levels. Interestingly, MAE slightly improves up to 40%; however, Figure 5b shows a clear increase in both metrics once missingness exceeds 60%, with a sharp rise at 80–90%. In the Air Quality NO2/O3 dataset, the model maintains stability up to approximately 40% missingness, but errors increase steadily after that.
As illustrated in Figure 5c, MAE nearly doubles between 10% and 90%, reflecting the higher variability and nonlinearity typically associated with air pollutant dynamics. Finally, the Air Quality O3 dataset shows small absolute error values due to its lower numerical range, but the upward trend with increasing missingness is still evident (Figure 5d). While errors remain minimal up to 60%, both MAE and RMSE rise consistently beyond 60%, confirming that even low-magnitude datasets eventually exhibit degradation as missingness becomes severe.
The coefficient of determination (R2) values reported in Table 4 indicate a strong agreement between the imputed and ground-truth observations across all datasets. A score consistently close to 1 R2 demonstrates that the proposed Hybrid model is able to explain the majority of variance in the original data, even under increasing levels of missingness. This suggests that the model not only minimises absolute and squared errors (as reflected by MAE and RMSE) but also preserves the underlying temporal and inter-variable relationships. Overall, the results show that the model performs reliably under moderate to high missingness conditions for all environmental datasets, but predictable degradation occurs at extreme levels of missing data.

3.2. Comparative Study

Table 5 reports the Mean Absolute Error (MAE) comparison, and Table 6 shows the corresponding RMSE values for the Hybrid model, KNN and BRITS across four environmental datasets and missingness levels from 10% to 90%. Visual comparison is available in Figure 6 and Figure 7.
For the SoilTemp dataset, the proposed Hybrid model consistently achieves the lowest MAE and RMSE across most missingness levels, demonstrating strong robustness to data loss. For example, at 30% missingness, the Hybrid MAE is 0.0962 while KNN and BRITS record 0.4045 and 0.5733, respectively; the RMSE trend is also similar. These differences are clearly visible in Figure 6a and Figure 7a.
The Meteorological dataset exhibits a different pattern; KNN outperforms the Hybrid model across the reported missingness levels. However, the RMSE results (Table 6 and Figure 7b) provide a more interesting picture. The Hybrid model attains slightly lower RMSE than KNN at low to moderate missingness, indicating that although KNN gives smaller average absolute errors, the Hybrid produces fewer very large errors in these missingness levels. At higher missingness (80–90%), KNN’s RMSE becomes lower than Hybrid’s, reflecting KNN’s relative resilience in reducing large residuals for this dataset at extreme data loss. These contrasting MAE/RMSE behaviours suggest that for Meteorological data, KNN yields smaller average deviations while the Hybrid model reduces large errors at low-to-moderate missingness.
For the NO2/O3 target, KNN initially achieves lower MAE than the Hybrid at low missingness, as presented in Table 5 and Figure 6c. However, this advantage diminishes as missingness increases, and the Hybrid overtakes KNN from roughly 50% missingness onward. The RMSE series shows that Hybrid already attains lower RMSE than KNN from as early as 20% missingness Figure 7c, indicating that the Hybrid model is better at limiting larger errors even when KNN has lower MAE at the very lowest missingness levels. In short, KNN is initially better on MAE for NO2/O3, but Hybrid becomes superior as missingness grows; RMSE shows that the Hybrid model reduces large residuals earlier than MAE suggests. For O3, absolute errors are very small for all methods; nevertheless, the Hybrid maintains a small but consistent advantage in MAE across most missingness levels and RMSE values follow the same pattern.

3.3. Ablation Study

Ablation study was conducted to evaluate the performance of three ablation variants (A, B, C) as defined in Table 3 against the proposed hybrid baseline model across four datasets (SoilTemp, Meteorological, Air Quality NO2O3, and Air Quality O3) using MAE and RMSE metrics under varying missing data percentages (10–90%). The results of this study for all four datasets are summarised in Table 7 and Table 8, and a visual representation is created in Figure 8 and Figure 9.
For SoilTemp dataset, across all missingness levels, the Baseline achieves the lowest MAE and RMSE, confirming that the full hybrid structure is essential for accurate reconstruction. Ablation B performs the worst, with very large errors at all missingness levels, indicating that the LSTM component is critical for capturing temporal dependencies. Ablation A and Ablation C perform moderately but consistently worse than the Baseline. RMSE results in Table 8 and Figure 9a follow the same trend, reinforcing the dominance of the Baseline model.
A similar pattern is observed in the Meteorological dataset. Ablation B again yields extremely high error values across all missingness levels (Table 7 and Figure 8b), showing that temporal modeling is indispensable for meteorological dataset as well. Ablation A and Ablation C remain competitive with each other but remain consistently inferior to the Baseline. RMSE comparisons reflect the same ordering. The Baseline retains a clear advantage across all missingness levels.
For the NO2/O3 dataset, the Baseline again outperforms all ablated variants across both metrics. Ablation B remains the poorest performer (MAE increasing to 3.3611 at 90%), while Ablation A and Ablation C track more closely. Notably, Ablation C performs relatively closer to the Baseline at very low missingness (10%) but diverges at higher missingness. RMSE differences in Table 8 and Figure 9c show a sharper degradation for all ablations, with Ablation B exhibiting the steepest rise.
The O3 dataset exhibits very small absolute error magnitudes, but the relative behaviour remains consistent with the other datasets. The Baseline consistently achieves the best MAE and RMSE values. At low missingness levels, Ablation A and Ablation C perform comparably to the Baseline, but Ablation B again shows noticeably higher errors. The RMSE curves in Figure 9d show similarly small but consistent gaps.

4. Discussion

The results of the multivariate evaluation of the proposed Hybrid model demonstrate that it provides reliable imputation performance across diverse environmental datasets, maintaining stable MAE and RMSE values up to approximately 40–70% missingness. This robustness indicates that the architecture effectively captures temporal dependencies, allowing it to reconstruct missing patterns even when substantial portions of the sequence are absent. However, the sharp increase in error beyond 70–80% missingness highlights the limitations of the model when the available contextual information becomes insufficient. Air quality datasets, particularly NO2/O3, show more rapid deterioration due to their higher variability, whereas Soil Temperature remains stable for longer. These findings underscore that while the model is well-suited for real-world environmental monitoring scenarios with moderate to high data loss, extreme missingness still presents a significant challenge that future work should address through enhanced temporal reasoning or external information integration.
The comparative evaluation highlights the strong performance and robustness of the proposed Hybrid model relative to KNN and BRITS across the four environmental datasets. In the SoilTemp and O3 datasets, the Hybrid model consistently yields the lowest errors, even under extreme missingness. These datasets exhibit smoother temporal variations, enabling the Hybrid model to effectively leverage both local feature extraction and sequence learning to reconstruct missing values with high stability. Across all datasets and missingness levels, the proposed Hybrid model demonstrates consistently superior imputation performance compared to both KNN and BRITS baselines. In terms of MAE, the hybrid approach achieves reductions exceeding 50% across all datasets, with improvements frequently surpassing 80% under moderate to high missingness levels, particularly for the SoilTemp and Meteorological datasets. For example, at 50% missingness, MAE is reduced from 0.5137 to 0.0968 for SoilTemp and from 5.5871 to 0.9137 for Meteorological data when compared to KNN and BRITS, respectively. Similar trends are observed for RMSE, where reductions of up to approximately 85% are achieved across datasets, even under extreme missingness conditions (up to 90%). These results indicate that the proposed architecture maintains stable reconstruction accuracy as missingness increases, whereas baseline methods exhibit rapid performance degradation. The consistent gains across heterogeneous environmental datasets highlight the robustness of the hybrid LSTM–Attention framework and its suitability for real-world IoT sensor data imputation.
In the Meteorological dataset, however, KNN demonstrates competitive behaviour at low missingness levels, marginally outperforming the Hybrid model in both MAE and RMSE. This result suggests that meteorological variables contain short-range correlations that KNN can exploit effectively through distance-based interpolation. Figure 10 shows the distribution comparison of t e m p _ t o p and t e m p A v g , target features in SoilTemp and Meteorological datasets, respectively. t e m p _ t o p may show skewness, reflecting localised fluctuations, while t e m p A v g distribution is smoother and less skewed, indicating stability from averaging. t e m p _ t o p exhibits a wider range, suggesting more variability in sensor readings. On the other hand, t e m p A v g is more concentrated, and hence, neigbouring values are favouring imputation. As missingness increases, KNN’s neighbourhood becomes sparse, and its performance deteriorates, allowing the Hybrid model to surpass it. This behaviour illustrates the advantage of Hybrid’s architecture in maintaining reconstruction accuracy under progressively degraded information availability.
A similar pattern emerges in the NO2/O3 dataset, where KNN initially achieves lower error metrics than the Hybrid model at 10–30% missingness. Pollutant concentrations often exhibit strong local covariation, explaining KNN’s effectiveness in early stages. However, as the missingness grows, the Hybrid model becomes the most accurate method, consistently outperforming both KNN and BRITS. The rapid performance decline of BRITS across all datasets and missingness levels indicates that, despite being designed for irregular time series, it struggles to maintain stability under the types of data gaps and nonlinear dynamics present in environmental monitoring.
Overall, the comparative study shows that the Hybrid model generalises more effectively than KNN and BRITS across diverse environmental datasets, particularly when missingness becomes substantial. Its ability to integrate local, sequential, and contextual information enables strong performance.
The proposed architecture offers several advantages: (i) the recurrent layers effectively capture long-range temporal dependencies in sensor data; (ii) the attention mechanism identifies informative past time steps, increasing interpretability; and (iii) the model generalises well across varying levels of missingness. However, the approach also has limitations. Computational cost is higher than classical methods such as KNN or spline interpolation, making the proposed model less suitable for deployment on resource-limited devices. Performance may degrade when observation rates are extremely sparse or when MNAR missingness dominates. In settings where data are highly stationary or when missingness is infrequent, simpler methods may outperform neural approaches. The LSTM–Attention–LSTM architecture is particularly suitable for multivariate IoT time-series where both short-term temporal dynamics and long-range dependencies are present. It is most effective under moderate to high missingness levels with dispersed missing points and in settings where strong correlations exist across sensor variables. However, the model may offer limited advantage for datasets having weak temporal dependencies, and its current implementation focuses on single-point missingness rather than consecutive gaps. These factors should be considered when selecting imputation techniques for practical deployments.
The ablation study confirms that each component of the Hybrid model, multihead attention and the two LSTM temporal modules, contributes positively to reconstruction accuracy. The Baseline consistently achieves the lowest MAE and RMSE values and shows the most stable behaviour across all missingness levels. These results validate the necessity of the full hybrid design and highlight the complementary strengths of its constituent components. Across datasets and metrics, Ablation B (removal of the second LSTM) is the most detrimental modification. The consistently poor performance of Ablation B across the SoilTemp, Meteorological, and NO2/O3 datasets demonstrates that the post-attention sequence module (LSTM2) is critical for producing accurate reconstructions when attention outputs are to be re-integrated into a sequential representation.
Ablation A (removal of the first / pre-attention LSTM) and Ablation C (removal of the attention mechanism) both produce moderate degradations relative to the Baseline but do not cause the extreme failure observed for Ablation B. For NO2/O3, Ablation C is relatively close to the Baseline at low missingness (MAE 1.7326 vs. Baseline 1.6996 at 10%), but its error grows faster with increasing missingness (see Figure 8c), indicating that attention contributes increasingly with larger data gaps. The O3 target (small-magnitude values) reflects the same qualitative pattern but with much smaller absolute errors: the Baseline maintains the best MAE/RMSE across missingness levels, and Ablation B again produces the largest deterioration.
In summary, removing the post-attention LSTM (Ablation B) causes the largest and most consistent performance drop across datasets and metrics, highlighting the importance of re-integrating attention-enhanced features into the sequential model. Omitting the pre-attention LSTM (Ablation A) or the attention module (Ablation C) leads to moderate but measurable degradations; attention becomes progressively more important as missingness increases. The full Baseline configuration consistently attains the lowest MAE and RMSE values, confirming that the combined sandwiched pipeline (LSTM1 + Attention + LSTM2) provides the most reliable reconstruction performance under varying degrees of data loss and varying types of data distributions.
Although MCAR simulations provide controlled insight into model behaviour, real-world IoT systems often exhibit MAR or MNAR patterns. Under MAR, incorporating covariates related to missingness (e.g., battery level, timestamp, or neighbouring sensor status) could further improve performance. MNAR scenarios require explicit modelling of the missingness process, such as incorporating masked value predictors or joint generative models. Extending the proposed model to these mechanisms is therefore an important next step. Gated Recurrent Units (GRUs) provide a more efficient alternative and can achieve similar performance with lower computational cost, making GRU–Attention–GRU architectures suitable for resource-constrained settings. A controlled comparison between LSTM- and GRU-based variants, using identical attention mechanisms and comparable parameters, is planned for future work.
In this study, we focused on imputing single-value missingness patterns, which frequently occur in IoT sensor streams due to packet drops and transient communication failures. A potentially valuable extension of this work is to evaluate the proposed hybrid LSTM–Attention–LSTM model under consecutive missing segments, which represent a more challenging imputation scenario as addressed by Zhang et al. [29]. As part of future work, we plan to investigate the performance of our model on consecutive-gap imputation tasks and consider comparative evaluations with dual-attention approaches across additional environmental and non-environmental domains. Understanding the inference of the model by analysing the attention weights to see which features are important in making imputations and which time steps are more significant than others can be another future direction. Testing the model on datasets from other domains and against other methods can be a useful way forward in evaluating the generalisability and robustness of the proposed model.

5. Conclusions

In this study, we presented a hybrid deep learning framework for robust missing data imputation in environmental monitoring datasets. By integrating LSTM with an attention mechanism, in a sandwiched interleaved manner, our model effectively captures both temporal and spatial dependencies, outperforming baseline and ablation variants in terms of MAE and RMSE. The results demonstrate that the proposed approach not only enhances imputation accuracy under varying levels of missingness but also maintains stability across different environmental datasets. These findings highlight the potential of hybrid architectures for improving data quality in environmental monitoring systems, which is critical for reliable decision-making and downstream tasks.

Author Contributions

Funding acquisition, U.A. and J.L.; supervision, U.A. and J.L.; methodology, J.L., A.L. and U.A.; implementation, A.L.; writing—original draft preparation, A.L., U.A. and J.L.; writing—review and editing, A.L., U.A. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been funded by the research project “Community SuDS Innovation Accelerator”, as part of the Flood and Coastal Resilience Innovation Programme (FCRIP) by Environmental Agency (EA), UK.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used and source code developed to reproduce the results of this study have been made publicly available on GitHub and can be accessed at the following repository: https://github.com/ammara-tees/LSTM-attention-model (accessed on 18 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wells, B.; Nowacki, A.; Chagin, K.; Kattan, M. Strategies for handling missing data in electronic health record derived data. Gener. Evid. Methods Improv. Patient Outcomes (eGEMs) 2013, 1, 1035. [Google Scholar] [CrossRef] [PubMed]
  2. Li, J.; Yan, X.S.; Chaudhary, D.; Avula, V.; Mudiganti, S.; Husby, H.; Shahjouei, S.; Afshar, A.; Stewart, W.F.; Yeasin, M.; et al. Imputation of missing values for electronic health record laboratory data. NPJ Digit. Med. 2021, 4, 147. [Google Scholar] [CrossRef] [PubMed]
  3. Zuo, Z.; Li, J.; Xu, H.; Al Moubayed, N. Curvature-based feature selection with application in classifying electronic health records. Technol. Forecast. Soc. Change 2021, 173, 121127. [Google Scholar] [CrossRef]
  4. Griffith, D.A.; Liau, Y.T. Imputed spatial data: Cautions arising from response and covariate imputation measurement error. Spat. Stat. 2021, 42, 100419. [Google Scholar] [CrossRef]
  5. Kaur, M.; Singh, S.; Aggarwal, N. Missing traffic data imputation using a dual-stage error-corrected boosting regressor with uncertainty estimation. Inf. Sci. 2022, 586, 344–373. [Google Scholar] [CrossRef]
  6. Decorte, T.; Mortier, S.; Lembrechts, J.J.; Meysman, F.J.R.; Latré, S.; Mannens, E.; Verdonck, T. Missing value imputation of wireless sensor data for environmental monitoring. Sensors 2024, 24, 2416. [Google Scholar] [CrossRef]
  7. Okafor, N.U.; Delaney, D.T. Missing data imputation on IoT sensor networks: Implications for on-site sensor calibration. IEEE Sens. J. 2021, 21, 22833–22845. [Google Scholar] [CrossRef]
  8. Carbery, C.M.; Woods, R.; McAteer, C.; Ferguson, D.M. Missingness analysis of manufacturing systems: A case study. Proc. Inst. Mech. Eng. 2022, 236, 1406–1417. [Google Scholar] [CrossRef]
  9. Escobar, C.; Arinez, J.; Macias, D.; Morales-Menendez, R. Learning with missing data. In Proceedings of the IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020. [Google Scholar]
  10. Hasan, M.K.; Alam, M.A.; Roy, S.; Dutta, A.; Jawad, M.T.; Das, S. Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Inform. Med. Unlocked 2021, 27, 100799. [Google Scholar] [CrossRef]
  11. Emran, N.A. Data completeness measures. In Pattern Analysis, Intelligent Security and the Internet of Things; Abraham, A., Muda, A.K., Choo, Y.H., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 117–130. [Google Scholar]
  12. Das, D.; Nayak, M.; Pani, S. Missing value imputation—A review. Int. J. Comput. Sci. Eng. 2019, 7, 548–558. [Google Scholar] [CrossRef]
  13. Kang, H. The prevention and handling of the missing data. Korean J. Anesthesiol. 2013, 64, 402–406. [Google Scholar] [CrossRef] [PubMed]
  14. Jäger, S.; Allhorn, A.; Bießmann, F. A benchmark for data imputation methods. Front. Big Data 2021, 4, 693674. [Google Scholar] [CrossRef] [PubMed]
  15. Brown, M.; Kros, J. Data mining and the impact of missing data. Ind. Manag. Data Syst. 2003, 103, 611–621. [Google Scholar] [CrossRef]
  16. Dagtekin, O.; Dethlefs, N. Imputation of partially observed water quality data using self-attention LSTM. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
  17. Jamshidian, M.; Mata, M. Advances in analysis of mean and covariance structure when data are incomplete. In Handbook of Latent Variable and Related Models; North-Holland: Amsterdam, The Netherlands, 2007; pp. 21–44. [Google Scholar]
  18. Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 140. [Google Scholar] [CrossRef]
  19. Hegde, H.; Shimpi, N.; Panny, A.; Glurich, I.; Christie, P.; Acharya, A. MICE vs PPCA: Missing data imputation in healthcare. Inform. Med. Unlocked 2019, 17, 100275. [Google Scholar] [CrossRef]
  20. Stekhoven, D.; Bühlmann, P. MissForest: Non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112–118. [Google Scholar] [CrossRef]
  21. Gupta, A.; Lam, M.S. Estimating missing values using neural networks. J. Oper. Res. Soc. 1996, 47, 229–238. [Google Scholar] [CrossRef]
  22. Silva-Ramírez, E.L.; Pino-Mejías, R.; López-Coello, M.; Cubiles-de-la Vega, M.D. Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Netw. 2011, 24, 121–129. [Google Scholar] [CrossRef]
  23. Beaulieu-Jones, B.K.; Moore, J.H. Missing data imputation in the electronic health record using deeply learned autoencoders. In Proceedings of the Pacific Symposium on Biocomputing, Kohala Coast, HI, USA, 4–8 January 2017; Volume 22, pp. 207–218. [Google Scholar]
  24. McCoy, J.T.; Kroon, S.; Auret, L. Variational autoencoders for missing data imputation with application to a simulated milling circuit. IFAC-PapersOnLine 2018, 51, 141–146. [Google Scholar] [CrossRef]
  25. Kazijevs, M.; Samad, M.D. Deep imputation of missing values in time series health data: A review with benchmarking. J. Biomed. Inform. 2023, 144, 104440. [Google Scholar] [CrossRef]
  26. Haibin, C.; Yongliang, H. A hybrid LSTM and decision tree model: A novel machine learning architecture for complex data classification. In Proceedings of the IEEE International Conference on Sensors, Electronics and Computer Engineering (ICSECE), Jinzhou, China, 18–20 August 2023; pp. 1441–1446. [Google Scholar]
  27. Tsoulos, I.; Charilogis, V.; Tsalikakis, D. Train neural networks with a hybrid method that incorporates a novel simulated annealing procedure. AppliedMath 2024, 4, 1143–1161. [Google Scholar] [CrossRef]
  28. Chen, Z.; Xu, H.; Jiang, P.; Yu, S.; Lin, G.; Bychkov, I.; Hmelnov, A.; Ruzhnikov, G.; Zhu, N.; Liu, Z. A transfer learning-based LSTM strategy for imputing large-scale consecutive missing data and its application in a water quality prediction system. J. Hydrol. 2021, 602, 126573. [Google Scholar] [CrossRef]
  29. Zhang, Y.; Thorburn, P.J. A dual-head attention model for time series data imputation. Comput. Electron. Agric. 2021, 189, 106377. [Google Scholar] [CrossRef]
  30. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
  31. Domhan, T. How much attention do you need? A granular analysis of neural machine translation architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Melbourne, Australia, 2018; Volume 1, pp. 1799–1808. [Google Scholar]
  32. Strubell, E.; Verga, P.; Andor, D.; Weiss, D.; McCallum, A. Linguistically-Informed Self-Attention for Semantic Role Labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 5027–5038. [Google Scholar]
  33. Tao, C.; Gao, S.; Shang, M.; Wu, W.; Zhao, D.; Yan, R. Get the point of my utterance! learning towards effective responses with multi-head attention mechanism. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden, 13–19 July 2018; International Joint Conferences on Artificial Intelligence Organization; pp. 4418–4424. [Google Scholar]
  34. Tang, G.; Müller, M.; Rios, A.; Sennrich, R. Why self-attention? A targeted evaluation of neural machine translation architectures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 4263–4272. [Google Scholar]
  35. Shang, Q.; Tang, Y.; Yin, L. A Hybrid model for missing traffic flow data imputation based on clustering and attention mechanism optimizing LSTM and AdaBoost. Sci. Rep. 2024, 14, 26473. [Google Scholar] [CrossRef]
  36. Wild, J.; Kopecký, M.; Macek, M.; Šanda, M.; Jankovec, J.; Haase, T. Climate at ecologically relevant scales: A new temperature and soil moisture logger for long-term microclimate measurement. Agric. For. Meteorol. 2019, 268, 40–47. [Google Scholar] [CrossRef]
  37. Feinberg, S.; Williams, R.; Hagler, G.S.W.; Rickard, J.; Brown, R.; Garver, D.; Harshfield, G.; Stauffer, P.; Mattson, E.; Judge, R.; et al. Long-term evaluation of air sensor technology under ambient conditions in Denver, Colorado. Atmos. Meas. Tech. 2018, 11, 4605–4615. [Google Scholar] [CrossRef]
  38. U.S. Environmental Protection Agency. EPA Environmental Dataset Gateway. Available online: https://edg.epa.gov (accessed on 11 November 2025).
  39. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  40. Laeeq, A.; Adeel, U.; Li, J.; Starkey, E. A Hybrid LSTM–Attention Approach for Missing Data Imputation in IoT Time Series. In Intelligent Data Engineering and Automated Learning–IDEAL 2025; Springer Nature: Cham, Switzerland, 2026; pp. 301–312. [Google Scholar]
  41. Cao, W.; Wang, D.; Li, J.; Zhou, H.; Li, Y.; Li, L. BRITS: Bidirectional recurrent imputation for time series. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018; NIPS’18, pp. 6776–6786. [Google Scholar]
Figure 1. Correlation matrix of the features in the SoilTemp dataset.
Figure 1. Correlation matrix of the features in the SoilTemp dataset.
Make 08 00018 g001
Figure 2. Correlation matrix of the features in the meteorological dataset.
Figure 2. Correlation matrix of the features in the meteorological dataset.
Make 08 00018 g002
Figure 3. (a) Correlation matrix of the features in the NO2/O3 dataset. (b) Correlation matrix of the features in the O3 dataset.
Figure 3. (a) Correlation matrix of the features in the NO2/O3 dataset. (b) Correlation matrix of the features in the O3 dataset.
Make 08 00018 g003
Figure 4. Proposed Hybrid model Architecture.
Figure 4. Proposed Hybrid model Architecture.
Make 08 00018 g004
Figure 5. MAE and RMSE trends across increasing missingness levels for (a) SoilTemp, (b) Meteorological, (c) Air Quality NO2/O3, and (d) Air Quality O3 datasets.
Figure 5. MAE and RMSE trends across increasing missingness levels for (a) SoilTemp, (b) Meteorological, (c) Air Quality NO2/O3, and (d) Air Quality O3 datasets.
Make 08 00018 g005
Figure 6. MAE comparison of the Hybrid model, KNN, and BRITS across different levels of induced missingness for (a) SoilTemp, (b) Meteorological, (c) Air Quality NO2/O3, and (d) Air Quality O3 datasets.
Figure 6. MAE comparison of the Hybrid model, KNN, and BRITS across different levels of induced missingness for (a) SoilTemp, (b) Meteorological, (c) Air Quality NO2/O3, and (d) Air Quality O3 datasets.
Make 08 00018 g006
Figure 7. RMSE comparison of the Hybrid model, KNN, and BRITS across different levels of induced missingness for (a) SoilTemp, (b) Meteorological, (c) Air Quality NO2/O3, and (d) Air Quality O3 datasets.
Figure 7. RMSE comparison of the Hybrid model, KNN, and BRITS across different levels of induced missingness for (a) SoilTemp, (b) Meteorological, (c) Air Quality NO2/O3, and (d) Air Quality O3 datasets.
Make 08 00018 g007
Figure 8. MAE comparison of different ablation variants for (a) SoilTemp, (b) Meteorological, (c) Air Quality NO2/O3, and (d) Air Quality O3 datasets.
Figure 8. MAE comparison of different ablation variants for (a) SoilTemp, (b) Meteorological, (c) Air Quality NO2/O3, and (d) Air Quality O3 datasets.
Make 08 00018 g008
Figure 9. RMSE comparison of different ablation variants for (a) SoilTemp, (b) Meteorological, (c) Air Quality NO2/O3, and (d) Air Quality O3 datasets.
Figure 9. RMSE comparison of different ablation variants for (a) SoilTemp, (b) Meteorological, (c) Air Quality NO2/O3, and (d) Air Quality O3 datasets.
Make 08 00018 g009
Figure 10. Distribution comparison of target features in SoilTemp and Meteorological datasets.
Figure 10. Distribution comparison of target features in SoilTemp and Meteorological datasets.
Make 08 00018 g010
Table 1. Model configuration and training parameters.
Table 1. Model configuration and training parameters.
ParameterValueSearch Ranges/Justification
w i n d o w _ s i z e 30 10 , 20 , 30
LSTM units64Widely used default
LSTM Activation functionReLUEnhances gradient propagation and model stability
d r o p o u t _ r a t e 0.3 0.1 , 0.3
Number of attention heads6 2 , 4 , 6
l e a r n i n g _ r a t e 0.001Widely used default
b a t c h _ s i z e 32Balances training stability and execution time
e p o c h s 100 50 , 100
Loss functionMSEPenalises large mistakes and useful for predictive tasks
OptimiserAdamConverges fast and requires less manual tuning
Evaluation MetricsRMSE, MAE, R2Help assess overall accuracy and error distribution
Table 2. Cross-validation performance comparison of different imputation methods across datasets. Results are reported as mean ± standard deviation.
Table 2. Cross-validation performance comparison of different imputation methods across datasets. Results are reported as mean ± standard deviation.
DatasetMethodMAERMSE R 2
SoilTempHybrid 0.1365 ± 0.0674 0.1620 ± 0.0720 0.9980 ± 0.0008
KNN 0.2346 ± 0.1830 0.6178 ± 0.3454 0.9893 ± 0.0005
BRITS 0.8799 ± 0.0330 1.9036 ± 0.0466 0.6936 ± 0.0034
MeteorologicalHybrid 1.2635 ± 0.4668 1.6526 ± 0.5918 0.9693 ± 0.0231
KNN 0.4783 ± 0.2083 1.2927 ± 0.3159 0.9821 ± 0.0081
BRITS 2.4775 ± 0.0231 5.5508 ± 0.0542 0.6899 ± 0.0064
NO2/O3Hybrid 0.9331 ± 0.1092 1.4211 ± 0.0891 0.9813 ± 0.0030
KNN 0.3682 ± 0.2145 1.0875 ± 0.3374 0.9844 ± 0.0079
BRITS 2.2976 ± 0.0199 4.9455 ± 0.0171 0.7037 ± 0.0020
O3Hybrid 0.0010 ± 0.0001 0.0017 ± 0.0001 0.9599 ± 0.0017
KNN 0.0005 ± 0.0002 0.0012 ± 0.0005 0.9770 ± 0.0164
BRITS 0.0023 ± 0.0001 0.0049 ± 0.0002 0.6943 ± 0.0031
Table 3. Ablation study design and purpose.
Table 3. Ablation study design and purpose.
ModelLSTM1LSTM2AttentionPurpose
BaselineFull proposed Hybrid model as baseline to compare
Ablation ATest importance of pre-attention LSTM layer (LSTM1)
Ablation BTest importance of post-attention LSTM layer (LSTM2)
Ablation CTest contribution of attention and concatenation layers
✔ indicates the component is included; – indicates the component is removed.
Table 4. Performance of the model across varying missingness rates for different environmental datasets.
Table 4. Performance of the model across varying missingness rates for different environmental datasets.
Missing %SoilTempMeteorologicalAir Quality NO2/O3Air Quality O3
MAE RMSE R2 MAE RMSE R2 MAE RMSE R2 MAE RMSE R2
100.09600.13660.97640.92191.2350096791.69962.23370.92650.00130.00210.9385
200.09620.13690.97630.91481.22640.96831.68762.22330.92730.00130.00210.9396
300.09620.13680.97640.90611.21590.96921.68462.22140.92790.00130.00210.9401
400.09660.13700.97650.90471.20910.96971.68672.23170.92710.00130.00210.9405
500.09680.13740.97650.91371.22190.96931.71012.26270.92510.00130.00220.9372
600.09840.13870.97640.96491.29920.96481.75082.34560.91940.00140.00230.9298
700.10200.14170.97671.06621.43900.95721.81982.43500.91270.00150.00250.9186
800.11250.15250.97621.28901.86020.93212.05772.75070.89090.00190.00310.8699
900.18180.28390.93911.65632.27370.90112.45853.27720.84320.00250.00390.7904
Table 5. MAE comparison of Hybrid, KNN, and BRITS imputation performance across varying missing rates for multiple environmental datasets.
Table 5. MAE comparison of Hybrid, KNN, and BRITS imputation performance across varying missing rates for multiple environmental datasets.
Missing
%
SoilTempMeteorologicalAir Quality NO2O3Air Quality O3
Hybrid KNN BRITS Hybrid KNN BRITS Hybrid KNN BRITS Hybrid KNN BRITS
100.09600.28460.41360.92190.49153.02871.69960.87362.97130.00130.00120.0028
200.09620.34390.49630.91480.55373.61591.68761.12503.50710.00130.00180.0033
300.09620.40450.57330.90610.60514.23261.68461.37904.04020.00130.00240.0037
400.09660.46060.65610.90470.67444.91171.68671.60654.63570.00130.00300.0043
500.09680.51370.72690.91370.73245.58711.71011.82455.24380.00130.00360.0048
600.09840.57670.80800.96490.82056.29031.75082.09015.87500.00140.00420.0054
700.10200.63550.88001.06620.86796.92951.81982.35856.48750.00150.00490.0059
800.11250.69010.95731.28900.94777.60722.05772.59807.04930.00190.00540.0064
900.18180.75401.03801.65631.03208.28992.45852.84717.55030.00250.00600.0069
Bold values denote the best-performing method for each missing % and dataset.
Table 6. RMSE comparison of Hybrid, KNN, and BRITS imputation performance across varying missing rates for multiple environmental datasets.
Table 6. RMSE comparison of Hybrid, KNN, and BRITS imputation performance across varying missing rates for multiple environmental datasets.
Missing
%
SoilTempMeteorologicalAir Quality NO2O3Air Quality O3
Hybrid KNN BRITS Hybrid KNN BRITS Hybrid KNN BRITS Hybrid KNN BRITS
100.13660.66980.81681.23501.25696.20262.23371.99886.46080.00210.00320.0053
200.13690.73380.89411.22641.31496.75792.22332.36256.96690.00210.00410.0058
300.13680.79960.95991.21591.34067.31092.22142.69117.45730.00210.00490.0061
400.13700.84601.02461.20911.40057.95592.23172.88818.01070.00210.00560.0066
500.13740.89191.07841.22191.39538.49312.26273.07658.52890.00220.00620.0070
600.13870.95201.13791.29921.47309.04172.34563.31749.06900.00230.00670.0075
700.14171.00261.18451.43901.45699.46592.43503.56069.50440.00250.00720.0078
800.15251.04531.23471.86021.50769.94582.75073.75309.89740.00310.00770.0081
900.28391.09471.28622.27371.563310.42193.27723.953510.23150.00390.00810.0084
Bold values denote the best-performing method for each missing % and dataset.
Table 7. MAE comparison across datasets for different ablation variants.
Table 7. MAE comparison across datasets for different ablation variants.
(a) Ablation Results-SoilTemp and Meteorological Datasets
Missing %SoilTempMeteorological
Ablation AAblation BAblation CBaselineAblation AAblation BAblation CBaseline
100.19591.68060.15780.09601.04073.21011.24800.9219
200.19581.68070.15790.09621.03883.20641.24320.9148
300.19601.68070.15780.09621.03363.20221.23790.9061
400.19601.68070.15780.09661.03563.19291.23970.9047
500.19571.68110.15720.09681.05383.19131.25330.9137
600.19511.68170.15640.09841.12723.19681.29260.9649
700.19381.68030.15800.10201.23463.24671.39691.0662
800.19251.67980.15940.11251.48173.28971.58081.2890
900.20831.66090.20160.18181.87813.36111.91701.6563
(b) Ablation Results-Air Quality NO2O3 and O3 Datasets
Missing %Air Quality NO2O3Air Quality O3
Ablation AAblation BAblation CBaselineAblation AAblation BAblation CBaseline
101.83292.12101.73261.69960.00160.00130.00150.0013
201.82972.09091.72241.68760.00160.00130.00150.0013
301.82372.07641.71601.68460.00160.00130.00150.0013
401.83182.07261.71511.68670.00160.00130.00150.0013
501.84532.10541.72201.71010.00170.00140.00160.0013
601.88302.14071.75421.75080.00170.00140.00170.0014
701.94942.19321.83301.81980.00180.00150.00180.0015
802.17682.45072.04822.05770.00220.00200.00220.0019
902.53782.83842.48012.45850.00270.00260.00280.0025
Bold values denote the best-performing ablation variant for each missing % and dataset.
Table 8. RMSE comparison across datasets for different ablation variants.
Table 8. RMSE comparison across datasets for different ablation variants.
(a) Ablation Results-SoilTemp and Meteorological Datasets
Missing %SoilTempMeteorological
Ablation AAblation BAblation CBaselineAblation AAblation BAblation CBaseline
100.32231.69670.16910.13661.38583.60501.56231.2350
200.32231.69680.16920.13691.38423.59841.55601.2264
300.32251.69690.16910.13681.37783.59371.54861.2159
400.32231.69700.16920.13701.37993.58891.54651.2091
500.32181.69740.16890.13741.40133.59331.56281.2219
600.32061.69820.16820.13871.50863.62231.62601.2992
700.31941.69720.17040.14171.65423.71231.77471.4390
800.31811.69830.17530.15252.05933.84592.09391.8602
900.33411.69410.28080.28392.54524.08512.53012.2737
(b) Ablation Results-Air Quality NO2O3 and O3 Datasets
Missing %Air Quality NO2O3Air Quality O3
Ablation AAblation BAblation CBaselineAblation AAblation BAblation CBaseline
102.39852.78952.28882.23370.00240.00210.00230.0021
202.39772.76142.27572.22330.00240.00210.00230.0021
302.38872.74542.26552.22140.00240.00210.00230.0021
402.40252.74042.27212.23170.00240.00210.00230.0021
502.42172.76832.27842.26270.00250.00220.00240.0022
602.48902.83582.34562.34560.00260.00230.00250.0023
702.56552.90532.44832.43500.00270.00250.00270.0025
802.85013.24002.74762.75070.00330.00320.00330.0031
903.30123.75393.32213.27720.00400.00400.00410.0039
Bold values denote the best-performing ablation variant for each missing % and dataset.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Laeeq, A.; Li, J.; Adeel, U. A Hybrid LSTM–Attention Model for Multivariate Time Series Imputation: Evaluation on Environmental Datasets. Mach. Learn. Knowl. Extr. 2026, 8, 18. https://doi.org/10.3390/make8010018

AMA Style

Laeeq A, Li J, Adeel U. A Hybrid LSTM–Attention Model for Multivariate Time Series Imputation: Evaluation on Environmental Datasets. Machine Learning and Knowledge Extraction. 2026; 8(1):18. https://doi.org/10.3390/make8010018

Chicago/Turabian Style

Laeeq, Ammara, Jie Li, and Usman Adeel. 2026. "A Hybrid LSTM–Attention Model for Multivariate Time Series Imputation: Evaluation on Environmental Datasets" Machine Learning and Knowledge Extraction 8, no. 1: 18. https://doi.org/10.3390/make8010018

APA Style

Laeeq, A., Li, J., & Adeel, U. (2026). A Hybrid LSTM–Attention Model for Multivariate Time Series Imputation: Evaluation on Environmental Datasets. Machine Learning and Knowledge Extraction, 8(1), 18. https://doi.org/10.3390/make8010018

Article Metrics

Back to TopTop