Pattern-Aware BiLSTM Framework for Imputation of Missing Data in Solar Photovoltaic Generation

Jang, Minseok; Joo, Sung-Kwan

doi:10.3390/en18174734

Open AccessArticle

Pattern-Aware BiLSTM Framework for Imputation of Missing Data in Solar Photovoltaic Generation

by

Minseok Jang

and

Sung-Kwan Joo

^*

The School of Electrical Engineering, Korea University, Seoul 02841, Republic of Korea

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(17), 4734; https://doi.org/10.3390/en18174734

Submission received: 26 July 2025 / Revised: 31 August 2025 / Accepted: 3 September 2025 / Published: 5 September 2025

(This article belongs to the Special Issue Renewable Energy Development in Distribution Networks: Optimization, Assessment and Design of Renewable Plants)

Download

Browse Figures

Versions Notes

Abstract

Accurate data on solar photovoltaic (PV) generation is essential for the effective prediction of energy production and the effective management of distributed energy resources (DERs). Such data also plays a crucial role in ensuring the operation of DERs within modern power distribution systems is both safe and economical. Missing values, which may be attributed to faults in sensors, communication failures or environmental disturbances, represent a significant challenge for distribution system operators (DSOs) in terms of performing state estimation, optimal dispatch, and voltage regulation. This paper proposes a Pattern-Aware Bidirectional Long Short-Term Memory (PA-BiLSTM) model for solar generation imputation to address this challenge. In contrast to conventional convolution-based approaches such as the Convolutional Autoencoder and U-Net, the proposed framework integrates a 1D convolutional module to capture local temporal patterns with a bidirectional recurrent architecture to model long-term dependencies. The model was evaluated in realistic block–random missing scenarios (1 h, 2 h, 3 h, and 4 h gaps) using 5 min resolution PV data from 50 sites across 11 regions in South Korea. The numerical results show that the PA-BiLSTM model consistently outperforms the baseline methods. For example, with a time gap of one hour, it achieves an MAE of 0.0123, an R² value of 0.98, and an average MSE, with a maximum reduction of around 15%, compared to baseline models. Even under 4 h gaps, the model maintains robust accuracy (MAE = 0.070, R² = 0.66). The results of this study provide robust evidence that accurate, pattern-aware imputation is a significant enabling technology for DER-centric distribution system operations, thereby ensuring more reliable grid monitoring and control.

Keywords:

distributed energy resources (DERs); renewable energy; photovoltaic; solar PV; imputation; missing pattern; LSTM; BiLSTM

1. Introduction

The global solar power market has seen rapid expansion, with installed capacity growing from 40 GW in 2010 to over 2 TW in 2024, contributing 6.9% to global electricity production [1]. Solar PV has become central to the RE100 initiative, supporting corporate decarbonization, and in South Korea, recent policy efforts aim to establish RE100 industrial zones powered entirely by renewables [2,3]. Against this backdrop of rapid deployment and policy support, accurate solar generation forecasting and operational optimization have become indispensable.

Accurate imputation of missing solar generation data is critical, given the significant role of solar power in global renewable energy initiatives and the challenges associated with its inherent variability and uncertainty. Recent literature emphasizes that the rapid expansion of renewable energy introduces complexities such as increased system integration costs due to output variability, necessitating robust data imputation to maintain grid stability, and optimizing operational efficiency [4,5].

Furthermore, uncertainties arising from the variability of renewable energy generation, including forecast errors and the impact of weather conditions, demonstrate the need for reliable imputation methods that can accurately capture spatial and temporal dependencies [5]. Additionally, economic analysis emphasizes the importance of accurately managing and mitigating uncertainties in renewable generation, particularly in relation to the need for reliable data for informed decision-making in infrastructure planning and investment [6,7].

However, in practice, datasets from solar generation are frequently incomplete due to a variety of operational and technical issues. The prevalence of missing data poses a considerable challenge, as it has been shown to have a detrimental effect on the accuracy of analytical models, especially those employed in the field of power generation forecasting [8]. The problem is so widespread that some reports suggest it is common for PV production databases to have substantial portions of data missing, in some cases up to 40% [9]. Multiple factors can cause missing solar generation data. Hardware failures in pyranometers, temperature sensors and inverters are frequently the cause of such issues. Data transmission from the facility to the central database may experience interruptions, potentially leading to substantial gaps in recorded information. Faults within data-logging and storage systems may cause corruption or loss of information [8,10].

In recent studies, deep learning or machine learning methods have shown promising results for solar generation data or electrical data imputation. In [11], machine learning models, specifically Random Forest and Gradient Boosting, are applied to impute missing values in solar generation time series. This involves introducing synthetic gaps into real-world photovoltaic plant data and training both algorithms using environmental predictors such as irradiance, temperature, humidity, and wind speed.

Building on the GAN paradigm, SolarGAN introduces a multivariate imputation framework based on Generative Adversarial Networks to reconstruct missing solar irradiance data by modeling spatiotemporal correlations among multiple stations. The generator learns the joint distribution of observed and missing data using neighboring station values as conditional context, while the discriminator distinguishes between real and imputed sequences. A temporal convolutional layer is employed to capture local patterns, enhancing the model’s ability to impute structured gaps [12].

Furthermore, a conditional Wasserstein GAN with gradient penalty (WGAN-GP) is introduced for photovoltaic (PV) power imputation. The generator integrates both random noise and historical context to synthesize missing data, while the discriminator guides learning by enforcing temporal consistency. These features demonstrate strong spectral and statistical similarity to the original PV data, enhancing the applicability of this approach in forecasting and distributed energy resource operations [13].

Beyond GAN-based methods, autoencoder variants have also been proposed to handle missing or limited data. A denoising masked autoencoder (DMAE) for imputing missing values has been proposed in electric load data collected in constrained environments. The DR model combines the strengths of denoising autoencoders, which handle short time series, and masked autoencoders, which manage high proportions of missing data, to learn robust load reconstruction patterns [14,15].

Complementarily, an improved Variational Autoencoder (VAE) has been designed for small-sample PV power augmentation. By enhancing latent space learning and employing structured sampling with noise injection, it generates realistic synthetic sequences that boost downstream solar forecasting performance [16]. Collectively, these approaches underscore the value of generative modeling in recovering data integrity and enriching datasets for reliable renewable energy analytics.

The Pattern-Aware BiLSTM (PA-BiLSTM) proposed in this paper has several key advantages. Firstly, by incorporating a 1D convolutional module to extract local temporal structures and a bidirectional LSTM to capture long-range dependencies, the proposed framework offers a comprehensive approach to both short-term patterns (e.g., morning ramp and midday peak) and global temporal information throughout the entire sequence. Secondly, PCA-based seasonal embeddings are employed to explicitly encode seasonal characteristics. This allows for a data-driven representation of seasonal variability, which has not been commonly considered in previous GAN- or AE-based models. Third, instead of min–max scaling with the observed maximum, each time series is scaled by the generator’s rated maximum 5 min output (kW). This capacity-based factor remains stable over time, prevents normalized values from exceeding 1 under new weather conditions, and embeds a meaningful physical upper bound into the learning objective. Finally, empirical results under block–random missing scenarios (one-hour, two-hour, three-hour, and four-hour continuous gaps) confirm the robustness of the PA-BiLSTM, particularly for long consecutive gaps. This underscores its practical relevance for the operation of distribution systems involving distributed energy resources (DERs), where accurate, temporally coherent imputation is critical for reliable state estimation and dispatch.

The remainder of this paper is organized as follows. In Section 2, a review of existing time-series imputation methods for solar generation systems is presented. Section 3 presents the proposed PA-BiLSTM-based framework for imputing missing solar data. Section 4 presents the results of experiments conducted using multiple generated missing solar datasets, evaluating the effectiveness of the proposed method. Finally, Section 5 concludes the paper and outlines future research directions.

2. Related Work

Recent advances in deep learning have resulted in substantial improvements in the management of missing values in time-series data. A substantial number of studies have employed bidirectional recurrent architectures, generative models, and temporal decay mechanisms to capture temporal dependencies and missingness patterns.

In [17], Ma et al. introduced LSTM-BIT, a bi-directional LSTM framework that combines forward and backward sequence learning with transfer learning. The approach utilizes a pre-trained model from a source building and employs input transfer techniques to mitigate the effect of continuous missing blocks in target domains. While the model has been demonstrated to be effective for large-scale building energy datasets, it does not explicitly capture local temporal patterns prior to sequential modeling.

In [18], Bidirectional Recurrent Imputation (BRITS) is introduced. It is a framework for multivariate time-series data which treats missing values as learnable variables during optimization. It performs both forward and backward passes and enforces consistency between the two, allowing for dynamic imputation. However, BRITS does not exploit localized structures using convolutional mechanisms.

GRU-D is a gated recurrent architecture developed for multivariate clinical time series with missing values [19]. GRU-D incorporates decay mechanisms for both input variables and hidden states, leveraging masking and time intervals to model informative missingness. While effective in healthcare applications, the model is not inherently bi-directional and does not explicitly enhance local temporal features.

In [20], a novel Style Transfer GAN architecture is proposed, which integrates Bi-LSTM to generate synthetic photovoltaic (PV) time series across different weather domains. Although the primary focus of this method is on data simulation rather than imputation, it enhances the temporal realism in augmented datasets through the utilization of domain transfer.

In contrast, the proposed PA-BiLSTM augments input sequences with Conv1D-based local pattern extraction prior to the application of a bi-directional LSTM. This design facilitates the capture of both short-range and long-range temporal dependencies by the model, in a manner that is both efficient in terms of computation and effective in its application.

3. The Proposed PA-BiLSTM Method

This study presents a comprehensive framework for imputing missing values in solar photovoltaic (PV) generation data using deep learning models. The process comprises nine structured steps, encompassing data preparation, preprocessing, feature engineering, model training, and evaluation. Each step is described in Figure 1.

Step 1. Raw data and weather parsing: Collect 5 min interval solar generation data from PV sites and hourly interval ASOS weather data from the Korean Meteorological Association (KMA), then synchronize them by timestamp.

Step 2. Normalization of solar generation by capacity-based Maximum Generation: Calculate the 5 min maximum generation for each site based on its installed capacity. And normalize each site’s generation data by dividing by its maximum output, scaling values to the [0, 1] range.

Step 3. Standardization of the weather data: Standardize meteorological variables using a standard scale to ensure zero mean and unit variance.

Step 4. Feature embedding: Embed region, time information (08:00~18:00), and seasonal trends using PCA to enrich spatiotemporal features.

Step 5. Seasonally balanced sampling: Split the dataset into training, validation, and test sets with equal seasonal distribution to ensure generalization.

Step 6. Missing pattern generation: Create diverse missing data patterns (block, random) to simulate realistic imputation scenarios.

Step 7. Imputation model training and comparison: Train and compare multiple deep learning models (e.g., Convolution Autoencoder, U-net) under consistent conditions.

Step 8. Model evaluation on the test set: Evaluate model performance using MSE on unseen test data with generated missingness.

3.1. Normalization Solar Generation Data

Normalizing solar generation data is crucial for model training due to varying scales across different generators. While min–max scaling is common, using the observed maximum value from a dataset presents two key challenges. First, the observed maximum can be skewed by transient weather conditions, leading to an underestimation of the true maximum capacity. This can cause normalized values to exceed 1 during testing with different weather patterns. Second, the maximum value is not static and changes as new data is collected, requiring frequent re-normalization of the entire dataset, which is inefficient.

To address these issues, a robust normalization method based on maximum generation is applied. This value is derived from each generator’s capacity (kW) and represents the maximum possible energy output in a 5 min interval, ensuring a consistent and physically grounded scaling factor. The maximum generation is calculated as Equation (1).

{MaxGeneration}_{5 \min} = {Capacity}_{kW} \times \frac{5 \min}{60 \min / h}

(1)

Subsequently, the observed generation is normalized as Equation (2).

{SolarGeneration}_{normalized, t} = \frac{{SolarGeneration}_{observed, t}}{{MaxGeneration}_{5 \min}}

(2)

This physics-based approach ensures that normalization is robust against anomalous weather conditions and remains scalable and consistent as new data is introduced.

3.2. Feature Embedding

In order to effectively capture the spatiotemporal and environmental context of solar power generation, three types of feature embeddings were designed and incorporated: region, time, and season. These embeddings provide the model with structured information beyond raw measurements, thereby enhancing its capacity to learn from high-dimensional and correlated input features.

3.2.1. Region Embedding

In considering the heterogeneity of PV generators across regions, each site was mapped to its corresponding administrative region. A unique 8-dimensional dense vector was randomly assigned to each region, and generator–to–region mapping was used to attach the appropriate region embedding to every sample. This approach facilitates the implicit learning of regional effects by the model, including local irradiance patterns, atmospheric conditions, and infrastructure variations.

3.2.2. Time Embedding

In representing the cyclical nature of solar power generation over time, each 5 min interval was encoded using a sinusoidal embedding based on the hour and minute values [21]. The time embedding vector is a critical component in the analysis, providing a comprehensive representation of the temporal dimension in the data. The time embedding vector

e_{t} \in ℝ^{4}

for each time step t was defined as following Equation (3)

e_{t} = [\sin (2 π \cdot \frac{h_{t}}{24}), | \cos (2 π \cdot \frac{h_{t}}{24}), | \sin (2 π \cdot \frac{m_{t}}{60}), | \cos (2 π \cdot \frac{m_{t}}{60})]

(3)

where

h_{t}

and

m_{t}

denote the hour and minute corresponding to time step t, respectively.

This 4-dimensional time embedding smoothly represents daily cycles and allows the model to learn time-dependent patterns in solar generation without the need for discrete or categorical encoding.

3.2.3. PCA-Based Seasonal Embedding

Seasonal effects play a crucial role in shaping solar generation due to variations in the sun angle, daylight duration, and prevailing weather conditions [22]. In this study, generator-specific seasonal embeddings are constructed using a data-driven approach based on Principal Component Analysis (PCA) to quantitatively capture seasonal patterns. For each season (spring, summer, fall, and winter), a comprehensive set of 20 features is extracted from the raw 5 min resolution generation data. These included:

(10 features) Hourly mean generation profiles for the 10 h period from 08:00 to 17:00.
Overall mean and standard deviation, as defined in Equations (4) and (5).
The maximum value shown in Equation (6) and the 25th and 75th percentiles shown in Equations (7) and (8) to describe the range and spread.
The skewness and kurtosis of the distribution, as shown in Equations (9) and (10).
The proportion of high-generation (>0.8) and low-generation (<0.1) intervals, calculated according to Equations (11) and (12).

μ_{G} = \frac{1}{N_{s}} \sum_{i = 1}^{N_{s}} G_{i}

(4)

where

μ_{G}

is the mean generation,

N_{s}

is the total number of data points in the season, and

G_{i}

is the i-th solar generation value.

σ_{G} = \sqrt{\frac{1}{N_{s} - 1} \sum_{i = 1}^{N_{s}} {(G_{i} - μ_{G})}^{2}}

(5)

where

σ_{G}

is the sample standard deviation of the solar generation.

G_{m a x} = \max (G_{n o r m, i})

(6)

where

G_{m a x}

is the maximum value in the set of generation data for the season.

Q_{1} (G_{i})

(7)

Q_{3} (G_{i})

(8)

where

Q_{1} (G_{i})

and

Q_{3} (G_{i})

are the first (25th percentile) and third (75th percentile) quartiles of the generation data, respectively.

S k e w n e s s (G_{i})

(9)

K u r t o s i s (G_{i})

(10)

where these functions calculate the skewness and kurtosis of the generation data distribution.

R_{> 0.8} = \frac{|{i | G_{i} > 0.8}|}{N_{s}}

(11)

where

R_{> 0.8}

is the ratio of time intervals, where generation exceeds 0.8.

R_{< 0.1} = \frac{|{i | G_{i} < 0.1}|}{N_{s}}

(12)

where

R_{< 0.1}

represents the time interval ratio when generation falls below 0.1.

Let

x_{g, s} \in R^{20}

denote the raw feature vector for generator g in season s. All feature vectors were standardized by

{x^{'}}_{g, s}

across all generators and seasons. Principal Component Analysis (PCA) was subsequently used to reduce the 20-dimensional vectors to 4 dimensions, as presented in Equation (13).

z_{g, s} = P C A_{4} ({x^{'}}_{g, s}) \in R^{4}

(13)

The resulting vector

z_{g, s}

serves as a seasonal embedding that captures dominant generation patterns specific to each generator–season pair. This process ensures that each embedding reflects both the temporal structure and statistical behavior of generation in its respective seasonal context.

The resulting feature vectors, comprising both distributional and temporal statistics, were standardized and then subjected to PCA for dimensionality reduction. The selection of the top principal components was undertaken to form a compact and informative seasonal embedding vector. This study used a 4-dimensional embedding to retain most variance and clearly represent key seasonal patterns.

In Figure 2b, Principal Component 1 (PC1) captures the overall magnitude of solar generation, distinguishing high-output seasons (e.g., summer) from low-output ones (e.g., winter), while Principal Component 2 (PC2) reflects differences in generation pattern shape, separating transitional seasons (spring, fall) from extremes (summer, winter).

3.3. Proposed Pattern-Aware Bidirectional Long Short-Term Memory (PA-BiLSTM)

This study utilizes a 1D convolution layer as a pattern-aware mechanism to recognize temporal structures within the input sequence

x_{t}

, where

x_{t}

represents a set of features related to solar power generation. The purpose of this module is to transform multi-variate input time series into a set of local features as follows:

x_{t}^{conv} = σ (W_{conv} * x_{t} + b_{conv})

(14)

where

*

denotes the 1D convolution operation,

W_{conv} \in R^{C_{filter} \times C \times k}

is the convolution kernel with window size

k

, and

σ

represents a non-linear activation function such as Rectified Linear Unit (ReLU). The output of the Conv1D layer is concatenated with the original input to form a composite representation in Equation (15).

X_{t}^{combined} = [x_{t}; x_{t}^{conv}] \in R^{(C + C_{f i l t e r}) \times T}

(15)

This enriched input is used to enhance temporal learning by retaining both the original contextual signals and the locally extracted patterns. In addition, bidirectional long–short-term memory (BiLSTM) is employed to model long-term temporal dependencies by incorporating the combined input. The data is processed in both forward and reverse directions, as shown in Equations (16) and (17) [23].

h_{f w d, t} = {LSTM}_{fwd} (X_{t}^{combined}, h_{t - 1})

(16)

h_{r e v, t} = {LSTM}_{rev} (X_{t}^{combined}, h_{t + 1})

(17)

The bidirectional hidden states from both directions are concatenated to form the unified representation as follows:

h_{b i d, t} = [h_{f w d, t}; h_{r e v, t}] \in R^{2 H}

(18)

The imputed solar generation output is computed using a two-layer feed-forward neural network with activation, using

h_{b i d, t}

bidirectional hidden states as follows:

\hat{y_{t}} = σ_{2} (W_{2} (σ_{1} (W_{1} h_{b i d, t} + b_{1})) + b_{2})

(19)

where

W_{1} \in R^{H \times 2 H}, b_{1} \in R^{H}

and

W_{2} \in R^{1 \times H}, b_{2} \in R

. This network first applies a linear transformation followed by activation

σ_{1}

(e.g., ReLU) to capture non-linear relationships in the BiLSTM output. A second linear transformation

σ_{2}

is then applied, followed by a sigmoid activation function, which ensures the solar generation predicted value

\hat{y_{t}}

is bounded between 0 and 1, which makes it suitable for normalized solar generation data.

Finally, the model is trained to minimize a hybrid weighted mean squared error (hybrid MSE) loss, where

L

is utilized [24]. This loss emphasizes the accuracy of missing value imputation via a weighting parameter as follows:

L (\hat{y}, y, M) = (1 - α) \cdot \frac{\sum_{t} (1 - M_{t}) {(\hat{y_{t}} - y_{t})}^{2}}{\sum_{t} (1 - M_{t})} + α \cdot \frac{\sum_{t} M_{t} {(\hat{y_{t}} - y_{t})}^{2}}{\sum_{t} M_{t}}

(20)

where

M_{t}

is a binary mask indicating missing entries (1 for missing, 0 for observed), and the hyperparameter

α = 0.7

is selected empirically in this study. To the best of our knowledge, the PA-BiLSTM framework, incorporating both the model architecture and the embedding strategy, constitutes a unique contribution of this study. Its effectiveness is illustrated in Figure 3.

4. Numerical Results

4.1. Data Description

This study utilizes 5 min interval solar power generation data. The dataset covers the period from January to December 2021 and includes data from 50 distinct solar power generation sites. The solar generation units are distributed across 11 regions in South Korea: Gyeongsangnam-do, Chungcheongnam-do, Jeollanam-do, Gyeongsangbuk-do, Chungcheongbuk-do, Sejong City, Incheon, Daejeon, Jeollabuk-do, Gyeonggi-do, and Busan. Table 1 shows the number of solar power generators distributed across each region.

Hourly ASOS weather data for each region are sourced from the Korea Meteorological Administration (KMA). The weather data include humidity (%), temperature (°C), solar radiation (MJ/m²), cloud cover (0~10), and precipitation (mm). In this study, solar radiation corresponds to the global horizontal irradiance (GHI) measured on a horizontal plane.

4.2. Missing Pattern Simulation and Experimental Setup

In this study, time steps with zero solar power output were excluded from the training and evaluation process. The inclusion of zero outputs, which are often deterministic and non-informative, has the potential to introduce a bias into the learning process by overemphasizing trivial patterns. By focusing exclusively on active generation periods (e.g., from 08:00 to 18:00), the model is better able to learn meaningful temporal dependencies and impute missing values with higher accuracy under realistic operating conditions. T = 120 corresponds to 5 min intervals from 08:00 to 18:00.

Let X = {

x_{1}

,

x_{2}

,…,

x_{t}

} ∈

R^{T}

denote a solar power generation time series, where T represents the total number of time steps. In the presence of missing values, the observed data can be expressed as shown in Equation (21)

X^{o b s} = X ⊙ (1 - M)

(21)

where

M = \{m_{1}, m_{2}, \dots, m_{t}\}

is the missing indicator vector, with m_t = 1 indicating a missing value at time t and m_t = 0 indicating an observed value as shown Figure 4. ⊙ denotes an element-wise product. Multiple missing data scenarios, including block-wise patterns, are employed to simulate realistic solar generation data loss. The objective is to reconstruct the complete solar generation time series X using the observed missing data

X^{o b s}

and the missing indicator M.

In order to ensure robust model generalization across seasonal patterns, the dataset was split into training, validation, and test sets using a seasonally aware sampling strategy. Let

D = {(X_{t}^{(i)}, Y_{t}^{(i)})}_{i = 1}^{N}

denote the full dataset, where

Y_{t}^{(i)} \in R^{T}

is the normalized solar generation sequence of the sample i. Each sample is associated with a seasonal label

s_{i} \in \{spring, summer, fall, winter\},

which is derived from solar-specific temporal rules. The dataset is first partitioned seasonally as follows:

D = \{(X_{t}^{(i)}, Y_{t}^{(i)}) | s_{i} = s\}, f o r s \in \{spring, summer, fall, winter\}

(22)

For each seasonal subset

D_{s}

, stratified sampling is used to split the dataset into training (70%), validation (15%), and test (15%) blocks as shown in Figure 5.

4.3. Evaluation Metrics

The objective evaluation of imputation quality is facilitated by the computation of error metrics on artificially masked (missing) segments, utilizing a binary mask

M = \{m_{1}, m_{2}, …, m_{t}\}

(1 for missing, 0 for observed). Let

\hat{y_{t}}

denote the imputed generation value at time t,

\bar{y}

the mean of the observed values, and

y_{t}

the corresponding ground truth. The total number of missing points in a sequence is

N_{m i s s} = \sum_{t = 1}^{T} M

. The following metrics are shown as follows:

{MSE}_{miss} = \frac{1}{N_{miss}} \sum_{t = 1}^{T} M_{t} {(\hat{y_{t}} - y_{t})}^{2}

(23)

{MAE}_{miss} = \frac{1}{N_{miss}} \sum_{t = 1}^{T} M_{t} |\hat{y_{t}} - y_{t}|

(24)

R_{miss}^{2} = 1 - \frac{\sum_{t = 1}^{T} M_{t} {(\hat{y_{t}} - y_{t})}^{2}}{\sum_{t = 1}^{T} M_{t} {(\bar{y} - y_{t})}^{2}}

(25)

These metrics directly quantify the reconstruction fidelity in the intentionally removed segments, thereby isolating the true imputation performance from portions that were never missing. In addition to MSE, which is used as the primary criterion for selecting the best-performing model across scenarios, two additional metrics—MAE and the coefficient of determination (R²)—are reported to further highlight the performance of the selected model.

4.4. Results

This study set out to evaluate the effectiveness of the proposed Pattern-Aware BiLSTM (PA-BiLSTM) through rigorous comparison with a selection of five baseline approaches: (i) Linear Imputation (LI), (ii) Historical Imputation (HI), (iii) Convolutional Autoencoder (ConvAE), (iv) U-Net, and (v) vanilla BiLSTM. All models were trained and validated under identical data splits and missing-pattern generation procedures to ensure a fair comparison.

These baselines represent a range of imputation strategies, from traditional statistical methods (LI, HI) to deep learning models (ConvAE, U-Net, BiLSTM). Linear and historical imputation serve as simple yet widely used benchmarks in real-world scenarios, providing insight into the gains achieved by learning-based methods. ConvAE and U-Net represent convolutional neural network (CNN)-based architectures capable of learning spatial or local patterns, while the vanilla BiLSTM captures temporal dependencies using sequential modeling. The comparison of representative models demonstrates that integrating local pattern extraction via Conv1D with bidirectional temporal reasoning in PA-BiLSTM improves imputation performance across diverse missingness patterns.

Rather than simulating a full-day missing pattern, this study simulated block–random missing scenarios of 1–4 h. This design was based on two considerations. Firstly, short or isolated missing segments can generally be reconstructed effectively using conventional statistical methods (e.g., linear interpolation or historical averaging), so they do not provide sufficiently robust benchmarks for evaluating the benefits of advanced deep learning models. Secondly, the average effective daily generation time for solar PV is around four hours, roughly corresponding to the core midday production window when solar irradiance and grid impact are at their peak. Consequently, the maximum missing block duration is set to four hours in this study. In contrast, all-day missing patterns are rare in practice and were excluded from this study [25].

The generating of block–random missing scenarios was undertaken, incorporating continuous gaps of 1 h, 2 h, 3 h and 4 h at random positions within each daily sequence (5 min resolution). Each scenario was applied to the test set independently, yielding three evaluation subsets. The models were assessed using MSEs, which were computed exclusively on the ground-truth missing portions. In order to account for stochasticity, each experiment was repeated five times, with different random seeds employed on each iteration.

In order to evaluate the effectiveness of the proposed Pattern-Aware BiLSTM (PA-BiLSTM), a comprehensive evaluation was conducted across five distinct simulation cases. Each case corresponds to a separate training run with randomized initial weights, while applying the same fixed test dataset. The objective of this approach is to isolate the variability introduced solely by training dynamics, as opposed to data selection. As a point of reference, the baseline models—Linear Interpolation (LI) and Historical Average (HA)—remain constant across all cases, as they are non-learning-based methods and are unaffected by stochastic training procedures. The results are summarized in Table 2.

In addition to, Table 3 summarizes the imputation performance of BiLSTM, ConvAE, U-Net and PA-BiLSTM across 1–4 h gap durations, as evaluated using MAE and R². Several observations can be made. The bold and italicized text indicates the best score for each 1–4 h gap durations. Firstly, PA-BiLSTM achieves the lowest MAE consistently and one of the highest R² values across all time horizons. At the one-hour gap, the PA-BiLSTM records an MAE of 0.0123 with an R² value of 0.98, outperforming all baseline models. Even at longer time horizons of 3–4 h, the model maintains competitive accuracy, with an MAE of 0.0474 and 0.070, and a higher R² of 0.81 and 0.66, respectively, demonstrating its robustness under challenging conditions, compared to other models.

Across all models, the expected trend of an increase in MAE and a decrease in R² with longer gap durations is observed, reflecting the inherent difficulty of imputing long gaps. Nevertheless, the PA-BiLSTM performs better than the baselines, consistently preserving both magnitude and temporal variability. The results demonstrate that PA-BiLSTM achieves superior short-horizon accuracy and also demonstrates robustness and generalization capability under extended missing scenarios.

Firstly, the proposed PA-BiLSTM model consistently outperforms both traditional and deep learning-based baselines across all test cases, achieving the lowest mean squared error (MSE) in each setting. Compared to the vanilla BiLSTM, the PA-BiLSTM demonstrates enhanced robustness and reduced variance, highlighting the advantage of incorporating localized temporal features through its convolutional front-end. While the BiLSTM model shows stable performance across the five repeated training runs, the PA-BiLSTM not only achieves a lower average error but also exhibits a tighter performance distribution, underscoring its superiority in both accuracy and consistency.

Furthermore, CNN-based architectures such as ConvAE and U-Net show performance levels comparable to BiLSTM, albeit with slightly higher MSEs. These models benefit from strong local pattern extraction but often fail to capture long-range temporal dependencies, which limits their imputation accuracy in highly dynamic contexts. Baseline methods such as Linear Interpolation (LI) and Historical Average (HA), although straightforward and deterministic, serve as reference anchors for evaluating model performance. As expected, these baselines yield the highest errors, thereby reinforcing the need for data-driven, sequence-aware models in energy time-series imputation tasks.

Baseline methods such as Linear Interpolation (LI) and Historical Average (HA), although simple and deterministic, serve as useful reference anchors. As expected, they result in the highest MSE values, further emphasizing the necessity for sequence-aware deep learning approaches in accurately handling complex missing data scenarios within the energy domain.

This quantitative evaluation is further supported by Figure 6 and Figure 7, which present representative examples of imputed solar generation profiles under gap durations ranging from 1 to 4 h. The x-axis represents the hourly timeline of a single day. To illustrate typical reconstruction performance across different models, three cases were randomly sampled from the entire test dataset. Because the dataset comprises 50 solar plants, it is impractical to display all plants simultaneously; therefore, three plants were randomly selected for each figure. Importantly, the plants depicted in Figure 6A,B and Figure 7A,B are distinct, even though they are uniformly anonymized with the labels “a-,” “b-,” and “c-“ to protect sensitive plant information. This sampling and anonymization strategy ensures that the results remain representative of the broader dataset while safeguarding proprietary details. For both training and evaluation, the daytime interval between approximately 08:00 and 18:00 was considered, corresponding to the period when solar generation is active and most relevant for system operations.

From Figure 6 and Figure 7, several trends can be observed. As the gap length increases, baseline methods such as LI and HA exhibit progressively larger deviations from the ground truth, often producing oversimplified reconstructions that fail to capture intra-day variability. CNN-based models (ConvAE and U-Net) provide better local approximations but tend to generate excessive fluctuations that reduce overall smoothness and interpretability. The BiLSTM captures temporal dependencies more effectively, yet its output occasionally shows inconsistency across different patterns. In contrast, the proposed PA-BiLSTM consistently reconstructs both the magnitude and temporal variability of the missing segments, aligning more closely with the observed profiles across all scenarios. These visual results corroborate the quantitative findings, demonstrating that PA-BiLSTM achieves better accuracy and stability, particularly under challenging long-gap conditions.

5. Conclusions

Accurate short-term forecasting of photovoltaic (PV) generation is critical for distribution system operators (DSOs) as it is fundamental to operational planning and decision-making. In practice, missing or corrupted PV measurements reduce the quality of the data used to train forecasting models, thereby reducing their predictive accuracy and introducing uncertainty into scheduling and control strategies. Imputation techniques can be employed to preserve the continuity and integrity of historical datasets, enabling forecasting models to learn from more representative training inputs. This enhances prediction accuracy, strengthening the reliability of operational planning processes such as unit commitment, reserve allocation, and demand response coordination. Thus, the availability of high-quality imputed data leads to more resilient and economically efficient distribution system operations.

In this study, a robust Pattern-Aware BiLSTM (PA-BiLSTM) approach was proposed to address the issue of missing data imputation in solar power generation datasets. The proposed method incorporates several significant methodological contributions aimed at enhancing practical applicability and improving imputation accuracy.

Firstly, a PCA-based seasonal embedding technique was introduced, offering distinct advantages such as eliminating the need for manual feature selection, capturing the dominant modes of seasonal variation, and providing a scalable and data-driven means for updating embeddings as additional seasonal data are collected. By embedding seasonal characteristics into a continuous latent vector space, the proposed model effectively generalized across diverse seasonal conditions. Secondly, the proposed method implemented normalization based on maximum generation values in 5 min intervals, significantly enhancing its applicability in industrial environments by ensuring predictions remain consistent and interpretable relative to operational conditions. Thirdly, the imputation performance was validated using carefully designed simulation datasets explicitly incorporating seasonal variability, thus, the robustness and reliability of the findings are ensured across realistic temporal scenarios. Lastly, unlike previous studies, the proposed PA-BiLSTM leveraged LSTM-based architecture specifically tailored to capture inherent temporal dependencies and dynamics, resulting in notable improvements in prediction accuracy over existing imputation methods.

However, an important limitation was identified in the analysis. PA-BiLSTM demonstrated lower performance in situations with highly volatile generation patterns. The observed volatility reduced prediction stability, indicating that additional research is needed to enhance model robustness under conditions of high variability.

In addition to volatility, the evaluation was conducted under block–random missing scenarios of 1–4 h to emphasize the challenge of longer consecutive gaps. However, real-world PV datasets often exhibit irregular missingness due to sensor drift, inverter shutdowns, or communication dropouts. In such cases, feed-forward strategies or lightweight statistical interpolation can already achieve competitive performance, especially for isolated or short gaps. Nevertheless, the proposed PA-BiLSTM remains applicable to irregular settings by incorporating mask vectors or time-gap features, or through training with mixed missingness patterns. This suggests that while block–random evaluation highlights the model’s strength under challenging conditions, the framework can be naturally extended to more realistic irregular missing scenarios.

The findings also highlight the potential for generalization across different climatic regimes. While the training data were sourced from temperate South Korea, two design elements—PCA-based seasonal embeddings and normalization by peak capacity—help mitigate location-specific biases by encoding seasonal dynamics and scaling outputs to plant-level magnitudes. These features allow the model to reproduce typical diurnal patterns under regime shifts. Nevertheless, extreme environments such as tropical convective systems, arid dust-prone regions, or high-latitude winters may introduce distributional shifts in both power profiles and missingness patterns. For deployment in such settings, a two-step strategy is recommended: (i) zero-shot evaluation using the existing embedding and normalization, followed by (ii) lightweight fine-tuning of the final layers with limited local data while keeping the convolutional and recurrent backbones fixed. This approach balances robustness with practicality in scenarios where cross-climate data are limited. Future research directions may include the integration of uncertainty quantification techniques or advanced architecture designed specifically to handle volatile temporal dynamics. Overall, this study demonstrates the significant potential of the proposed PA-BiLSTM model for enhancing the accuracy and practical applicability of solar power generation data imputation, while clearly outlining pathways for further methodological improvements.

Author Contributions

Conceptualization, M.J. and S.-K.J.; methodology, M.J.; software, M.J.; validation, S.-K.J. and M.J.; writing—original draft preparation, M.J.; writing—review and editing, S.-K.J.; supervision, S.-K.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This work was supported by the Korea Institute of Energy Technology Evaluation and Planning(KETEP) and the Ministry of Trade, Industry and Energy(MOTIE) of the Republic of Korea under Grant (No. RS-2025-02315367) and (No. RS-2025-02313547).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ember. Global Electricity Review 2025. Available online: https://ember-energy.org/latest-insights/global-electricity-review-2025/ (accessed on 17 July 2025).
RE100. 2024 RE100: Annual Disclosure Report. Available online: https://www.there100.org/our-work/publications/2024-re100-annual-disclosure-report (accessed on 17 July 2025).
Yonhap News Agency. S. Korea Launches Task Force to Create RE100 Industrial Complex. Available online: https://en.yna.co.kr/view/AEN20250716002900320 (accessed on 17 July 2025).
Lee, D.; Kim, D.; Joo, S.-K. Interval-Stochastic Programming for Integrated Generation, Transmission, and Energy Storage System (ESS) Planning Considering Uncertainty in Renewable Energy Sources. IEEE Access 2025, 13, 30834–30844. [Google Scholar] [CrossRef]
Kim, S.; Joo, S.-K. Transmission Pricing Incorporating the Impact of System Fault and Renewable Energy Uncertainty on the Transmission Margin. IEEE Access 2023, 11, 103779–103789. [Google Scholar] [CrossRef]
Lee, D.; Joo, S.-K. Economic Analysis of Large-Scale Renewable Energy (RE) Source Investment Incorporating Power System Transmission Costs. Energies 2023, 16, 7407. [Google Scholar] [CrossRef]
Shin, K.; Lee, J. Investment Decision for Long-Term Battery Energy Storage System Using Least Squares Monte Carlo. Energies 2024, 17, 2019. [Google Scholar] [CrossRef]
Liu, W.; Ren, C.; Xu, Y. Missing-Data Tolerant Hybrid Learning Method for Solar Power Forecasting. IEEE Trans. Sustain. Energy 2022, 13, 1843–1852. [Google Scholar] [CrossRef]
de-Paz-Centeno, I.; García-Ordaz, M.T.; García-Olalla, Ó.; Alaiz-Moretón, H. Imputation of missing measurements in PV production data within constrained environments. Expert Syst. Appl. 2023, 217, 119510. [Google Scholar] [CrossRef]
Li, Q.; Xu, Y.; Chew, B.S.H.; Ding, H.; Zhao, G. An Integrated Missing-Data Tolerant Model for Probabilistic PV Power Generation Forecasting. IEEE Trans. Power Syst. 2022, 37, 4447–4459. [Google Scholar] [CrossRef]
Costa, T.; Falcão, B.; Mohamed, M.A.; Annuk, A.; Marinho, M. Employing machine learning for advanced gap imputation in solar power generation databases. Sci. Rep. 2024, 14, 23801. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Luo, Y.; Zhang, Y.; Srinivasan, D. SolarGAN: Multivariate solar data imputation using generative adversarial network. IEEE Trans. Sustain. Energy 2020, 12, 743–746. [Google Scholar] [CrossRef]
Liu, Z.; Xuan, L.; Gong, D.; Xie, X.; Liang, Z.; Zhou, D. A WGAN-GP Approach for Data Imputation in Photovoltaic Power Prediction. Energies 2025, 18, 1042. [Google Scholar] [CrossRef]
Ryu, S.; Kim, M.; Kim, H. Denoising autoencoder-based missing value imputation for smart meters. IEEE Access 2020, 8, 40656–40666. [Google Scholar] [CrossRef]
Jeong, J.; Ku, T.-Y.; Park, W.-K. Denoising Masked Autoencoder-Based Missing Imputation within Constrained Environments for Electric Load Data. Energies 2023, 16, 7933. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, T.; Li, T.; Sun, X.; Liu, Z. Small Sample Data Augmentation Method for Photovoltaic Power Generation Based on Improved Variational Auto-encoder. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC), Xi’an, China, 25–27 May 2024; pp. 1047–1053. [Google Scholar]
Ma, J.; Cheng, J.C.P.; Jiang, F.; Chen, W.; Wang, M.; Zhai, C. A Bi-directional Missing Data Imputation Scheme Based on LSTM and Transfer Learning for Building Energy Data. Energy Build. 2020, 211, 109792. [Google Scholar] [CrossRef]
Cao, W.; Wang, D.; Li, J.; Zhou, H.; Li, L.; Li, Y. BRITS: Bidirectional Recurrent Imputation for Time Series. arXiv 2018, arXiv:1805.10572. [Google Scholar] [CrossRef]
Che, Z.; Purushotham, S.; Cho, K.; Sontag, D. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci. Rep. 2018, 8, 6085. [Google Scholar] [CrossRef] [PubMed]
Fu, X.; Zhang, C.; Zhang, X.; Sun, H. A Novel GAN Architecture Reconstructed Using Bi-LSTM and Style Transfer for PV Temporal Dynamics Simulation. IEEE Trans. Sustain. Energy 2024, 15, 2826–2829. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Lee, J.; Lee, J.; Wi, Y.-M. Impact of Revised Time of Use Tariff on Variable Renewable Energy Curtailment on Jeju Island. Electronics 2021, 10, 135. [Google Scholar] [CrossRef]
Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
Yoon, J.; Jordon, J.; van der Schaar, M. GAIN: Missing Data Imputation using Generative Adversarial Nets. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 5689–5698. [Google Scholar]
Denhard, A.; Bandyopadhyay, S.; Habte, A.; Sengupta, M. Evaluation of Time-Series Gap-Filling Methods for Solar Irradiance Applications; No. NREL/TP-5D00-79987; National Renewable Energy Lab.(NREL): Golden, CO, USA, 2021. [Google Scholar]

Figure 1. Overview of the Pattern-Aware BiLSTM framework.

Figure 2. (a) Example of static (One-Hot) seasonal embeddings; (b) Example of PCA-based seasonal embeddings.

Figure 3. PA-BiLSTM architecture for missing values in solar generation data.

Figure 4. Masking mechanism for missing values in solar generation data.

Figure 5. Monthly based sampling and dataset partitioning strategy.

Figure 6. Examples of the actual solar generation and missing imputation results for dataset (A) Random block-1 h interval; (B) Random block-2 h interval.

Figure 7. Examples of the actual solar generation and missing imputation results for dataset (A) Random block-3 h interval; (B) Random block-4 h interval.

Table 1. Summary of the number of solar generators by region.

Region	Number of Solar Generators
Gyeonggi-do	1
Gyeongsangnam-do	9
Gyeongsangbuk-do	4
Daejeon	2
Busan	2
Sejong	3
Incheon	4
Jeollanam-do	15
Jeollabuk-do	4
Chungcheongnam-do	4
Chungcheongbuk-do	2

Table 2. Model-wise MSE (

{kWh}^{2}

) comparison under solar generation imputation.

Table 2. Model-wise MSE (

{kWh}^{2}

) comparison under solar generation imputation.

Hour	Model	Case 1	Case 2	Case 3	Case 4	Case 5
1	LI	0.00793	-	-	-	-
	HA	0.03370	-	-	-	-
	BiLSTM	0.00743	0.00786	0.00746	0.00765	0.00750
	ConvAE	0.00861	0.00855	0.00829	0.00842	0.00911
	Unet	0.00856	0.00847	0.00821	0.00831	0.00857
	PA-BiLSTM (proposed)	0.00720	0.00723	0.00744	0.00742	0.00725
Hour	Model	Case 1	Case 2	Case 3	Case 4	Case 5
2	LI	0.01172	-	-	-	-
	HA	0.03657	-	-	-	-
	BiLSTM	0.01001	0.01013	0.01013	0.00990	0.0102
	ConvAE	0.01515	0.01502	0.01482	0.01528	0.0152
	Unet	0.01151	0.01155	0.01118	0.01105	0.0116
	PA-BiLSTM (proposed)	0.01010	0.00984	0.00996	0.00977	0.0098
Hour	Model	Case 1	Case 2	Case 3	Case 4	Case 5
3	LI	0.01613	-	-	-	-
	HA	0.03808	-	-	-	-
	BiLSTM	0.01397	0.0136	0.01328	0.0136	0.0136
	ConvAE	0.02275	0.0232	0.02303	0.0230	0.0226
	Unet	0.01518	0.0054	0.01526	0.0155	0.0162
	PA-BiLSTM (proposed)	0.01313	0.0133	0.01334	0.0131	0.0131
Hour	Model	Case 1	Case 2	Case 3	Case 4	Case 5
4	LI	0.02147	-	-	-	-
	HA	0.03892	-	-	-	-
	BiLSTM	0.01605	0.0166	0.01591	0.0163	0.0162
	ConvAE	0.02600	0.0271	0.02684	0.0265	0.0255
	Unet	0.01952	0.0199	0.01954	0.0197	0.0196
	PA-BiLSTM (proposed)	0.01579	0.0154	0.01675	0.0152	0.0157

The bold and italicized text indicates the best score for each case.

Table 3. Model performance MAE (kWh) and R² under 1–4 h missing gaps.

Model	1 h		2 h		3 h		4 h
Model	MAE	$R^{2}$	MAE	$R^{2}$	MAE	$R^{2}$	MAE	$R^{2}$
BiLSTM	0.0166	0.98	0.0341	0.89	0.068	0.61	0.102	0.28
ConvAE	0.0256	0.96	0.0421	0.88	0.059	0.77	0.076	0.66
Unet	0.0155	0.97	0.0361	0.86	0.058	0.71	0.080	0.55
PA-BiLSTM (proposed)	0.0123	0.98	0.0259	0.93	0.0474	0.81	0.070	0.66

The bold and italicized text indicates the best score for each case.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jang, M.; Joo, S.-K. Pattern-Aware BiLSTM Framework for Imputation of Missing Data in Solar Photovoltaic Generation. Energies 2025, 18, 4734. https://doi.org/10.3390/en18174734

AMA Style

Jang M, Joo S-K. Pattern-Aware BiLSTM Framework for Imputation of Missing Data in Solar Photovoltaic Generation. Energies. 2025; 18(17):4734. https://doi.org/10.3390/en18174734

Chicago/Turabian Style

Jang, Minseok, and Sung-Kwan Joo. 2025. "Pattern-Aware BiLSTM Framework for Imputation of Missing Data in Solar Photovoltaic Generation" Energies 18, no. 17: 4734. https://doi.org/10.3390/en18174734

APA Style

Jang, M., & Joo, S.-K. (2025). Pattern-Aware BiLSTM Framework for Imputation of Missing Data in Solar Photovoltaic Generation. Energies, 18(17), 4734. https://doi.org/10.3390/en18174734

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Pattern-Aware BiLSTM Framework for Imputation of Missing Data in Solar Photovoltaic Generation

Abstract

1. Introduction

2. Related Work

3. The Proposed PA-BiLSTM Method

3.1. Normalization Solar Generation Data

3.2. Feature Embedding

3.2.1. Region Embedding

3.2.2. Time Embedding

3.2.3. PCA-Based Seasonal Embedding

3.3. Proposed Pattern-Aware Bidirectional Long Short-Term Memory (PA-BiLSTM)

4. Numerical Results

4.1. Data Description

4.2. Missing Pattern Simulation and Experimental Setup

4.3. Evaluation Metrics

4.4. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI