1. Introduction
In the context of increasingly severe global climate change and environmental pollution, the traditional energy structure dominated by fossil fuels must gradually transition to a low-carbon energy system centered on renewable energy sources [
1]. As social electricity demand continues to grow and large-scale renewable energy sources are gradually integrated into the grid, the complexity of the power grid structure also increases [
2], placing higher demands on grid stability, efficiency, and economic viability. However, annual losses from curtailed wind and solar power in the new energy sector still amount to nearly 10 billion kilowatt-hours [
3]. Additionally, since thermal power plants provide most of the power auxiliary services, their own operational load rates are relatively low, and the costs of flexibility upgrades are substantial [
4], resulting in industry-wide losses. Energy storage systems can provide various services to power grid operations, including peak shaving, frequency regulation, black start, and demand response support [
5], making them an important means to enhance the flexibility, economic efficiency, and safety of traditional power systems. Energy storage can significantly enhance the utilization rate of renewable energy sources, such as wind and solar power, support distributed generation and microgrids [
6], and is a crucial technology for promoting the transition from fossil fuels to renewable energy sources [
7].
However, in actual operation [
8,
9], energy storage systems face challenges such as the large number of battery cells, their complex structure, and prolonged operational periods [
10]. This necessitates frequent monitoring and assessment of battery status to ensure the stable and safe operation of energy storage power plants. In particular, the automated classification of cell- and module-level states—for example, distinguishing between regular operation and anomalous behaviors, identifying early-stage faults, categorizing degradation stages (SOH levels), and detecting imminent end-of-life indicators—directly supports critical operational functions. Timely and reliable state classification enables online condition monitoring and early fault detection to prevent safety incidents (e.g., thermal runaway), supports predictive maintenance and remaining useful life (RUL) estimation to reduce unplanned outages and maintenance costs, and provides inputs for dispatch and power-optimization strategies that increase renewable energy utilization and overall economic returns. Moreover, state-aware classification facilitates hierarchical control and fault isolation in PACK-level systems, allowing operators to selectively disengage or re-route affected modules and thereby maintain service continuity. Under prolonged standby conditions and non-standard operational conditions, a significant amount of redundant and ineffective data is generated. These data are also unable to accurately reflect the operational status of the energy storage system itself [
11], thereby increasing transmission and storage costs while impairing system monitoring, maintenance, and optimization. Targeted preprocessing and classification must therefore both reduce data volume and explicitly preserve diagnostically relevant signatures (e.g., transient features and SOH-related patterns) so that downstream decision-support algorithms receive high-fidelity inputs. Additionally, deployment-oriented metrics such as inference latency, memory footprint, and false-alarm rates should be reported to demonstrate practical applicability. Therefore, conducting data processing on energy storage system data and exploring methods to enhance data quality and reduce data volume hold significant importance [
12].
The primary purpose of battery data extraction is to study the charge characteristics (SOC), health status (SOH), and remaining useful life (RUL) of lithium batteries [
13,
14]. Currently, research on battery data preprocessing focuses on various data features that reflect battery characteristics. Reference [
15] proposes a neural network-based algorithm that uses actual time series data measured in vehicles to achieve continuous automated estimation of battery health status. This method simplifies the traditional modeling process and offers greater versatility and application flexibility. Reference [
16] proposes a battery modeling method based on an improved LSTM neural network, which uses unsupervised algorithms to extract typical recurring loads from in-vehicle time series data, achieving data compression and balanced sampling for battery modeling. This method maintains accuracy even with significantly compressed data volumes, making it suitable for in-vehicle battery environments with limited computational power. The study referenced in [
17] proposes a data compression method that converts monthly electric vehicle charging raw sequence data into feature maps, effectively reducing data volume while retaining key information, thereby reducing model training time and achieving a balance between feature representation and information preservation. Reference [
18] introduces multi-stage data compression processing in battery capacity prediction, using PCA to compress the dimensionality of battery aging features, then employing t-SNE to preserve local structure and map data to a visualizable space, combined with DBSCAN to remove noise from compressed data. This achieves effective dimensionality reduction and data cleaning for high-dimensional data, enhancing model training efficiency and prediction accuracy. These methods are highly applicable to vehicle power batteries but are not fully compatible with data processing for energy storage batteries. References [
19,
20,
21,
22] modify and optimize the Kalman filter algorithm to improve data preprocessing effectiveness, achieving good results with minimal computational effort. However, this method is sensitive to initial conditions and low-quality data, and may fail when initial estimation errors increase. Reference [
22] (Peng et al., Applied Energy, 2019) develops an improved cubature Kalman filter (CKF) for more accurate SOC estimation under nonlinear battery dynamics. That work is primarily concerned with model-based state estimation for individual cells or packs and focuses on improving estimation accuracy under model nonlinearity. It does not address large-scale data management tasks such as time-series segmentation, diagnostic-aware compression, or the preservation of transient fault signatures in compressed archives. By contrast, the present work targets station-scale operational data pipelines: the ASW-UKF module is designed for online consistency restoration and robust handling of missing/corrupted samples in streaming data, and the TRF performs temporally aware, importance-weighted multi-level compression to retain diagnostically relevant events while reducing storage. Thus, although both lines of work use Kalman-filtering ideas for robustness in estimation, their goals and end-to-end functionality are different. Literature [
23] employs unsupervised learning autoencoders for battery data denoising, leveraging their robust feature extraction capabilities to achieve satisfactory results in data denoising. However, the output data may suffer from over-smoothing, loss of original battery data features, and high computational costs. Reference [
24] proposes a bidirectional LSTM encoder–decoder for SOC sequence estimation, leveraging sequence-learning to predict SOC trajectories. That approach is essentially a data-driven sequence modeling solution aimed at improving SOC prediction accuracy; it typically relies on labeled training data, offline model training, and relatively heavy inference cost. It likewise does not provide an integrated solution for online data cleaning, missing-data recovery, segmentation, or compression tuned to preserve transient diagnostic signatures. In contrast, our ASW-UKF-TRF pipeline combines an online, light-weight consistency-restoration module (ASW-UKF) with a compression module (TRF) that is explicitly designed to balance compression efficiency and diagnostic fidelity for large heterogeneous station datasets.
Recent research has further expanded in two directions that are directly relevant to our work. First, adaptive and online variants of Kalman filtering—including sliding-window and context-aware Unscented Kalman Filter (UKF) schemes—have been proposed to provide robust, low-latency state reconciliation under non-stationary field conditions, making them more suitable for continuous station operation than conventional offline filters [
25,
26,
27]. Second, there is growing interest in temporally aware segmentation and diagnostic-aware (importance-weighted) compression techniques that explicitly prioritize retention of transient fault signatures and diagnostically valuable segments while aggressively compressing low-information intervals; such approaches aim to balance downstream diagnostic performance with storage constraints [
28,
29]. Notably, a small but rising body of work has begun to explore integrated pipelines that jointly consider online consistency restoration, transient preservation, and compression—a combination rarely addressed by vehicle-focused studies and one that is crucial for energy-storage-scale deployments where data volume, heterogeneity, and the operational duty cycle differ markedly [
30]. In this manuscript, we therefore position ASW-UKF as the online consistency-restoration component and TRF as the temporally aware, importance-driven compression component, arguing that their joint design better meets the specific requirements of energy storage stations than approaches that treat preprocessing and compression as separate, offline steps. The algorithm comparison content is shown in
Table 1 and
Table 2.
Recent advances in preprocessing and compression techniques for lithium-ion battery data have been primarily driven by research on electric-vehicle (EV) power batteries, where the principal constraints are limited on-board computing power and strict latency requirements. Those studies have developed effective denoising, feature-extraction, and compact-representation methods (e.g., sequence models, autoencoders, and model-based filters) that are suitable for vehicle-scale deployments. However, grid-scale energy storage systems present fundamentally different challenges: station-level datasets are several orders of magnitude larger, contain heterogeneous behavior from thousands of cells or modules operating in parallel, and include long-term trends, intermittent faults, and environmental disturbances that do not commonly appear in EV datasets. Consequently, algorithms optimized for EV scenarios often fail to deliver the required combination of scalability, information preservation (especially of transient events), and operational interpretability that power-plant operators require for online monitoring, fault diagnosis, and asset management.
To address these gaps, we propose an integrated pipeline—Adaptive Sliding Window/Unscented Kalman Filter/Time-series Random Forest (ASW–UKF–TRF)—specifically tailored to the data and operational characteristics of energy storage power plants. The pipeline is designed with three linked objectives: (1) reliably remove redundant and low-value observations while preserving diagnostically relevant signatures; (2) reconstruct and smooth incomplete or noisy signals with model-based filtering to retain physical interpretability; and (3) provide a hierarchical, importance-weighted compression that supports different downstream consumers (real-time alarms, medium-term diagnostics, and archival analytics).
The first stage, ASW, performs lightweight, data-driven cleaning and deduplication using an adaptive windowing strategy. Rather than applying a fixed-length segmentation, ASW dynamically adjusts the window size in response to local signal stability (for example, using a per-sample voltage-change-rate statistic). Within each adaptive window, triplet-based equality checks (voltage, current, SOC—rounded to a configurable number of decimals) identify exact or near-duplicate samples for removal or collapse. This design preserves short transient events by skipping deduplication when sign reversals or abrupt rate changes are detected, while aggressively compressing during long, steady-state segments. ASW’s behavior is governed by a small set of interpretable hyperparameters (min/max window, change threshold, and rounding precision), which we tune empirically and report for reproducibility.
The second stage, UKF, is a model-based smoothing and imputation module that operates on the ASW-pruned stream. We select the Unscented Kalman Filter because it propagates nonlinear state uncertainties with minimal linearization error and provides an explicit uncertainty estimate for each imputed value. This property facilitates downstream decision thresholds and confidence-aware compression. In practice, the UKF reduces high-frequency noise, fills intermittent missing samples produced by ASW, and preserves the physical coherence of derived quantities (e.g., SOC trajectories). UKF tuning focuses on process and measurement noise scales (Q, R) and unscented transform parameters (α, β, κ). We provide recommended ranges and a simple data-driven rule for selecting the change threshold and Q/R scaling in
Section 3.
The third stage, TRF, combines time-series feature extraction with a RandomForest classifier that is trained to categorize each processed sample (or sample window) into operationally meaningful classes (e.g., regular, transient event, degradation indicator, redundant). TRF differs from off-the-shelf static classifiers by: (a) using windowed temporal features (rolling statistics, short-term trend slopes, spectral-energy proxies) that capture both steady and transient behavior; (b) incorporating event-importance weighting in the loss function to prioritize rare but critical events; and (c) producing class-specific importance scores used to drive hierarchical compression (high-importance samples are retained at fine granularity; low-importance data are coarsely aggregated). This graded compression strategy yields substantial reductions in storage and transmission load while preserving the information that operators and diagnostic pipelines need. The complete algorithm flow is shown in
Figure 1Evaluation of the pipeline utilizes multi-faceted metrics, including reconstruction error (RMAE/RMSE), coefficient of determination (R2), classification F1 score for important event classes, and deployment-oriented indicators such as compressed data ratio, model parameter count, and inference latency (measured on a reference CPU). We also conduct ablation studies (e.g., ASW off, UKF replaced by EKF, and TRF replaced by standard RF) and sensitivity analyses for key hyperparameters (ASW thresholds, UKF Q/R, TRF window length, and minimum segment length). These analyses demonstrate that ASW–UKF–TRF achieves consistent improvements in accuracy retention and compression efficiency relative to representative baselines (unsupervised LSTM, MCC-EKF, autoencoder, FSW-UKF, ASW-EKF), while maintaining interpretable uncertainty estimates that aid operational decision-making.
Finally, we consider practical deployment aspects. The pipeline is implemented in Python 3.12.2 with a modest computational footprint (experimental runs reported on an Intel i5 machine with 8 GB RAM), and the modular design allows computationally intensive components (e.g., TRF training) to be offloaded to cloud or batch infrastructure. At the same time, lightweight ASW/UKF can operate near-real-time on edge controllers. Remaining limitations include the need for broader cross-site validation (different chemistries and usage patterns), further optimization for embedded deployment, and automated hyperparameter adaptation for heterogeneous fleets—directions that we discuss in the Conclusion and future work sections.
The remainder of this paper is organized as follows.
Section 2 presents a description of the methodology considered (ASW, UKF, TRF usage methods, combined use of methods, and their roles in each section),
Section 3 conducts simulation experiments, and
Section 4 presents the conclusions.
2. Adaptive Sliding Window—Unscented Kalman Filter—Sequential Random Forest Algorithm
2.1. Adaptive Sliding Window (ASW)
To enhance the flexibility and fidelity of time series data cleaning, this paper proposes an adaptive sliding window method. This method dynamically adjusts the window length to adapt to the characteristics of data changes at different stages, thereby achieving more accurate data redundancy removal and detail retention.
ASW slides a variable-sized window over the time series data and adjusts the window based on the following principles:
Data change rate: Calculate the data change rate within the window. Use a small window in rapidly changing areas to capture details and a large window in stable and smooth areas to improve compression efficiency.
Key event detection: Identify key events such as charging transitions and abnormal points through rules (e.g., voltage exceeding thresholds, temperature anomalies), and force the use of the smallest window to retain details.
In this formula, and represent the measured values at times and , respectively, while θ denotes the threshold for the data change rate. and correspond to the critical limits of voltage and temperature. The parameter indicates the current window size, denotes the minimum window size, and is a scaling factor that controls the expansion of the window.
2.2. Unscented Kalman Filter (UKF)
Extended Kalman filtering is a commonly used method for estimating the state of nonlinear systems. It linearizes the system function using a first-order Taylor expansion and combines it with the standard Kalman filtering framework for prediction and updating. The state transition and observation models of EKF are as follows:
where
and
representing process noise and observation noise, respectively,
represents the system state at time step
,
is the observation at time step
.
However, this method is prone to introducing linearization errors when the system is highly nonlinear or model uncertainty is substantial, which affects estimation accuracy and stability. To overcome this limitation, this paper adopts the more advanced Unscented Kalman Filter (UKF). By applying the unscented transformation to reintroduce a set of representative Sigma points, the mean and covariance are propagated directly in the original nonlinear space, avoiding explicit linearization and enabling more accurate capture of the statistical characteristics of nonlinear systems. The primary process of the unscented transformation is as follows:
The following Sigma points are generated:
where Sigma point prediction is as follows:
Predicting the mean and covariance:
Predicted observation values:
Observed covariance and cross-covariance:
In the above equations, the i-th sigma point at step , generated from the estimate and its covariance . The parameters and are the state dimension and scaling factor, respectively. The weights and correspond to the mean and covariance of each sigma point. and represent the process of noise covariance and observation noise covariance. and denote the predicted state and its covariance, while is the predicted observation. The variables and are the observation covariance and cross-covariance. finally is the Kalman gain used for updating the state estimate , and its covariance .
2.3. Time Series Random Forest (TRF)
Random Forest is an ensemble learning method composed of multiple decision trees (classification and regression trees) with relatively weak performance. It is trained through random selection of samples and features, and ultimately produces an ensemble output using majority voting (for classification) or averaging (for regression). Its structure is shown in
Figure 2. RF not only possesses strong nonlinear modeling capabilities and robustness, but can also be used to address issues with missing data. Each tree independently predicts missing samples, and the optimal estimate is ultimately determined through a voting process.
However, traditional RF assumes that samples are independent of each other and does not consider dynamic dependencies in time series, which limits its performance in time series modeling.
To address this issue, this paper introduces a time window mechanism based on RF. It incorporates historical information into the modeling process by constructing lagged features, thereby forming a time series random forest algorithm that enhances the model’s ability to characterize the system’s evolution characteristics. The model structure and training and prediction process of TRF are shown below:
Model structure:
Let the original time series be
and the target variable be
. During the training phase, TRF constructs input samples in the following format:
This represents the target output at time t, and p is the time window length (lag order); the training set is generated using a sliding window method:
represents the training set generated using a sliding window method. is the final prediction result, computed as the average (for regression) or majority vote (for classification) of all trees.
Each sample contains the current and previous p − 1 observation information, which can be used to predict the current or future state.
The overall training and prediction process is as follows: TRF trains multiple decision trees on the above-constructed sample set, with each tree constructed based on different data subsets and feature subsets. The final prediction result is the average output (regression) or majority vote (classification) of all trees:
where
is the i-th decision tree, and
N is the total number of trees. Its structure is shown in
Figure 3.
2.4. Adaptive Sliding Window-Unscented Kalman Filter Data Preprocessing (ASW-UKF)
By cleaning and deduplicating lithium battery time series data using the adaptive sliding window (ASW) method, noise (such as high-frequency random fluctuations and spike noise) can be effectively removed from the original data while retaining key trends. However, the compression process may result in data point loss or discontinuity, especially in areas with high rates of change or during critical events. Additionally, battery data may contain missing values due to sensor failures or communication interruptions. To address these issues and further smooth the data to enhance the accuracy of subsequent analyses, the Unscented Kalman Filter (UKF) is employed as a subsequent step to the ASW method to fill in missing data and achieve data smoothing. Together, these two methods complete the entire data preprocessing phase.
Among them, the data smoothing process is as follows:
In the equation, is the state at time k, f is the nonlinear state transition function, is the control input, and is the process noise.
- 2.
Observation equation:
In the equation, represents the compressed data points output by ASW, h represents the nonlinear observation function; and represents the observation noise.
- 3.
Fill in missing values:
When
is missing, UKF skips the observation update step and directly uses the state prediction value
as the estimate to fill in the missing points. The prediction value is based on the state transition function
f and the state
at the previous moment. Finally, a continuous estimate sequence is generated to fill in the missing points in the ASW output.
Among them, is the Sigma point, Wi is the weight, and n is the state dimension. If Zk is missing, then .
- 4.
Smooth the data:
Generate a smooth state estimate
by fusing the predicted value
and the observed value
using the Kalman gain.
In this case, is the Kalman gain, and is the predicted observation value.
The preprocessing of lithium battery time series data is accomplished by combining an adaptive sliding window (ASW) with an unscented Kalman filter (UKF). The two techniques work together to achieve efficient compression, noise removal, missing value filling, and trend retention, making it suitable for battery health status estimation.
2.5. Adaptive Sliding Window-Unconditional Kalman Filter-Time Series Random Forest (ASW-UKF-TRF) Algorithm
After undergoing adaptive sliding window (ASW) and unscented Kalman filter (UKF) preprocessing steps, lithium-ion battery time-series data (such as voltage, current, and temperature) have been efficiently compressed, missing values filled, and noise smoothed, generating high-quality continuous data. These data serve as the foundation for subsequent classification and intelligent decision-making tasks. Then, using a time-series random forest (TRF), classification is performed and feature importance is evaluated to provide decision support for the battery management system.
Figure 4 shows the ASW-UKF-TRF algorithm flowchart.
3. Experimental Simulation and Results
The experimental data used in this paper are derived from the actual operational data of the Battery Management System (BMS) of a lithium iron phosphate battery, assembled by China Xiamen Haichen Energy Storage Co., Ltd., (Xiamen, China) in a storage cabinet at a specific energy storage company in Hunan, China. The data were collected continuously over one month, with uninterrupted acquisition of multidimensional battery pack time-series data throughout the entire process. The sampling rate was set to once every 30 s, with a total of 80,628 raw battery operation data points recorded during the entire collection period. Data were acquired using CAN-bus Professional tools with a CANalyst-11 analyzer (serial number 31F00033448) integrated into the station CAN network. During logging, communication irregularities occasionally produced blank frames or error frames, which manifest as missing, corrupted, or noisy records in the raw logs. These fault cases are documented in the revised manuscript. The proposed ASW-UKF-TRF preprocessing pipeline performs targeted cleaning, imputation, and smoothing to remove such noise and recover usable samples for subsequent segmentation, classification, and compression. All collected data were used for full-scale training of the algorithm model to validate the end-to-end performance of the ASW-UKF-TRF method in a real-world operational environment. The code was run on a device equipped with an Intel(R) Core(TM) i5-9300H CPU at 2.40 GHz, a Windows 10 Professional 64-bit operating system, and 8.00 GB of RAM. The programming software used is Python 3.12.2, with Python supported by PyTorch 2.7.1. This framework implements data classification and compressed simulation model construction based on the ASW-UKF-TRF algorithm, providing corresponding measurement and testing environments for each module of the algorithm. The experimental test environment is shown in
Figure 5,
Figure 6 and
Figure 7.
First, the ASW-UKF algorithm was employed to perform data preprocessing, resulting in a total of 495,686 data points being marked and processed. After a series of cleaning, denoising, filling, and smoothing operations, clean data was obtained. The original file size was 18,513,304 bytes, and the processed file size was 11,450,196 bytes, achieving a certain degree of compression. The fitting results between the processed data and the original data are shown in
Figure 8,
Figure 9 and
Figure 10.
A comparative experiment was conducted using the same data but different methods. Compared with the algorithms in papers [
17,
20,
24] and the processing effects of traditional algorithm combinations, the fitting degree and compression ratio were significantly improved under similar operating conditions. Additionally, a comparison of the computational costs among the different algorithms was performed, demonstrating that our method achieves better efficiency without compromising performance. The comparison results are shown in
Table 3 and
Table 4.
Taking total pressure as an example, the preprocessed data from each method were exported as images for intuitive comparison of the results, as shown in
Figure 11. The ASW-UKF algorithm fits the original curve best compared to other algorithms and is more suitable for further processing.
After ASW-UKF data preprocessing, the data is further classified and compressed using the TRF algorithm. Based on key features such as the instantaneous change rate of voltage and the rolling standard deviation of current, the entire operation process is dynamically divided into four sequential segments: the load-discharge segment, the regular charging segment, the standby discharge segment, and the charge–discharge change segment. The processing results are shown in
Figure 12 and
Figure 13.
Compared with traditional random forest algorithms, time-series random forests offer better performance in processing time-series data. The classification results of RF are characterized by widespread state transitions, which are neither smooth nor stable. In contrast, TRF can significantly reduce meaningless short-term fluctuations, maintain long-term stable segments, and more closely reflect the actual operating status of energy storage power stations. A comparison of the classification performance of the two algorithms is shown in
Figure 14.
Figure 15,
Figure 16 and
Figure 17 illustrate the effects of the proposed ASW-UKF-TRF processing on representative station time series.
Figure 13 shows an overlay of original and compressed/reconstructed signals for a typical measurement channel; the close alignment indicates preservation of the primary waveform and key transient features.
Figure 14 presents the residual analysis, including a histogram and time series of reconstruction errors, along with summary statistics (mean error, RMSE, MAE, and maximum absolute error). This demonstrates that the residuals remain small and concentrated, except for rare transient peaks.
Figure 15 summarizes the classification-aware compression performance across the four data categories: per-class sampling reduction, per-class compression ratio (bytes retained/bytes original), and the number of preserved diagnostically relevant events per class (events detected before and after compression). For completeness, the manuscript now reports both sample-level and file-size reductions: the dataset was reduced from 80,628 samples to 40,766 samples (a 49.44% reduction in sample count), and the stored file size decreased from 18,513,304 bytes to 7,143,949 bytes (61.46% reduction in bytes). In addition to these aggregate figures, we have included quantitative metrics that support the visual plots, including reconstruction statistics (RMSE, MAE, maximum error), classification accuracy (exceeding 95% on the evaluated station dataset), anomaly F1 score for preserved events, and per-class compression ratios. These metrics confirm that the compression achieves substantial storage savings while maintaining monitoring fidelity and preserving transient diagnostic signatures. High-resolution figure files have been provided in the revised submission to improve the legibility of plots and axes.
To ensure reproducibility and methodological transparency, key hyperparameters were explicitly specified for each module of the proposed ASW-UKF-TRF framework. The sampling rate was fixed at once every 30 s, yielding a uniform data resolution that underpinned subsequent feature extraction and segmentation. To ensure consistent labeling, the classification targets were clearly defined in accordance with state-of-health thresholds. Within the adaptive sliding window (ASW) module, the minimum window size, maximum window size, and the change threshold governed the flexibility of dynamic segmentation. Additionally, rounding decimals was applied to stabilize equality checks during voltage, current, and SOC comparisons.
For state estimation with the unscented Kalman filter (UKF), the process noise scale (Q) and measurement noise scale (R) were calibrated as proportional factors to balance system dynamics and measurement fidelity. In the trend-based random forest (TRF) classifier, the minimum segment length was introduced to suppress spurious fluctuations, while the maximum tree depth regulated model complexity. Collectively, these hyperparameters constitute a transparent configuration that facilitates both the evaluation of model performance and the reproducibility of results in independent studies. The hyperparameters are shown in
Table 5 4. Discussion
The ASW-UKF-TRF algorithm proposed in this paper achieves end-to-end processing and efficient compression of massive battery data from energy storage power stations by using adaptive sliding window (ASW) to clean noise and duplicate data, unscented Kalman filter (UKF) to fill in missing values and smooth time series, and time series random forest (TRF) to classify and compress data based on importance. Simulation experiments based on real-world data from Hunan Province demonstrate that this method significantly reduces the total data volume while maintaining high accuracy (lowest RMAE and highest R2), providing lightweight yet high-quality time-series data support for applications such as online monitoring, fault diagnosis, and intelligent scheduling, thereby demonstrating clear engineering application value.
Theoretically, this paper integrates the ASW, UKF, and TRF paradigms for the first time, proposing a new framework that combines adaptive sliding window cleaning, filtering interpolation smoothing, and importance-based hierarchical compression. To address the characteristics of sudden changes and missing values in battery data, a dedicated UKF filtering-interpolation strategy was designed to enhance data reconstruction accuracy; the interpretability of TRF was utilized to assess the importance of different data categories, providing a quantifiable method for prioritizing time-series data management. These theoretical innovations offer new quantitative tools and methodologies for data processing and optimized scheduling in large-scale energy storage systems.
Although the method demonstrated notable improvements in accuracy and compression efficiency under the present test conditions, several limitations related to scale and battery ageing remain. First, computational cost and memory usage increase with the number of monitored variables, the length of sliding windows, and the number of sigma points used in the UKF stage. For continuous station streams, this implies higher processing time and an increased memory footprint, unless bounded by windowing or reduced-order models. Second, data from aged or faulty batteries tend to exhibit more frequent transients, nonstationary bias, and higher noise levels, which can degrade classification accuracy and compression fidelity if the importance metric or model parameters are not adapted. To make these considerations explicit, we propose the following evaluation and mitigation plan: (i) quantify runtime (throughput in samples/sec) and peak memory as functions of data volume using representative workloads (e.g., datasets of ~105, ~106 and ~107 samples); (ii) evaluate algorithmic performance across SOH levels (e.g., nominal 100%, 90%, 80% and 70%) and for injected fault cases, reporting classification accuracy, anomaly F1, compression ratio and reconstruction error as primary metrics; (iii) explore mitigation strategies such as reduced-order state models or sparse sigma-point selection to cut per-update cost, adaptive thresholding and online parameter updates to track ageing-induced distribution shifts, and parallel/streaming implementations (batching or GPU acceleration for the TRF) to increase throughput. These steps will allow quantifying how accuracy and compression scale with dataset size and battery degradation. They will inform practical deployment choices (e.g., update frequency, model complexity, and hardware provisioning) for long-term station operation.