Dictionary Learning-Based Data Pruning for System Identification

Wang, Tingna; Zhang, Sikai; Song, Mingming; Sun, Limin

doi:10.3390/app15179368

Open AccessArticle

Dictionary Learning-Based Data Pruning for System Identification

¹

College of Civil Engineering, Tongji University, Shanghai 200092, China

²

Shanghai Qi Zhi Institute, Shanghai 200232, China

³

Baosight Software, Shanghai 201900, China

⁴

State Key Laboratory of Disaster Reduction in Civil Engineering, Tongji University, Shanghai 200092, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(17), 9368; https://doi.org/10.3390/app15179368

Submission received: 22 July 2025 / Revised: 20 August 2025 / Accepted: 21 August 2025 / Published: 26 August 2025

(This article belongs to the Section Mechanical Engineering)

Download

Browse Figures

Versions Notes

Abstract

In system identification, augmenting time series data via time shifting and nonlinearisation can lead to both feature and sample redundancy. However, research has mainly focused on feature redundancy while largely ignoring the issue of sample redundancy. This paper proposes a novel data pruning method, called mini-batch FastCan, to reduce sample-wise redundancy based on dictionary learning. Time series data is represented by some representative samples via dictionary learning. The useful samples are selected based on their correlation with the representative samples. The method is tested on two simulated datasets and two benchmark datasets. The R-squared value between the coefficients of models trained on the full datasets and the coefficients of models trained on pruned datasets is adopted to evaluate the performance of data pruning methods. It is found that the proposed method significantly outperforms the random pruning method, with a higher median or mean and a lower variance of R-squared values.

Keywords:

data pruning; system identification; dictionary learning; NARX

1. Introduction

System identification refers to a method of identifying the mathematical description of a dynamic system using measured input–output data [1]. It can be used to forecast future values, assess the effects of input variations, design control schemes, etc. According to the objective of system identification, it can be generally divided into two types [2]. The first type focuses on the approximation scheme that produces the minimum production errors, such as fuzzy logic [3] and neural networks [4]. The second type focuses on the elucidation of the underlying rule that represents the system, such as spectral analysis [5], the Volterra series [6], and nonlinear autoregressive with eXogenous inputs (NARX) [7]. NARX-based methods have gained increasing attention in various fields such as engineering, finance, biology, and social sciences due to their flexibility in modeling a variety of systems while maintaining interpretability [2,8,9].

To avoid model overfitting and reduce computational complexity, feature selection methods based on orthogonalisation techniques and greedy search are widely applied in constructing NARX models [10,11,12]. The orthogonalisation-based method was firstly derived in [10] to efficiently decide which terms should be included in the nonlinear autoregressive moving average with eXogenous input models. This method was then further developed in [11] as orthogonal forward-regression estimators to identify parsimonious models of structure-unknown systems by modifying and augmenting some orthogonal least squares methods. A more comprehensive review of this feature selection idea and its development for system identification can be found in [12].

Similar to feature selection, input selection is also a crucial step in system identification, which can help to improve identification performance and reduce storage and training costs [12,13,14]. While the optimal selection of input samples (often referred to as data pruning) is currently an active research topic for computer vision [15,16,17] and natural language processing [18,19], its application in system identification has received less attention, although it remains an intractable task. Data pruning plays a vital role in system identification for the following reasons: first, time series data collected from continuous physical processes often exhibit strong temporal correlation, especially when sampled at high frequencies, resulting in redundant observations that hinder efficient model training. Second, the common use of time-shift operations to generate delayed signal versions further exacerbates data redundancy.

This paper proposes a data pruning method based on dictionary learning to select useful time series samples for constructing a concise mathematical model to represent a system accurately. Specifically, the reduced polynomial NARX method is applied to the identification of nonlinear dynamic systems as an example. It is worth noting that the proposed method can also be used to select informative samples for other types of system identification models. The canonical correlation-based fast feature selection method introduced in [20,21] is adopted to find the most important terms that should be included in the NAXR model to achieve the required accuracy. Dictionary learning is then creatively combined with the fast selection method based on canonical correlation (FastCan) to find the most useful time series samples to train these terms.

Dictionary learning, initially popular in the signal processing field, is used to learn a set of fundamental elements known as atoms, where a sparse linear combination of these atoms can generate the original signal [22]. It is currently widely studied for image processing, since images usually admit sparse representation [23,24]. For the proposed method, the k-means-based dictionary learning, introduced in Section 2.2, provides a dictionary of time terms, which is used as pseudo-labels in the data pruning process. An atom here is a basic, representative pattern within a set of time series samples.

In this paper, the proposed method, named mini-batch FastCan, is compared with the random pruning method. Random pruning commonly serves as a reliable benchmark in sample selection tasks and often yields better results than more advanced methods when 30% or less of the data is retained [25]. The reduced NARX trained with the full dataset is used as the baseline model. The coefficients learned with the pruned dataset are compared with the coefficients of the baseline model. The closer the coefficients are to the baseline model, the better the pruned dataset is. The schematic diagram of the research in this paper is illustrated in Figure 1 to provide a high-level overview of the proposed approach.

The next section introduces the details of the proposed data pruning method for system identification, followed by numerical case studies to demonstrate the necessity and advantages of the method. Subsequently, case studies on two public benchmark datasets [26,27] for nonlinear system identification are presented, with a discussion on the effect of hyperparameters in the proposed method.

Figure 1. The schematic diagram of this research. (1) Input Data: A time series

y (t)

is generated and augmented via time-shifting and nonlinearisation to create a redundant term library. Feature redundancy is represented by a correlation heatmap, while sample redundancy is visualised by a principal component analysis (PCA) plot. Feature selection is then applied to the term library to reduce feature-wise redundancy. (2) Methodology: The baseline model is trained on the full set of samples (Dataset A). The random pruned model is trained on a random subset of Dataset A (Dataset B). The mini-batch FastCan pruned model is trained on the samples selected based on the atoms learned from Dataset A (Dataset C). To compare the data pruning methods quantitatively, Dataset A is pruned repeatedly by random selection and mini-batch FastCan, and the models are then trained on the pruned datasets, respectively. (3) Output Results: The selected sample distributions obtained by both methods are visualised via a PCA plot. Pruning performance is evaluated by the R-squared score between the coefficients learned from the pruned samples and the coefficients of the baseline model. (The colour style used in this figure is from bg-mpl-stylesheets [28]).

Figure 1. The schematic diagram of this research. (1) Input Data: A time series

y (t)

is generated and augmented via time-shifting and nonlinearisation to create a redundant term library. Feature redundancy is represented by a correlation heatmap, while sample redundancy is visualised by a principal component analysis (PCA) plot. Feature selection is then applied to the term library to reduce feature-wise redundancy. (2) Methodology: The baseline model is trained on the full set of samples (Dataset A). The random pruned model is trained on a random subset of Dataset A (Dataset B). The mini-batch FastCan pruned model is trained on the samples selected based on the atoms learned from Dataset A (Dataset C). To compare the data pruning methods quantitatively, Dataset A is pruned repeatedly by random selection and mini-batch FastCan, and the models are then trained on the pruned datasets, respectively. (3) Output Results: The selected sample distributions obtained by both methods are visualised via a PCA plot. Pruning performance is evaluated by the R-squared score between the coefficients learned from the pruned samples and the coefficients of the baseline model. (The colour style used in this figure is from bg-mpl-stylesheets [28]).

2. Methodology

This section introduces the dictionary learning-based data pruning method, mini-batch FastCan, for a discrete system with finite orders.

2.1. Reduced Polynomial NARX

The reduced polynomial NARX model is a nonlinear dynamic model that represents the system output as a sparse subset of polynomial terms consisting of past outputs and inputs selected from the full NARX model structure. It reduces complexity by including only important or relevant terms rather than all possible combinations. Mathematically, a NARX model is used to simulate a system, which is formulated as

\begin{matrix} y (k) = & F [y (k - 1), y (k - 2), \dots, y (k - n_{y}), \\ u (k - 1), u (k - 2), \dots, u (k - n_{u})] + e (k), \end{matrix}

(1)

where

y (k)

,

u (k)

, and

e (k)

are the system output, input, and noise sequences, respectively.

n_{y}

and

n_{u}

are the maximum lags for the system output and input.

F [\cdot]

is a nonlinear function.

The power form polynomial model is adopted to approximate the nonlinear mapping

F [\cdot]

, and Equation (1) is then given as

\begin{matrix} y (k) = θ_{0} & + \sum_{i_{1} = 1}^{n} f_{i_{1}} (x_{i_{1}} (k)) \\ + \sum_{i_{1} = 1}^{n} \sum_{i_{2} = i_{1}}^{n} f_{i_{1} i_{2}} (x_{i_{1}} (k), x_{i_{2}} (k)) + \dots \\ + \sum_{i_{1} = 1}^{n} \dots \sum_{i_{ℓ} = i_{ℓ - 1}}^{n} f_{i_{1} i_{2} \dots i_{ℓ}} (x_{i_{1}} (k), x_{i_{2}} (k), \dots, x_{i_{ℓ}} (k)) + e (k), \end{matrix}

(2)

where ℓ is the degree of polynomial nonlinearity,

n = n_{y} + n_{u}

, and

\begin{matrix} f_{i_{1} i_{2} \dots i_{d}} (x_{i_{1}} (k), x_{i_{2}} (k), \dots, x_{i_{d}} (k)) = θ_{i_{1} i_{2} \dots i_{d}} \prod_{j = 1}^{d} x_{i_{j}} (k), & 1 \leq d \leq ℓ \end{matrix}

(3)

\begin{matrix} x_{i_{j}} (k) = \{\begin{matrix} y (k - i), & 1 \leq i \leq n_{y} \\ u (k - i + n_{y}), & n_{y} + 1 \leq i \leq n = n_{y} + n_{u} \end{matrix} \end{matrix}

(4)

where

θ_{i_{1} i_{2} \dots i_{d}}

are model parameters and

\prod_{j = 1}^{d} x_{i_{j}} (k)

are model terms whose order is not higher than ℓ.

The total number of model terms in the polynomial NARX model given in Equation (2) is

M = (n + ℓ)! / [n! ℓ!]

. It can be seen that the full NARX model can include a large number of terms, increasing the risk of overfitting. Because only a subset of these terms is typically important to capture the underlying dynamic relationship [2], the canonical correlation-based fast feature selection is carried out here to find the m significant model terms from

\prod_{j = 1}^{d} x_{i_{j}} (k)

with

y (k)

as the target. Therefore, the reduced polynomial NARX model can be given as

y (k) = θ_{0} + \sum_{j = 1}^{m} \sum_{i_{1} = 1}^{n} \dots \sum_{i_{d} = i_{d - 1}}^{n} f_{i_{1} i_{2} \dots i_{d}} (x_{i_{1}} (k), x_{i_{2}} (k), \dots, x_{i_{d}} (k)) + e (k), 1 \leq m \leq M

(5)

Refer to [20,21] for details on the specific feature selection steps.

2.2. Data Pruning with Mini-Batch FastCan

For simplicity, the selected model terms are represented by the matrix

X

as follows:

X = (\begin{matrix} x_{1, 1} & \dots & x_{1, N} \\ ⋮ & ⋱ & ⋮ \\ x_{m, 1} & \dots & x_{m, N} \end{matrix}),

(6)

where m is the number of the features, i.e., the selected model terms, and N is the number of time series samples given these selected features.

A dictionary

D \in R^{m \times q}

for the matrix

X \in R^{m \times N}

is obtained by mini-batch k-means clustering [29] and used as the target matrix for the subsequent data pruning step, where q is the number of atoms in the dictionary and typically

q > m

for an overcomplete dictionary. Each atom is a column vector within the matrix

D

. Each data sample

x_{j} \in R^{m}

(column of

X

) can be approximated as a sparse linear combination of the dictionary atoms given by

x_{j} \approx D a_{j},

(7)

where

a_{j} \in R^{q}

is a sparse coefficient vector for the j-th sample, which means most of the entries in

a_{j}

are zero.

The canonical correlation-based fast selection method [21] is performed again, with

D

as the target, to find n significant samples from the sample matrix

X

. For selection methods which evaluate the linear association between candidates, no additional information is gained once the number of selected samples n exceeds the rank of the data matrix m. Therefore, to avoid invalid sample selection due to

n > m

, samples are selected in separate batches by the mini-batch FastCan method, whose pseudocode is given in Algorithm 1 and the corresponding codes are available in the GitHub repository (https://github.com/MatthewSZhang/data-pruning-sysid (accessed on 21 July 2025; version 0.1.0), https://github.com/scikit-learn-contrib/fastcan (accessed on 21 July 2025; version 0.4.0)). Within each batch, the redundancy and interaction between samples are considered, while these are ignored between batches.

Algorithm 1: Mini-batch FastCan
	Input: $X \in R^{m \times N}$ ;			`▹ Sample matrix`
	$q \in N$ ;			`▹ Number of atoms in a dictionary $D$`
	$p \in N$ ;			`▹ Batch size; Optional`
	$n \in N$ ;			`▹ Number of samples to select`
	Output: $s \in R^{1 \times n}$ ;			`▹ Selected indices`
	Step 1:
	Apply the k-means-based dictionary learning [29] to $X^{⊤}$ with q clusters;
	The resulting q cluster centres form the columns of a dictionary $D \in R^{m \times q}$ ;
	`▹ Target matrix`
	Step 2:
	if p is not specified or $p > ⌈ n / q ⌉$ then
		Set $p \leftarrow ⌈ n / q ⌉$ ;
		if $p > m$ then
			Set $p \leftarrow m$ ;
	Step 3:
	Generate the batch matrix $B \in R^{q \times t}$ , where $t =$ $⌈ n / (q \times p) ⌉$ , $B [i, j]$ $\leq p$ and $\sum_{i = 1}^{q} \sum_{j = 1}^{t} B [i, j] = n$ ;
	Step 4:
	Initialize the candidate sample matrix $X_{c} \leftarrow X$ and the target matrix $D_{c} \leftarrow D$ ;
	for $i \leftarrow 1$ to q do
		Let $d_{i} \leftarrow D_{c} [:, i]$ ;
		for $j \leftarrow 1$ to t do
			Select $B [i, j]$ samples from $X_{c}$ by using the canonical-correlation-based fast selection method [21], with $d_{i}$ serving as the target vector
			Append the indices of the selected samples to $s$ ;
			Remove the selected samples from the candidate matrix $X_{c}$ ;
	return s;

In Algorithm 1, if the batch size p is not specified, it is set to

⌈ q / k ⌉

. If a given

p > ⌈ n / q ⌉

, it is also reset to

⌈ q / k ⌉

, since exceeding this threshold will result in some atoms being excluded from the selection process. Additionally, this algorithm allows the number of atoms q to be smaller than the number of features m, enabling a larger batch size to better capture sample redundancy.

3. Numerical Case Studis

3.1. Visualisation of Sample Redundancy in System Identification

Here is an example to intuitively demonstrate the feature-wise and sample-wise redundancy that occurs in the system identification process. A time series sampled from

y (t) = \sin (2 π t)

over a duration of 1 s at a sample rate of 100 Hz is illustrated in Figure 2. After applying time shifting, the one-dimensional data

y (t)

is transformed into 20-dimensional data, as shown in Table 1, with a time step of

Δ t = 0.01 s

. There are 80 time series samples in total, where samples containing NaN values are excluded.

The feature-wise redundancy is illustrated in the correlation heatmap of Figure 3a, where the

i^{t h}

feature corresponds to

y (t - i Δ t)

with

1 \leq i \leq 20

. The high Pearson’s correlation near the main diagonal indicates redundancy between neighbouring features, such as

y (t - i Δ t)

and

y (t - (i + 1) Δ t)

. This kind of redundancy can be mitigated via feature selection.

Figure 3b shows the redundancy within the time series samples using principal component analysis (PCA). After projecting the 20-dimensional data into two-dimensional space, it is found that the data forms a continuous trajectory over time. The proximity of adjacent samples in the lower-dimensional space given by PCA indicates that there is redundancy between these time series samples. This redundancy can be addressed by data pruning.

3.2. Data with Dual Stable Equilibria

The simulated time series data is generated by a non-autonomous nonlinear system given by

\ddot{y} + \dot{y} - y + y^{2} + y^{3} = u

(8)

where

u (t) = 0.1 cos (0.2 π t)

. By setting

u (t) = 0

and initialising y and

\dot{y}

with different values, a phase portrait of the nonlinear system is obtained, as shown in Figure 4.

It can be seen that this system exhibits two stable equilibria, with five measurements for each of the left and right equilibria. Therefore, this dataset is referred to as the Symmetrical Dual-Stable-Equilibria (SDSE) dataset. To comprehensively capture its dynamics, the data pruning algorithm should identify two distinct kinds of samples attracted to the different stable spirals and select samples from both kinds. Unlike feature selection, random selection serves as a strong baseline for sample selection and often outperforms more sophisticated algorithms when retaining 30% or less of the data [25]. Therefore, the results of the proposed data pruning method will be compared with those of the random selection method to demonstrate its advantages.

To quantitatively evaluate the selection performance of the mini-batch FastCan and random methods, a reduced polynomial NARX model with ten terms and an intercept is derived using the full dataset as the baseline. The model coefficients trained on the full training dataset are used as the baseline results. A more detailed description of the baseline NARX model and its prediction performance on the test dataset are provided in Appendix A. The NARX terms are then fixed, and the model coefficients are trained using the pruned training datasets obtained via the mini-batch FastCan and random selection methods. 100 samples are then selected as an example, and the corresponding number of atoms for dictionary learning is set to 15. The reasons for this number of atoms will be discussed in Section 5. The coefficients trained on the two pruned datasets are compared with those trained on the full dataset using the R-squared value. The sample selection and training processes are repeated ten times, with the results presented in Figure 5a.

As shown in Figure 5a, the mini-batch FastCan method gives results with a higher median and lower variance, indicating superior and more consistent performance compared to random selection. As illustrated in Figure 6a, the same pattern is observed when the number of selected samples varies between 50 and 150 and the number of atoms is fixed to 15, except for 60 samples—this case will be discussed in Section 5. Furthermore, for the mini-batch FastCan method, the variance of the model coefficients decreases as the number of selected samples increases. However, this relationship does not exist when the samples are randomly selected in Figure 6a.

To examine whether the sample distributions corresponding to the results shown in Figure 5a are different, the structure of the SDSE data is visualised using its first and second principal components, as shown in Figure 7a. The selected samples corresponding to the results shown in Figure 5a are projected onto these principal directions, with their distribution shown in Figure 7a. It can be observed that the samples selected via mini-batch FastCan tend to cluster around the learned atoms, and the randomly selected samples are evenly distributed across the entire dataset according to the spatial distribution of candidate samples. Compared to random selection, the proposed method makes more samples distributed at the left and right ends. Nevertheless, the overall difference between the two sample distributions is relatively small.

To further determine whether the sample distribution selected by the proposed method is different from that of the random method, 2 measurements in left equilibrium and 98 measurements in right equilibrium are simulated to repeat the previous analysis. This dataset is referred to as the Asymmetrical Dual-Stable-Equilibria (ADSE) dataset. A total of 100 samples are selected, and the number of atoms is set to 20 (see Section 5 for details on this setting). The results are presented in Figure 5b, Figure 6b and Figure 7b. As shown in Figure 5b and Figure 6b, when the number of samples across different conditions is imbalanced, the mini-batch FastCan method has obvious advantages over the random sample selection method. Furthermore, as shown in Figure 7b, almost all samples selected by the random method are concentrated in the right area, that is, these samples belong to the right equilibrium. In contrast, the proposed method selects samples from both equilibria, with those from the left equilibrium evenly distributed in the left area. These phenomena indicate that, unlike random selection, the mini-batch FastCan method is less influenced by the imbalance in the candidate sample distribution and is more effective at capturing diverse system behaviours. This phenomenon, in turn, helps explain why the identification performance corresponding to the proposed method is superior to that corresponding to the random selection method.

4. Case Studies on the Benchmark Datasets

Two benchmark datasets, collected from the Electro-Mechanical Positioning System (EMPS) [26] and the Wiener–Hammerstein System (WHS) [27], are adopted to evaluate the performance of the mini-batch FastCan method for data pruning in real-world system identification tasks.

4.1. Data from the Electro-Mechanical Positioning System

The EMPS is a standard drive configuration for the prismatic joints of robots or machine tools. The primary nonlinearity of the corresponding data is introduced by friction effects. The baseline NARX model with ten terms and an intercept is derived, and the model coefficients trained on the full training dataset are used as the baseline to quantitatively evaluate the selection performance of the mini-batch FastCan and random methods. See Appendix A for the performance of the baseline NARX on the test dataset.

The number of selected samples is 100 and the number of atoms for dictionary learning is set to 25 (see Section 5 for details on this setting). The coefficients trained on the two pruned datasets are compared with those from the full dataset using the R-squared value, with the NARX terms fixed. The sample selection and training processes are repeated ten times, with the results illustrated in Figure 8a. In addition, Figure 9a presents the results corresponding to two methods for different numbers of selected samples ranging from 20 to 120.

Consistent with the observations from the SDSE and ADSE datasets, the mini-batch FastCan method produces results with a higher median and lower variance in Figure 8a, indicating more stable and reliable performance. Additionally, as expected, the variance of the model coefficients also decreases as the number of selected samples increases for the mini-batch FastCan method in Figure 9a.

The structure of the data collected from the EMPS is visualised by its first two principal components, as shown in Figure 10a. The atoms determined for the dictionary learning and the selected samples corresponding to the results given in Figure 8a are projected onto this reduced space. The atoms and selected samples corresponding to the mini-batch FastCan method are distributed in the central, left, and right regions, while the randomly selected samples are almost only spread over the central part.

4.2. Data from the Wiener–Hammerstein System

The WHS is a well-known block-oriented structure consisting of a static nonlinearity between two linear time-invariant blocks. The same strategy is applied to quantitatively evaluate the sample selection performance of the mini-batch FastCan and random methods for the identification task of the WHS. The baseline NARX model with ten terms and an intercept is derived, and its performance on the test dataset is provided in Appendix A.

The number of atoms for dictionary learning is set to five (see Section 5 for details on this setting), and 100 samples are selected, with the R-squared results shown in Figure 8b. Then, the number of selected samples varies between 20 and 120 to demonstrate the variation of R-squared values with the selected sample size, as presented in Figure 9b. Figure 8b demonstrates that the results obtained by the mini-batch FastCan are better than those obtained by the random method. This phenomenon is also observed in Figure 9b, except at 40 samples. This exception will be analysed and further discussed in the next section.

The structure of the data collected from the WHS is visualised in Figure 10b using its first two principal components. The atoms determined for the dictionary learning and the selected samples are shown in Figure 10b, with the number of atoms set to five. Different from the observations from the ADSE data and the EMPS data, the samples obtained by the mini-batch FastCan method and the random selection method have similar distributions, with the selected samples mainly distributed over the central part of the dataset. This phenomenon is unexpected because the samples associated with the results in Figure 8b exhibit similar distributions but provide varying levels of information for system identification. This observation suggests that sample selection strategies based solely on distributional characteristics of the data space may be insufficient for identifying informative samples.

5. The Effect of Hyperparameters on the Mini-Batch FastCan Method

In the mini-batch FastCan method, sample selection results are influenced by two hyperparameters, namely the number of atoms in the dictionary and the batch size. To examine how these two hyperparameters affect the performance of the selected samples, system identification results against the varying atom size and batch size for the four previously used datasets are given in Figure 11 and Figure 12. For all datasets, 100 samples are selected. For the results in Figure 11, the batch size is not given as an input parameter but is instead set to the ceiling of the ratio between the number of selected samples and the number of atoms (

⌈ n / q ⌉

), with the aim of minimising redundancy among the selected samples when all atoms are used in the sample selection process. Therefore, the batch size varies for each atom size in Figure 11.

As shown in Figure 11, the performance deteriorates when the atom size exceeds the size associated with the highest R-squared value. This may be attributed to the fact that, for a fixed number of selected samples, larger atom sizes lead to smaller batch sizes, thus increasing sample-wise redundancy. In practical engineering applications, the optimal atom size can be determined by defining a candidate range and selecting the value that yields the best performance via an exhaustive search.

In Figure 12, when changing the batch size for each dataset, the number of atoms is fixed at the optimal atom size obtained from Figure 11, where the optimal atom sizes q for the SDSE, ADSE, EMPS, and WHS datasets are 15, 20, 25, and 5, respectively. According to Step 2 in Algorithm 1, the maximum batch sizes for the DSE, EMPS, and WHS datasets are 7, 5, 4, and 10, respectively. It can be seen that when the atom size is fixed at the optimal value, using a larger batch size to reduce redundancy can generally improve the performance of the selected samples. Therefore, the atom sizes and the batch sizes for the previous four case studies are the optimal atom sizes obtained from Figure 11 and the corresponding maximum batch sizes. However, it is worth noting that in Figure 12, the relationship between the R-squared value and batch size is not a straightforward linear relationship, especially for the EMPS and WHS datasets. A more sophisticated method can be developed in the future to determine the appropriate hyperparameters, including atom size and batch size, for the proposed method.

To examine the underlying causes of the previously observed exceptions—where random selection outperforms or matches the performance of the mini-batch FastCan method, as shown in Figure 6a and Figure 9b—Figure 13 presents the impact of atom size on model performance across varying selected sample sizes for the mini-batch FastCan approach. This analysis aims to assess whether suboptimal hyperparameter settings contribute to these discrepancies.

To investigate the potential causes for the previously noted exceptions where random selection outperforms or matches mini-batch FastCan selection in Figure 6a and Figure 9b, the effect of atom size on model performance corresponding to the mini-batch FastCan method across different selected sample sizes is demonstrated in Figure 13. For each sample size, the maximum R-squared value is printed out. It can be observed that when the mini-batch FastCan method performs close to or worse than the random method, the adopted atom size is not the optimal size for the given number of selected samples. For the ADSE dataset, when 60 samples are selected, the optimal atom size is 5 rather than 15. For the WHS dataset, when 40 samples are selected, the optimal atom size is 10 rather than 5. After tuning the atom size for the mini-batch FastCan method, the comparison of sample selection performance of the two methods is then illustrated in Figure 14. The results indicate that the mini-batch FastCan method can provide better sample selection results when the appropriate atom size is applied.

In addition, as shown in Figure 13a,c,d, when the number of selected samples exceeds a certain threshold, the optimal atom size no longer changes with the number of selected samples. At the same time, Figure 6a and Figure 9a,b indicate that the performance of the selected samples also stabilises when the number of selected samples exceeds this threshold for each dataset. This consistent phenomenon suggests that when more samples cannot contribute additional information, the number of optimal atoms learned from the selected samples tends to stabilise. This finding supports, to a certain extent, the rationality and robustness of using dictionary learning to generate pseudo-labels. Moreover, the careful tuning of hyperparameters is particularly important when the number of selected samples is small.

6. Conclusions

In this paper, a dictionary learning-based data pruning method, called mini-batch FastCan, is introduced for selecting informative time series samples in the identification of discrete systems with finite orders. Two key characteristics of this method are worth mentioning. First, the k-means-based dictionary learning adopted in the mini-batch FastCan method is well-suited for generating pseudo-labels in the data pruning process. This approach mitigates the impact of imbalanced candidate sample distributions on sample selection. Second, selecting samples in separate batches allows for the selection of more samples than the number of necessary model terms while effectively accounting for redundancy within each batch. This batch-wise approach helps ensure that the selected samples are more diverse and informative, which is crucial for accurate system identification.

A key strength of the proposed method is its versatility, as the selection process is independent of the specific system identification technique used. However, its primary limitation lies in the absence of a systematic process for optimising hyperparameters, including atom size and batch size. The relationship between these parameters and model performance (R-squared) is complex and nonlinear, exhibiting dataset-dependent variability that makes their selection challenging. While these hyperparameters can currently be determined by engineering practices, future work should focus on developing theoretically grounded optimisation strategies to improve generalisability and methodological rigour.

The case studies on the synthetic and benchmark datasets show that the proposed method effectively reduces the sample size without significantly degrading the identification performance of the benchmark system. Compared to random selection, the data pruning method generally yields better results, with a higher median or mean and lower variance of R-squared values. Additionally, it is found that the careful selection of hyperparameters is crucial, particularly when the proportion of selected samples is small.

Author Contributions

Conceptualisation, T.W. and S.Z.; methodology, T.W. and S.Z.; software, T.W. and S.Z.; validation, T.W., S.Z. and M.S.; formal analysis, T.W. and S.Z.; investigation, T.W. and S.Z.; resources, L.S.; data curation, T.W. and S.Z.; writing—original draft preparation, T.W.; writing—review and editing, T.W., S.Z. and M.S.; visualisation, T.W.; supervision, M.S. and L.S.; project administration, L.S.; funding acquisition, L.S. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to acknowledge the support for this study from Shanghai Qi Zhi Institute Innovation Program (SQZ202310) and the National Natural Science Foundation of China (52378187).

Data Availability Statement

The original data presented in this study are openly available at https://github.com/MatthewSZhang/data-pruning-sysid (accessed on 21 July 2025; version 0.1.0). The benchmark datasets used in this study were obtained from publicly available resources at https://www.nonlinearbenchmark.org/ (accessed on 21 July 2025).

Conflicts of Interest

Author Sikai Zhang was employed by the company “Baosight Software, Shanghai, China”. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NARX	Nonlinear autoregressive with eXogenous inputs
FastCan	Fast selection method based on canonical correlation
PCA	Principal component analysis
SDSE	Symmetrical dual-stable equilibria
ADSE	Asymmetrical dual-stable equilibria
EMPS	Electro-mechanical positioning system
WHS	Wiener–Hammerstein system

Appendix A. Prediction Performance of Baseline NARX

For each of the four datasets, a reduced polynomial NARX model with ten terms and an intercept is derived using the full dataset as a baseline. To select relevant nonlinear terms, the maximum input and output lags are set to four for the SDSE system, ADSE system, and EMPS and to seven for the WHS. For all systems, the polynomial degree is fixed at three, and the number of selected model terms is limited to ten. These parameter settings are designed to provide a satisfactory trade-off between predictive accuracy and model simplicity but can be further optimised based on specific application requirements.

Figure A1. The performance of the baseline NARX model for SDSE test data with different initial conditions. (a) Test 1. (b) Test 2.

Figure A2. The performance of the baseline NARX model for ADSE test data with different initial conditions. (a) Test 1. (b) Test 2.

Figure A3. The performance of the baseline NARX model for two benchmark datasets. (a) EMPS dataset. (b) WHS dataset.

References

Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control, 5th ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2015; Chapter 1; pp. 1–2. [Google Scholar]
Billings, S.A. Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains; John Wiley & Sons, Ltd.: West Sussex, UK, 2013; Chapter 1; pp. 9–10. [Google Scholar]
Nelles, O. Nonlinear System Identification: From Classical Approaches to Neural Networks and Fuzzy Models; SpringerLink: Berlin/Heidelberg, Germany, 2001. [Google Scholar]
Miller, W.T.; Sutton, R.S.; Werbos, P.J. Neural Networks for Control; The MIT Press: London, UK, 1990. [Google Scholar]
Jenkins, G.M.; Watts, D.G. Spectral Analysis and Its Applications; Holden-Day, Inc.: San Francisco, CA, USA, 1968. [Google Scholar]
Schetzen, M. The Volterra and Wiener Theories of Nonlinear Systems; John Wiley & Sons Inc.: New York, NY, USA, 1980. [Google Scholar]
Chen, S.; Billings, S.A. Representations of non-linear systems: The NARMAX model. Int. J. Control 1989, 49, 1013–1032. [Google Scholar] [CrossRef]
Kukreja, S.L.; Galiana, H.L.; Kearney, R.E. NARMAX representation and identification of ankle dynamics. IEEE Trans. Biomed. Eng. 2003, 50, 70–81. [Google Scholar] [CrossRef] [PubMed]
Boynton, R.; Balikhin, M.; Wei, H.; Lang, Z. Applications of NARMAX in space weather. In Machine Learning Techniques for Space Weather; Elsevier: Amsterdam, The Netherlands, 2018; pp. 203–236. [Google Scholar]
Korenberg, M.; Billings, S.A.; Liu, Y.; McIlroy, P. Orthogonal parameter estimation algorithm for non-linear stochastic systems. Int. J. Control 1988, 48, 193–210. [Google Scholar] [CrossRef]
Chen, S.; Billings, S.A.; Luo, W. Orthogonal least squares methods and their application to nonlinear system identification. Int. J. Control 1989, 50, 1873–1896. [Google Scholar] [CrossRef]
Hong, X.; Mitchell, R.J.; Chen, S.; Harris, C.J.; Li, K.; Irwin, G.W. Model selection approaches for non-linear system identification: A review. Int. J. Syst. Sci. 2008, 39, 925–946. [Google Scholar] [CrossRef]
Goodwin, G.C. Optimal input signals for nonlinear-system identification. Proc. Inst. Electr. Eng. 1971, 118, 922–926. [Google Scholar] [CrossRef]
Mehra, R. Optimal inputs for linear system identification. IEEE Trans. Autom. Control 1974, 19, 192–200. [Google Scholar] [CrossRef]
Raju, R.S.; Daruwalla, K.; Lipasti, M. Accelerating deep learning with dynamic data pruning. arXiv 2021, arXiv:2111.12621. [Google Scholar] [CrossRef]
Sorscher, B.; Geirhos, R.; Shekhar, S.; Ganguli, S.; Morcos, A.S. Beyond neural scaling laws: Beating power law scaling via data pruning. Adv. Neural Inf. Process. Syst. 2022, 35, 19523–19536. [Google Scholar]
Yang, Z.; Yang, H.; Majumder, S.; Cardoso, J.; Gallego, G. Data pruning can do more: A Comprehensive data pruning approach for object re-identification. arXiv 2024, arXiv:2412.10091. [Google Scholar] [CrossRef]
Marion, M.; Üstün, A.; Pozzobon, L.; Wang, A.; Fadaee, M.; Hooker, S. When less is more: Investigating data pruning for pretraining LLMs at scale. arXiv 2023, arXiv:2309.04564. [Google Scholar] [CrossRef]
Jin, R.; Xu, Q.; Wu, M.; Xu, Y.; Li, D.; Li, X.; Chen, Z. LLM-based knowledge pruning for time series data analytics on edge-computing devices. arXiv 2024, arXiv:2406.08765. [Google Scholar]
Zhang, S.; Lang, Z.Q. Orthogonal least squares based fast feature selection for linear classification. Pattern Recognit. 2022, 123, 108419. [Google Scholar] [CrossRef]
Zhang, S.; Wang, T.; Worden, K.; Sun, L.; Cross, E.J. Canonical-correlation-based fast feature selection for structural health monitoring. Mech. Syst. Signal Process. 2025, 223, 111895. [Google Scholar] [CrossRef]
Aharon, M.; Elad, M.; Bruckstein, A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 2006, 54, 4311–4322. [Google Scholar] [CrossRef]
Mairal, J.; Bach, F.; Ponce, J. Task-driven dictionary learning. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 34, 791–804. [Google Scholar] [CrossRef] [PubMed]
Vu, T.H.; Monga, V. Fast low-rank shared dictionary learning for image classification. IEEE Trans. Image Process. 2017, 26, 5160–5175. [Google Scholar] [CrossRef] [PubMed]
Ayed, F.; Hayou, S. Data pruning and neural scaling laws: Fundamental limitations of score-based algorithms. arXiv 2023, arXiv:2302.06960. [Google Scholar] [CrossRef]
Janot, A.; Gautier, M.; Brunot, M. Data set and reference models of EMPS. In Proceedings of the Nonlinear System Identification Benchmarks, Eindhoven, The Netherlands, 10–12 April 2019. [Google Scholar]
Schoukens, J.; Ljung, L. Wiener-Hammerstein Benchmark; Technical Report; Linköping University Electronic Press: Linköping, Sweden, 2009. [Google Scholar]
Billinge, S. bg-mpl-stylesheets. 2024. Available online: https://github.com/Billingegroup/bg-mpl-stylesheets (accessed on 21 July 2025).
Sculley, D. Web-scale k-means clustering. In Proceedings of the 19th International Conference on World Wide Web, Raleigh, NC, USA, 26–30 April 2010; pp. 1177–1178. [Google Scholar]

Figure 2. Time series of

y (t) = \sin (2 π t)

at 100 Hz over 1 s.

Figure 2. Time series of

y (t) = \sin (2 π t)

at 100 Hz over 1 s.

Figure 3. Correlation heatmap illustrating redundancy in data. (a) Feature-wise. (b) Sample-wise.

Figure 4. A phase portrait for a nonlinear system with dual stable equilibria. Trajectories from different initial conditions are shown converging to the two stable equilibria, guided by the underlying vector field.

Figure 5. Comparison of sample selection performance between mini-batch FastCan and random selection methods using the R-squared value on two datasets. (a) SDSE dataset. (b) ADSE dataset.

Figure 6. The effect of sample size on the performance of mini-batch FastCan and random selection methods for two datasets. SD stands for standard deviation. (a) SDSE dataset. (b) ADSE dataset.

Figure 7. Visualisation of the data structure of two datasets. (a) SDSE dataset. (b) ADSE dataset.

Figure 8. Comparison of sample selection performance between mini-batch FastCan and random selection methods using the R-squared value on two benchmark datasets. (a) EMPS dataset. (b) WHS dataset.

Figure 9. The effect of sample size on the performance of mini-batch FastCan and random selection methods for two benchmark datasets. SD stands for standard deviation. (a) EMPS dataset. (b) WHS dataset.

Figure 10. Visualisation of the data structure of two benchmark datasets. (a) EMPS dataset. (b) WHS dataset.

Figure 11. The effect of atom size on the performance of the mini-batch FastCan method. (a) SDSE dataset. (b) ADSE dataset. (c) EMPS dataset. (d) WHS dataset.

Figure 12. The effect of batch size on the performance of the mini-batch FastCan method under the optimal atom size. (a) SDSE dataset. (b) ADSE dataset. (c) EMPS dataset. (d) WHS dataset.

Figure 13. The effect of atom size on the performance of the mini-batch FastCan method under different sample numbers. (a) SDSE dataset. (b) ADSE dataset. (c) EMPS dataset. (d) WHS dataset.

Figure 14. Comparison of sample selection performance between the mini-batch FastCan method with tuned atom size and random selection using the R-squared value on two datasets. (a) SDSE dataset. (b) WHS dataset.

Table 1. Time-shifted data representation with Δt = 0.01 s. NaN indicates unavailable past values. Dots denote the continuation of the same pattern.

t (s)	y(t − Δt)	y(t − 2Δt)	y(t − 3Δt)	…	y(t − 20Δt)
0	NaN	NaN	NaN	…	NaN
0.01	0.000	NaN	NaN	…	NaN
0.02	0.063	0.000	NaN	…	NaN
0.03	0.127	0.063	0.000	…	NaN
0.04	0.189	0.127	0.063	…	NaN
⋮	⋮	⋮	⋮	⋱	⋮
0.99	−0.063	−0.127	−0.189	…	−0.955

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, T.; Zhang, S.; Song, M.; Sun, L. Dictionary Learning-Based Data Pruning for System Identification. Appl. Sci. 2025, 15, 9368. https://doi.org/10.3390/app15179368

AMA Style

Wang T, Zhang S, Song M, Sun L. Dictionary Learning-Based Data Pruning for System Identification. Applied Sciences. 2025; 15(17):9368. https://doi.org/10.3390/app15179368

Chicago/Turabian Style

Wang, Tingna, Sikai Zhang, Mingming Song, and Limin Sun. 2025. "Dictionary Learning-Based Data Pruning for System Identification" Applied Sciences 15, no. 17: 9368. https://doi.org/10.3390/app15179368

APA Style

Wang, T., Zhang, S., Song, M., & Sun, L. (2025). Dictionary Learning-Based Data Pruning for System Identification. Applied Sciences, 15(17), 9368. https://doi.org/10.3390/app15179368

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dictionary Learning-Based Data Pruning for System Identification

Abstract

1. Introduction

2. Methodology

2.1. Reduced Polynomial NARX

2.2. Data Pruning with Mini-Batch FastCan

3. Numerical Case Studis

3.1. Visualisation of Sample Redundancy in System Identification

3.2. Data with Dual Stable Equilibria

4. Case Studies on the Benchmark Datasets

4.1. Data from the Electro-Mechanical Positioning System

4.2. Data from the Wiener–Hammerstein System

5. The Effect of Hyperparameters on the Mini-Batch FastCan Method

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Prediction Performance of Baseline NARX

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI