4.1. Overall Architecture
Figure 1 presents the overall methodological framework proposed in this study. To address the high computational overhead introduced by information gain calculations during Shapelet candidate selection in the original Shapelet discovery module, as well as the low training efficiency of the Transformer’s self-attention mechanism when processing large-scale time series data, we introduce two key structural enhancements to the ShapeFormer model. These improvements are designed to significantly boost both computational efficiency and temporal modeling performance.
During the Shapelet discovery phase, we design a coarse screening strategy based on Euclidean distance to rapidly eliminate candidate subsequences with weak discriminative capability prior to detailed evaluation. This strategy significantly reduces the frequency of distance computations and information gain evaluations, thereby substantially decreasing the time cost of Shapelet mining.
In the general representation learning module, we propose a novel Convolution-Inverted Attention (CIA) neural network module. This design replaces the original two-layer convolutional structure with a single-layer convolutional architecture, thereby enhancing computational efficiency while retaining strong local feature extraction capability. Moreover, by introducing an inverse attention mechanism that shifts the computation dimension of self-attention from the temporal axis to the variable axis, the model can effectively capture inter-variable dependencies. This approach substantially reduces training time while preserving the model’s discriminative performance. The following sections will detail the specific modules of our method.
4.2. Coarse Screening in Shapelet Discovery
In the Shapelet Discovery module, we improved the Offline Shapelet Discovery (OSD) method. During the shapelet candidates extraction phase, we employed Perceptually Important Points (PIPs) to extract Shapelets from the training set
[
38]. Specifically, we recursively search the time series
X for the newest PIP with the maximum vertical distance from the line formed by two previously selected PIPs. When a new PIP is added to the PIP set, we use the third consecutive PIP to obtain new shapelet candidates. Thus, for a new PIP, up to three Shapelets may be added to the shapelet candidates set [
19,
39]. In this paper, we adopt the same strategy as Shapeformer [
19], setting the number of PIPs to
and
L as the time series length, selecting up to
shapelet candidates. Each shapelet simultaneously stores its numerical segment, start and end positions, and associated variable channel information, providing data support for subsequent segment screening.
Figure 2 shows an example of identifying the first 5 PIPs from the time series
X in the training dataset.
Although the PIP method effectively reduces the number of shapelet candidates, the computational burden remains significant during subsequent screening due to the need for repeated PSD and information gain calculations. To address this issue, this paper proposes a coarse-grained screening mechanism based on Euclidean distance. This approach is grounded in two key considerations: First, Euclidean distance itself is computationally straightforward, making it suitable for rapid preliminary screening of large-scale shapelet candidates. Second, from the perspective of shape similarity, Euclidean distance effectively reflects the discriminative potential of shapelet candidates. Although Euclidean distance is known to be sensitive to noise and scaling shifts, its use in the coarse-grained screening phase is justified by the fact that this stage focuses on rapidly filtering out obviously non-discriminative shapelets from large datasets. Since this phase is preliminary, the impact of noise is minimized as it only serves to reduce the pool of shapelet candidates. Additionally, by using more refined methods, such as information gain, in the subsequent fine-grained screening phase, we ensure that only the most discriminative shapelets are selected. Therefore, the use of Euclidean distance in the coarse screening phase effectively enhances the overall efficiency of the shapelet discovery process without significantly compromising the classification accuracy.
By employing the coarse screening mechanism to eliminate less discriminative candidates before fine-grained screening, we significantly enhance the overall efficiency of the discovery process. For ease of presentation in the coarse-grained screening stage, we introduce
and
to denote shapelet candidates indexed by channel
and candidate index
k in the target and other classes, respectively. This is only an indexing notation and does not change the shapelet definition in
Section 3; each
still corresponds to a numeric shapelet subsequence extracted from a single channel, together with its meta information (length and location). Consequently, all distance computations in this section are performed on the same numeric subsequences; the superscripts/subscripts are used solely for bookkeeping and for describing the coarse screening process succinctly. We categorize shapelet candidates extracted from the training set
into two classes:
represents shapelet candidates on
within the target class, while
denotes shapelet candidates on
within the other class, as illustrated in
Figure 3. For
,
indicates the number of samples in the target class,
m denotes the variable index, and
represents the Shapelet index. For
,
denotes the number of samples in the other class. Accordingly,
simply denotes evaluating candidate
on sample
during screening.
Our coarse screening process is illustrated in
Figure 4. For a given target class candidate shapelet
and time series sample
, their minimum Euclidean distances within the target class and across other classes are defined as follows:
where
and
are the numbers of samples from the target class and from other classes, respectively; and
is the minimum Euclidean distance (Equation (
3)).
Calculate the average minimum distance
of this shaplet across other categories, then define a discriminative metric
based on the distance differences between samples of different categories for filtering.
A larger
indicates that the candidate is more discriminative for separating the target class from other classes. We rank candidates by
in descending order and discard the bottom
candidates, where
is an experimental hyperparameter, for which we conducted hyperparameter sensitivity experiments in
Section 5.3.
After the coarse screening process concludes, the retained shapelet candidates are designated as
and enter the Fine Screening module. By calculating their Perceptual Subsequence Distance (PSD) with all instances in the training data
, the optimal information gain is identified to evaluate their discriminative capability. The Shapelet set
with the highest information gain is selected as the final choice and stored in the Shapelet pool.
here,
b denotes the sliding window start index in
X (not the start index of the shapelet in its source seties),
and
are the channel index and length of
,
is the length -
subsequence on channel
starting at
b; and
signifies the complexity-invariant distance.
By introducing a correction factor related to the intrinsic pattern complexity of the sequence, this metric effectively enhances the robustness of traditional Euclidean distance in measuring morphological similarity. It has been demonstrated to improve the discriminative capability of shapelets in time series classification tasks [
40].
4.3. Class-Specific Representation
To deeply mine discriminative features highly correlated with categories within time series, we introduced a class-specific representation module into our model. Based on the self-attention mechanism of Transformers, this module constructs high-level feature representations by modeling the differential relationships between shapelets and input sequences.
Each
in the final shapelet set
records its length
, channel index
, and position
within the original sequence. For input sequence
X, we compute
distances between all subsequences in
X on channel
, restricting the search range to a neighborhood centered at
with radius
w. The subsequence with the shortest distance becomes the best-fit subsequence
.
We linearly project both the shapelet
and its best-fit subsequence
into the same embedding space
,
, yielding their difference features:
. Here,
and
denotes the linear projection, while
represents the embedding size of the difference features. Subsequently, the difference features
are integrated with position embeddings to capture their sequential order. To better indicate the positional information of shapelets, both the position index
and channel index
of shapelets are learned through linear projection to obtain their embeddings.
Here, is the position embedding function, which maps the start point, end point, and one-hot encoded variables into dense vectors via a learnable linear projection, thereby endowing the model with positional awareness.
Feed all
into the MHA of the Transformer Encoder, where
G denotes the number of elements in
. Given the projection
, compute the attention weight for position
i to position
j, ultimately yielding the output
, where
.
Due to the category-specific nature of these features, attention scores between samples of the same category are significantly higher than those between samples of different categories, thereby enhancing the model’s ability to distinguish between categories. Simultaneously, leveraging the local discriminative properties of shapelet, differential features can identify representative key subsequences across different time segments and variable dimensions within the time series. This enables the model to more effectively capture temporal dependencies and cross-variable correlations within the sequence.
4.4. Generic Representation
To enhance the effectiveness of modeling multivariate time series features, we propose a novel universal feature extraction module—CIA (Convolution-Inverted Attention)—whose overall structure is illustrated in
Figure 1a. The core concept of the CIA module is to achieve synergistic integration between local feature extraction and global variable correlation modeling. Traditional Transformers compute attention over the temporal dimension, which can capture long-term dependencies but incurs high computational overhead and tends to overlook inherent correlations between variables. Conversely, while convolutional operations efficiently extract local temporal patterns, their limited receptive field makes it difficult to model global dependencies.
Inspired by iTransformer [
41], this module employs a dimensional Conversion approach, treating variables as tokens and time points as features. This shifts the application dimension of Self-Attention from the temporal axis to the variable axis, as illustrated in
Figure 5. This design enables the model to explicitly learn correlations between variables while leveraging one-dimensional convolutional layers to efficiently capture local morphological features in the temporal dimension. The CIA module achieves dual modeling of temporal and variable dependencies while maintaining computational efficiency, significantly enhancing the discriminative power and generalization capabilities of the general representation.
Unlike traditional iTransformer, the CIA module incorporates convolutional layers into the self-attention mechanism. The convolution operation allows the CIA module to achieve a stronger local receptive field, improving its ability to capture local temporal patterns. Additionally, the convolutional layers help reduce the computational cost, making the model more efficient when handling long time series. In contrast, iTransformer only inverts the attention dimension to model dependencies between time and variables, without incorporating convolution, limiting its ability to efficiently extract local features.This design is particularly important for sensor time series, which often exhibit strong local fluctuations, short-term transient patterns, and noise-contaminated dynamics. By introducing a convolutional layer before inverted attention, the CIA module explicitly captures local temporal variations that are typically under-modeled by the purely attention-based iTransformer, while preserving its ability to model global inter-variable dependencies.
One-Dimensional Convolution for Local Feature Extraction: For the time series
, we employ a convolutional module for local feature extraction. This convolutional block consists of a one-dimensional convolutional layer (Conv1D), batch normalization (BatchNorm), and a GELU activation function in sequence. The computational process is as follows:
The kernel dimensions of the convolution are , where is the kernel size of the convolution filter. The resulting universal features are , where is the feature dimension of the convolved output, which controls the subsequent number of tokens.
Inverse Attention Models Variable Dependencies: The overall structure is shown in
Figure 6. After obtaining features containing local information
U, we transpose dimensions to treat variables as tokens and time points as features:
, where
is the learnable position encoding. To convert time series embeddings into variable token representations, we employ a Multi-layer Perception (MLP) network to map each variable’s time series embedding to dimension
. This transforms each variable into a token [
41,
42],
, where
represents the mapping dimension. Consequently, we obtain
variable tokens.Subsequently, feature
is input into the multi-head attention mechanism to learn correlations. Through the linear projection matrix
, queries, keys, and values
,
) are obtained.
serves as the query and key for a variable token. For any pair of variable tokens
, their pre-Softmax score is:
The correlation between variable
i and variable
j in the projection is measured by
, expressed in matrix form as:
Next, the
function yields the weight coefficients
. These weights are then applied to sum all values, resulting in the output
,
After obtaining the variable representation
updated through the self-attention mechanism, the model further performs independent nonlinear mapping on the features of each variable token via a Feed-Forward Network (FFN) [
6] to enhance its expressive capability. This process employs residual connections and Layer Normalization to maintain training stability.
among these, the
consists of two fully connected layers and the
activation function, which performs a nonlinear feature transformation on each variable token. Since this module uses classical features as input tokens, we employ average pooling to derive the final class tokens:
Under this architecture, the self-attention weight matrix clearly reflects global correlations among variables, thereby enhancing model interpretability. The final output effectively integrates local temporal patterns with global variable dependencies.
4.6. Big- Complexity Analysis
In this section, we provide an analysis of the computational complexity of the proposed modules (DFM, CIA, Transformer encoder) to support the claims of improved efficiency. The complexity of each module is evaluated using Big-
notation, allowing for a clear understanding of the performance improvements over previous methods.
Table 2 illustrates the complexity analysis for each module in EffiShapeFormer.
The DFM module involves two stages: Coarse Screening and Fine Screening. The Coarse Screening stage uses Euclidean distance calculations, which have a complexity of , where L is the time series length and N is the number of shapelet candidates. The Fine Screening stage incorporates Perceptual Subsequence Distance (PSD), which involves operations due to the pairwise distance computations.
The CIA module modifies the Transformer self-attention mechanism by shifting the attention dimension from the temporal axis to the variable axis. The quadratic complexity of the self-attention mechanism is , where D is the number of variable and L is the length of the sequence.
The Transformer Encoder’s self-attention mechanism has a complexity of , where L is the sequence length and D is the dimensionality of the input.
By integrating the DFM and the CIA module, our model achieves a significant reduction in computational complexity, especially in comparison to previous methods such as ShapeFormer. The overall complexity of the EffiShapeFormer framework is reduced from to , since , this demonstrates the efficiency improvements we have achieved.