Next Article in Journal
Galloping Target Tracking and Parameter Measurement Method for Overhead Transmission Lines Based on SAM2 Video Segmentation
Previous Article in Journal
Adaptive Control Strategy for a Single-Inverter Dual-PMSM System Under Load Disturbance
Previous Article in Special Issue
Unseen-Crop Plant Disease Classification via Disentangled Representation Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Expert-Transformer with Prototype-Aware Contrastive Learning for Semi-Supervised Time-Series Classification

Northeast Branch of State Grid Corporation of China, Shenyang 110180, China
*
Author to whom correspondence should be addressed.
Electronics 2026, 15(11), 2303; https://doi.org/10.3390/electronics15112303
Submission received: 28 April 2026 / Revised: 14 May 2026 / Accepted: 19 May 2026 / Published: 26 May 2026
(This article belongs to the Special Issue Advances in Data-Driven Artificial Intelligence, 2nd Edition)

Abstract

Semi-supervised time-series classification (TSC) faces challenges in handling intra-class variability and distribution shifts, which limit the effectiveness of standard contrastive learning methods. To address these limitations, we propose the Expert-Transformer with Prototype-Aware Contrastive Learning (ExT-PACL), a novel framework that integrates an uncertainty-guided Mixture-of-Experts (MoE) module within a Transformer encoder to dynamically capture diverse temporal patterns. An expert balancing strategy ensures all experts contribute meaningfully, preventing collapse and enhancing representation robustness. In addition, a prototype-aware contrastive learning loss guides both labeled and high-confidence unlabeled samples toward class prototypes, improving discriminative power and reducing reliance on large negative sample sets. Extensive experiments on multiple benchmark datasets demonstrate that ExT-PACL achieves superior generalization and state-of-the-art performance.

1. Introduction

Time-series classification (TSC) is a fundamental task in data mining and machine learning, with extensive applications in healthcare (e.g., ECG/EEG interpretation) [1], industrial systems (sensor anomaly detection) [2], finance (market regime identification) [3], and human activity recognition [4]. The advances of deep learning have naturally influenced TSC, yielding impressive performance [5,6,7]. Nevertheless, in many practical settings, obtaining labeled data is highly costly and labor-intensive, which significantly limits the feasibility of fully supervised approaches.
Semi-Supervised Learning (SSL) provides an effective strategy by combining a limited amount of labeled data with a large pool of unlabeled samples. Within SSL approaches, contrastive learning has gained considerable attention. Its main idea is to construct a consistent representation space in which similar (positive) sample pairs are drawn together, while dissimilar (negative) pairs are separated. Techniques such as Contrastive Predictive Coding (CPC) and SimCLR have been successfully extended to time-series data [8,9]. For example, Eldele et al. [10] introduced a robust baseline for TSC by employing a temporal contrastive loss combined with carefully designed data augmentations.
Semi-supervised contrastive learning for TSC, despite its effectiveness, remains constrained by the inability to handle intra-class diversity and distribution shifts. Standard models treat all temporal patterns uniformly, which limits their capacity to extract discriminative features across varying samples of the same class. This limitation motivates the integration of a Mixture-of-Experts (MoE) system: by assigning samples to specialized experts based on uncertainty, the model can dynamically capture diverse temporal characteristics, ensuring that rare or non-dominant patterns receive appropriate attention. Furthermore, to avoid imbalanced expert utilization and potential collapse, a balancing strategy is employed, guaranteeing that each expert contributes meaningfully to representation learning. In this way, the MoE framework becomes essential for semi-supervised contrastive learning, directly addressing the challenges of overfitting, non-uniform intra-class patterns, and insufficiently discriminative representations, thereby linking the model design to the fundamental limitations of existing approaches.
This paper presents a novel architecture, Expert-Transformer with Prototype-Aware Contrastive Learning (ExT-PACL), designed to address the significant challenges in semi-supervised time-series classification. Traditional models often struggle to generalize across diverse temporal patterns within a single class due to the lack of specialized mechanisms. To address this, we propose the integration of an uncertainty-guided Mixture-of-Experts (MoE) system within the Transformer encoder. Each expert in this framework is a neural network tailored to handle specific types of temporal data, such as periodic signals, transient spikes, or noisy segments. By incorporating uncertainty estimation, the gating mechanism dynamically routes each input token or sequence to the most relevant combination of experts, enabling the model to adaptively specialize on diverse patterns. This approach enhances the model’s ability to generalize across unseen temporal variations and builds domain-specific expertise internally, addressing one of the key challenges in TSC.
Furthermore, the introduction of an expert balancing strategy ensures that the MoE system remains robust, preventing expert collapse and ensuring that each expert contributes meaningfully to the learning process. This expert balancing strategy further improves the model’s adaptability to various temporal patterns, while maintaining a cohesive representation of the underlying data structure. Overall, our contributions lie in the synergistic combination of an uncertainty-guided MoE system, a prototype-aware contrastive loss, and an expert balancing strategy, providing a comprehensive solution that addresses the core challenges in semi-supervised time-series classification and sets a new direction for future research in this area.
Finally, to enhance the discriminative power of the learned representations and reduce reliance on large numbers of negative samples, we introduce a prototype-aware contrastive learning loss. In our semi-supervised framework, we iteratively compute class prototypes from the data feature representations and the classification outcomes. These prototypes serve as anchors in the contrastive learning process, where the loss function actively pulls both labeled and high-confidence unlabeled samples toward their respective class prototypes while repelling them from others. This mechanism injects essential categorical semantics into the learning process, enabling the model to distinguish between classes more effectively. Combined with the MoE system, this prototype-aware loss not only improves the model’s ability to generalize across diverse patterns but also aligns the learned features with the final classification task, thus overcoming the challenges of poor intra-class discrimination and overfitting.
Our main contributions are summarized as follows:
1.
We integrate an uncertainty-guided Mixture-of-Experts (MoE) with the Transformer encoder, enabling dynamic routing of input sequences to specialized experts and improving generalization across diverse temporal patterns.
2.
An expert balancing strategy, which prevents expert collapse and ensures meaningful contribution from all experts, is designed to enhance robustness and adaptability.
3.
We introduce a prototype-aware contrastive learning loss, which iteratively computes class prototypes and guides both labeled and high-confidence unlabeled samples, embedding categorical semantics to strengthen discriminative feature learning.
4.
We conduct extensive experiments on public benchmark datasets. The results demonstrate that ExT-PACL consistently outperforms state-of-the-art SSL methods, particularly in extreme low-label scenarios, while also being more computationally efficient than standard contrastive baselines.

2. Related Work

2.1. Semi-Supervised Time-Series Classification

Self-supervised learning (SSL) seeks to mitigate the reliance on large labeled datasets by pre-training models on tasks designed for unlabeled data. The central idea is to learn a representation space where embeddings remain invariant under semantically meaningful data augmentations, thereby capturing essential temporal features. For instance, the SelfMatch framework employs a form of self-distillation, where a “weakly augmented” version of a time-series guides the learning process of a “strongly augmented” version [11]. This flow of knowledge from the more stable view to the perturbed one encourages the model to generate consistent embeddings, thus directly fostering invariance to these augmentations.
A key focus in semi-supervised time-series classification (TSC) is the development of innovative architectures that effectively utilize the structure of unlabeled data. Graph-based approaches have demonstrated considerable potential. For instance, the ITS2Graph framework redefines TSC as a node classification task within a graph, where nodes represent time-series samples derived from their latent features [12]. The framework further employs graph generative adversarial learning to generate samples for minority classes, effectively addressing the pervasive challenge of class imbalance—an issue common in semi-supervised settings with limited and potentially biased labeled data.
Additionally, ensemble methods are gaining traction for enhancing the reliability of pseudo-labeling, a critical aspect of many SSL algorithms. The CE-SFDA framework uses a classifier ensemble to produce more accurate pseudo-labels for the target domain, complemented by a memory-aware knowledge distillation technique that captures both global and local data structures [13]. This ensemble strategy reduces the propagation of errors from noisy pseudo-labels, which is a frequent issue in iterative SSL methods. Another emerging trend is the conversion of time-series data into other modalities to leverage the power of architectures from diverse fields. The TS-BCT framework, for example, learns discriminative time-series representations through temporal-specific augmentations and bidirectional consistency regularization [14]. It aims to establish a feature space with well-defined class boundaries, even in the absence of abundant labeled data. By incorporating pseudo-labels from both augmented views, TS-BCT creates class-relevant contrastive patterns, reinforcing bidirectional consistency and aiding the model in learning a more distinct and separable feature space.

2.2. Contrastive Representation Learning for Time-Series

A key application of contrastive learning in time-series classification (TSC) is the extraction of transferable representations from unlabeled data. The success of this approach hinges on designing effective data augmentation strategies and contrastive objectives that capture robust temporal features.
The TS-TCC framework introduces a novel contrastive learning method, applying both weak (e.g., jittering and scaling) and strong (e.g., permutation and jitter) augmentations to raw time-series data [15]. It incorporates a unique temporal contrastive module that conducts a extra prediction task, encouraging the model to learn resilient temporal representations. This is further strengthened by a contextual contrastive module, which maximizes the similarity between contexts derived from different augmentations of the same sample, while minimizing the similarity between contexts from distinct samples.
The iBACon method addresses the challenge of predicting rare, high-difficulty events (e.g., sudden spikes in power demand) by introducing an Imbalance-Aware Contrastive Learning framework [16]. This approach quantifies sample difficulty using an Input–Output Difference (IOD) metric and groups samples with similar difficulty levels within the feature space. Additionally, DSDCLNet provides a solution for Multivariate Time-Series Classification, featuring a hierarchical dual-level design, which is particularly effective in semi-supervised settings where labeled data is limited [17]. The model’s innovation lies in its structured approach to learning representations at multiple granularities, enhancing its adaptability to varying data complexities.

3. Methodology

In this section, we will introduce the detailed schematic of the ExT-PACL framework, which is shown in Figure 1. The input time-series is first transformed into a sequence of embeddings. This sequence is processed by a stack of MoE-Transformer encoder layers. The output contextualized embeddings are used for two purposes: (1) to compute a standard classification loss via a linear projector for labeled data, and (2) to be aggregated into the prototype predicted vectors. These vectors are then used to compute the proposed prototype-aware contrastive learning loss and expert balancing loss, which supervises the learning of unlabeled data.

3.1. Problem Formulation

Let X = X L X U denote the entire dataset, where X L = { ( x 1 , y 1 ) , , ( x M , y M ) } is the labeled dataset and X U = { x M + 1 , , x N } is the unlabeled dataset, M N . For the labeled dataset, the i-th time sequence x i R T × C with T time steps has C channels. y i is the corresponding class label. The goal of STSC is to learn a mapping function f θ : R T × C R K by leveraging X L and X U . K is the number of classes.

3.2. MoE-Enhanced Expert-Transformer

In STSC, we should learn comprehensive representations from unlabeled data to improve the generalization performance of the model. Considering that input data usually have the time-dependent and non-stationary characteristic, a novel Mixture-of-Experts model is designed to combine information from different experts.
First, we adopt a convolutional-based encoder f e n and a Transformer-based auto-regressive module f a r to extract features from raw data first. Specifically, an input series x i is projected into a sequence of vectors, h i = f e n c ( x i ) , h i R T × D . T is the time step. D denotes the feature dimension. As in the paper [10], features from different time steps are summarized into a context vector c i R D
c i = f a r ( h i )
However, a standard Transformer-based auto-regressive module might struggle to handle inherent variability and uncertainty of unlabeled data and result in poor generalization, since the model must infer patterns in the absence of explicit ground-truth annotations. By employing multiple “experts” (specialized sub-models), each adept at processing specific data types (e.g., temporal patterns), the system can internally develop domain expertise. Specifically, we reinterpret the expert routing variable e as a latent random variable conditioned on the contextual representation c i , where the router models a posterior distribution for learning the latent temporal dynamics of sample x i
q θ ( e c i ) = Softmax ( W e c i + b e )
In this way, the expert assignment can be viewed as categorical distribution, e i Categorical ( q θ ( e c i ) ) .
Hence, we propose to integrate an uncertainty-based expert system into semi-supervised classification. Specifically, the context vector is viewed as the input of the expert system
c i e = f a r e ( h i ) , e = 1 , , E
In the experiment, the number of experts E is set to 4. Then, the router score of a specific expert is formulated as follows:
s i e = W e c i e + b e
( W e , b e ) is a learnable parameter for the e-th expert. By the way, we can learn the expert weight based on the router scores of all experts
π i e = exp ( s i e ) j = 1 E exp ( s i j )
The aggregated output obtained through the expert system is defined as
z i = j = 1 E π i j c i j
To alleviate the distribution shift problem caused by the uncertainty expert decision, we propose to quantify the expert decision distribution by the uncertainty estimation. First we calculate the mean vector and variance vector of expert weights
μ i = 1 E e E π i e , σ i e = π i e ( 1 π i e )
Because π i e is the softmax output and follows a multinomial distribution, the variance vector of each expert σ i e is expressed in the form of p ( 1 p ) . Based on the definition, the covariance vector is represented as
Σ i = d i a g ( σ i 1 2 , σ i 2 2 , , σ i E 2 )
where non-diagonal elements are 0. Correspondingly, the inverse covariance is formulated as a zero-padded diagonal matrix
Σ i 1 = d i a g ( 1 σ i 1 2 , 1 σ i 2 2 , , 1 σ i E 2 )
This design allows the model to learn specialized experts and to make soft, uncertainty-aware decisions.
After that, we compute the final expert weight through the uncertainty-aware fusion mechanism. The core key is apply the gated correction formula weighted by Mahalanobis distance to reduce the weight of experts with high uncertainty and increase the weight of those with low uncertainty
π i e = π i e exp ( 1 2 ( ( π i e μ i ) T Σ i 1 ( π i e μ i ) ) ) j = 1 E π i j exp ( 1 2 ( ( π i j μ i ) T Σ i 1 ( π i j μ i ) )
Because Σ i 1 is the diagonal matrix, the Mahalanobis distance calculation can be simplified as
π i e = π i e exp ( ( π i e μ i ) 2 2 σ i e 2 ) j = 1 E π i j exp ( ( π i j μ i ) 2 2 σ i j 2 )
Thus, the uncertainty-aware expert weight still follows j = 1 E π i j = 1. Meanwhile, the greater the deviation of the initial expert weight π i e from the mean, and the smaller its own variance (the lower the uncertainty), the higher its final weight will be.
In summary, instead of the original Transformer-based auto-regressive module, we propose to employ the MoE-Enhanced Transformer module to capture more generalized features
z i = j = 1 E π i j c i j

3.3. Balanced Expert Training and Prototype-Guided Representation Learning

Recent developments in learning-based optimization (LBO) and physics-informed learning (PIL) highlight the integration of data-driven models with domain-specific knowledge for improved generalization and optimization performance [18,19]. Inspired by these studies, our approach leverages multiple specialized experts in a Transformer encoder and prototype-aware contrastive loss to adaptively capture diverse temporal patterns while embedding structural information inherent to the time-series data.
To make experts have more “specialization” and avoid the situation where routers in the early stage of training compress all the samples onto one or a few experts, the posterior expert distribution q θ ( e | x ) should be peak or low entropy. In this way, all experts are needed for each sample equally. A corresponding expert balancing loss is designed
L = E [ K L ( q θ ( e | x ) | | p ( e ) ) ]
Thus, the posterior expert distribution q θ ( e | x ) is enforced to be consistent with the prior expert distribution p ( e ) , which is a uniform distribution. This variational regularization term constrains the posterior routing distribution toward a prior distribution p ( e ) , preventing posterior collapse and reducing the risk that a small subset of experts overfits noisy temporal patterns. This regularization encourages all experts to capture meaningful latent dynamics rather than memorizing local noise correlations.
As a regularization term based on the MoE model, this detailed KL divergence formula for each sample is
K L ( q θ ( e | x ) | | p ( e ) ) = j π i j log π i j E
where the prior expert distribution p ( e ) is defined as 1 / E . q θ i ( e | x ) = π i e .
Moreover, for enhancing the specialization of the experts, we design a self-learning prototype-aware contrastive loss to make the expert outputs for the same type of input be closer to each other, while the expert outputs for different types of input are as far apart as possible. Conventional contrastive learning typically defines positives and negatives based on data augmentations or heuristics. For example, two augmented samples of the same unlabeled sample are a positive pair. While this learns robust features, the resulting feature space is not necessarily organized according to the class boundaries of the downstream task. The model learns invariance to augmentations but not necessarily discrimination between categorical concepts. This can lead to sub-optimal decision boundaries when fine-tuning with limited labels.
In this paper, we propose a novel self-learning contrastive learning loss to strengthen the collaborative effect among different experts. For a feature vector z i , we define its positive sets based on its corresponding class prototype. The negative samples are feature vectors from other prototypes. In STSC, the prototype predicted vector of the i-th sample based on class prototypes is formulated as
r i = k = 1 K e d i s t ( z i , ϕ k ) j = 1 K e d i s t ( z i , ϕ j ) φ k
where d i s t denotes the euclidean metric, ϕ represents class prototype vectors, and φ is the indicator vector.
In STSC, prototypes derived only from limited labeled data are often biased and fail to capture the intrinsic distribution of the unlabeled majority. Therefore, a self-learning mechanism is needed to iteratively refine prototype features by leveraging the structural information embedded in features of unlabeled data. By progressively updating prototypes according to temporal features, the model can better align latent representations with underlying class structures, thereby improving cluster compactness and inter-class separability, which ultimately enhances classification performance in STSC. This strategy takes the data feature representations z and the classification results v, v i = f p r e ( z i ) , as inputs to compute ϕ and φ iteratively. The class prototype ϕ j is presented as
ϕ j = ϕ j η i = 1 n j ( ϕ j z i ) 1 + n j
where ϕ i , ϕ j are the of i-th and j-th class prototype vectors. n j denotes the number of samples at j-th class. η is the learning rate. At the beginning, ϕ i is defined as 1 K . Thus, φ i represents the indicator vector of the i-th class, which is defined as follows:
φ j = φ j η i = 1 n j ( φ j v i ) 1 + n j k = 1 K φ k η i = 1 n j ( φ k v i ) 1 + n j
The initial value of φ i is set to 1.
Since class prototypes are iteratively updated using both labeled and unlabeled samples, noisy pseudo-labels may introduce error propagation and cause semantic drift. To mitigate this issue, we adopt a reliability-aware prototype update mechanism. During the initial warm-up stage, class prototypes are initialized and updated only using labeled samples, which provides stable semantic anchors. After the warm-up stage, an unlabeled sample is allowed to participate in prototype updating only when its prediction confidence exceeds a predefined threshold and its predictions under weak and strong augmentations are consistent.
For an unlabeled sample x i , we compute the posterior distributions from two augmented samples, φ i w , φ i s . The sample is selected for prototype updating only if
max k φ i w ( k ) δ
where δ is the confidence threshold. The selected unlabeled samples contribute to the prototype update with confidence-based weights and are added into the set n k , while labeled samples remain the primary semantic anchors. In addition, the prototype update is performed with stop-gradient operation, so that unreliable pseudo-labels do not directly amplify gradient noise. This design reduces error propagation and stabilizes the semantic structure of the latent space.
Corresponding, the prototype-aware contrastive learning loss of the j-th sample is defined as follows:
^ j r = log exp ( C S ( r j 1 , r j 2 ) / τ C ) k = 1 K [ exp ( C S ( r j 1 , r k 1 ) / τ C ) + exp ( C S ( r j 1 , r k 2 ) / τ C ) ] ,
where τ C is a temperature hyperparameter of contrastive learning. C S represents the cosine similarity function.
All unlabeled data X U are trained with the prototype-aware contrastive learning loss. The complete loss function is as follows:
L c o n t r _ P = 1 2 ( N M ) j = 1 N M j p

3.4. Training Losses

Finally, the overall loss function is defined as follows:
L t o t a l = L c l s _ T + α 1 L c o n t r _ P + α 2 L c o n t r + α 3 L e b
where L c l s _ C and L c o n t r are cross-entropy loss of labeled data and sample-wise contrastive learning loss of unlabeled data respectively. α 1 , α 2 , and α 3 are the hyperparameters of the losses. The cross-entropy loss is designed to learn class information from labeled data
L c l s _ T = 1 M i = 1 M H ( y i , v i )
where H is the cross-entropy.
Then, we utilize the sample-wise contrastive learning loss to maximize the similarity between a i w and a i s from two different augmented samples of unlabeled data, which are obtained by a contrastive head, a i w = f c h ( z i w ) , a i s = f c h ( z i s ) . The features ( z i w , z i s ) are extracted from augmented unlabeled data, the weak augmented samples x i w and the strong augmented samples x i s . The sample-wise contrastive loss function of the i-th sample is given as follows:
i w = log exp C S ( a i w , a i s ) / τ I j = 1 ( N M ) [ exp C S ( a i w , a j w ) / τ I + exp C S ( a i w , a j s ) / τ I ] ,
i s = log exp C S ( a i s , a i w ) / τ I j = 1 ( N M ) [ exp C S ( a i s , a j s ) / τ I + exp C S ( a i s , a j w ) / τ I ] ,
where τ I is the temperature hyperparameter of sample-wise contrastive learning loss. The whole sample-wise contrastive learning loss is defined as follows:
L c o n t r = 1 2 N i = 1 N ( i w + i s )

4. Experiments

4.1. Datasets

We conduct experiments on three distinct datasets: two datasets from the UCR Repository (Wafer and PhalangesOutlinesCorrect), the Epileptic Seizure Recognition dataset [20], and a power grid dataset derived from real monitoring data collected from State Grid servers between March 2025 and May 2025.

4.2. Comparison Baselines

Comparison baselines include 3 classical SSL methods (e.g., SSL-ECG [21], CPC [8], SimCLR [9]), and 4 state-of-the-art contrastive methods (Mean-Teacher [22], DivideMix [23], SemiTime [24], FixMatch [25], TS-TCC [15], CA-TCC [10]).

4.3. Evaluation Metrics

Two evaluation metrics (accuracy and MF1-score) are employed on the experiment.

4.4. Implement Details

The overall framework of the proposed approach consists of three modules. First, an encoder is constructed to extract sequence features from raw data, which includes three convolutional blocks and a fully connected block. The size of three convolutional blocks is set to [32, 64, 64]. All convolutional blocks have a convolutional layer, BatchNorm layer, Relu function, and a MaxPooling layer, and the first convolutional block has an extra dropout layer. Subsequently, a MoE-Enhanced module is designed to develop domain expertise, which is composed of an auto-regressive layer, a router layer, and a feedforward layer. The auto-regressive layer is a Transformer network. The output dimension of the router layer is the number of experts. In our experiment, the number is set to 4. The feedforward layer is a fully connected layer and the size is set to 128. Finally, the learned category-specific feature representations are fed into a decoder to predict categories. The decoder is also a fully connected layer. We adopt the Adam optimizer for optimization, and the learning rate is set to 0.0003, and the batch size is set to 8.

4.5. Experimental Results

Semi-supervised classification results on two datasets—the Epileptic Seizure Recognition dataset and the UCR Repository datasets (Wafer and PhalangesOutlinesCorrect)—as presented in Table 1, demonstrate the superior performance of the proposed method at 1% and 5% labeled data fractions. Among the baselines, CA-TCC consistently performs the strongest, achieving the highest MF1-score at 1% labeled data and both the highest accuracy and MF1-score at 5% labeled data on the Epilepsy dataset. This highlights the effectiveness of contrastive learning approaches compared to other methods. Notably, the proposed ExT-PACL framework surpasses all baselines in overall performance, achieving the best results across most metrics, with the only exception being the F1-score on the Epilepsy dataset. These results confirm that integrating the MoE architecture with prototype-aware contrastive learning enhances feature representation and improves classification outcomes in semi-supervised TSC scenarios.
As illustrated in Figure 2, we conduct classification generalization experiments on the Epilepsy, Wafer, and POC datasets, using various labeled data fractions (1%, 5%, 50%, 75%, and 100%). CA-TCC is used as the baseline comparison method, as it employs a similar architecture for contrastive learning and builds a temporal auto-regressive module for classification. The proposed approach consistently outperforms CA-TCC, achieving higher accuracy across all data fractions. The curve representing our method consistently lies above that of the baseline, demonstrating its superior performance. Taking the POC dataset as an example, the standard deviations of the model in different data proportions are [0.1, 0.06, 0.07, 0.09, 0.08, 0.1], while the standard deviations of the proposed model are [0.06, 0.04, 0.03, 0.03, 0.1, 0.1], which demonstrates the robustness of the proposed model.
Figure 3 presents the precision–recall curves of the proposed method compared to CA-TCC on the power grid dataset. The curve for EXT-PACL declines gradually with minimal fluctuations, indicating stable performance. Overall, these results demonstrate the strong generalization ability of the proposed approach in semi-supervised time-series classification tasks.

4.6. Ablation Study

An ablation study was conducted to evaluate the contributions of the proposed design, as illustrated in Figure 4. The baseline model implements contrastive learning following the CA-TCC framework. Next, the MoE-Enhanced Transformer Encoder module was incorporated to enable richer feature interactions, resulting in variant A1. Finally, the prototype-aware contrastive learning strategy was applied, forming variant A2. As shown in Figure 4, compared to the baseline, variant A1 achieves Top-1 accuracies of 64.78% and 66.8% and MF1-scores of 50.19% and 53.9% under 1% and 5% labeled data settings, respectively. Building upon the MoE-Enhanced Transformer Encoder, the full proposed method further improves Top-1 accuracy by 2.22% and 0.7%, and MF1-score by 1.65% and 1.72%, demonstrating the effectiveness of both the MoE integration and the prototype-aware contrastive learning strategy.
Then, we empirically evaluate the effect of varying the number of experts. The model based on softmax routing is defined as E-PACL and is used for comparison. As shown in Table 2, increasing the number of experts improves accuracy and MF1-score up to E = 4 , beyond which performance gains saturate. Training time and model parameters increase linearly. Therefore, E = 4 strikes a balance between model expressivity and computational efficiency.
As shown in Table 3, ExT-PACL contains slightly more parameters than CA-TCC; the increase in single forward-pass time and inference latency is small. The prototype update mechanism is only used during training and does not introduce extra overhead during inference. Consequently, the inference complexity of ExT-PACL is close to that of CA-TCC, while achieving consistently better classification performance and stronger generalization ability.

4.7. Sensitivity Analysis

Three hyperparameters ( α 1 , α 2 , and α 3 ) are respectively analyzed, including cross-entropy loss weight, sample-wise contrastive loss weight, and prototype-aware contrastive loss weight. The values of all weights are selected from [0.001, 0.01, 0.1, 1, 10, 100]. As shown in Figure 5, the hyperparameter α 1 has the most important impact for the classification performance.

5. Conclusions

Semi-supervised contrastive learning for time-series classification (TSC) often suffers from limited generalization across diverse temporal patterns and insufficient guidance from latent category information, leading to sub-optimal feature representations. To overcome these challenges, we propose the Expert-Transformer with Prototype-Aware Contrastive Learning (ExT-PACL), a novel framework that integrates an uncertainty-guided Mixture-of-Experts (MoE) module within a Transformer encoder to adaptively capture complex temporal dynamics. An expert balancing mechanism ensures equitable utilization of all experts, preventing collapse and enhancing robustness. Furthermore, a prototype-aware contrastive loss iteratively computes class prototypes to guide both labeled and high-confidence unlabeled samples, embedding categorical semantics and improving discriminative power. Extensive experiments on multiple benchmark datasets demonstrate that ExT-PACL achieves state-of-the-art performance, particularly in low-label settings, highlighting its effectiveness in addressing intra-class variability and distribution shifts in semi-supervised TSC.

Author Contributions

Conceptualization, review and editing, Z.H.; methodology and writing, F.P.; software, K.H.; formal analysis, D.X.; resources, T.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by State Grid Corporation of China (529926250007).

Institutional Review Board Statement

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available on request from the authors.

Conflicts of Interest

All Authors were employed by the company State Grid Corporation of China. The authors declare no conflict of interest. The authors declare that this study received funding from State Grid Corporation of China. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

References

  1. Yehuda, Y.; Freedman, D.; Radinsky, K. Self-supervised Classification of Clinical Multivariate Time Series using Time Series Dynamics. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 2023; KDD’23; ACM: New York, NY, USA, 2023; pp. 5416–5427. [Google Scholar] [CrossRef]
  2. Dzaferagic, M.; Marchetti, N.; Macaluso, I. Fault Detection and Classification in Industrial IoT in Case of Missing Sensor Data. IEEE Internet Things J. 2021, 9, 8892–8900. [Google Scholar] [CrossRef]
  3. Oliveira, M.; Costa, G. Quantitative portfolio optimization framework with market regimes classification, probabilistic time series forecasting, and hidden Markov models. Digit. Financ. 2025, 7, 553–603. [Google Scholar] [CrossRef]
  4. Müller, P.N.; Müller, A.J.; Achenbach, P.; Göbel, S. IMU-Based Fitness Activity Recognition Using CNNs for Time Series Classification. Sensors 2024, 24, 742. [Google Scholar] [CrossRef] [PubMed]
  5. Chen, W.; Shi, K. Multi-scale Attention Convolutional Neural Network for time series classification. Neural Netw. 2021, 136, 126–140. [Google Scholar] [CrossRef] [PubMed]
  6. Liu, H.; Yang, D.; Liu, X.; Chen, X.; Liang, Z.; Wang, H.; Cui, Y.; Gu, J. TodyNet: Temporal dynamic graph neural network for multivariate time series classification. Inf. Sci. 2024, 677, 120914. [Google Scholar] [CrossRef]
  7. Foumani, N.M.; Tan, C.W.; Webb, G.I.; Salehi, M. Improving position encoding of transformers for multivariate time series classification. Data Min. Knowl. Discov. 2024, 38, 22–48. [Google Scholar] [CrossRef]
  8. Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  9. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020; Daumé, H., III, Singh, A., Eds.; PMLR (Proceedings of Machine Learning Research): Vienna, Austria, 2020; Volume 119, pp. 1597–1607. [Google Scholar]
  10. Eldele, E.; Ragab, M.; Chen, Z.; Wu, M.; Kwoh, C.K.; Li, X.; Guan, C. Self-Supervised Contrastive Representation Learning for Semi-Supervised Time-Series Classification. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15604–15618. [Google Scholar] [CrossRef] [PubMed]
  11. Xing, H.; Xiao, Z.; Dawei, Z.; Luo, S.; Dai, P.; Li, K. SelfMatch: Robust semisupervised time-series classification with self-distillation. Int. J. Intell. Syst. 2022, 37, 8583–8610. [Google Scholar] [CrossRef]
  12. Liu, C.; Guan, D.; Yuan, W.; Koc, C.K. ITS2Graph: Graph-based generative adversarial learning for imbalanced time series classification. Neural Netw. 2025, 191, 107770. [Google Scholar] [CrossRef] [PubMed]
  13. Pei, E.; Zhao, W.; Hu, Z.; He, L.; Ning, H.; Chen, H. Classifier ensemble based source-free domain adaptation for time series classification. Knowl. Based Syst. 2025, 330, 114584. [Google Scholar] [CrossRef]
  14. Liu, H.; Zhang, F.; Huang, X.; Wang, R.; Xi, L. Bidirectional consistency with temporal-aware for semi-supervised time series classification. Neural Netw. 2024, 180, 106709. [Google Scholar] [CrossRef] [PubMed]
  15. Eldele, E.; Ragab, M.; Chen, Z.; Wu, M.; Kwoh, C.K.; Li, X.; Guan, C. Time-Series Representation Learning via Temporal and Contextual Contrasting. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21), Virtual, 19–27 August 2021. [Google Scholar]
  16. Zhang, J.; Dai, Q.; Ye, R. iBACon: imBalance-Aware Contrastive Learning for Time Series Forecasting. IEEE Trans. Knowl. Data Eng. 2025, 37, 5967–5982. [Google Scholar] [CrossRef]
  17. Liu, M.; Sheng, H.; Zhang, N.; Zhao, P.; Yi, Y.; Jiang, Y.; Dai, J. DSDCLNet: Dual-stream encoder and dual-level contrastive learning network for supervised multivariate time series classification Multivariate Time Series Classification Gated Recurrent Unit Multiscale Convolutional Neural Network Dual-Stream Encoder Contrastive Learning. Knowl. Based Syst. 2024, 292, 111638. [Google Scholar] [CrossRef]
  18. Cao, Y.; Fan, L.; Wang, W.; He, Y. Machine learning-driven acceleration of Mg-MOF-74 synthesis optimization for enhanced CO2 adsorption. Sep. Purif. Technol. 2026, 398, 138172. [Google Scholar] [CrossRef]
  19. Yan, H.; Ye, L.; Zhou, T.; Li, Z.H.; Ye, T.; Zhang, F.; Liu, C.L. Physics-informed neural network for predicting multi-row film cooling superposition using Fourier transform and attention mechanism. Phys. Fluids 2025, 37, 065174. [Google Scholar] [CrossRef]
  20. Andrzejak, R.; Lehnertz, K.; Mormann, F.; Rieke, C.; David, P.; Elger, C. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 2002, 64, 061907. [Google Scholar] [CrossRef] [PubMed]
  21. Sarkar, P.; Etemad, A. Self-Supervised ECG Representation Learning for Emotion Recognition. IEEE Trans. Affect. Comput. 2020, 13, 1541–1554. [Google Scholar] [CrossRef]
  22. Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Nice, France, 2017; Volume 30. [Google Scholar]
  23. Li, J.; Socher, R.; Hoi, S.C.H. DivideMix: Learning with Noisy Labels as Semi-supervised Learning. arXiv 2020, arXiv:2002.07394. [Google Scholar]
  24. Fan, H.; Zhang, F.; Wang, R.; Huang, X.; Li, Z. Semi-Supervised Time Series Classification by Temporal Relation Prediction. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: Piscataway, NJ, USA, 2021; pp. 3545–3549. [Google Scholar] [CrossRef]
  25. Sohn, K.; Berthelot, D.; Carlini, N.; Zhang, Z.; Zhang, H.; Raffel, C.A.; Cubuk, E.D.; Kurakin, A.; Li, C.L. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H., Eds.; Curran Associates, Inc.: Nice, France, 2020; Volume 33, pp. 596–608. [Google Scholar]
Figure 1. The overall framework of the proposed approach.
Figure 1. The overall framework of the proposed approach.
Electronics 15 02303 g001
Figure 2. The generalization experiment conducted on the three datasets at different data fractions. Shaded areas in the figure represent the standard deviation trends.
Figure 2. The generalization experiment conducted on the three datasets at different data fractions. Shaded areas in the figure represent the standard deviation trends.
Electronics 15 02303 g002
Figure 3. The precision–recall curves of our proposed approach compared with CA-TCC on the power grid dataset.
Figure 3. The precision–recall curves of our proposed approach compared with CA-TCC on the power grid dataset.
Electronics 15 02303 g003
Figure 4. The ablation study for the proposed design.
Figure 4. The ablation study for the proposed design.
Electronics 15 02303 g004
Figure 5. The analysis of default hyperparameters.
Figure 5. The analysis of default hyperparameters.
Electronics 15 02303 g005
Table 1. Performance comparison for semi-supervised time-series classification.
Table 1. Performance comparison for semi-supervised time-series classification.
DatasetsEpilepsyWaferPOC
MethodsAccuracyMF1-ScoreAccuracyMF1-ScoreAccuracyMF1-Score
1% of labeled data
Random Init70.3 ± 2.166.2 ± 2.690.6 ± 1.658.1 ± 2.161.4 ± 0.038.3 ± 0.0
Supervised76.1 ± 0.774.8 ± 0.491.9 ± 1.367.6 ± 9.262.0 ± 0.840.0 ± 2.1
SSL-ECG89.3 ± 0.486.0 ± 0.393.4 ± 0.576.1 ± 2.462.5 ± 1.841.2 ± 4.9
CPC88.9 ± 1.185.8 ± 0.393.5 ± 0.478.4 ± 1.564.8 ± 1.048.2 ± 2.9
SimCLR88.3 ± 1.584.0 ± 1.093.8 ± 0.278.5 ± 1.161.5 ± 0.13.8 ± 0.3
TS-TCC91.2 ± 0.589.2 ± 0.293.2 ± 0.876.7 ± 4.663.8 ± 0.548.1 ± 0.9
Mean-Teacher91.5 ± 0.390.6 ± 0.694.7 ± 0.284.7 ± 0.362.1 ± 0.340.8 ± 1.2
DivideMix90.9 ± 0.789.4 ± 1.493.2 ± 0.582.0 ± 0.862.1 ± 0.640.7 ± 2.1
SemiTime91.6 ± 0.390.8 ± 0.694.4 ± 0.684.4 ± 1.262.0 ± 0.540.4 ± 1.6
FixMatch93.2 ± 0.292.2 ± 0.595.0 ± 0.484.8 ± 1.261.9 ± 0.540.0 ± 1.8
CA-TCC92.0 ± 0.191.9 ± 0.195.1 ± 0.385.1 ± 0.663.4 ± 0.449.3 ± 0.7
ExT-PACL93.8 ± 0.389.4 ± 0.595.1 ± 0.285.2 ± 0.765.6 ± 0.350.4 ± 0.5
5% of labeled data
Random Init75.5 ± 3.670.5 ± 3.391.2 ± 1.265.5 ± 8.261.6 ± 0.338.8 ± 1.0
Supervised83.4 ± 0.780.4 ± 0.794.6 ± 0.383.9 ± 0.661.4 ± 0.038.3 ± 0.0
SSL-ECG92.8 ± 0.289.0 ± 0.394.9 ± 0.384.5 ± 0.762.9 ± 0.343.3 ± 1.4
CPC92.8 ± 0.390.2 ± 0.592.5 ± 0.479.4 ± 0.866.9 ± 2.644.3 ± 8.4
SimCLR74.9 ± 1.589.2 ± 1.094.8 ± 0.283.3 ± 0.662.7 ± 1.142.4 ± 4.0
TS-TCC93.1 ± 0.393.7 ± 0.693.2 ± 0.481.2 ± 0.762.6 ± 1.142.6 ± 3.0
Mean-Teacher94.0 ± 0.493.6 ± 0.794.4 ± 0.783.8 ± 1.462.1 ± 0.641.2 ± 2.5
DivideMix93.9 ± 0.693.4 ± 1.194.7 ± 0.684.6 ± 1.562.9 ± 1.345.9 ± 7.0
SemiTime94.0 ± 0.593.0 ± 0.995.0 ± 0.484.7 ± 1.062.4 ± 0.541.8 ± 1.7
FixMatch93.7 ± 1.492.4 ± 0.394.9 ± 0.684.4 ± 1.263.1 ± 1.443.6 ± 4.3
CA-TCC94.5 ± 0.194.0 ± 0.195.8 ± 0.285.2 ± 0.666.4 ± 0.352.8 ± 0.3
ExT-PACL94.4 ± 0.290.8 ± 0.498.1 ± 0.194.1 ± 0.067.1 ± 0.253.5 ± 0.3
Table 2. Effectiveness Evaluation of Various MoE Routing.
Table 2. Effectiveness Evaluation of Various MoE Routing.
DatasetsEpilepsyPower
MethodsAccuracyMF1-ScoreAccuracyMF1-Score
1% of labeled data
CA-TCC63.4 ± 0.449.3 ± 0.769.9 ± 0.467.9 ± 0.6
E-PACL(2)63.2 ± 0.348.8 ± 0.570.4 ± 0.568.2 ± 0.4
ExT-PACL(2)64.1 ± 0.250.1 ± 0.870.4 ± 0.668.8 ± 0.4
E-PACL(4)64.1 ± 0.249.7 ± 0.570.2 ± 0.670.2 ± 0.2
ExT-PACL(4)65.6 ± 0.350.4 ± 0.571.7 ± 0.269.4 ± 0.7
E-PACL(8)63.2 ± 0.348.2 ± 0.870.1 ± 0.267.9 ± 0.2
ExT-PACL(8)64.2 ± 0.349.8 ± 0.571.2 ± 0.368.6 ± 0.3
5% of labeled data
CA-TCC66.4 ± 0.352.8 ± 0.376.8 ± 0.674.1 ± 0.7
E-PACL(2)66.5 ± 0.253.0 ± 0.277.5 ± 0.475.5 ± 0.2
ExT-PACL(2)66.9 ± 0.652.9 ± 0.577.2 ± 0.676.4 ± 0.5
E-PACL(4)66.3 ± 0.252.3 ± 0.677.8 ± 0.676.9 ± 0.4
ExT-PACL(4)67.1± 0.253.5 ± 0.378.7 ± 0.378.2 ± 0.5
E-PACL(8)65.7 ± 0.252.9 ± 0.475.8 ± 0.676.2 ± 0.3
ExT-PACL(8)66.2 ± 0.453.4 ± 0.576.9 ± 0.277.8 ± 0.7
Table 3. Training and Inference Time Complexity Analysis.
Table 3. Training and Inference Time Complexity Analysis.
DatasetsMethodsParamsParams (One Forward)Train (ms/it)Infer (ms/it)
POCCA-TCC453.58 K703.86 M86.919.97
ExT-PACL520.14 K704.53 M92.2110.38
powerCA-TCC340.81 K356.72 M61.307.63
ExT-PACL407.37 K359.46 M77.038.52
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, Z.; Peng, F.; Hou, K.; Xia, D.; An, T. Expert-Transformer with Prototype-Aware Contrastive Learning for Semi-Supervised Time-Series Classification. Electronics 2026, 15, 2303. https://doi.org/10.3390/electronics15112303

AMA Style

Huang Z, Peng F, Hou K, Xia D, An T. Expert-Transformer with Prototype-Aware Contrastive Learning for Semi-Supervised Time-Series Classification. Electronics. 2026; 15(11):2303. https://doi.org/10.3390/electronics15112303

Chicago/Turabian Style

Huang, Zhen, Fei Peng, Kaiyuan Hou, Deming Xia, and Tianyu An. 2026. "Expert-Transformer with Prototype-Aware Contrastive Learning for Semi-Supervised Time-Series Classification" Electronics 15, no. 11: 2303. https://doi.org/10.3390/electronics15112303

APA Style

Huang, Z., Peng, F., Hou, K., Xia, D., & An, T. (2026). Expert-Transformer with Prototype-Aware Contrastive Learning for Semi-Supervised Time-Series Classification. Electronics, 15(11), 2303. https://doi.org/10.3390/electronics15112303

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop