Next Article in Journal
Blockchains for Data Management: The DIGI4ECO Use Case and Practical Lessons Beyond Theory
Previous Article in Journal
FedX: Privacy-Preserving Explainable Federated Ensemble Intrusion Detection System for Edge-Enabled Internet of Vehicles
Previous Article in Special Issue
SiAraSent: From Features to Deep Transformers for Large-Scale Arabic Sentiment Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhanced Quantum-Inspired Deep Learning with Multi-Head Attention and Contrastive Learning for Text-Based Dialogue Sentiment Classification

1
Fujian Key Laboratory of Automotive Electronics and Electric Drive, Fujian University of Technology, Fuzhou 350118, China
2
Renewable Energy Technology Research Institute of Fujian University of Technology, Fujian University of Technology, Ningde 352101, China
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(5), 161; https://doi.org/10.3390/bdcc10050161
Submission received: 13 March 2026 / Revised: 7 May 2026 / Accepted: 15 May 2026 / Published: 18 May 2026
(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining: 2nd Edition)

Abstract

This study introduces the Quantum-inspired Pretrained Feature Embedding (ImprovedQPFE) model, a framework for dialogue sentiment classification. ImprovedQPFE integrates phase-pretrained complex embeddings, a bidirectional complex-valued GRU, a quantum-inspired attention mechanism, and supervised contrastive learning within a Transformer-based architecture, aiming to enhance feature discriminability under class imbalance. We evaluate ImprovedQPFE on the RECCON-DD and RECCON-IEM benchmarks under a unified and reproducible protocol, including standardized preprocessing and fixed data splits. To ensure reproducibility, all experiments were conducted using a fixed random seed of 42. The reported results are based on this single fixed-seed setting rather than averages over multiple repeated runs. The empirical results show that ImprovedQPFE achieves competitive performance and outperforms the compared baselines under the adopted experimental protocol. On the RECCON-DD dataset, ImprovedQPFE improves Macro-F1 from 80.08% to 83.75% compared with a strong non-quantum Transformer-based baseline equipped with contrastive learning. It also improves Pos-F1 while maintaining high performance for negative classes. On RECCON-IEM, ImprovedQPFE attains a leading Macro-F1 of 95.39% among the compared methods. These findings, together with an ablation analysis, support the effectiveness of the proposed quantum-inspired representation paradigm and its architectural components. However, further statistical validation with multiple repeated runs, standard deviations, confidence intervals, and significance testing remains an important direction for future work.

1. Introduction

1.1. Research Background

Natural Language Processing (NLP), a core branch of artificial intelligence, aims to enable computers to understand, interpret, and generate human language. It has long been a key focus of research. This technology has been widely applied in various downstream tasks, such as text classification [1,2], word sense disambiguation (WSD) [3,4], knowledge graphs [5], and question-answering systems [6]. In recent years, the NLP field has seen significant progress, primarily driven by the widespread adoption and successful application of deep learning methods, including convolutional neural networks (CNN) [7,8] and recurrent neural networks (RNN) [9].
Since 2018, the emergence of models such as ELMo, GPT, and BERT has marked the dawn of a new era in NLP. These models learn deep representations of language by pre-training on vast amounts of unlabeled text and are adapted to specific tasks through fine-tuning [10]. For example, the BERT (Bidirectional Encoder Representations from Transformers) model generates deep bidirectional representations by jointly training on both left and right contexts [11].
Sentiment Classification: Sentiment analysis aims to assess and identify the emotion or sentiment conveyed by textual data [12]. Transformer-based models and their variants have demonstrated excellent performance in this task [13]. For example, a hybrid architecture combining BERT, BiLSTM, and CNN layers has been employed for sentiment classification of student reviews in MOOCs, enhancing classification performance by fusing BERT’s context-aware capability, BiLSTM’s ability to capture long- and short-term dependencies, and CNN’s local feature extraction [14]. Other research has integrated BERT with Support Vector Machines (SVM) for sentiment analysis, where the two methods complement each other: BERT provides powerful language understanding, while SVM excels at classification [15].
Word Sense Disambiguation (WSD): WSD is a crucial research topic in NLP, aimed at determining the correct meaning of a word within a specific context [16]. It has broad applications in areas such as text classification, machine translation, and information retrieval [17]. Early studies focused on context-sensitive and statistical methods [18], while modern approaches leverage deep learning models. For instance, Graph Convolutional Networks (GCN) have been applied to WSD tasks, improving disambiguation accuracy by extracting discriminative features—such as words, part-of-speech tags, and semantic categories—from the context of ambiguous words.
Advances in NLP have also benefited the biomedical field, where researchers have proposed various pre-trained language models trained on biomedical datasets (e.g., text, electronic health records, protein and DNA sequences) for a range of biomedical tasks [19]. Furthermore, NLP techniques are being applied to AI-assisted programming tasks such as code generation, code completion, and code translation. As a widely researched area in NLP, text classification has seen remarkable achievements with deep learning models, including large pre-trained models like BERT and DistilBERT.
The advent of Large Language Models (LLMs) represents a revolutionary breakthrough in artificial intelligence. With unprecedented training scales and model parameters, they have significantly advanced capabilities in language understanding, synthesis, and commonsense reasoning, achieving performance levels close to human proficiency [20]. Pre-trained language models like BERT, GPT, and ERNIE are representative examples of LLMs. The evolution of NLP has progressed from Word2Vec and GloVe word embeddings in 2013, to the introduction of the attention mechanism and Transformer in 2017, culminating in the development of large multimodal models like GPT-4 in 2023 and Gemini in 2024.
Despite these significant advancements, NLP still faces challenges. The first is the “black box” nature of neural network models [21]. The millions, or even billions, of parameters in these models are difficult to interpret, limiting their application in critical areas such as medical diagnosis [22] or financial decision-making [23]. Secondly, the inherent uncertainty and ambiguity of natural language are not fully embedded in the models, which is inconsistent with how humans perceive language and may result in inadequate modeling [24]. To address these challenges, the emergence of quantum-inspired models offers a new research direction for NLP. These models, which construct neural networks based on quantum theory [25], hold the potential to enhance model interpretability, thereby improving the understanding and handling of the complexities of natural language.

1.2. Development of Quantum-Inspired Models

In recent years, researchers have extensively investigated the construction of quantum-inspired models and their application in various NLP scenarios. In 2018, Zhang et al. [26] introduced an end-to-end quantum-inspired language model for question-answering tasks, representing sentences as density matrices and proposing a joint representation for question answering. Building on this, Li et al. [27] developed a complex-valued word-embedding neural network, which defines semantic units in Hilbert space and uses complex vectors to represent words. In 2019, Li et al. proposed a novel matching model termed Complex-Valued Network (CNM), achieving performance comparable to traditional CNN and RNN baselines. In the same year, Tamburini [28] introduced a quantum-like Word Sense Disambiguation (QWSD) model based on quantum probability theory, representing words and sentences in complex domains. In 2024, Shi et al. [29] proposed ImprovedQPFE and ImprovedQPFE-ERNIE, which enhance quantum-like models with gated recurrent units (GRU) and incorporate attention mechanisms and CNNs, yielding improved experimental results in text classification tasks.
These studies have undoubtedly made significant contributions to the application of quantum theory in NLP, advancing the development of quantum-inspired models in the field. However, these improvements have largely overlooked a critical issue: prior knowledge is indispensable for constructing high-performance quantum-inspired models. Even though models based on quantum probability theory have been theoretically demonstrated to be suitable for natural language modeling, ELMo, BERT, GPT, and other language models have achieved great success by learning textual knowledge through pre-training tasks on large corpora. Currently, few works integrate pre-trained textual feature embeddings into quantum-inspired models, which may limit the performance enhancement of such models.
To improve the performance of quantum-inspired models, especially for emotion recognition in dialogues for sentiment classification, we adopt a self-embedding mechanism to incorporate semantic information and other features as part of our pre-training strategy, forming our novel enhanced quantum-inspired model. This model can first acquire knowledge from the dataset, accelerating the training process. We validate our model using the publicly available RECCON dataset. Figure 1 illustrates the dialogue content and emotional feedback within the RECCON-DD dataset. The RECCON-IEM dataset is similar in structure, containing analogous dialogues and emotional annotations.
The results demonstrate that our proposed method and the ImprovedImprovedQPFE model can effectively leverage the advantages of quantum-inspired models. In general, the contributions can be summarized as follows:
1.
Proposed a quantum-inspired emotion recognition model that integrates a self-embedding mechanism combining complex embeddings and phase pre-training
This paper designs and implements a quantum-inspired neural network architecture based on complex embeddings and phase information pre-training. By introducing dual-channel modeling in complex space (amplitude and phase) and combining emotion-label-driven phase pre-training, the proposed model effectively enhances the representation and differentiation of complex emotional semantics and causal relationships in text.
2.
Integrated multi-layer Transformer and contrastive learning mechanisms to enhance feature modeling and discriminative capabilities
The backbone of the proposed model incorporates multi-layer Transformer blocks and multi-head self-attention, combined with BiGRU for deep contextual modeling. Moreover, the model introduces contrastive learning loss and a quantum measurement module, further improving the ability to distinguish features of different classes and emotional states, significantly enhancing recognition performance for both positive and negative samples. Unlike a simple combination of existing neural modules, the proposed QPFE framework introduces a unified quantum-inspired amplitude–phase representation paradigm for text-based dialogue sentiment classification. In this framework, the amplitude component encodes semantic intensity, while the phase component captures affective tendency and contextual relational information. Moreover, the phase pre-training strategy and multi-operator quantum measurement module are specifically designed to strengthen feature separability under class imbalance. The ablation results further isolate the contribution of the quantum-inspired components by comparing QPFE with a strong non-quantum Transformer-based counterpart equipped with BiGRU and contrastive learning.
3.
Achieved excellent experimental results on public datasets, verifying the effectiveness and generalization ability of the method
Extensive experiments conducted on the publicly available emotion-cause pair extraction datasets RECCON-DD and RECCON-IEM show that the proposed model outperforms mainstream baseline methods on multiple evaluation metrics, including macro-F1, positive-class F1, and negative-class F1. The experiments also demonstrate that the model exhibits promising stability in addressing practical challenges such as class imbalance, complex contextual scenarios, and multi-granularity emotion analysis.

2. Related Work

2.1. Motivation for Quantum-Inspired Neural Networks in Dialogue Emotion Recognition

Traditional methods for Emotion Recognition in Conversation (ERC) have primarily relied on real-valued neural network architectures, such as RNNs and GNNs, which face significant limitations when dealing with the complexity and ambiguity of emotions. Emotional expressions in conversations often manifest as continuous, gradual states rather than clearly separated discrete categories. Consequently, existing representation methods based on real-valued spaces struggle to simultaneously capture explicit semantic information (such as word meaning) and implicit emotional nuances (such as tone and contextual dependencies). Therefore, it is foreseeable that more expressive representation methods are required to enhance the recognition of multi-level emotional states in sentiment recognition tasks.
Quantum-inspired complex-valued representations provide a unique perspective for addressing the above challenges. By embedding dialogue units (e.g., words, sentences, or utterances) into the complex domain, we can utilize the amplitude component to encode semantic intensity and the phase component to capture emotional tendencies and contextual dependencies. This dual-channel encoding mechanism is particularly suitable for dialogue scenarios involving complex semantic and emotional interactions. For instance, surface-level semantic content may convey explicit emotions, while underlying tone and contextual associations may imply another emotional state. Such representations, based on complex-valued spaces, not only capture the dynamic nonlinear changes in emotions but also enhance the model’s expressiveness and robustness in representing both explicit and implicit emotional features [30,31].
Although recent approaches such as DER-GCN [32] and COGMEN [33] have made progress in ERC tasks, particularly in modeling speaker dependencies and long-range contextual relationships, these methods remain confined to representation learning within real-valued embedding spaces, limiting their ability to fully capture the complexity of emotions. Quantum-inspired techniques extend the representational capacity of traditional architectures by introducing complex-valued computations, enabling emotion recognition models to learn emotional dynamics in higher-dimensional spaces, thereby overcoming the bottlenecks of conventional approaches in emotion modeling [34,35].

2.2. The RECCON-DD Dataset and Dialogue Emotion Recognition

The RECCON-DD dataset is constructed based on DailyDialog and is specifically designed for dialogue emotion recognition tasks. A key feature of this dataset is its provision of rich conversational contextual information, with each utterance annotated with corresponding emotion labels, enabling models to learn emotion recognition in realistic conversational scenarios. Unlike traditional single-sentence sentiment analysis, RECCON-DD requires models to understand the continuity and contextual dependencies in dialogue.
Emotion recognition on the RECCON-DD dataset poses several distinct challenges. First, there is the issue of emotion category imbalance, where certain emotion categories (such as happiness and sadness) appear more frequently in conversations, while others are relatively scarce. Second, context dependency is prominent, as the emotion of an utterance often relies on dialogue history and speaker state. Third, fine-grained emotion differentiation is required [36]; models need to accurately distinguish between similar yet distinct emotional states.
Traditional methods based on BERT and RoBERTa have achieved some success on RECCON-DD, but they primarily rely on pre-trained language representations and may lack a deep understanding of emotional dynamics and dialogue structure. Therefore, recent research has begun to explore more specialized architectures, such as those combining graph neural networks, memory networks, and attention mechanisms, to address the complexity of dialogue emotion recognition [37].

2.3. Complex-Valued Neural Networks and Quantum-Inspired Representation Learning

Complex-valued neural networks provide a significant extension to traditional real-valued networks by introducing computation in the complex domain, thereby enhancing the representational capacity of models. In dialogue emotion recognition tasks, the advantage of complex-valued representations lies primarily in the ability to simultaneously encode explicit and implicit semantic information. The real part is typically used to represent directly observable semantic features, while the imaginary part captures more abstract emotional and contextual relationships. Quantum-inspired embedding methods represent words in complex form, with the amplitude component encoding semantic intensity and the phase component encoding semantic direction or emotional tendency. This dual encoding mechanism allows the model to process multi-level semantic information within the same representation space. In dialogue scenarios, this representation is particularly effective because emotions in conversations often carry multiple meanings and exhibit gradual transitions.
When processing complex-valued inputs, traditional recurrent neural networks require corresponding extensions. The complex-valued GRU separately handles the real and imaginary parts for state updates while maintaining interaction between the two, enabling effective modeling of complex-valued sequences. This approach can capture richer sequential dynamic information while maintaining computational efficiency.
Positional encoding in complex-valued networks also requires special consideration. The traditional sinusoidal positional encoding can be extended to the complex domain by representing positional information in complex exponential form via Euler’s formula. This encoding not only preserves the relative relationships of positions but also provides the model with additional phase information to distinguish semantic features at different positions.

2.4. Contrastive Learning and Multi-Task Optimization

Contrastive learning, as an unsupervised representation learning method, learns meaningful feature representations by maximizing the similarity of positive sample pairs and minimizing that of negative sample pairs. In the task of dialogue emotion recognition, the application of contrastive learning requires the incorporation of supervisory information from emotion labels, thus forming a supervised contrastive learning framework. A multi-task learning framework combines contrastive learning with the primary classification task, enhancing model performance through the joint optimization of two objective functions. The advantage of this approach lies in the ability of contrastive learning to provide better feature representations, while the classification task offers explicit supervisory signals. The weighted combination of loss functions requires careful tuning to ensure that the two tasks mutually reinforce rather than interfere with each other.
Data augmentation plays a crucial role in contrastive learning. For textual data, common augmentation methods include random masking, synonym replacement, and sentence reordering. In dialogue emotion recognition, maintaining the consistency of emotional semantics is a key challenge for data augmentation, necessitating specially designed augmentation strategies to avoid altering the original emotion labels [38].

2.5. Hybrid Architectures and Quantum-Inspired Transformers

The traditional Transformer architecture enables effective modeling of long-range dependencies through self-attention mechanisms, yet it still exhibits certain limitations when handling complex emotional dynamics and semantic entanglements. To address these issues, researchers have begun exploring hybrid approaches that integrate quantum-inspired concepts with Transformer architectures. The multi-head attention mechanism provides a natural framework for quantum-inspired extensions, as each attention head can be viewed as a distinct measurement operator applied to quantum states, allowing features to be observed and extracted from different perspectives. In dialogue emotion recognition, this multi-perspective feature extraction is particularly important, as emotional information may be embedded in utterances in various forms.
Recent research efforts have focused on integrating complex-valued representations into the Transformer architecture. Key challenges in this integration include: (1) how to process complex-valued operations while maintaining computational efficiency; (2) how to design attention mechanisms suitable for complex-valued inputs; and (3) how to perform effective positional encoding in the complex-valued space. Some studies have proposed sparse attention patterns to reduce the complexity of complex-valued computations while preserving model performance.
The application of layer normalization in complex-valued networks is also an important research direction. Traditional layer normalization needs to be adapted to the characteristics of complex-valued inputs, particularly by considering different normalization strategies for amplitude and phase information. Such improved normalization methods have been shown to significantly enhance training stability and convergence.

2.6. Enhanced Quantum-Inspired Architecture Design

To address the specific requirements of dialogue emotion recognition, the enhanced quantum-inspired model adopts a multi-level architectural design. The core idea of this architecture is to simultaneously process semantic content and emotional information through complex-valued representations, where the real part encodes explicit lexical semantics, and the imaginary part captures implicit emotional tones and contextual dependencies.
Phase pre-training represents a key innovation in this architecture. By pre-training a dedicated phase extractor, the model learns phase patterns associated with different emotions. This pre-training process uses emotion labels as supervisory signals, enabling the model to map different emotional states to distinct phase intervals. The resulting phase embeddings are subsequently used to initialize the phase parameters of the main model, providing a better starting point for training.
The multiple measurement mechanism simulates the process of quantum measurement, extracting real-valued features from complex-valued states via multiple distinct measurement operators. Each operator focuses on different feature dimensions, akin to the concept of multi-head attention. The measurement results are then weighted and combined through an attention mechanism, allowing the model to adaptively select the most relevant features. This design enhances the model’s sensitivity to different types of emotional expressions.
The BiGRU extends the traditional GRU architecture by separately processing the real and imaginary parts of complex-valued inputs, thereby preserving the evolution of quantum states. The bidirectional mechanism ensures that the model can simultaneously leverage both forward and backward contextual information, which is particularly important for understanding the trajectory of emotions in dialogues. The incorporation of residual connections and layer normalization further improves the training stability and convergence of deep complex-valued networks.

3. Methodology

3.1. Problem Formalization

TaskText-Based Dialogue Sentiment Classification Task

Given a dialogue utterance sequence X = {x1, …, xn}, the goal is to predict its sentiment class y ∈ {1, …, C}. This study considers text input only; no visual or acoustic signals are used.

3.2. Overall Model Architecture

Figure 2 can be read from left to right as a four-stage pipeline: representation, context modeling, feature refinement, and classification. Tokens are first mapped to amplitude–phase complex embeddings, then contextualized by BiGRU and attention-based blocks, compressed by the measurement module, and finally optimized with classification and contrastive objectives.
  • Intuitively, the model first builds a phase-aware representation for each token, then uses recurrent and attention modules to determine how surrounding utterances modify that representation, and finally extracts a compact feature for sentiment prediction.
  • Specifically, the architecture contains: (1) a complex embedding layer; (2) a bidirectional recurrent encoder; (3) multi-head attention and stacked Transformer blocks; (4) a quantum measurement readout; and (5) a classification head trained with cross-entropy and supervised contrastive loss.

3.3. Enhanced Complex Embedding Layer

3.3.1. Complex Embedding Design

The embedding layer is the representation entry point of the model. Each token is mapped to a complex vector whose magnitude encodes semantic strength and whose phase encodes affective or relational tendency. This amplitude–phase split is the basic quantum-inspired representation used throughout the manuscript.
Amplitude Embedding: The amplitude component r i d is obtained via a trainable real-valued embedding matrix E amp | V | × d ,where | V | denotes the vocabulary size. The amplitude embedding undergoes LayerNorm normalization to ensure numerical stability:
r i = LayerNorm ( E amp [ x i ] )
The amplitude embedding primarily encodes the semantic intensity information of words, analogous to the semantic representation in traditional word embeddings.
Phase Embedding: The phase component θ i d is obtained via an independent phase embedding matrix E phase | V | × d , and pre-trained phase parameters θ pretrained can be introduced:
θ i = E phase [ x i ] α scale
The phase embedding is initialized within the range [ π , π ] , where α scale is a learnable phase scaling parameter. If pretrained phase parameters exist, these pretrained values are directly utilized. The phase information encodes implicit relational structures and semantic similarities between words.
Complex Representation: The final complex-valued embedding is formed by combining the amplitude and phase components via Euler’s formula:
e i = r i e i θ i = r i cos θ i + i sin θ i
In practice, the complex number is represented as a combination of its real and imaginary parts:
Re ( e i ) = r i cos ( θ i ) , Im ( e i ) = r i sin ( θ i )
where denotes element-wise multiplication. The final complex embedding has a shape of [ batch _ size ,   seq _ len ,   d ,   2 ] , with the last dimension representing the real and imaginary parts, respectively.
Quantum State Representation: The complex embedding ei for each word can be viewed as a quantum state | ψ i , with a probability amplitude of r i and a phase of θ i . This representation allows the model to leverage the principle of quantum superposition to represent multiple semantic states simultaneously.

3.3.2. Positional Encoding Integration

To enhance sequence modeling capability, sinusoidal positional encoding is incorporated into the amplitude embedding. For a position pos and a dimension i, the positional encoding is defined as:
P E ( p o s , 2 i ) = sin p o s 10000 2 i / d model , P E ( p o s , 2 i + 1 ) = cos p o s 10000 2 i / d model
The positional encoding is directly added to the amplitude embedding:
r i final = r i + PE pos
This design enables the model to perceive the positional information of words within the sequence while preserving the integrity of the complex-valued representation.

3.4. BiGRU Architecture

BiGRU Cell Design

BiGRU extends the traditional GRU to the complex domain, achieving the evolution of quantum states by separately processing the real and imaginary parts of complex numbers. For each time step t in the input sequence, we process the complex-valued embedding x t d .
Complex Separation Processing: The complex input is first separated into its real and imaginary parts:
x t real = Re ( x t ) , x t imag = Im ( x t )
Processing by BiGRU: The real and imaginary parts are processed separately by bidirectional GRUs. For the forward GRU, the calculations for the update gate, reset gate, and candidate hidden state are as follows:
z t real = σ W z real h t 1 real , x t real + b z real r t real = σ W r real h t 1 real , x t real + b r real h ˜ t real = tanh W h real r t real h t 1 real , x t real + b h real h t real = 1 z t real h t 1 real + z t real h ˜ t real
The imaginary part undergoes the same calculation process using independent parameter matrices W z imag , W r imag , W h imag . This separate processing allows the model to independently learn the temporal patterns of the real and imaginary parts.
Residual Connection and Normalization: To enhance gradient flow and training stability, we introduce residual connections and layer normalization:
h t real = LayerNorm h t real + W proj real x t real h t imag = LayerNorm h t imag + W proj imag x t imag
where W proj is a projection matrix for dimension matching.
Quantum State Reconstruction: The processed real and imaginary parts are recombined into a complex quantum state:
h t = h t real + i h t imag
In implementation, we use a stacked representation:
h t = h t real ; h t imag d × 2
The BiGRU considers both forward and backward context information simultaneously. For each time step t , the forward GRU processes information from t   =   1 to t , and the backward GRU processes information from t   =   n to t :
h t = QuantumGRU forward ( x 1 , , x t ) h t = QuantumGRU backward ( x n , , x t )
The final bidirectional hidden state is obtained by concatenation:
h t = h t ; h t 2 d
This bidirectional design enables the model to capture both forward and backward semantic dependencies simultaneously, which is crucial for understanding emotional causality in dialogues.

3.5. Multi-Head Self-Attention Mechanism

Quantum State Attention Computation

The multi-head self-attention mechanism operates on quantum state sequences, capturing different types of semantic relationships by learning multiple attention subspaces in parallel.
Complex Feature Flattening: First, flatten the complex quantum state h t d (shape [ batch ,   seq _ len ,   d ,   2 ] ) into a real-valued vector:
x t flat = Re ( h t ) ; Im ( h t ) 2 d
Query, Key, Value Generation: Generate Query (Q), Key (K), and Value (V) matrices from the flattened complex features via linear transformations:
Q = X flat W Q , K = X flat W K , V = X flat W V
where W Q , W K , W V 2 d × d model are learnable weight matrices, and d model is the model dimension.
Multi-Head Splitting: Split Q ,   K ,   V into h heads (in this paper, h   =   8 ), with each head’s dimension being d k = d model / h :
Q i = Q W i Q , K i = K W i K , V i = V W i V
where Q i , K i , V i batch × seq _ len × d k , W i Q , W i K , W i V are the projection matrices for the i -th head.
Attention Score Calculation: For each attention head i , calculate the attention score matrix:
scores i = Q i K i T d k
The scaling factor d k prevents excessively large dot product values that could cause softmax gradient vanishing. If an attention mask M exists (for handling padding positions), apply the mask:
scores i = scores i + M ( )
Attention Weights and Output: Compute attention weights via the softmax function:
Attention i = softmax ( scores i )
Multi-Head Concatenation and Output Projection: Concatenate the outputs of all attention heads and obtain the final output via an output projection matrix W O :
MultiHead ( Q , K , V ) = Concat ( head 1 , , head h ) W O
where W O d model × d model the final output is projected back to the complex representation form via a linear layer:
Output = MultiHead ( Q , K , V ) W out batch × seq _ len × d × 2
Advantage of Quantum State Attention: Compared to standard attention, quantum state attention can leverage the phase information of complex representations to better capture implicit relationships between words. The phase difference Δ θ = θ i θ j encodes the semantic similarity between words i and j , enabling the attention mechanism to more accurately identify emotional keywords and causal relationships.

3.6. Quantum Transformer Block

3.6.1. Architecture Design

The Quantum Transformer block adapts the standard Transformer architecture to the complex domain. Each block contains two main sub-layers—a multi-head self-attention sub-layer and a position-wise feed-forward network sub-layer—each equipped with residual connections and layer normalization.
The position-wise feed-forward network (FFN) performs a nonlinear transformation on the quantum state. First, flatten the complex features:
x 1 flat = [ Re ( x 1 ) ; Im ( x 1 ) ]
Then apply a two-layer linear transformation with a GELU activation function:
FFN ( x 1 flat ) = W 2 GELU ( W 1 x 1 flat + b 1 ) + b 2
where W 1 2 d × 4 d ,   W 2 4 d × 2 d are weight matrices, and b 1 ,   b 2 are bias terms. The GELU activation function is defined as:
GELU ( x ) = x · Φ ( x ) = x · 1 2 1 + erf x 2
Finally, the second residual connection and layer normalization are applied:
x 2 = LayerNorm ( x 1 flat + Dropout ( FFN ( x 1 flat ) ) )
Output x 2 is reshaped into its complex representation, serving as input to the next Transformer block.

3.6.2. Residual Connection Adaptation

Residual connections for complex features require special handling. Since complex numbers consist of real and imaginary parts, residual connections are performed in the flattened real-valued space:
Output = LayerNorm ( x flat + Sublayer ( x flat ) )
where x flat = [ Re ( x ) ;   Im ( x ) ] . This design ensures (1) Gradient Flow: Residual connections provide a direct path for gradient propagation, alleviating the vanishing gradient problem in deep networks; (2) Feature Preservation: Allows the model to retain original quantum state information while learning incremental improvements; (3) Training Stability: Layer normalization ensures stable feature distribution and accelerates convergence. Multiple Transformer blocks (3 layers in this paper) are stacked, with the output of each layer serving as the input to the next:
x ( l + 1 ) = TransformerBlock ( l ) ( x ( l ) )
This deep architecture enables the model to extract and refine feature representations layer by layer, from low-level local patterns to high-level global semantic relationships.

3.7. Enhanced Quantum Measurement Mechanism

3.7.1. Multiple Measurement Operator Design

The quantum measurement mechanism maps quantum states to an observable classical feature space, following the Born rule in quantum mechanics. We use multiple measurement operators to capture feature information from different dimensions.
Quantum State Representation: The input quantum state | ψ is represented by a complex sequence with shape [ batch ,   seq _ len ,   d ,   2 ] :
| ψ = j = 1 d ( a j + i b j ) | j
where a j = Re ( ψ j ) ,   b j = Im ( ψ j ) are the real and imaginary parts of the j -th basis state, respectively.
Probability Amplitude Calculation: According to quantum mechanics principles, the probability amplitude for each basis state is:
| ψ j | 2 = a j 2 + b j 2
representing the probability of measuring the j-th basis state.
Multiple Measurement Operators: We design M different measurement operators (in this paper, M   =   3 ), M 1 , M 2 , , M M each implemented via a linear transformation:
M i : 2 d d measure
Specifically, each measurement operator is defined as:
f i ( | ψ ) = W i measure · [ Re ( | ψ ) ; Im ( | ψ ) ] + b i measure
where W i measure d measure × 2 d denotes the learnable weight matrix and b i measure denotes the bias term.

3.7.2. Measurement Probability Interpretation

According to the Born rule, the observation probability of the measurement operator Mi acting on quantum state | ψ is:
P ( m i ) = | m i | ψ | 2 = Tr ( M i ρ )
where ρ = | ψ ψ | is the density matrix. In the implementation, we calculate it as follows:
Density Matrix Representation: For each position in the sequence, the density matrix is:
ρ j = | ψ j ψ j | = a j 2 a j b j a j b j b j 2
Measurement Probability: The observation probability for measurement operator M i is calculated as:
P i = Tr ( M i ρ ) = j = 1 d Tr ( M i ρ j )
In practical implementation, we use a simplified calculation:
P i = softmax ( W i measure · [ a ; b ] )
where a = [ a 1 , , a d ] ,   b = [ b 1 , , b d ] are the real and imaginary part vectors, respectively.
Attention-Weighted Pooling: The probability distribution is used for sequence-level attention-weighted pooling:
α seq = softmax 1 d j = 1 d | ψ j | 2 f pooled = t = 1 seq _ len α seq [ t ] f measured [ t ]
This design enables the model to focus on positions in the sequence with larger probability amplitudes, which typically contain important semantic information.

3.8. Contrastive Learning Strategy

3.8.1. Contrastive Loss Function

Contrastive learning learns more discriminative feature representations by pulling samples of the same class closer and pushing samples of different classes apart. Given N samples in a batch, we construct 2 N samples (including original and augmented samples).
Similarity Calculation: The similarity between sample i and j is calculated via cosine similarity:
sim ( z i , z j ) = z ˜ i T z ˜ j = z i T z j z i 2 · z j 2
Define the similarity matrix S 2 N × 2 N as:
S i j = sim ( z i , z j )
Positive Sample Pair Mask: Construct a positive sample pair mask M pos { 0 , 1 } 2 N × 2 N , where M pos [ i , j ] = 1 indicates that samples i and j belong to the same class (y_i = y_j), otherwise 0. Diagonal elements (similarity of a sample with itself) are excluded:
M pos [ i , j ] = 1 if   y i = y j   and   i j 0 otherwise
Negative Sample Pair Mask: The negative sample pair mask is defined as:
M neg = 1 M pos I
where I is the identity matrix (excluding the diagonal).
Temperature Scaling: Use the temperature parameter τ = 0.07 to scale the similarity, controlling the sharpness of contrastive learning:
S scaled = S τ
A smaller temperature parameter makes the model more sensitive to similar samples, enhancing feature discriminability.
Contrastive Loss Calculation: For each sample i , the contrastive loss is defined as:
L contrastive ( i ) = log j = 1 2 N M pos [ i , j ] · exp ( S scaled [ i , j ] ) k = 1 2 N ( M pos [ i , k ] + M neg [ i , k ] ) · exp ( S scaled [ i , k ] )
The numerator represents the sum of exponential similarities of positive sample pairs, and the denominator represents the sum of exponential similarities of all sample pairs (both positive and negative). To avoid numerical instability, a small constant ϵ = 10 8 is added:
L contrastive ( i ) = log j M pos [ i , j ] · exp ( S scaled [ i , j ] ) + ϵ k exp ( S scaled [ i , k ] ) + ϵ

3.8.2. Feature Representation Learning

Through contrastive learning, the learned feature representations possess the following properties:
Intra-class Compactness: Samples of the same class cluster together in the feature space, reducing intra-class distance. For a positive sample pair ( z i ,   z j ) , where y i   =   y j , the contrastive loss encourages sim ( z i , z j ) 1 .
Inter-class Separation: Samples of different classes are separated in the feature space, increasing inter-class distance. For a negative sample pair ( z i ,   z k ) , where y i y k , the contrastive loss encourages sim ( z i , z k ) 0 .
Enhanced Feature Discriminability: Through contrastive learning, the learned feature representations can better distinguish between different classes, especially in cases of class imbalance, helping to improve recognition performance for minority classes.
Implementation Details: During training, we use data augmentation techniques to construct positive sample pairs, such as: (1) Random word masking: Randomly mask words in the input sequence with a probability of 15%; (2) Synonym replacement: Replace some words with their synonyms; (3) Sentence reordering: For dialogue data, adjust the order of utterances. These augmentation techniques ensure the diversity of positive sample pairs, making contrastive learning more effective.

3.9. Training Strategy Optimization

Joint Loss Function

The total loss function is a weighted combination of classification loss (cross-entropy loss) and contrastive loss:
L total = L CE + λ L contrastive
Cross-Entropy Loss: For multi-class tasks:
L CE = 1 N i = 1 N c = 1 C y i , c log ( y ^ i , c )
where N is the batch size, C is the number of classes, y i , c { 0 , 1 } is the one-hot encoding of the true label, and y ^ i , c is the model’s predicted probability distribution.
Label Smoothing: To prevent overfitting, we use label smoothing. The smoothed label is:
y ˜ i , c = ( 1 α ) · y i , c + α C

4. Experiments

4.1. Datasets

This study involved validation experiments on two widely used dialogue emotion recognition datasets. The first is the RECCON-DD dataset, derived from the DailyDialog corpus, specifically designed for binary classification tasks in causal relation detection. We adopt stratified sampling to partition the dataset into training (60%), validation (20%), and test (20%) sets, ensuring balanced class distribution across subsets. The dataset contains 7 basic emotion categories—joy, surprise, anger, sadness, fear, disgust, and neutral—providing rich emotional context for the model.
The second dataset is RECCON-IEM, constructed from dialogue text slices of the RECCON-IEM multimodal sentiment corpus, focusing on the binary classification task of “emotion-triggering event” causal detection. Each sample consists of <emotion><SEP>dialogue utterance, with the labels field indicating the presence of explicit emotional causal clues. The emotion categories are from the original RECCON-IEM annotations, covering six major classes—angry, frustrated, excited, sad, happy, and neutral—providing cross-speaker, multi-context emotional context for the model.

4.2. Model Architecture and Training Configuration

To improve experimental reproducibility, all experiments were implemented using PyTorch2.4.1 under a unified evaluation protocol. Unless otherwise specified, the random seed was fixed to 42 for Python random, NumPy1.24.4, and PyTorch. Therefore, all results reported in the tables and figures are based on a single run using the fixed random seed of 42, rather than averages over multiple repeated runs. Since multiple random seeds were not used in this version, we do not claim statistical significance or full statistical robustness, and this point is clearly acknowledged as a limitation in the Conclusions section. In addition, cuDNN deterministic mode was enabled and cuDNN benchmark mode was disabled to reduce nondeterministic behavior during training. All compared models were trained and evaluated using the same training/validation/test splits, preprocessing pipeline, tokenizer, vocabulary, optimizer, learning rate schedule, early stopping strategy, and evaluation metrics.
For data preprocessing, all dialogue samples were loaded from CSV files containing the text and labels fields. The task was formulated as a binary classification problem according to the labels field. Emotion information was additionally extracted from the beginning of each text sequence and mapped into seven auxiliary emotion categories; namely, happiness, surprise, anger, sadness, fear, disgust, and no emotion. The original emotion prefix before <SEP> was removed during preprocessing. The dataset was split into training, validation, and test sets with a ratio of 60%, 20%, and 20%, respectively. Stratified sampling was adopted according to the class labels to preserve the label distribution across different subsets.
A lightweight word-level tokenizer was used in all experiments. The tokenizer first lowercased each input sentence and then performed whitespace-based tokenization. The vocabulary was constructed only from the training set to avoid data leakage. The maximum vocabulary size was set to 10,000, including three special tokens: <PAD>, <UNK>, and <SEP>. Each input sequence was truncated or padded to a maximum length of 256 tokens. The attention mask was generated according to the padded positions, where non-padding tokens were assigned 1 and padding tokens were assigned 0.
The proposed Improved ImprovedQPFE model uses an embedding dimension of 256 and a hidden size of 256. The dropout rate was set to 0.3. Before training the final classification model, we pretrained a phase extractor on the auxiliary emotion labels. The pretraining stage used at most 2000 training samples, a maximum sequence length of 128, a batch size of 32, and was trained for 8 epochs. The phase extractor was optimized using AdamW with a learning rate of 1 × 10−3 and a weight decay of 0.01. The learned phase embeddings were then used as pretrained phase information for the Improved ImprovedQPFE model.
For the main classification training, all models were trained under the same optimization protocol. We used AdamW as the optimizer with a learning rate of 5 × 10−5 and a weight decay of 0.01. The learning rate was adjusted by CosineAnnealingWarmRestarts with T_0 = 5, T_mult = 2, and eta_min = 1 × 10−6. The training batch size was set to 32, while the test batch size was set to 16. The maximum number of training epochs was 30. Early stopping was applied with a patience of 8 epochs, and the model with the highest validation macro-F1 score was selected for final testing. The loss function was cross-entropy loss with label smoothing, where the smoothing coefficient was set to 0.1. Gradient clipping with a maximum norm of 1.0 was applied to stabilize training.
To ensure fair comparison, all baseline models used the same training, validation, and test splits, the same tokenizer, the same vocabulary, and the same optimization settings. The compared models include Deep LSTM, Deep CNN, Enhanced Transformer, Hybrid CNN-LSTM, Attention Enhanced, Deep Residual, and Gated CNN. The Deep LSTM model consists of a three-layer bidirectional LSTM followed by multi-head attention. The Deep CNN model uses multiple convolutional kernels for multi-scale feature extraction. The Enhanced Transformer contains four Transformer encoder layers with eight attention heads. The Hybrid CNN-LSTM combines convolutional feature extraction with bidirectional LSTM sequence modeling. The Attention Enhanced model stacks recurrent encoding, multi-head attention, self-attention, and feed-forward layers. The Deep Residual model uses four residual feed-forward blocks with multi-head attention. The Gated CNN model employs gated convolutional units with kernel sizes of 3, 5, and 7.
The model performance was evaluated using Accuracy, Macro-F1, Macro-Precision, and Macro-Recall. Since the task is binary classification, we also reported the F1 scores of the positive and negative classes. Macro-F1 was used as the primary metric for model selection and comparison, as it provides a balanced evaluation under potential class imbalance.

4.3. Visualization Analysis

4.3.1. Attention Weight Visualization

This figure illustrates the aggregated attention weight distribution of the multi-head attention mechanism when processing the emotional dialogue utterance, “I am so happy because I passed the exam.” The horizontal axis represents the input sequence tokens as Keys, and the vertical axis represents the output positions as Queries. The color intensity indicates the magnitude of the normalized attention weights, where lighter or warmer colors correspond to relatively higher attention allocation.
As shown in Figure 3, the emotional keyword “happy” receives relatively higher attention from several output positions, indicating that the model tends to focus more on this emotionally salient token during contextual representation learning. It should be noted that attention weights are normalized allocation coefficients rather than signed sentiment contributions. Therefore, a higher attention weight for “happy” does not mean that the word has a positive weight in the classification decision, but rather that it plays an important role in constructing the sentence representation.
In addition, tokens such as “because” and “passed” also receive certain attention in local contexts, suggesting that the model captures part of the semantic relationship between the emotion and its possible cause. Overall, the visualization shows that the multi-head attention mechanism can assign relatively greater attention to emotionally relevant and contextually important tokens, which helps the model to learn more informative emotional dialogue representations.

4.3.2. Feature Space Distribution Visualization

This figure uses the t-SNE dimensionality reduction method to visualize the feature representations extracted by the ImprovedQPFE model from the real DailyDialog test set in a two-dimensional space. The test samples shown in the figure include seven emotion categories: neutral, joy, sadness, anger, fear, surprise, and disgust.
As shown in Figure 4, different emotion categories exhibit certain distribution differences in the feature space, but they do not form completely clear and independent cluster structures. Specifically, the joy category shows a relatively obvious clustering tendency, indicating that the model has learned more discriminative representations for this category. In contrast, other categories, such as neutral, anger, and disgust, are more widely distributed and overlap with samples from other emotion classes. Sadness, fear, and surprise also show partial concentration in some regions, but their boundaries are still not clearly separated.
These visualization results suggest that the proposed quantum-inspired model can learn useful emotion-related feature representations to some extent. However, the feature space still contains overlaps among several emotion categories, indicating that the inter-class separability remains limited. Therefore, although the model demonstrates a certain ability to distinguish emotional dialogue features, there is still room for further improvement in enhancing feature compactness and class separation.

4.4. Mathematical Definitions of Evaluation Metrics

4.4.1. Confusion Matrix Basics

In the binary sentiment detection task, we define four basic statistics based on the confusion matrix. For a sample set D = { ( x i , y i ) } i = 1 N , where x i represents the input dialogue and y i { 0 , 1 } represents the true sentiment label (0 for negative, 1 for positive), the model prediction is y ^ i { 0 , 1 } . The four basic elements of the confusion matrix are defined as follows:
TP = i = 1 N I [ y i = 1 y ^ i = 1 ] TN = i = 1 N I [ y i = 0 y ^ i = 0 ] FP = i = 1 N I [ y i = 0 y ^ i = 1 ] FN = i = 1 N I [ y i = 1 y ^ i = 0 ]
The function I [ · ] is an indicator function that returns 1 if the condition within the brackets is true, and 0 otherwise. TP denotes True Positives (samples with positive sentiment that the model correctly predicts), TN denotes True Negatives (samples with negative sentiment that the model correctly predicts), FP denotes False Positives (negative sentiment samples incorrectly predicted as positive by the model), and FN denotes False Negatives (positive sentiment samples incorrectly predicted as negative by the model).

4.4.2. Accuracy

Accuracy is the most basic metric for evaluating the overall performance of a classification model, defined as the proportion of correctly predicted samples to the total number of samples:
Accuracy = TP + TN TP + TN + FP + FN = TP + TN N
where N is the total sample size. This metric measures the model’s prediction accuracy across the entire dataset, with a value range of [0, 1]; values closer to 1 indicate better model performance.

4.4.3. Recall

Recall measures the model’s ability to identify actual positive samples, defined as the proportion of actual positive samples that are correctly predicted:
Recall c = TP c TP c + FN c
For binary sentiment detection, the recall for positive and negative emotions are:
Recall pos = TP TP + FN Recall neg = TN TN + FP

4.4.4. F1 Score

The F1 score is the harmonic mean of precision and recall, used to comprehensively evaluate the model’s performance on a specific category:
F 1 c = 2 · Precision c × Recall c Precision c + Recall c
The macro-average F1 score is the arithmetic mean of the F1 scores for all classes:
F 1 pos = 2 · Precision pos × Recall pos Precision pos + Recall pos F 1 neg = 2 · Precision neg × Recall neg Precision neg + Recall neg
This metric assigns equal weight to each class regardless of its sample size. This makes the metric more sensitive to the performance of minority classes, effectively reflecting the model’s balanced performance on class-imbalanced datasets.

4.4.5. Macro-F1

The macro-averaged F1 score is the arithmetic mean of the F1 scores for each individual class, assigning equal weight to all classes:
Macro-F 1 = 1 C c = 1 C F 1 c
For a binary classification task:
Macro-F 1 = F 1 pos + F 1 neg 2
The core characteristic of the macro-averaged F1 score is that it gives equal importance to each class, independent of their sample sizes. This makes the metric more sensitive to the performance on minority classes and effectively reflects the model’s balanced performance on datasets with class imbalance.

4.5. Experimental Results

Table 1 summarizes the main experimental results. It can be observed that the method proposed in this paper achieves competitive performance across all key metrics. On the RECCON-DD dataset, our model attains a Macro-F1 score of 83.75%, outperforming all listed baseline methods. Notably, it demonstrates a particular advantage in recognizing positive emotions (Pos. F1: 76.27%), which is a challenging aspect in sentiment analysis, while also maintaining a high and competitive negative emotion recognition rate (Neg. F1: 91.22%).
This Figure 5 visually compares the performance of the proposed model against various baseline methods on the RECCON-DD dataset. The bar chart clearly illustrates that (1) Our method achieves the highest Macro-F1 score (83.75%), demonstrating an overall performance advantage; (2) It shows the best Pos. F1 score (76.27%), indicating a superior capability in identifying positive emotions, which are often more subtle and context-dependent; (3) It maintains a near-optimal Neg. F1 score (91.22%), comparable to the best-performing baselines in negative emotion recognition. This balanced and superior performance across all three key metrics validates the effectiveness of our integrated quantum-inspired architecture, multi-head attention mechanism, and contrastive learning strategy. The performance gains can be attributed to the synergistic design: complex embeddings enhance feature representation, BiGRU captures sequential dependencies, multi-head attention models contextual relationships, and contrastive learning improves feature discriminability.

Cross-Dataset Validation: Experimental Results on RECCON-IEM

To verify the generalization ability and robustness of the model, we conducted additional experimental validation on the RECCON-IEM dataset. The RECCON-IEM dataset contains rich emotional expressions in dramatic dialogue scenarios. Compared to the DailyDialog dataset, it features more complex emotional dynamics and more diverse forms of expression.
This Figure 6. details the specific numerical values of various performance metrics of the ImprovedQPFE model on the RECCON-IEM dataset. The figure includes multi-dimensional evaluation metrics such as accuracy, precision, recall, F1 score, and classification performance for different emotion categories. From the figure, it can be seen that (1) The model also achieves excellent performance on the RECCON-IEM dataset, with all metrics reaching high levels; (2) There are certain differences in the recognition performance of different emotion categories, which is related to the sample size and complexity of emotional expression for each category; (3) The model can maintain stable performance when dealing with complex emotional scenarios, demonstrating good robustness; (4) Compared with the results on the RECCON-DD dataset, the performance on the RECCON-IEM dataset shows slight differences, but the overall trend is consistent, verifying the model’s cross-dataset adaptation capability. These results indicate that the ImprovedQPFE model not only performs excellently on a single dataset but also has good cross-dataset generalization ability, enabling it to adapt to different types of dialogue emotion recognition tasks.
This Figure 7 compares the performance of different models on the RECCON-IEM dataset by reorganizing the results according to evaluation metrics. Instead of grouping the results by model, the figure presents F1-score and accuracy in separate panels, allowing a more direct comparison of model performance under each metric. As shown in the figure, ImprovedImprovedQPFE achieves the best overall performance, with an F1-score of 0.9540 and an accuracy of 96.22%, outperforming DeepTransformer, ResidualCNN, and HybridCNNRNN. ResidualCNN also demonstrates strong performance, achieving an F1-score of 0.9477 and an accuracy of 95.75%, while HybridCNNRNN obtains slightly lower but still competitive results. In comparison, DeepTransformer shows the weakest performance among the four models, with an F1-score of 0.8620 and an accuracy of 88.63%. These results indicate that the proposed ImprovedImprovedQPFE model provides more effective feature representation and classification capability for dialogue emotion recognition. The metric-oriented organization of the figure further improves clarity and makes it easier to identify the performance differences among models for each evaluation criterion.

4.6. Ablation Study

To rigorously deconstruct the proposed ImprovedQPFE model, validate the overall advantage of the quantum-inspired paradigm, and verify the necessity of each component within its hybrid architecture, we conducted a systematic ablation study on the RECCON-DD dataset. The results are detailed in Table 2.

4.6.1. Validation of the Net Advantage of the Quantum-Inspired Paradigm

The most pivotal argument of this study stems from the comparison between the full ImprovedQPFE model and the strong non-quantum baseline (Variant A1). This baseline itself constitutes a formidable competitor, integrating a standard Transformer encoder, BiGRU, and contrastive learning. The key finding is: replacing the entire quantum-inspired representation stack with standard real-valued components leads to a significant performance drop of 3.67 percentage points in Macro-F1 (from 83.75% to 80.08%). This substantial performance gap strongly proves that the quantum-inspired paradigm—representing and processing information in a complex amplitude-phase space—provides a unique and significant performance advantage that cannot be replicated by simply using a larger or deeper traditional real-valued network. This advantage likely originates from the paradigm’s more intrinsic capability to model the superposition, ambiguity, and contextual interdependence of emotional states in dialogue.

4.6.2. Necessity of the Hybrid BiGRU-Transformer Architecture

Removing the BiGRU module (A2) results in the second-largest performance degradation (ΔMacro-F1: −2.34 pp), highlighting its indispensable complementary role alongside the Transformer. While the Transformer’s self-attention mechanism excels at capturing global dependencies and discourse structure at arbitrary distances, the BiGRU specializes in modeling local, fine-grained sequential patterns and the immediate contextual flow between dialogue turns. For the task of dialogue emotion recognition, where emotional shifts are often tightly dependent on the rhetorical relations of adjacent utterances, this local sequential modeling capability is crucial. The significant performance drop of Variant A2 confirms that a pure Transformer architecture, despite its power, is suboptimal for comprehensively capturing the hierarchical nature of emotional dynamics that exists at both local and global levels. Therefore, the hybrid design of BiGRU and Transformer is a necessary, not redundant, choice for achieving optimal performance.

4.6.3. Contribution Analysis of Other Components

Ablation of other components further reveals their hierarchical value.
Quantum Attention provides refined focusing (A3): Its removal causes a −1.89 pp drop, validating its ability to guide the model’s attention to the most relevant emotional cues within the complex feature space.
Contrastive Learning enhances feature discriminability (A4): Its absence leads to a −1.23 pp decline, demonstrating that this strategy effectively sharpens the model’s classification boundaries by explicitly separating the feature representations of samples from different classes.
Phase Pre-training ensures training stability (A5): Contributing a gain of −0.83 pp, it indicates that the strategy of providing a good initialization for phase parameters facilitates more robust and faster convergence of the model.
Quantum Measurement enables efficient feature extraction (A6): Although it has the smallest individual contribution (−0.47 pp), this module is necessary for efficiently compressing high-dimensional complex quantum states into discriminative features suitable for classification.

5. Conclusions

To address the challenge of complex semantic understanding in conversational sentiment analysis and to enhance the performance of quantum-inspired models for sentiment classification, this study proposes the Quantum-inspired Pretrained Feature Embedding (ImprovedQPFE) model. The ImprovedQPFE model systematically integrates core concepts from quantum computing into a neural network architecture. It employs a complex-valued embedding layer to represent tokens as quantum states, utilizes a bidirectional complex-valued Gated Recurrent Unit (BiGRU) to model contextual sequential evolution, incorporates a quantum attention mechanism to capture long-range semantic dependencies, and applies a multi-operator quantum measurement to extract observable probabilistic features for final classification.
Building upon the ImprovedQPFE architecture, we designed a comprehensive framework that synergistically combines quantum-inspired feature learning with a supervised contrastive learning mechanism, conducting extensive validation on dialogue emotion recognition tasks. The model was thoroughly trained and evaluated on the RECCON-DD and RECCON-IEM datasets using the PyTorch framework on an NVIDIA GPU platform. The experimental results show that the ImprovedQPFE model achieves better performance than the compared baselines across multiple key metrics under the adopted single-seed experimental protocol. Specifically, on the RECCON-DD dataset, ImprovedQPFE achieves the best Macro-F1 score among the compared baselines, reaching 83.75%, with improved performance in identifying positive sentiments (Pos. F1: 76.27%) while maintaining high negative sentiment recognition performance (Neg. F1: 91.22%). On the RECCON-IEM dataset, ImprovedQPFE attains a Macro-F1 score of 95.39%, with accuracy and recall exceeding 96% under the same fixed-seed experimental setting. These results suggest the potential effectiveness of the proposed model, while further repeated experiments are needed to verify its statistical robustness.
Ablation studies systematically validated the contribution of each core component: the complex-valued embedding layer provides the most substantial performance gain, underscoring the foundational role of quantum state representation in semantic modeling; the BiGRU module significantly enhances sequential context modeling; the quantum attention mechanism effectively captures long-range dependencies within dialogues; and the contrastive learning strategy further improves feature discriminability. These components together form a cohesive and synergistic framework.
Although the proposed QPFE model achieves promising results on dialogue emotion recognition and emotion–cause analysis tasks, several limitations should be acknowledged. First, the novelty of the model mainly lies in the integration of quantum-inspired complex-valued feature representation, multi-head attention, and contrastive learning within a unified text-based framework, rather than in proposing a fundamentally new quantum computing paradigm. The quantum-inspired components are used to enhance feature representation through phase- and amplitude-related transformations, and their specific contribution has been further examined through ablation analysis. Second, although the experiments were conducted under a unified benchmarking protocol with fixed data splits, standardized preprocessing, ablation studies, and cross-dataset evaluation, the reported results are based on a single fixed random seed of 42 rather than multiple repeated runs. Therefore, the current results should be interpreted as performance under a controlled single-seed setting, and we avoid making claims of statistical significance or full statistical robustness. Future work will further strengthen empirical validation by conducting experiments with multiple random seeds, reporting means and standard deviations, confidence intervals, and statistical significance tests. Future research will extend the proposed framework to larger-scale pretrained language models and multimodal dialogue emotion recognition scenarios involving acoustic and visual signals, so as to better evaluate its generalization ability in more complex affective computing settings.
This research presents a novel neural network modeling paradigm inspired by quantum theory. The proposed ImprovedQPFE model demonstrates not only robust performance in dialogue understanding tasks but also offers good interpretability and verifiable component contributions. It provides a new perspective and a practical pathway for enhancing deep learning models by leveraging principles from quantum computing.

Author Contributions

Conceptualization, F.Z., L.Z., F.G. and X.W. (Xunhuang Wang); Methodology, F.Z., L.Z., F.G. and X.W. (Xunhuang Wang); Software, L.Z. and F.G.; Validation, L.Z. and F.G.; Formal analysis, F.Z., L.Z., F.G., X.W. (Xunhuang Wang), J.W., T.F., H.J. and X.W. (Xueming Wu); Investigation, L.Z. and F.G.; Resources, F.Z., L.Z. and F.G.; Data curation, L.Z. and F.G.; Writing—original draft, L.Z. and X.W. (Xunhuang Wang); Writing—review and editing, L.Z., J.W., T.F., H.J. and X.W. (Xueming Wu); Visualization, F.Z., L.Z., F.G., X.W. (Xunhuang Wang), J.W., T.F., H.J. and X.W. (Xueming Wu); Supervision, F.Z., L.Z., F.G. and X.W. (Xunhuang Wang); Project administration, F.Z., L.Z. and F.G.; Funding acquisition, F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific Research Projects in Fujian University of Technology, grant number GY-Z240043, and by the Renewable Energy Technology Research Institute of Fujian University of Technology, grant number PT4300101.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in the RECCON repository at https://github.com/declare-lab/RECCON (accessed on 12 May 2026). The specific datasets include RECCON-DD and RECCON-IEM.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jim, J.R.; Talukder, M.A.R.; Malakar, P.; Kabir, M.M.; Nur, K.; Mridha, M.F. Recent Advancements and Challenges of NLP-Based Sentiment Analysis: A State-of-the-Art Review. Nat. Lang. Process. J. 2024, 6, 100059. [Google Scholar] [CrossRef]
  2. Tang, H.; Kamei, S.; Morimoto, Y. Data Augmentation Methods for Enhancing Robustness in Text Classification Tasks. Algorithms 2023, 16, 59. [Google Scholar] [CrossRef]
  3. Zhang, C.-X.; Liu, R.; Gao, X.-Y.; Yu, B. Graph Convolutional Network for Word Sense Disambiguation. Discret. Dyn. Nat. Soc. 2021, 2021, 2822126. [Google Scholar] [CrossRef]
  4. Wang, T.; Zhong, J.; Chen, J.; Hu, Q. Composite Kernels for Automatic Word Sense Disambiguation. J. Comput. Theor. Nanosci. 2015, 12, 619–623. [Google Scholar] [CrossRef]
  5. Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; Wu, X. Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE Trans. Knowl. Data Eng. 2024, 36, 3580–3599. [Google Scholar] [CrossRef]
  6. Yilmaz, S.; Toklu, S. A Deep Learning Analysis on Question Classification Task Using Word2vec Representations. Neural Comput. Appl. 2020, 32, 2909–2928. [Google Scholar] [CrossRef]
  7. Yang, C.; Zhang, Y. Public Emotions and Visual Perception of the East Coast Park in Singapore: A Deep Learning Method Using Social Media Data. Urban For. Urban Green. 2024, 94, 128285. [Google Scholar] [CrossRef]
  8. Farhangian, F.; Cruz, R.M.O.; Cavalcanti, G.D.C. Fake News Detection: Taxonomy and Comparative Study. Inf. Fusion 2024, 103, 102140. [Google Scholar] [CrossRef]
  9. Kanahuati-Ceballos, M.; Valdivia, L.J. Detection of depressive comments on social media using RNN, LSTM, and random forest: Comparison and optimization. Soc. Netw. Anal. Min. 2024, 14, 44. [Google Scholar] [CrossRef]
  10. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  11. Shahid, R.; Wali, A.; Bashir, M. Next Word Prediction for Urdu Language Using Deep Learning Models. Comput. Speech Lang. 2024, 87, 101635. [Google Scholar] [CrossRef]
  12. Punetha, N.; Jain, G. Game Theory and MCDM-Based Unsupervised Sentiment Analysis of Restaurant Reviews. Appl. Intell. 2023, 53, 20152–20173. [Google Scholar] [CrossRef]
  13. Bashiri, H.; Naderi, H. Comprehensive Review and Comparative Analysis of Transformer Models in Sentiment Analysis. Knowl. Inf. Syst. 2024, 66, 7305–7361. [Google Scholar] [CrossRef]
  14. Baqach, A.; Battou, A. A New Sentiment Analysis Model to Classify Students’ Reviews on MOOCs. Educ. Inf. Technol. 2024, 29, 16813–16840. [Google Scholar] [CrossRef]
  15. Govers, J.; Feldman, P.; Dant, A.; Patros, P. Down the Rabbit Hole: Detecting Online Extremism, Radicalisation, and Politicised Hate Speech. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef] [PubMed]
  16. Basili, R.; Rocca, M.D.; Pazienza, M.T. Contextual Word Sense Tuning and Disambiguation. Appl. Artif. Intell. 1997, 11, 235–262. [Google Scholar] [CrossRef]
  17. Shaukat, S.; Asad, M.; Akram, A. Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach. Appl. Sci. 2023, 13, 5103. [Google Scholar] [CrossRef]
  18. HaCohen-Kerner, Y.; Kass, A.; Peretz, A. HAADS: A Hebrew Aramaic Abbreviation Disambiguation System. J. Am. Soc. Inf. Sci. 2010, 61, 1923–1932. [Google Scholar] [CrossRef]
  19. Wong, M.-F.; Guo, S.; Hang, C.-N.; Ho, S.-W.; Tan, C.-W. Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review. Entropy 2023, 25, 888. [Google Scholar] [CrossRef] [PubMed]
  20. Chen, J.; Liu, Z.; Huang, X.; Wu, C.; Liu, Q.; Jiang, G.; Pu, Y.; Lei, Y.; Chen, X.; Wang, X.; et al. When Large Language Models Meet Personalization: Perspectives of Challenges and Opportunities. World Wide Web 2024, 27, 42. [Google Scholar] [CrossRef]
  21. Sarker, I.H. LLM Potentiality and Awareness: A Position Paper from the Perspective of Trustworthy and Responsible AI Modeling. Discov. Artif. Intell. 2024, 4, 40. [Google Scholar] [CrossRef]
  22. Wu, S.; Roberts, K.; Datta, S.; Du, J.; Ji, Z.; Si, Y.; Soni, S.; Wang, Q.; Wei, Q.; Xiang, Y.; et al. Deep Learning in Clinical Natural Language Processing: A Methodical Review. J. Am. Med. Inform. Assoc. 2019, 27, 457–470. [Google Scholar] [CrossRef]
  23. Lin, W.; Liao, L.-C. Lexicon-Based Prompt for Financial Dimensional Sentiment Analysis. Expert. Syst. Appl. 2024, 244, 122936. [Google Scholar] [CrossRef]
  24. Jain, G.; Lobiyal, D.K. Word Sense Disambiguation Using Cooperative Game Theory and Fuzzy Hindi WordNet Based on ConceptNet. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 21, 1–25. [Google Scholar] [CrossRef]
  25. Ni, P.; Li, Y.; Li, G.; Chang, V. Natural Language Understanding Approaches Based on Joint Task of Intent Detection and Slot Filling for IoT Voice Interaction. Neural Comput. Appl. 2020, 32, 16149–16166. [Google Scholar] [CrossRef]
  26. Zhang, P.; Gao, H.; Zhang, J.; Song, D. Quantum-Inspired Neural Language Representation, Matching and Understanding. Found. Trends® Inf. Retr. 2023, 16, 318–509. [Google Scholar] [CrossRef]
  27. Zhang, P.; Hui, W.; Wang, B.; Zhao, D.; Song, D.; Lioma, C.; Simonsen, J.G. Complex-Valued Neural Network-Based Quantum Language Models. ACM Trans. Inf. Syst. 2022, 40, 1–31. [Google Scholar] [CrossRef]
  28. Liu, Y.; Li, Q.; Wang, B.; Zhang, Y.; Song, D. A Survey of Quantum-Cognitively Inspired Sentiment Analysis Models. arXiv 2023, arXiv:2306.03608. [Google Scholar] [CrossRef]
  29. Shi, J.; Chen, T.; Lai, W.; Zhang, S.; Li, X. Pretrained Quantum-Inspired Deep Neural Network for Natural Language Processing. IEEE Trans. Cybern. 2024, 54, 5973–5985. [Google Scholar] [CrossRef]
  30. Lai, W.; Shi, J.; Chang, Y. Quantum-Inspired Fully Complex-Valued Neutral Network for Sentiment Analysis. Axioms 2023, 12, 308. [Google Scholar] [CrossRef]
  31. Yan, P.; Li, L.; Zeng, D. Quantum Probability-Inspired Graph Attention Network for Modeling Complex Text Interaction. Knowl.-Based Syst. 2021, 234, 107557. [Google Scholar] [CrossRef]
  32. Ai, W.; Shou, Y.; Meng, T.; Yin, N.; Li, K. DER-GCN: Dialogue and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialogue Emotion Recognition. arXiv 2023, arXiv:2312.10579. [Google Scholar] [CrossRef]
  33. Joshi, A.; Bhat, A.; Jain, A.; Singh, A.V.; Modi, A. COGMEN: COntextualized GNN Based Multimodal Emotion recognitioN. arXiv 2022, arXiv:2205.02455. [Google Scholar] [CrossRef]
  34. Singh, J.; Bhangu, K.S.; Alkhanifer, A.; AlZubi, A.A.; Ali, F. Quantum Neural Networks for Multimodal Sentiment, Emotion, and Sarcasm Analysis. Alex. Eng. J. 2025, 124, 170–187. [Google Scholar] [CrossRef]
  35. Tiwari, P.; Zhang, L.; Qu, Z.; Muhammad, G. Quantum Fuzzy Neural Network for Multimodal Sentiment and Sarcasm Detection. Inf. Fusion 2024, 103, 102085. [Google Scholar] [CrossRef]
  36. Arnett, C.; Jones, E.; Yamshchikov, I.P.; Langlais, P.-C. Toxicity of the Commons: Curating Open-Source Pre-Training Data. arXiv 2024, arXiv:2410.22587. [Google Scholar] [CrossRef]
  37. Buehler, M.J. PRefLexOR: Preference-Based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking. arXiv 2024, arXiv:2410.04715. [Google Scholar] [CrossRef]
  38. Li, X.; Gao, M.; Zhang, Z.; Yue, C.; Hu, H. Selection of LLM Fine-Tuning Data Based on Orthogonal Rules. arXiv 2024, arXiv:2410.04715. [Google Scholar] [CrossRef]
Figure 1. RECCON-DD dialogue.
Figure 1. RECCON-DD dialogue.
Bdcc 10 00161 g001
Figure 2. Overall Architecture of the Quantum-inspired Pre-trained Feature Embedding (ImprovedQPFE) Model.
Figure 2. Overall Architecture of the Quantum-inspired Pre-trained Feature Embedding (ImprovedQPFE) Model.
Bdcc 10 00161 g002
Figure 3. Multi-Head Attention Weight Visualization.
Figure 3. Multi-Head Attention Weight Visualization.
Bdcc 10 00161 g003
Figure 4. t-SNE Feature Space Visualization-Emotion Class Clustering (Real Data).
Figure 4. t-SNE Feature Space Visualization-Emotion Class Clustering (Real Data).
Bdcc 10 00161 g004
Figure 5. Visual Comparison of Performance with Baseline Methods.
Figure 5. Visual Comparison of Performance with Baseline Methods.
Bdcc 10 00161 g005
Figure 6. Detailed Performance Comparison on the RECCON-IEM Dataset.
Figure 6. Detailed Performance Comparison on the RECCON-IEM Dataset.
Bdcc 10 00161 g006
Figure 7. Performance Comparison Organized by Evaluation Metric on the RECCON-IEM Dataset.
Figure 7. Performance Comparison Organized by Evaluation Metric on the RECCON-IEM Dataset.
Bdcc 10 00161 g007
Table 1. Performance comparison of ImprovedQPFE and baseline models on the RECCON-DD and RECCON-IEM datasets.
Table 1. Performance comparison of ImprovedQPFE and baseline models on the RECCON-DD and RECCON-IEM datasets.
ModelRECCON-DDRECCON-IEM
Pos-F1Neg-F1Macro-F1Pos-F1Neg-F1Macro-F1RecallAccuracy
1DeepTransformer---80.4191.9986.2087.0988.63
ResidualCNN---92.5097.0394.7795.2295.74
HybridCNNRNN---89.6595.8292.7393.5394.04
2Deep LSTM66.8788.9977.93-----
Deep CNN38.8885.9062.39-----
Enhanced Transformer61.6188.0574.83-----
Attention Enhanced75.0390.9582.99-----
Deep Residual45.3683.8564.61
Gated CNN58.4286.1072.26-----
3Ours76.2791.2283.7593.4597.3495.3996.3696.21
Table 2. Ablation study results of ImprovedQPFE on the RECCON-DD dataset.
Table 2. Ablation study results of ImprovedQPFE on the RECCON-DD dataset.
Model ConfigurationMacro-F1Pos. F1Neg. F1ΔMacro-F1Core Purpose and Isolated Effect
ImprovedQPFE (Full)83.7576.2791.22--
A1.w/o Quantum Components, w/Strong Transformer80.0871.4588.71−3.67Net gain of the quantum paradigm vs. strong baseline
A2.w/o BiGRU81.4173.5689.26−2.34To verify the necessity of local sequential modeling
A3.w/o Quantum Attention81.8674.1889.54−1.89To assess the role of global semantic focusing
A4.w/o Contrastive Learning82.5275.3389.71−1.23To validate feature discriminability enhancement
A5.w/o Phase Pre-training82.9275.8589.99−0.83To evaluate the stability of parameter initialization
A6.w/o Quantum Measurement83.2876.0590.51−0.47To assess the role of the feature readout mechanism
Note: A1 (w/o Quantum Components, w/Strong Transformer) is a critical constructed baseline. Within an advanced architecture that retains the BiGRU, Transformer blocks, and contrastive learning framework, it completely replaces all quantum-inspired components with their standard, high-performance real-valued counterparts (e.g., real-valued embeddings, standard multi-head attention, random initialization, linear projection). This variant is designed to directly quantify the pure contribution of the quantum representation paradigm against a powerful modern non-quantum baseline.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zou, F.; Zou, L.; Guo, F.; Wang, X.; Weng, J.; Fang, T.; Jiang, H.; Wu, X. Enhanced Quantum-Inspired Deep Learning with Multi-Head Attention and Contrastive Learning for Text-Based Dialogue Sentiment Classification. Big Data Cogn. Comput. 2026, 10, 161. https://doi.org/10.3390/bdcc10050161

AMA Style

Zou F, Zou L, Guo F, Wang X, Weng J, Fang T, Jiang H, Wu X. Enhanced Quantum-Inspired Deep Learning with Multi-Head Attention and Contrastive Learning for Text-Based Dialogue Sentiment Classification. Big Data and Cognitive Computing. 2026; 10(5):161. https://doi.org/10.3390/bdcc10050161

Chicago/Turabian Style

Zou, Fumin, Lei Zou, Feng Guo, Xunhuang Wang, Jianqing Weng, Tao Fang, Haocai Jiang, and Xueming Wu. 2026. "Enhanced Quantum-Inspired Deep Learning with Multi-Head Attention and Contrastive Learning for Text-Based Dialogue Sentiment Classification" Big Data and Cognitive Computing 10, no. 5: 161. https://doi.org/10.3390/bdcc10050161

APA Style

Zou, F., Zou, L., Guo, F., Wang, X., Weng, J., Fang, T., Jiang, H., & Wu, X. (2026). Enhanced Quantum-Inspired Deep Learning with Multi-Head Attention and Contrastive Learning for Text-Based Dialogue Sentiment Classification. Big Data and Cognitive Computing, 10(5), 161. https://doi.org/10.3390/bdcc10050161

Article Metrics

Back to TopTop