2.2. Quantum-Inspired (Classical Approximation) Amplitude Encoding Layer
The proposed quantum-inspired (classical approximation) encoding layer maps each BERT token embedding into a quantum-inspired (classical approximation) feature representation using classical computations. The proposed encoder is quantum-inspired (classical approximation), and it is not a real quantum algorithm. The algorithm was implemented entirely on classical computers using Dense layers, ReLU activation functions, and cosine and sine operations, eliminating the need for real quantum hardware. Algorithm 1 describes how this approximation is done via a classical neural network.
For a BERT embedding vector
, where t is the token index, T is the total sequence length, and 768 is the BERT’s embedding dimension. We first apply L2-normalization:
where
is the normalized embeddings and
is the squared L2 norm. This normalization step is necessary because subsequent rotation operations are only geometrically valid when the input has unit norm (similar to quantum states must satisfy
Without this, rotation matrices would scale the vectors rather than purely rotate them, breaking the quantum-inspired (classical approximation) analogy and distorting probability calculations downstream. Next, we extract amplitude and phase features using two Dense (fully connected) layers, compressing 768 dimensions to 16:
Here, represents amplitude features (similar to the magnitude of quantum amplitudes) and represents phase features (similar to quantum phase angles). This compression from 768 to 16 is required to form the qubit-like 2D vectors; separating amplitude and phase allows us to independently control magnitude and direction in the quantum-inspired (classical approximation) representation, mimicking the amplitude-phase structure of real quantum states.
We reshape these 16 values into
qubits, where each qubit is a 2D vector
(each is a 2D vector, mimics quantum states), with
indexing the qubits. We then create a weighted sum of amplitude and (0.3) phase features:
where
is the normalized weighted combination and 0.3 is a fixed weighted coefficient. This phase is weighted less (0.3) because amplitude carries the primary semantic information from BERT embeddings, while phase provides secondary directional context. We then initialize each qubit (2D vectors) as:
where
performs a circular shift of the values by 1 position and
normalizes the amplitude of the combined vector. This initialization introduces local correlation between adjacent features before any variational transformation (similar to quantum superposition initialization). Thus, ensuring each 2D qubit vector
captures neighborhood context from the beginning. After this step, within each variational layer
and for each qubit,
we apply single-qubit rotations
using classical cos/sin operations followed by:
where
is a learnable rotation parameter (
qubits and
rotation types representing
), and
are the cosine and sine of the half-rotation angle. The rotation matrix
is the real part of the Pauli rotation matrix from quantum mechanics (removing the imaginary unit but preserving the rotation structure [
26]). We use cos/sin instead of other non-linearities (e.g., tanh) because they preserve geometrical rotations [
27]. Next, they preserve norm (
, ensuring the qubit vector length is unchanged after transformation and extracts semantic patterns. They also provide periodic nonlinearity, useful for capturing repeating linguistic patterns in misinformation text [
28,
29].
After single-qubit rotations, we apply CNOT-like gates that simulates inter-qubit correlation through probability-based mixing (Not real quantum entanglement):
where
is the control probability computed from the squared first component of qubit
, and this mixes the components of qubit
based on qubit’s probability. This operation classically approximates the CNOT gate’s effect of conditionally flipping a target qubit based on the control qubit, enabling cross-qubit feature interaction that captures long-range dependencies in the text representations [
26]. We additionally apply a circular CNOT between the first qubit
and the last qubit
to introduce global correlation across all 8 feature groups, preventing information isolation at the boundaries.
We next compute the classical probabilities for each the 8 qubits:
. Where
is the squared first component of qubit
(similar to Born’s rule in quantum measurement, where
) and
is a small numerical constant to avoid log(0) in the entropy calculation. We then normalize these probabilities across all qubits:
, ensuring
so it forms valid probability distribution. Using this probability distribution, we calculate Shannon entropy:
, which measures the uncertainty/spread of information across the 8 qubits (high entropy means features are distributed evenly, while low entropy means focused representations). This entropy signal H is then used to compute adaptive importance weights for each qubit via
, where Linear_ent is a Dense layer that maps the 2D entropy vector
to qubit-level importance scores (one weight per qubit per batch), and sigmoid normalizes weights to probabilities. This entropy-driven weighting allows the model to automatically focus on the most informative qubits (indicative of misinformation) for each input sample rather than treating all qubits equally. We then stack all qubit vectors into
(B is batch size, 8 qubits, each 2D). Then apply entropy weighting via element-wise multiplication
. This step obtains important scaled qubit features. Finally, we join both raw (unweighted) representations
and the entropy-weighted representation
. Where flat reshape (B, 8, 2) to (B, 16), so the concatenation step gives 16 + 16 = 32 dimensions. This concatenation preserves both original feature structure
and the entropy-modulation view (
, giving richer information for the final projection. The 32-dimensional vector
is then projected back to 768 dimensions using a Dense Layer:
where Linear_out maps
ReLU(x) = max (0, x) introduces nonlinearity and sparsity, and the LayerNorm stabilizes training by normalizing activations across the 768 dimensions. The final output
is a 768-dimensional classically approximated quantum-inspired feature embedding for each input, ready to be passed to the temporal transformer encoder layer. Following the work done by [
26,
28,
30] this entire Algorithm 1 approximates a variational quantum circuit using classical operations, avoiding quantum hardware while retaining the rotation-based feature transformation benefits.
| Algorithm 1: Quantum-inspired (Classical Approximation) Encoding Layer |
![Applsci 16 06338 i001 Applsci 16 06338 i001]() |
2.3. Dual-Attention Temporal Transformer Encoder
Algorithm 2 describes how the dual-attention temporal transformer model computes the final temporally rich hidden representations. This dual-attention temporal Transformer encoder works on the quantum-enhanced sequence
, which is generated by the quantum-inspired (classical approximation) amplitude-encoding layer. It processes this sequence through
stacked transformer blocks, with each block combining standard multi-head self-attention and a new temporal attention bias. The query, key and value projections follow standard multi-head self-attention formulation introduced in the transformer architecture [
31] (Vaswani et al., 2017). Then, in each block, the input sequence is converted into query, key, and value projections:
where
are learnable projection matrices. The scaled dot-product attention mechanism
is adopted from [
31] (Vaswani et al., 2017) and augmented it with learnable temporal bias:
where
is a learnable temporal bias matrix [
32]:
scaling temporal distances by parameter β and maximum temporal span
this encourages the model to weight contextually and temporal ordering is included using time-aware positional encoding (adopted from Kim and Lee, 2024 [
32]). The usual sinusoidal positional encoding (from Vaswani et al., 2017 [
31]) is improved by adding a learnable temporal component.
where
denotes positional encoding,
represent token position in the sequence,
is the embedding dimension, γ is a learnable scalar that controls the strength of the temporal component, k is the dimension index and t is the temporal index of the token. Similarly, the dual-attention within each head (combined version of [
31,
32]) is then:
where
denotes concatenation along the head dimension and the result is linearly projected to obtain the final block output:
where,
is the learnable matrix that combines and projects the concatenated dual-attention outputs back into the model-space representation before layer normalization. Through multiple stacked
layers, the encoder converts the quantum-enhanced sequence
into the final hidden representation
, which is then sent to the next fusion layer.
| Algorithm 2: Dual Attention Temporal transformer encoder |
![Applsci 16 06338 i002 Applsci 16 06338 i002]() |
2.4. Propagation Graph Attention Module
Propagation graph attention module is a separate standalone module to model propagation graphs. Algorithm 3 describes how the propagation graphs for our work are built. It models how misinformation spreads through social networks, capturing patterns invisible to text-only analysis. For datasets that include clear social network information, such as FakeNewsNet and PHEME, which contains Twitter follow relationships, we build the propagation graph
directly from these connections. We use retweet, reply, and share edges (which are present within 4 h temporal windows [
2,
33,
34]) to form the graph.
| Algorithm 3: Propagation Graph construction |
![Applsci 16 06338 i003 Applsci 16 06338 i003]() |
Algorithm 4 describes how propagation graph attention, neighbourhood node aggregation and global graph representations are computed using the heterogenous propagation graphs
from Algorithm 3. In this algorithm,
represents the tensors of real numbers with varying shapes, depending on the computed representations. Using Algorithm 4, we model the spatio-temporal evolution of claims through social networks. We use graph attention layer operating on the earlier generated heterogeneous propagation graphs. We compute attention weights for neighboring users:
where
,
are hidden representations,
encodes temporal delay, and
is a learnable attention vector.
is computed as a learnable linear projection of min-max normalized timestamp difference
, with output dimension
The
and Node update is defined as:
where
is the neighborhood and σ is activation. This layer captures how different user groups spread information and identify characteristic propagation patterns of misinformation vs. real news (e.g., rapid spread vs. organic gradual dissemination).
| Algorithm 4: Propagation graph attention computation |
![Applsci 16 06338 i004 Applsci 16 06338 i004]() |
2.6. Datasets
Table 2 describes the summary of two benchmark datasets used to evaluate the proposed QuST-TF model. First, FakeNewsNet [
33] includes about 23,200 news articles: 1056 from PolitiFact (432 fake, 624 real) and 22,140 from GossipCop (5323 fake, 16,817 real). It also provides tweet graphs, user interaction networks, and social engagement data. To avoid temporal leakage, we split FakeNewsNet [
33] by time. Articles before 2018 were used for training, 2019 for validation, and 2020 onward for testing. Because of the 1:3 class imbalance, we created a balanced test set of 2000 articles, equally split between fake and real news. Next, PHEME [
28] consists of 6425 rumor threads, 2402 misinformation threads, and 4023 non-misinformation threads, from nine real-world events. These include 105,354 tweets and veracity labels for each thread (true, false, unverified). For PHEME [
28], we used leave-one-out event evaluation to avoid cross-event leakage. This resulted in a test set of about 1359 threads (708 misinformation, 651 non-misinformation). For both datasets, user IDs were anonymized, URLs were removed from content features, and publication timestamps were excluded from model inputs to prevent indirect leakage.