Deep One-Directional Neural Semantic Siamese Network for High-Accuracy Fact Verification

Naseer, Muchammad; Windiatmaja, Jauzak Hussaini; Asvial, Muhamad; Sari, Riri Fitri

doi:10.3390/bdcc9070172

Open AccessArticle

Deep One-Directional Neural Semantic Siamese Network for High-Accuracy Fact Verification

Department of Electrical Engineering, Faculty of Engineering, Universitas Indonesia, Depok 16424, Indonesia

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(7), 172; https://doi.org/10.3390/bdcc9070172

Submission received: 8 May 2025 / Revised: 20 June 2025 / Accepted: 25 June 2025 / Published: 30 June 2025

(This article belongs to the Special Issue Machine Learning and AI Technology for Sustainable Development)

Download

Browse Figures

Versions Notes

Abstract

Fake news has eroded trust in credible news sources, driving the need for tools to verify the accuracy of circulating information. Fact verification addresses this issue by classifying claims as Supports (S), Refutes (R), or Not Enough Info (NEI) based on evidence. Neural Semantic Matching Networks (NSMN) is an algorithm designed for this purpose, but its reliance on BiLSTM has shown limitations, particularly overfitting. This study aims to enhance NSMN for fact verification through a structured framework comprising encoding, alignment, matching, and output layers. The proposed approach employed Siamese MaLSTM in the matching layer and introduced the Manhattan Fact Relatedness Score (MFRS) in the output layer, culminating in a novel algorithm called Deep One-Directional Neural Semantic Siamese Network (DOD–NSSN). Performance evaluation compared DOD–NSSN with NSMN and transformer-based algorithms (BERT, RoBERTa, XLM, XL-Net). Results demonstrated that DOD–NSSN achieved 91.86% accuracy and consistently outperformed other models, achieving over 95% accuracy across diverse topics, including sports, government, politics, health, and industry. The findings highlight the DOD–NSSN model’s capability to generalize effectively across various domains, providing a robust tool for automated fact verification.

Keywords:

verification; NSMN; FEVER dataset; Siamese MaLSTM; DOD–NSSN

1. Introduction

Fact verification has evolved into a crucial mechanism beyond generating binary decision labels, as done in fact-checking methods [1] and hoax detection [2]. Fact verification categorizes information based on evidence that either supports or refutes a claim, or even indicates that there is insufficient information. A dataset specifically designed for fact verification tasks is the Fact Extraction and VERification (FEVER) Dataset [3], which is designed specifically for fact verification tasks with labels such as SUPPORTS, REFUTES, and Not Enough Information (NEI) [4]. Additionally, the Neural Semantic Matching Networks (NSMN) algorithm [5] has been developed to leverage this dataset. NSMN employs a Bidirectional Long Short-Term Memory (BiLSTM) architecture [6], a neural network model with strong contextual understanding capabilities. However, NSMN reveals limitations in fact verification, achieving an accuracy rate below 70%. Model accuracy within the 60–70% range reflects the lower threshold of human trust in such models [7].

In the FEVER Dataset, fact verification is performed by comparing claim sentences with evidence to determine their similarity. Existing models, such as NSMN and transformer-based architectures like BERT and RoBERTa, often face overfitting due to architectural complexity and limited generalization across diverse domains. NSMN, in particular, performs sub optimally on real-world datasets like FEVER, largely due to its dependence on BiLSTM, which introduces high computational cost and limited semantic discrimination. This likely contributes to overfitting, where models achieve high training accuracy but degrade during testing [8]. BiLSTM processes sequences bidirectionally to capture context and generate classification probabilities. While transformers improve contextual representation, they rely on feature-based classification and lack explicit semantic similarity modeling for fact-evidence alignment. These limitations expose a gap: the need for lightweight, directionally optimized models that ensure semantic relevance and interpretability for robust, domain-independent fact verification. Addressing these calls for a novel architecture that combines unidirectional semantic encoding with principled relatedness scoring to enhance generalization and reduce overfitting.

There is another LSTM algorithm, i.e., the Siamese Manhattan Long Short-Term Memory (Siamese MaLSTM), which can be utilized to measure the semantic similarity between two sentences. Siamese MaLSTM compares two text inputs from the initial stage to produce a closeness score using the Manhattan distance matrix. The hypothesis formed in this study is that the performance of Siamese MaLSTM is capable of effectively verifying facts by measuring text similarity. Additionally, Siamese MaLSTM only involves one-way training, making the model simpler and reducing the risk of overfitting. Structurally, Siamese MaLSTM has the potential to measure the similarity between claim sentences and evidence sentences more accurately when combined with the FEVER Dataset, without requiring modifications to the dataset structure to achieve classification.

To address the challenges of fact verification, this research developed a new model called the Deep One-Directional Neural Semantic Siamese Network (DOD–NSSN), featuring a layered architecture consisting of an encoding layer, an alignment layer, a matching layer, and a modified output layer. In the encoding layer, Google News Word2Vec was used to encode the text, while in the alignment layer, the Split and Zero Padding technique was implemented. In the matching layer, the Siamese MaLSTM algorithm was applied to measure the closeness between claims and evidence. Additionally, this research introduced a new metric, the Manhattan Fact Relatedness Score (MFRS), which was integrated into the output layer to assess fact classification more accurately.

To evaluate the performance of the DOD–NSSN algorithm, a comparative analysis was conducted with transformer algorithms, including BERT, RoBERTa, XLM, and XL-Net, which are also used for Natural Language Processing (NLP) [9,10,11], fact-checking or verification tasks. The selection of these algorithms was based on literature studies, showing that they achieve an average accuracy of over 70% in text classification. The reason for comparing transformer algorithms with DOD–NSSN is that both transformer algorithms and NSMN shared a similar method of understanding text by performing classification through pattern recognition during training. DOD–NSSN, however, took a different approach compared to the aforementioned algorithms. While other algorithms began the process with feature extraction followed by classification modeling, DOD–NSSN performed regression between claims and evidence before classification. The results of this regression were then used to measure the Manhattan distance, with the closeness score being employed to generate the classification. This study serves as a benchmark for comparing the performance of DOD–NSSN with transformer algorithms and NSMN, which are specifically designed for fact verification.

In summary, based on the literature findings and the formulated hypotheses, the main contributions of this research include:

The development of the DOD–NSSN fact verification model with a modified architecture across four main layers: the encoding layer, alignment layer, matching layer, and output layer. The modifications include the introduction of Siamese MaLSTM to process two text inputs (claim and evidence) from two LSTM models at the early stage of sentence processing, enabling classification based on Manhattan distance to achieve high accuracy.
The introduction of MFRS as an evaluative metric designed to enhance precision in fact verification classification.

This paper’s chapters outlines are as follows: Section 2 discusses the related works on fact verification. Next, Section 3 explains the DOD–NSSN which describe the components of newly proposed fact verification model and the implementation of the proposed model. Section 4 presents the findings of this research, including the performance of the proposed model and other models in the context of fact verification, with concluding remarks in Section 5.

2. Related Works

The FEVER Dataset is a primary reference dataset for fact verification [12,13], utilizing three labels as the output for fact assessment: SUPPORTS (the evidence supports the claim), REFUTES (the evidence refutes the claim), and NOT ENOUGH INFO (there is not enough evidence to support or refute the claim). The initial construction of the FEVER Dataset was based on text from Wikipedia. The FEVER Dataset involves extracting claims from lengthy encyclopedic texts and generating factual statements. Although there are other fact verification datasets such as HOVER [14], which employs semantic matching techniques, and VITAMIN C [15], which was constructed by combining data from Wikipedia and web searches, a study [16] has demonstrated the superiority of fact verification datasets by assessing the ability of models to detect information based on evidence. The FEVER, HOVER, and VITAMIN C datasets were compared using transformer models such as BERT and RoBERTa [16]. The results of fact verification consistently showed that across all models, the FEVER Dataset achieved higher accuracy compared to the other datasets, exceeding 85% accuracy in BERT and RoBERTa models for fact verification.

Word embedding is a technique used in natural language processing and plays a crucial role in fact verification. Word embedding is the process of converting words or phrases into vectors in a multidimensional space that reflects their semantic relationships and contextual information [17]. By representing words as vectors, WordNet captures contextual and semantic relationships that contribute to fact verification tasks. The integration of WordNet into Word2Vec enhances the algorithm’s capacity to understand and model semantic relationships by incorporating a vast vocabulary network, including synonyms, hyponyms, and hypernyms [18,19]. Although WordNet Word2Vec is well-suited for traditional text-based articles, it requires adjustments for the unique characteristics of social media content, particularly for adapting to real-time information presented in news formats. Social media platforms deliver constantly evolving and rapidly updated information, providing users with the ability to share news updates and various types of content in real-time. In this context, Google News Word2Vec becomes highly relevant, as it is trained on a broader corpus compared to WordNet Word2Vec, with a specific focus on news articles from another source [20].

The process of presenting facts and manually searching for supporting or refuting evidence is time-consuming [21,22,23,24]. Neural Semantic Matching Networks (NSMN) is a machine learning method specifically designed to handle fact verification tasks. NSMN consists of four layers: encoding layer, alignment layer, matching layer, and output layer. NSMN plays an important role in semantic matching between two sequences of text [5]. Several studies have utilized NSMN for fact verification purposes [25], however, based on findings from the literature review, the accuracy of the NSMN model does not appear to outperform other algorithms for fact verification. The proposed model approach follows the layer sequence of the NSMN model.

NSMN incorporates Bidirectional Long Short-Term Memory (BiLSTM) into the network layers to process input text in both directions [26]. To manage long-term dependencies and handle larger inputs, memory cells have been integrated into the recurrent network. Another LSTM model, the Siamese Manhattan Long Short-Term Memory (Siamese MaLSTM), has been relatively underutilized and less explored compared to other models, despite studies favoring BiLSTM for text classification tasks [27]. Research reported in [28], which aimed to identify duplicate questions in a text, demonstrated that the application of Siamese MaLSTM in that study resulted in a well-performing model, achieving an accuracy of 95%. This research showed the potential of applying Siamese MaLSTM in text classification processes. This study applies Siamese MaLSTM for text verification by modifying the NSMN architecture, supported by a literature review of both models. This research focuses on developing a high-accuracy fact verification model using indicators in Table 1.

Table 1 outlines the key aspects of the innovations proposed in this research. First, in the context of syntactic and textual features, this research reviews whether other studies use fact verification-specific datasets and the application of word embedding aimed at enriching text representation. Only studies [5,30] utilized a combination of fact verification datasets with word embedding. This research implemented these advantages in DOD–NSSN to capture broader linguistic nuances in the text, thereby enhancing the model’s understanding of input text during the training process. Furthermore, in terms of modeling, this research examined whether the model supports better model development and integration. All studies developed models except for study [30], which focused on model application. DOD–NSSN is designed to accommodate model development, ensuring performance improvement through updates and algorithm adjustments for fact verification. This research also reviewed verification features, including evidence retrieval to determine whether information supports, refutes, or lacks sufficient evidence related to a given claim. All evaluated models, except for [29,31], supported this feature. DOD–NSSN not only supports this feature but also contributes to achieving an accuracy of over 70%, which provides reliability in delivering verification that surpasses the human trust threshold score. This threshold was not met in studies [5,29,31]. Furthermore, this research conducted a domain-specific review of sentence similarity by using a similarity vector in the text, which serves to measure the similarity score between claim sentences and evidence sentences. Only studies [28,31] which did not use similarity scores, relying instead on probability values to produce classifications. Finally, this research examined various syntactic and textual features, including word embeddings, which significantly strengthen the foundation of this study. To address the limitations in accuracy, this study draws inspiration from recent findings on semantic associations [32,33], which open up opportunities for developing more effective semantic matching strategies in the context of fact verification.

NSMN is a machine learning method specifically designed to handle fact verification tasks. NSMN consists of four layers: encoding layer, alignment layer, matching layer, and output layer. NSMN plays an important role in semantic matching between two sequences of text [5]. Several studies have utilized NSMN for fact verification purposes [25,34,35], however, based on findings from the literature review, the accuracy of the NSMN model does not appear to outperform other algorithms for fact verification. The proposed model approach follows the layer sequence of the NSMN model. NSMN incorporates Bidirectional Long Short-Term Memory (BiLSTM) into the network layers to process input text in both directions [26]. To manage long-term dependencies and handle larger inputs, memory cells have been integrated into the recurrent network. Another LSTM model, the Siamese Manhattan Long Short-Term Memory (Siamese MaLSTM), has been relatively underutilized and less explored compared to other models, despite studies favoring BiLSTM for text classification tasks [27,36]. Research reported in [37], which sought to automatically detect duplicate questions within a sizeable corpus of textual data, demonstrated that deploying the Siamese MaLSTM architecture produced a robust and reliable model—ultimately achieving an impressive 95% accuracy on the evaluation set. These encouraging results underscore the considerable potential and versatility of Siamese MaLSTM not only for duplicate-question detection but also for a broader range of text-classification tasks in natural-language-processing pipelines.

In summary, we introduced DOD–NSSN, a novel fact verification model that employs a one-directional Siamese MaLSTM architecture, leveraging Manhattan distance-based relatedness scoring to measure semantic similarity between claims and evidence. This design overcomes the overfitting issues in NSMN, which arise from BiLSTM’s bidirectional complexity and high computational overhead. We develop Manhattan Fact Relatedness Score (MFRS) as a new evaluative metric that directly measures semantic closeness rather than relying on traditional probability-based classification. This approach resolves the interpretability challenges in transformer models like BERT and RoBERTa, which lack explicit similarity measurement and depend on feature-based classification.

3. DOD–NSSN

In this subsection, the discussion focuses on the architecture of proposed model, DOD–NSSN. The process begins with preprocessing and feature construction, which are formulated and followed by the extraction stage. Additionally, an overview of the DOD–NSSN development carried out in this study is provided. Furthermore, the pseudocode for the DOD–NSSN algorithm is outlined in Algorithm 1. The implementation of Siamese MaLSTM in this research involves using the claim on the left side of the sentence and the evidence on the right side. DOD- NSSN takes two inputs: the claim text (claim_text) and the evidence text (evidence_texts). The output of this algorithm consists of a score (score) and a prediction (pred). The algorithm process begins by iterating through each piece of evidence text in the list of evidence texts (evidence_texts). The claim text and evidence text are encoded using Google News Word2Vec to generate vector representations of the text.

Algorithm 1: Pseudocode of DOD–NSSN.

1.: Input: input1(Cl), input2(Ev)
2.: Output: score (y), pred (C)
3.: Initialize: score ← NULL, fact_relatedness_score ← 1
4.: for each Cl in Ev do:
5.: $claim_text_encoded (\bar{C l})$ ← encode(Cl)
6.: $evidence_text_encoded (\bar{E v}$ ) ← encode(Ev)
7.: $alignment_matrix (Zp) \leftarrow align (\bar{C l}$ $, \bar{E v}$ )
8.: $matching_matrix (MaLSTM) \leftarrow match ({\bar{C l}}_{z p}$ $, {\bar{E v}}_{z p}$ )
9.: fact_relatedness_score ← manhattan_distance (matching_matrix)
10.: $y \leftarrow dot_product (fact_relatedness_score, \bar{C l}$
11.: if y >= 0.57 and y <= 1.0:
12.: C ← “SUPPORTS”
13.: else if y >= 0.0 and y <= 0.4499:
14.: C ← “REFUTES”
15.: else:
16.: C ← “NEI”
17.: return C

Next, an alignment process is performed between the encoded claim text and the encoded evidence text to produce a matrix. The matching operation involves comparing matrix elements to calculate a similarity score, which is computed using the Manhattan distance method (manhattan_distance). This similarity score serves as a measure of the relevance of the evidence text to the claim. Subsequently, a dot product multi plication is performed between the fact-relatedness score (fact_relatedness_score) and the vector representation of the claim text (claim_text_encoded) to generate a score value (score). Predictions are made as SUPPORTS or REFUTES when the score falls within a specific range of values, if the score does not fall within this range, the result is predicted as Not Enough Information (NEI). An illustration of the model is presented in Figure 1.

3.1. DOD–NSSN Preprocessing

To achieve optimal classification results, the model undergoes several stages before processing the text. The FEVER Dataset has limitations that impact the development of the DOD–NSSN model. First, the dataset focuses on claims sourced from Wikipedia articles, which may introduce bias toward subjects more frequently represented in the encyclopedia. This constraint can limit the model’s ability to generalize when handling claims from various sources and subjects. To address this limitation, we conducted additional evaluations to ensure that DOD–NSSN can adapt well to various topics and contexts outside the scope of the FEVER dataset, thereby proving the model’s reliability for application in fact verification systems. Before DOD–NSSN processing begins, sentence alignment is performed to ensure phrase consistency. Additionally, the implementation of the Manhattan Fact Relatedness Score (MFRS) is a novel approach that generalizes patterns from existing data to calculate the distance between claims and evidences. Figure 2 illustrates the preprocessing steps.

Through overcoming dataset limitations with appropriate strategies, the DOD–NSSN model is established as a reliable tool for real-world fact verification. The initial stage of language processing in the textual context includes tokenization, stopword removal, and vectorization procedures. Specifically, this research provides a more detailed explanation of the preprocessing of the FEVER Dataset before it is implemented into the model:

The dataset is processed and consists of two categories: sentence 1 (as the claim) and sentence 2 (as the evidence).
Tokenizing sentences aims to split the text in each sentence into individual tokens or words. For example, the sentence “Coffee stunts growth in children.” After tokenization, it becomes [“Coffee”, “stunts”, “growth”, “in”, “children”, “.”].
After tokenization, the next step is to remove stopwords from the text. Stopwords are common words that do not provide much information about the context or meaning of the sentence. For example, the result of stopword removal using the previous sentence becomes [“Coffee”, “stunts”, “growth”, “children”, “.”].
The next process is converting uppercase letters in the text to lowercase during preprocessing to avoid discrepancies in tokenizing the same words with different cases. For example, “Coffee” and “coffee” are considered the same word after converting the text to lowercase.
The next step is removing non-alphanumeric characters such as punctuation and special symbols, which are often deleted from the text to maintain consistency. For example, the result of removing non-alphanumeric characters from the previous sentence becomes [“coffee”, “stunts”, “growth”, “children”].
Finally, replace short forms with their full forms to ensure consistency and ease of text processing. For example, “don’t” can be replaced with “do not” for text consistency within the dataset.

3.2. DOD–NSSN Feature Extraction

Feature extraction plays a crucial role in the analysis of claims and evidence in the fact verification process. The feature extraction method used in this study is Google News Word2Vec. It provides word vectors that capture the semantic relationships for input text. This study leverages a set of fact verification features using the FEVER Dataset, with the aim of improving accuracy. It is specifically designed for tasks related to news. Additionally, Google News Word2Vec offers a vocabulary vector of 3 billion words, effectively representing each news sentence in the FEVER dataset. In the Fact Verification task, this study allocated 70% of the data for the model training phase (which was further divided into 90% for training data and 10% for validation data), while the remaining 30% was allocated for the model testing phase, specifically for claim verification. The performance of the training and testing data was measured using accuracy, precision, recall, and F1-Score metrics.

3.3. DOD–NSSN Layers

To develop the DOD–NSSN algorithm, a thorough examination of the fundamental structure was conducted to identify the necessary components. DOD–NSSN combines elements from the NSMN architecture and the Siamese MaLSTM architecture for fact verification. The integration of Siamese MaLSTM with certain NSMN architectures introduces a new approach to predicting and confirming factual claims during the verification process. The NSMN architecture underwent several stages of transformation into DOD–NSSN. It consists of four layers: the encoding layer, alignment layer, matching layer, and output layer. The difference in architecture is illustrated in Figure 3.

3.3.1. Encoding Layer

The encoding layer transforms the text input data of the Claim (Cl) and Evidence (Ev), which contain real numbers (

R

) with dimension size

d_{0}

equal to the number of tokens in Cl (n) and the number of tokens in Ev (m). In the encoding layer stage, these two inputs are converted into numerical representations or vectors. DOD–NSSN utilizes Google News Word2Vec to encode the text with a 300-dimensional architecture. The FEVER Dataset that forms Cl serves as the source of sentences on the left side for fact verification, encompassing claims that need to be validated. For example, there is a Cl containing the sentence “Alexandra Daddario is a ballerina.”. Meanwhile, on the right side is the evidence, also taken from the FEVER Dataset to form Ev, with an example where Ev contains the sentence “Alexandra Daddario is a teacher at the high school”. The input sequence is processed in one direction by tokenizing (H) and creating new vectors in the form of encoded input text using Google News Word2Vec. The resulting encodings are

\bar{C l}

and

\bar{E v}

. These two vectors are

R

vectors with dimension d₁, which is a subset of d₀. The result of the input encoded using Google News Word2Vec is

\bar{C l}

represented as [0.12, 0.34, … 0.78] with a dimension d₁ of 300 and n of 5, and

\bar{E v}

represented as [0.11, 0.28, … 0.67] with a dimension d1 of 300 and n of 9. The two input sequences are summarized in Equations (1) and (2).

\bar{C l} = H_{left} (Cl) \in R^{d_{1} \times n},

(1)

\bar{E v} = H_{right} (Ev) \in R^{d_{1} \times m},

(2)

3.3.2. Alignment Layer

In the alignment layer, DOD–NSSN employs the Split and Zero Padding method. The choice of Split and Zero Padding is based on its effectiveness in handling large datasets. This approach helps reduce the risk of excessive parameters and the potential for overfitting. The next stage involves the alignment layer, where techniques such as text Split (St) and text Zero Padding (Zp) are applied to reduce overfitting in sentences.

St = {left : \bar{C l}, right : \bar{E v}},

(3)

Zp (St, pz, tz, Lz) = \{\begin{matrix} {\{0\}}^{(L_{z} - |S t|)} \cup S t \\ S t \cup {\{0\}}^{(L_{z} - |S t|)} \end{matrix}

(4)

In Equation (3), the split text process (St) divides the dataset into two parts: claim (

\bar{C l}

) and evidence (

\bar{E v}

), stored in the ‘left’ and ‘right’ keys. The padding function Zp(St, pz, tz, Lz) in Equation (4) explains the process of adding zeros to the St sequence to reach the length of the text (Lz). If the padding region (pz) contains predicted text and the truncating region (tz) contains post text, zeros will be added at the beginning of the St sequence until it reaches length Lz. Conversely, if pz contains post text and tz contains predicted text, zeros will be added at the end of the St sequence. In this way, all sequences in the dataset are standardized to the same length, facilitating further processing in the layers of the model. Each claim undergoes the split and zero padding process, resulting in new matrices,

\bar{C l}

_zp for the claim and

\bar{E v}

_zp for the evidence. For example, the encoded result of

\bar{C l}

[0.12, 0.34, … 0.78] in row 1 and the encoded result of

\bar{E v}

[0.11, 0.28, … 0.67] in row 1 will pass through the alignment layer, with the matrix size in row 1 of the alignment layer being 1 × 20.

Furthermore, the adjustment of

\bar{C l}

in the alignment layer, which was originally 1 × 5, is changed to 1 × 20 by adding zeros at the beginning because the length of

\bar{C l}

is less than 20. This results in [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.12, 0.34, … 0.78] to match the length of the layer size. The adjustment of

\bar{E v}

in the alignment layer is also performed similarly, changing from 1 × 9 to 1 × 20 by adding zeros at the beginning because the length of

\bar{E v}

is less than 20, resulting in [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.11, 0.28, … 0.67]. If

\bar{C l}

and

\bar{E v}

have more than 20 tokens, truncation is applied at the alignment layer stage.

3.3.3. Matching Layer

The matching layer in NSMN uses BiLSTM, while in DOD–NSSN, Siamese MaLSTM is employed. In DOD–NSSN, the process of matching two input vectors is applied to establish linguistic references in news fact classification using Siamese MaLSTM. The Manhattan distance (MaLSTM) generated serves as the fundamental reference for the algorithm to produce fact verification. The matching layer is responsible for calculating MaLSTM before producing the final prediction. The calculation of the alignment matrix is explained in Equation (5).

MaLSTM = \sum (|\bar{{C l}_{z p 1}} - \bar{{C l}_{z p 2}}| - |\bar{{E v}_{z p 1}} - \bar{{E v}_{z p 2}}|) \in R^{d_{2} \times n}

(5)

The values of

\bar{C l}

_zp and

\bar{E v}

_zp are used as inputs in the formula to calculate the Manhattan distance. The Manhattan distance involves summing the absolute weights to balance the two inputs used in the model. To calculate the matching layer from the results of split and zero padding, the first step is to compute the absolute difference between the claim and evidence for each token. For example, if the tokens

\bar{C l}

_zp and

\bar{E v}

_zp are compared one by one, such as |0.12 − 0.11|, the result is 0.01. All tokens in

\bar{C l}

_zp and

\bar{E v}

_zp are then summed based on their respective positions. By utilizing the Siamese MaLSTM architecture, the model is capable of identifying semantic relationships, including paraphrasing and contextually equivalent expressions, which are often difficult to distinguish using traditional classification methods. For example, the word “buy” in the claim sentence and “purchase” in the evidence sentence are semantically equivalent despite appearing in different forms. Siamese MaLSTM enables the model to recognize this level of semantic closeness based on analysis of meaning, and to produce accurate similarity assessments between the claim and the evidence without the need for bidirectional positional analysis of words within the sentence.

3.3.4. Output Layer

After obtaining the Siamese Manhattan distance, the final prediction is made in the output layer (y), resulting in a class prediction (C) for fact verification. In the FEVER Dataset, determining the verification label involves a series of steps to ensure accuracy and consistency in claim classification. This process takes into account unique challenges, especially since each claim must be evaluated against directly related evidence. One of the challenges is ensuring that the relevant claim and evidence are well-connected, as there is variation in the structure and content of the claims and evidence. To address this challenge, the system relies on the Manhattan Fact Relatedness Score (MFRS) generated from the Siamese matching layer. MFRS is a new variable introduced in this research. It is derived from the matching layer performed by Siamese MaLSTM and provides an objective assessment of the level of relationship between the claim and the evidence, aiding in determining their correspondence. Additionally, Figure 4 illustrates the classification ranges determined based on the MFRS results.

A predefined score range, from 0 to 1, offers a clear framework for claim classification. If the association score falls within a certain range, the classification of ‘SUPPORTS’, ‘REFUTES’, and ‘NEI’ can be consistently assigned, providing clarity in interpreting fact verification results. The challenge in this study involved handling ambiguous or unclear claim sentences, as well as the complexity of presenting evidence that may not directly support or refute the claim. To address this, the model designed in this research is equipped to handle cases of ambiguity by assigning the ‘NEI’ classification when there is not enough information available to support or refute the claim. To obtain the MFRS, it is necessary to run the encoding, alignment, and matching layer processes. The output from the matching layer, which is the absolute weight, produces y, and this weight forms a vector

R

. The resulting score is then classified into a prediction score, which is used to generate C. The proximity thresholds of 0.4399 and 0.5699 were chosen based on the analysis of the data distribution from the Manhattan distance output scores. Sentence pairs with Manhattan scores below 0.44 consistently exhibited contradictory relationships (REFUTES), while scores above 0.5699 indicated strong supportive relationships (SUPPORTS). Furthermore, the data analysis revealed a range reflecting informational ambiguity, where the semantic proximity between the claim and evidence was insufficient to either support or refute the claim. Therefore, this ambiguous class was designated as NEI.

The class distribution analysis was conducted by reviewing each Manhattan distance score between claims and evidence, grouping them, and identifying score ranges to observe the population of class labels. The results of this grouping showed that for the score range 0–0.4399, 90% of the data was labeled as REFUTES, while 10% was labeled as SUPPORTS. The score range of 0.44–0.5699, representing class ambiguity, included 164 data points, approximately 2.34% of the total 7000 data. Finally, for the score range of 0.57–1, 100% of the data was labeled as SUPPORTS. This class distribution using Manhattan distance served as the basis for determining score ranges in the MFRS (Manhattan Fact Relatedness Score). The fact-relatedness score falls within the range of 0 to 0.4399, the class is classified as REFUTES, indicating a substantial distance between the claim and the provided evidence.

If the fact-relatedness score falls within the range of 0.4399 to 0.5699, the classification is set as NEI. This classification is applied when the result is ambiguous, making the model unable to definitively determine whether the claim SUPPORTS or REFUTES the evidence. If the fact-relatedness score falls within the range of 0.57 to 1, the classification is set as ‘SUPPORTS’. This classification indicates a strong correspondence between the claim and the evidence, leading to the conclusion that the claim is accurate. The final prediction provided by the DOD–NSSN algorithm corresponds to the fact verification label, which can be S (SUPPORTS), R (REFUTES), or NEI. These labels are determined based on the closest Manhattan distance in each class, represented by vectors of dimension d₃. The output layer is summarized in Equations (6) and (7).

y = MaLSTM \in R^{d_{3} \times n},

(6)

C = \{\begin{matrix} R E F U T E S = I F y \leq 0.4399 \\ N E I = I F 0.44 \leq y \leq 0.5699 \\ S U P P O R T S = I F 0.57 \leq y \leq 1 \end{matrix}

(7)

MFRS was selected as an evaluation metric for its use of Manhattan distance, effectively reflecting semantic relatedness between claims and evidence and providing accurate factual assessments. For example, from the result of the absolute value summation performed using the MaLSTM formula, a value of 0.62 is obtained and stored in the variable y. This value can be categorized as the SUPPORTS class based on the predefined value range. In the output layer, the Manhattan Fact Relatedness Score (MFRS) does not rely on the positional sequence of words within a sentence or traditional probabilistic approaches. Instead, it calculates the semantic closeness between the claim and evidence sentences using the Manhattan distance. This distance score is then used to classify the labels SUPPORTS, REFUTES, or NEI based on empirically defined threshold values. As a result, the model predicts labels based on semantic similarity analysis performed by the Siamese MaLSTM and produces classification outcomes that are more stable, interpretable, and capable of handling semantic ambiguity compared to conventional softmax-based models.

3.4. DOD–NSSN Implementation Environment

For the implementation of the DOD–NSSN algorithm, we used Jupyter Notebooks, which are designed to support machine learning research [38], with Python 3 as the runtime environment. The specific hardware and software configurations used for the implementation are detailed in Table 2.

3.5. Dataset

The Google News Word2Vec feature serves as a hidden layer within a linear neural network operating on a textual corpus [39,40]. The dataset used in this study is the Fact Extraction and VERification (FEVER) dataset, which was originally constructed by extracting claims from Wikipedia articles and associating them with human-labeled evidence. Each entry in the dataset includes a claim, one or more evidence sentences, and a label (SUPPORTS, REFUTES, or NEI). The dataset used consists of 10,000 total data samples, of which 7000 (70%) were used for training and validation (split into 90% training and 10% validation), and 3000 (30%) for testing. This research utilizes Google News Word2Vec as the dataset dictionary with a specific focus on news data. Google News Word2Vec consists of 3-billion-word vectors and includes a vocabulary of 3 million English words. The input values for Siamese MaLSTM consist of the content of the evidence and the content of the claim, which were then processed to obtain a hypothesis and the Manhattan distance. The structure of the dataset used in this study is detailed in Table 3.

3.6. Performance Evaluation Metrics

The model evaluation in this study utilizes various metrics, including accuracy, precision, recall, and F1-Score. Accuracy measures the closeness between the predicted values and the actual values, while the F1-Score calculates the balanced average between precision and recall. This study defines the True Positive (TP) variable as data that displays a positive value and is correctly predicted. The True Negative (TN) variable refers to negative data that is accurately predicted, while the False Positive (FP) variable represents positive data that is incorrectly predicted as negative by the system. Additionally, False Negative (FN) occurs when the system predicts positive data as negative. Equations (8)–(11) are used to calculate accuracy, precision, recall, and the F1-Score.

A c c u r a c y = \frac{T P + T N}{(T P + F P + T N + F N)},

(8)

P r e c i s i o n = \frac{T P}{(T P + F P)},

(9)

R e c a l l = \frac{T P}{(T P + F N)},

(10)

F 1 - Score = \frac{2 \times (Precision \times Recall)}{(Precision + Recall)},

(11)

3.7. Dataset Testing Against DOD–NSSN

DOD–NSSN demonstrates the model’s ability to classify fact verification labels. The curves show a continuous improvement in accuracy, F1-Score, recall, and precision metrics with each iteration, indicating a steady enhancement in the classification output generated by DOD–NSSN. Figure 5 illustrates the trend values of precision, recall, and F1-Score achieved by the DOD–NSSN algorithm.

The observed trend indicates that the algorithm has strong capabilities in utilizing factual information, as evidenced by the increasing levels of accuracy, precision, and recall. Additionally, these results highlight the effectiveness of DOD–NSSN in improving the fact verification process, demonstrating its potential to play a significant role in various fields that require careful and reliable information classification. Observations of the DOD–NSSN model’s iteration process show that increasing the number of epochs results in a reduction in the model’s loss accuracy values. These previous observations indicate that the training and validation process of the DOD–NSSN model provides significant advantages in both the training and testing sets. This phenomenon can be attributed to the distinctive architecture of the DOD–NSSN model, which integrates neural semantic encoding with a Siamese neural network to enhance its effectiveness in fact verification tasks. The finding depicted in Figure 6 reveals that the DOD–NSSN model consistently demonstrates low loss values.

4. Results and Discussion

4.1. Results

The findings of this research show that DOD–NSSN achieved the highest scores across various metrics, including accuracy, precision, recall, and F1-Score, compared to NSMN. Specifically, DOD–NSSN demonstrated a significant accuracy improvement, surpassing NSMN by a margin of 22.43%. Additionally, DOD–NSSN outperformed NSMN by a precision margin of 31.02%, a recall margin of 12.48%, and an F1-Score margin of 19.66%. These findings highlight the greater effectiveness of DOD–NSSN in accurately identifying fact verification data compared to NSMN. The pre-training process was influenced by the modification of the BiLSTM architecture into Siamese MaLSTM, contributing to the algorithm’s enhanced performance. Specifically, the NSMN model produced the lowest accuracy score of 69.43% compared to other algorithms. Accuracy reflected the proportion of correctly classified fact verifications, and the high accuracy of DOD–NSSN demonstrated its overall strong performance. Figure 7 shows the performance measurements of the DOD–NSSN model compared to other models.

The F1-Score provided a comprehensive assessment by accounting for False Negatives (FN) and False Positives (FP), ensuring a balanced evaluation. After analyzing the F1-Score metrics across various models, each model exhibited its own strengths and weaknesses. Starting with NSMN, it achieved an F1-Score of 74.33%, reflecting a balance between precision at 57.85% and recall at 86.79%, but it still lagged behind other models in terms of overall performance, with an accuracy of 69.43%. Additionally, NSMN displayed lower precision, indicating a tendency to generate more False Positives in fact verification tasks. DOD–NSSN exceled across all evaluation metrics, outperforming transformer models such as BERT, RoBERTa, XLM, and XL-Net. It achieved the highest F1-Score of 93.99%, reflecting an excellent balance between precision and recall, which are critical indicators for fact verification tasks where both false positives and false negatives can have significant consequences. In comparison, RoBERTa, the best-performing transformer model, achieves a lower F1-Score of 82.99%.

DOD–NSSN’s high precision of 88.87% suggests that when the model identifies a statement as a fact, it is highly likely to be accurate, thereby reducing the risk of spreading false information. Furthermore, its recall of 99.27% minimizes the likelihood of missing true facts, which is crucial in high-stakes environments such as news reporting or fact verification. These results show that DOD–NSSN maintains strong overall performance without sacrificing one metric for another, making it the most reliable model in this context. XL-Net slightly outperformed NSMN with an F1-Score of 74.90%, achieving a precision of 72.71% and a recall of 92.71%, resulting in an accuracy of 76.32%. BERT showed significant results with an F1-Score of 81.87%, obtaining a precision of 78.80%, a recall of 85.64%, and an accuracy of 78.68%. Another model, XLM, achieved an F1-Score of 82.50%, with a precision of 78.36% and a recall of 86%, though its accuracy was slightly lower than BERT, with XLM having an accuracy of 73.19%. RoBERTa was the best performing transformer model, reaching an F1-Score of 82.99%, with a precision of 81.30% and a recall of 84.98%, resulting in an accuracy of 76.93%. However, DOD–NSSN consistently outperformed all these models across all metrics, proving its robustness in fact verification tasks. Table 4 presents a summary of model data regarding accuracy, F1-Score, recall, and precision, demonstrating that DOD–NSSN outperforms NSMN and transformer algorithms across all evaluation metrics.

The DOD–NSSN model achieved the best F1-Score compared to the other models, with a value of 93.99%, a precision of 88.87%, and a recall of 99.27%. These results demonstrate the excellent capability of DOD–NSSN to maintain a strong balance between precision and recall, leading to the highest accuracy of 91.86%. The varying levels of difficulty in fact verification tasks faced by each model may affect their ability to achieve a high F1-Score. Overall, the DOD–NSSN model excels in balancing False Negatives (FN) and False Positives (FP), as evidenced by its superior F1-Score performance. These findings underscore the DOD–NSSN model’s proficiency in executing fact verification tasks by effectively classifying and recognizing facts. Additionally, transformer algorithms such as BERT, RoBERTa, XLM, and XL-Net showed strong performance, consistently achieving scores above the 70% threshold across all evaluation metrics. These results highlight their potential as viable alternatives for fact verification tasks.

The accuracy of the transformer algorithm is superior compared to NSMN because the transformer uses positional encoding to provide information about the order of tokens in the text. This positional encoding allows the transformer to understand more complex meanings, even when the relationships between words in a sentence are not adjacent. BiLSTM processes sequences of text inputs in both forward and backward directions. Although this approach can enhance contextual understanding, the results of this study indicate that the added complexity does not significantly improve NSMN’s performance in addressing the challenges posed by the FEVER Dataset. Table 5 presents the fact results processed by the DOD–NSSN model.

Specifically, NSMN struggles to accurately capture the semantic similarity between claims and supporting or refuting evidence, leading to lower accuracy in fact-verification tasks. However, what sets DOD–NSSN apart is its innovative use of the Siamese MaLSTM architecture within the language representation framework. This approach has demonstrated effectiveness and precision in classifying textual data, culminating in an excellent accuracy score of 91.86%, the highest among all the algorithms tested. The observed variation in performance across models emphasizes the importance of selecting the most appropriate approach for specific tasks.

In this study, DOD–NSSN emerged as the model with the highest accuracy score, making it an optimal solution for fact verification. To analyze prediction outcomes in detecting misclassification of text, this research also calculated recall and precision metrics. Moreover, based on the margin of error obtained through a 95% confidence interval, the DOD–NSSN model demonstrates a higher level of stability compared to other models, as shown in Table 6.

Table 6 presents a comparison of the margin of error for the DOD–NSSN model against several other models, including NSMN, XLM, XL-NET, RoBERTa, and BERT, using a 95% confidence interval. This analysis provides insights into the model’s ability to accurately retrieve relevant information (recall) and its capability to minimize false positives (precision), offering a comprehensive assessment of its performance in fact verification tasks. The comparison focuses on four key evaluation metrics: accuracy, precision, recall, and F1-score. In terms of accuracy, DOD–NSSN demonstrates the lowest margin of error at 0.0006 percent, indicating a higher level of stability compared to other models that exhibit greater variations in accuracy. For precision, DOD–NSSN achieves a margin of error of 0.0008 percent, which is relatively lower than XL-NET, a model with a larger margin of error. In the recall metric, DOD–NSSN outperforms the other models with the lowest margin of error at 0.0002 percent, highlighting its ability to maintain consistency in correctly identifying factual data. Similarly, the F1-score for DOD–NSSN reflects a lower margin of error at 0.0006 percent, which is significantly smaller than models like XL-NET and RoBERTa. Overall, these findings suggest that DOD–NSSN is not only more consistent but also exhibits greater stability in its predictive performance. The reduced margin of error highlights the model’s reliability, making it a superior choice for fact verification tasks, where maintaining consistency is crucial.

Furthermore, to validate the performance claims of the proposed model, statistical significance testing was conducted using independent sample t-tests across four evaluation metrics, i.e., accuracy, precision, recall, and F1-score. The results indicated that all calculated p-values were below the significance threshold of 0.05, with most comparisons yielding p < 0.001. These findings suggest that the observed performance differences are statistically significant. For instance, the p-value for the F1-score comparison between DOD–NSSN and all baseline models was recorded at 0.0000. This result demonstrates that the superior performance achieved by DOD–NSSN is not due to random variation, but rather reflects a meaningful and statistically significant improvement over existing models. Therefore, these outcomes provide strong evidence that DOD–NSSN is a robust model for fact verification tasks, supported by both empirical and statistical validation. The results of the t-statistic and p-value tests are presented in Table 7.

4.2. Discussion

In the previous experiment, a dataset containing 10,000 observations was used to perform a model comparison between the DOD–NSSN and NSMN algorithms. In this study, the dataset was split with 70% of the data used for training (which was further divided into 90% for training data and 10% for validation data), and 30% of the data was used for testing. The findings of this study indicate that DOD–NSSN outperforms the NSMN algorithm and transformers in terms of accuracy, precision, recall, and F1-Score when trained and tested with equivalent amounts of data. DOD–NSSN requires only 1.74 s per epoch for training, highlighting its efficiency in handling a large number of fact verification tasks. The integration of Siamese MaLSTM into the language representation framework sets DOD–NSSN apart from previous models like NSMN. This architecture addresses the overfitting issues commonly found in BiLSTM-based model in NSMN, as highlighted in previous research [5]. Additionally, the use of the Manhattan Fact Relatedness Score (MFRS) in the output layer improves the model’s accuracy in classifying claims. Compared to transformer models, DOD–NSSN demonstrates superior generalization by maintaining high performance across various topics. For instance, in the Healthcare domain, where data complexity often reduces model accuracy, DOD–NSSN achieves strong results with an accuracy of 96.19% and an F1 Score of 82.50%. These findings highlight the model’s robustness in handling datasets that are typically challenging for research.

The performance gain of DOD–NSSN attributed to two key innovations in its architecture: (1) the use of the Siamese MaLSTM model in the matching layer, which enables more effective semantic comparison between claim and evidence, and (2) the introduction of the Manhattan Fact Relatedness Score (MFRS) in the output layer, which transforms the classification problem into a similarity scoring task. This combination of the model avoid overfitting issues commonly observed in BiLSTM-based NSMN and to generalize better across domains. Instead of directly classifying the claim-evidence pair into discrete labels via Softmax, DOD–NSSN first computes a semantic similarity score using the Manhattan distance of the encoded vectors. This score is then mapped to the labels (SUPPORTS, REFUTES, NEI) based on empirically defined thresholds. This approach ensures better interpretability and robustness, especially in ambiguous or borderline cases.

The implementation used by DOD–NSSN involved leveraging Siamese MaLSTM, which utilized a Siamese network to perform textual analysis to verify claims and substantiate specific facts. This analysis resulted in label predictions related to the claims and evidence discussed in the text. Both input texts use identical sub-models to extract their respective features. A claim is labeled as SUPPORTS if the second entry includes identification and paraphrasing. If the second text presents a correlation but lacks identical details, it is labeled as REFUTES. If both texts have no connection, the claim is classified as NEI. The performance of the DOD–NSSN model, when compared to benchmark algorithms, represents a significant advancement in automated fact verification technology. With its proven capabilities, the DOD–NSSN model can effectively identify and verify factual information with much higher accuracy, precision, recall, and F1-Score. This performance improvement not only enhances the reliability and trustworthiness of fact verification systems but also has a positive impact on the fight against the spread of false information. Moreover, in the midst of the widespread dissemination of misinformation in today’s digital era, the challenges faced by society are becoming increasingly complex. The impact goes beyond public discourse, influencing decision-making processes and even the integrity of democratic institutions. The findings of this research address these challenges by providing a more effective and sophisticated tool for identifying and refuting false information. The prediction results of DOD–NSSN on these various topics can be seen in Table 8. These topics were selected based on news data available in the FEVER Dataset. Topics were initially determined manually to ensure accurate labeling, allowing the model to precisely measure performance based on the specified topics. The five topics with the largest data volume were then chosen: Sports, Government, Political, Health, and Industry.

The DOD–NSSN model was subsequently used to train and test the data for each topic, with performance measured using accuracy, precision, recall, and F1-Score indicators. The practical implications of the findings from this research are significant for various stakeholders involved in the dissemination of information, including social media platforms, news portals, and academic researchers. Implementing more accurate and reliable fact verification models such as DOD–NSSN has the potential to greatly reduce the spread of misinformation while simultaneously enhancing the credibility of information shared online. To assess the performance of DOD–NSSN across different news topics, tests were also conducted related to sports news, government, political news, health, and industry.

From the DOD–NSSN performance results shown in Table 8, it demonstrates excellent generalization capabilities across various topics. The model’s accuracy achieved consistently high results, ranging from 91% to 98%, indicating that DOD–NSSN is highly accurate in classifying topics. The model’s precision was also exceptionally high, with a maximum value of 98.60%, showing that the model makes almost no errors in predicting claims supported by evidence. Additionally, the model’s recall achieved high results as well, with a maximum value of 98.30%, indicating that DOD–NSSN successfully identifies nearly all claims correlated with evidence. The F1-Score, which combines precision and recall, also scored high across all topics, reflecting balanced and robust performance in handling various types of data. With these results, it can be concluded that DOD–NSSN has excellent generalization capabilities and is effective across a wide range of topics while maintaining high accuracy, precision, recall, and F1-Score.

5. Conclusions

This research aimed to introduce a new fact verification model with a layered architecture, consisting of an encoding layer, an alignment layer, a matching layer, and an output layer. This layered architecture has been successfully implemented in the proposed model for fact verification. Another objective of this research was to develop a new fact verification model with high accuracy. The proposed method is the Deep One-Directional Neural Semantic Siamese Network (DOD–NSSN). As hypothesized, the model achieved the highest accuracy performance at 91.86%. Additionally, other indicators also scored the highest compared to other algorithms, with an F1-Score of 93.99%, a precision of 88.87%, and a recall of 99.27%. The other findings indicate that the use of Siamese MaLSTM in the language representation process can function as an alternative model architecture, yielding excellent model accuracy on various news topics, consistently above 90%.

The scalability of the DOD–NSSN model is ready for deployment using adequate computing technology and infrastructure, such as cloud computing and parallel data processing. By leveraging these technologies, the model can be scaled up and run in a distributed manner, enabling the processing of large amounts of data while avoiding overfitting, as observed in the NSMN algorithm. The DOD–NSSN model can be adapted into an evidence-based generative Artificial Intelligence (AI) model for fact verification, utilizing model packages available in the repository. This approach can be effectively implemented in evidence-based AI systems. As the demand for large-scale fact verification increases, the generative AI capabilities of the DOD–NSSN model are well-suited for producing high-quality information and mitigating misinformation in online environments. The DOD–NSSN model demonstrates significant scalability compared to previous methods such as NSMN. DOD–NSSN requires only 1.74 s per epoch for training, highlighting its efficiency in handling a large number of fact verification tasks. This short training time, combined with faster prediction times, is a result of the optimized DOD–NSSN architecture, which includes neural semantic encoding and the use of Siamese neural networks. These enhancements not only improve speed but also reduce the risk of overfitting, making DOD–NSSN more suitable for managing large-scale fact processes in real-world applications, outperforming previous models like NSMN.

For future research, it is recommended to develop the DOD–NSSN model by combining it with techniques such as transfer learning and fine-tuning on larger and more diverse datasets to enhance the model’s performance in more complex real-world scenarios. Additionally, further research could integrate DOD–NSSN with hoax detection systems or other AI-based applications to expand its applications. Testing the model on more varied computing infrastructures, such as using edge computing or Internet of Things (IoT) devices, is also suggested to ensure that the model remains effective and efficient in different technological environments. Another approach that could be explored is combining DOD–NSSN with ensemble techniques to improve the accuracy and robustness of this fact verification system.

Author Contributions

Conceptualization, M.N., M.A. and R.F.S.; methodology, M.N., M.A. and R.F.S.; software, M.N. and J.H.W.; validation, M.A. and R.F.S.; formal analysis, M.N. and J.H.W.; investigation, M.A. and R.F.S.; resources, M.N. and J.H.W.; data curation, M.N. and J.H.W.; writing—original draft preparation, M.N. and J.H.W.; writing—review and editing, M.N., M.A. and R.F.S.; visualization, M.N.; supervision, M.A. and R.F.S.; project administration, M.N. and R.F.S.; funding acquisition, M.N. and R.F.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Ministry of Education, Culture, Research, and Technology (Kemendikbudristek) of Indonesia under Penelitian Kerjasama research scheme with grant number NKB-1156/UN2.RST/HKP.05.00/2023. The work of Muchammad Naseer was supported in part by Lembaga Pengelola Dana Pendidikan (LPDP), Ministry of Finance, Indonesia, under Contract 20200421301149.

Data Availability Statement

The DOD–NSSN model version 1 has been uploaded and is available on GitHub, allowing access and usage across various platforms through the repository at https://github.com/naseercolab/dod-nssn/ (last access: 25 December 2024) for future research applications.

Acknowledgments

The authors sincerely thank Universitas Teknologi Bandung for its support.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DOD–NSSN	Deep One Directional Neural Semantic Siamese Network
NSMN	Neural Semantic Matching Network
Siamese MaLSTM	Siamese Manhattan Long Short-Term Memory
BiLSTM	Bidirectional Long Short-Term Memory
XL-Net	Generalized Autoregressive Pretraining for Language Understanding
BERT	Bidirectional Encoder Representations from Transformers
XLM	Cross-lingual Language Model Pretraining
RoBERTa	Robustly Optimized BERT Approach

References

Giachanou, A.; Ghanem, B.; Ríssola, E.A.; Rosso, P.; Crestani, F.; Oberski, D. The Impact of Psycholinguistic Patterns in Discriminating between Fake News Spreaders and Fact Checkers. Data Knowl. Eng. 2022, 138, 101960. [Google Scholar] [CrossRef]
Nayoga, B.P.; Adipradana, R.; Suryadi, R.; Suhartono, D. Hoax Analyzer for Indonesian News Using Deep Learning Models. Procedia Comput. Sci. 2021, 179, 704–712. [Google Scholar] [CrossRef]
Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Mittal, A. FEVER: A Large-Scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 809–819. [Google Scholar] [CrossRef]
Thorne, J.; Vlachos, A.; Cocarascu, O.; Christodoulopoulos, C.; Mittal, A. The Fact Extraction and VERification (FEVER) Shared Task. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), Brussels, Belgium, 1 November 2018. [Google Scholar] [CrossRef]
Nie, Y.; Chen, H.; Bansal, M. Combining Fact Extraction and Verification with Neural Semantic Matching Networks. Proc. AAAI Conf. Artif. Intell. 2019, 33, 6859–6866. [Google Scholar] [CrossRef]
White, R. Evidence and Truth. Philos. Stud. 2023, 180, 1049–1057. [Google Scholar] [CrossRef]
Yin, M.; Vaughan, J.W.; Wallach, H. Understanding the Effect of Accuracy on Trust in Machine Learning Models. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow Scotland, UK, 4–9 May 2019. [Google Scholar] [CrossRef]
Ma, Z.; Li, J.; Song, Y.; Wu, X.; Chen, C. Network Intrusion Detection Method Based on FCWGAN and BiLSTM. Comput. Intell. Neurosci. 2022, 2022, 6591140. [Google Scholar] [CrossRef] [PubMed]
Lakatos, R.; Pollner, P.; Hajdu, A.; Joó, T. Investigating the Performance of Retrieval-Augmented Generation and Domain-Specific Fine-Tuning for the Development of AI-Driven Knowledge-Based Systems. Mach. Learn. Knowl. Extr. 2025, 7, 15. [Google Scholar] [CrossRef]
Ion, R.; Păiș, V.; Mititelu, V.B.; Irimia, E.; Mitrofan, M.; Badea, V.; Tufiș, D. Unsupervised Word Sense Disambiguation Using Transformer’s Attention Mechanism. Mach. Learn. Knowl. Extr. 2025, 7, 10. [Google Scholar] [CrossRef]
Zengeya, T.; Fonou Dombeu, J.V.; Gwetu, M. A Centrality-Weighted Bidirectional Encoder Representation from Transformers Model for Enhanced Sequence Labeling in Key Phrase Extraction from Scientific Texts. Big Data Cogn. Comput. 2024, 8, 182. [Google Scholar] [CrossRef]
Li, Q.; Zhou, W. Connecting the Dots Between Fact Verification and Fake News Detection. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; International Committee on Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1820–1825. [Google Scholar]
Wang, Y.; Xia, C.; Si, C.; Yao, B.; Wang, T. Robust Reasoning Over Heterogeneous Textual Information for Fact Verification. IEEE Access 2020, 8, 157140–157150. [Google Scholar] [CrossRef]
Jiang, Y.; Bordia, S.; Zhong, Z.; Dognin, C.; Singh, M.; Bansal, M. HOVER: A Dataset for Many-Hop Fact Extraction and Claim Verification. In Proceedings of the Findings of the Association for Computational Linguistics Findings of ACL: EMNLP 2020, Online, 16–20 November 2020; pp. 3441–3460. [Google Scholar] [CrossRef]
Schuster, T.; Fisch, A.; Barzilay, R. Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference; Online, 6–11 June 2021, pp. 624–643. [CrossRef]
Atanasova, P.; Simonsen, J.G.; Lioma, C.; Augenstein, I. Fact Checking with Insufficient Evidence. Trans. Assoc. Comput. Linguist. 2022, 10, 746–763. [Google Scholar] [CrossRef]
Gohourou, D.; Kuwabara, K. Knowledge Graph Extraction of Business Interactions from News Text for Business Networking Analysis. Mach. Learn. Knowl. Extr. 2024, 6, 126–142. [Google Scholar] [CrossRef]
Chen, Z.; He, Z.; Liu, X.; Bian, J. Evaluating Semantic Relations in Neural Word Embeddings with Biomedical and General Domain Knowledge Bases. BMC Med. Inf. Decis. Mak. 2018, 18, 65. [Google Scholar] [CrossRef]
Mitrov, G.; Stanoev, B.; Gievska, S.; Mirceva, G.; Zdravevski, E. Combining Semantic Matching, Word Embeddings, Transformers, and LLMs for Enhanced Document Ranking: Application in Systematic Reviews. Big Data Cogn. Comput. 2024, 8, 110. [Google Scholar] [CrossRef]
Ellis, L. B2B Marketing News: New Influencer Marketing & Social Reports, LinkedIn’s B2B Engagement Study, & Google My Business Gets Messaging UK, 2021. Available online: https://www.toprankmarketing.com/blog/b2b-marketing-news-021921/ (accessed on 24 June 2025).
Pan, L.; Chen, W.; Xiong, W.; Kan, M.-Y.; Wang, W.Y. Zero-Shot Fact Verification by Claim Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 476–483. [Google Scholar]
Chen, Z.; Hui, S.C.; Zhuang, F.; Liao, L.; Jia, M.; Li, J.; Huang, H. A Syntactic Evidence Network Model for Fact Verification. Neural Netw. 2024, 178, 106424. [Google Scholar] [CrossRef]
Liu, R.; Zhang, Y.; Yang, B.; Shi, Q.; Tian, L. Robust and Resource-Efficient Table-Based Fact Verification through Multi-Aspect Adversarial Contrastive Learning. Inf. Process. Manag. 2024, 61, 103853. [Google Scholar] [CrossRef]
Chen, C.; Chen, W.; Zheng, J.; Luo, A.; Cai, F.; Zhang, Y. Input-Oriented Demonstration Learning for Hybrid Evidence Fact Verification. Expert. Syst. Appl. 2024, 246, 123191. [Google Scholar] [CrossRef]
Chen, J.; Zhang, R.; Guo, J.; Fan, Y.; Cheng, X. GERE: Generative Evidence Retrieval for Fact Verification; Association for Computing Machinery: New York, NY, USA, 2022; Volume 1, ISBN 9781450387323. [Google Scholar]
Chen, Q.; Ling, Z.; Jiang, H.; Zhu, X.; Wei, S.; Inkpen, D. Enhanced LSTM for Natural Language Inference. In Proceedings of the ACL 2017—55th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; Volume 1, pp. 1657–1668. [Google Scholar]
Kowsher, M.; Tahabilder, A.; Islam Sanjid, M.Z.; Prottasha, N.J.; Uddin, M.S.; Hossain, M.A.; Kader Jilani, M.A. LSTM-ANN & BiLSTM-ANN: Hybrid Deep Learning Models for Enhanced Classification Accuracy. Procedia Comput. Sci. 2021, 193, 131–140. [Google Scholar] [CrossRef]
Zhu, B.; Zhang, X.; Gu, M.; Deng, Y. Knowledge Enhanced Fact Checking and Verification. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3132–3143. [Google Scholar] [CrossRef]
Gong, H.; Wang, C.; Huang, X. Double Graph Attention Network Reasoning Method Based on Filtering and Program-Like Evidence for Table-Based Fact Verification. IEEE Access 2023, 11, 86859–86871. [Google Scholar] [CrossRef]
Chang, Y.-C.; Kruengkrai, C.; Yamagishi, J. XFEVER: Exploring Fact Verification across Languages. In Proceedings of the Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023), Taipei City, Taiwan, 20–21 October 2023. [Google Scholar]
Zhong, W.; Xu, J.; Tang, D.; Xu, Z.; Duan, N.; Zhou, M.; Wang, J.; Yin, J. Reasoning Over Semantic-Level Graph for Fact Checking. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
Stella, M.; Kenett, Y.N. Knowledge Modelling and Learning through Cognitive Networks. Big Data Cogn. Comput. 2022, 6, 53. [Google Scholar] [CrossRef]
Belaroussi, R. Subjective Assessment of a Built Environment by ChatGPT, Gemini and Grok: Comparison with Architecture, Engineering and Construction Expert Perception. Big Data Cogn. Comput. 2025, 9, 100. [Google Scholar] [CrossRef]
Schuster, T.; Shah, D.; Yeo, Y.J.S.; Roberto Filizzola Ortiz, D.; Santus, E.; Barzilay, R. Towards Debiasing Fact Verification Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3417–3423. [Google Scholar]
Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Mittal, A. Evaluating Adversarial Attacks against Multiple Fact Verification Systems. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 2944–2953. [Google Scholar]
Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The Performance of LSTM and BiLSTM in Forecasting Time Series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; Volume 2019, pp. 3285–3292. [Google Scholar] [CrossRef]
Imtiaz, Z.; Umer, M.; Ahmad, M.; Ullah, S.; Choi, G.S.; Mehmood, A. Duplicate Questions Pair Detection Using Siamese MaLSTM. IEEE Access 2020, 8, 21932–21942. [Google Scholar] [CrossRef]
Carneiro, T.; Da Nobrega, R.V.M.; Nepomuceno, T.; Bian, G.B.; De Albuquerque, V.H.C.; Filho, P.P.R. Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications. IEEE Access 2018, 6, 61677–61685. [Google Scholar] [CrossRef]
Di Gennaro, G.; Buonanno, A.; Palmieri, F.A.N. Considerations about Learning Word2Vec. J. Supercomput. 2021, 77, 12320–12335. [Google Scholar] [CrossRef]
Lombardo, G.; Tomaiuolo, M.; Mordonini, M.; Codeluppi, G.; Poggi, A. Mobility in Unsupervised Word Embeddings for Knowledge Extraction—The Scholars’ Trajectories across Research Topics. Future Internet 2022, 14, 25. [Google Scholar] [CrossRef]

Figure 1. DOD–NSSN Layer Architecture: Encoding, Alignment, Matching, and Output Layer.

Figure 2. Preprocessing and Feature Extraction in DOD–NSSN for Model Training and Testing.

Figure 3. Layer Architecture Comparison Between DOD–NSSN and NSMN Algorithms.

Figure 4. Relatedness Score of NSMN algorithm and Manhattan Fact Relatedness Score of DOD–NSSN algorithm.

Figure 5. Performance evaluation of DOD–NSSN on Accuracy, Precision, Recall, and F1-Score.

Figure 6. Loss value of training and validation across epochs.

Figure 7. A Comparison of DOD–NSSN with Other Algorithms Shows that DOD–NSSN Outperformed NSMN and Transformer-based Algorithms Across All Performance Indicators.

Table 1. Comparison Of Fact Verification Models Based on Key Indicators.

No	Type of Feature	Characteristic	NSMN [5]	BERT [29]	RoBERTa [28]	XLM [30]	XL-NET [31]	DOD–NSSN[Proposed]
1	Syntactic/textual	Fact verification Dataset	Y	Y	Y	Y	N	Y
1	Syntactic/textual	Word embedding	Y	N	N	Y	Y	Y
2	Modelling	Development Model	Y	Y	Y	N	Y	Y
3	Verification features	Supports	Y	N	Y	Y	N	Y
		Refutes	Y	Y	Y	Y	N	Y
		Not enough info	Y	N	Y	Y	N	Y
		Evidence	Y	Y	Y	Y	Y	Y
4	Human confidence value for the model	Accuracy > 70%	N	N	Y	Y	N	Y
5	Domain-specific fact verification words	Similarity vector	Y	Y	N	Y	N	Y
6	Content similarity feature	Relatedness score	Y	N	N	Y	Y	Y

Table 2. Implementation Environment of Proposed Model.

Specific Hardware and Models	Configuration
Specific Hardware
CPU model name	Tesla T4
RAM	16 GB
GPU	Amazon EC2 G4
GPU Memory	16 GB
Models Evaluation Parameter
Epoch	50
Train Validation Ratio	7:3
Optimizer	Adam
Evaluation	Accuracy, Precision, Recall, F1-Score
Loss Evaluation Metric	mean_squared_error
Runtime Environment
Development Platform	Google Colaboratory
Programming Language	Python 3.10
Libraries & Versions	TensorFlow 2.12.0, NumPy 1.23.5, Gensim 4.3.1, Pandas 1.5.3, Scikit-learn 1.2.2, Matplotlib 3.7.x

Table 3. Detailed Description of Dataset Attributes.

Attributes	Description
id	Document Id
input1	Contents of Claim
input2	Content of Evidence
actual label	Actual Label of Claim

Table 4. Model Performance Results.

Model	Accuracy	Precision	Recall	F1-Score
NSMN	69.43%	57.85%	86.79%	74.33%
XL-Net	76.32%	72.71%	92.71%	74.90%
BERT	78.68%	78.80%	85.64%	81.87%
XLM	73.19%	78.36%	86.00%	82.50%
RoBERTa	76.93%	81.30%	84.98%	82.99%
DOD–NSSN	91.86%	88.87%	99.27%	93.99%

Table 5. Fact Verification’s Results.

No	Claim	Evidence	Actual Label	Predicted Label	Predicted Status
1	Dinosaurs lived around 100 million years ago.	Dinosaurs as well as most life was extinct about 65 million years ago	SUPPORTS	SUPPORTS	TRUE
2	Saltwater taffy candy imported in Australia	Saltwater taffy candy imported in Japan	REFUTES	SUPPORTS	FALSE
3	China gets war reparation funds from Japan after World War II	-	NEI	REFUTES	FALSE
4	Mortal Kombat X (2015) on PC have a local multiplayer	Mortal Kombat X (2015) does have local multiplayer on PC.	SUPPORTS	SUPPORTS	TRUE
5	Mao Zedong killed over 50 million people during his reign	-	NEI	NEI	TRUE
6	Fresher chicken eggs have darker yellow yolks	As eggs age, the yolk may become flatter and less round due to the thinning of the albumen, the egg white	REFUTES	NEI	FALSE
7	RBI governer Urjit Patel is Brother-in-law of Mukesh Ambani	RBI Governor Urjit Patel Mukesh Ambani’s brother-in-law is a hoax	REFUTES	REFUTES	TRUE
8	CMI students have access to every mathematician’s house in the world 24/7	-	NEI	NEI	TRUE
9	Amrita University hosting ACM ICPC World Finals 2019	In 2019, Amrita University in India was selected to host the world finals of the competition.	SUPPORTS	SUPPORTS	TRUE
10	4 ppm tds safe for drinking water	15 ppm tds safe for drinking water most likely to give you mercy	REFUTES	REFUTES	TRUE

Table 6. Margin of Error comparison using 95% Confidence Interval DOD–NSSN.

Model	Accuracy	Precision	Recall	F1-Score
NSMN	0.0011%	0.0001%	0.0008%	0.0001%
XL-Net	0.0010%	0.0011%	0.0006%	0.0011%
BERT	0.0010%	0.0010%	0.0008%	0.0009%
XLM	0.0011%	0.0010%	0.0008%	0.0009%
RoBERTa	0.0010%	0.0009%	0.0009%	0.0009%
DOD–NSSN	0.0006%	0.0008%	0.0002%	0.0006%

Table 7. T-Statistic Comparison Between DOD–NSSN and Baseline Models.

Model	Accuracy	Precision	Recall	F1-Score
t-Test
NSMN	8.2026	9.8631	14.7618	8.2719
XL-Net	5.6829	5.1382	7.7594	8.0320
BERT	5.5454	4.2369	5.7347	5.0994
XLM	6.8276	3.3417	15.6962	4.8343
RoBERTa	6.2817	3.1850	6.0124	4.6282
p-Value
NSMN	0.0000	0.0000	0.0000	0.0000
XL-Net	0.0000	0.0000	0.0000	0.0000
BERT	0.0000	0.0000	0.0000	0.0000
XLM	0.0000	0.0008	0.0000	0.0000
RoBERTa	0.0000	0.0015	0.0000	0.0000

Table 8. DOD–NSSN Evaluation Metrics Across Various Topics.

Topic	Accuracy	Precision	Recall	F1-Score
Sports	97.75%	96.76%	95.60%	96.95%
Government	97.13%	97.24%	92.71%	96.12%
Political	97.49%	95.95%	95.11%	96.67%
Health	82.50%	98.36%	96.00%	96.19%
Industry	98.37%	98.60%	98.30%	97.79%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Naseer, M.; Windiatmaja, J.H.; Asvial, M.; Sari, R.F. Deep One-Directional Neural Semantic Siamese Network for High-Accuracy Fact Verification. Big Data Cogn. Comput. 2025, 9, 172. https://doi.org/10.3390/bdcc9070172

AMA Style

Naseer M, Windiatmaja JH, Asvial M, Sari RF. Deep One-Directional Neural Semantic Siamese Network for High-Accuracy Fact Verification. Big Data and Cognitive Computing. 2025; 9(7):172. https://doi.org/10.3390/bdcc9070172

Chicago/Turabian Style

Naseer, Muchammad, Jauzak Hussaini Windiatmaja, Muhamad Asvial, and Riri Fitri Sari. 2025. "Deep One-Directional Neural Semantic Siamese Network for High-Accuracy Fact Verification" Big Data and Cognitive Computing 9, no. 7: 172. https://doi.org/10.3390/bdcc9070172

APA Style

Naseer, M., Windiatmaja, J. H., Asvial, M., & Sari, R. F. (2025). Deep One-Directional Neural Semantic Siamese Network for High-Accuracy Fact Verification. Big Data and Cognitive Computing, 9(7), 172. https://doi.org/10.3390/bdcc9070172

Article Menu

Deep One-Directional Neural Semantic Siamese Network for High-Accuracy Fact Verification

Abstract

1. Introduction

2. Related Works

3. DOD–NSSN

3.1. DOD–NSSN Preprocessing

3.2. DOD–NSSN Feature Extraction

3.3. DOD–NSSN Layers

3.3.1. Encoding Layer

3.3.2. Alignment Layer

3.3.3. Matching Layer

3.3.4. Output Layer

3.4. DOD–NSSN Implementation Environment

3.5. Dataset

3.6. Performance Evaluation Metrics

3.7. Dataset Testing Against DOD–NSSN

4. Results and Discussion

4.1. Results

4.2. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI