Semantic Textual Similarity with Constituent Parsing Heterogeneous Graph Attention Networks

Wu, Hao; Huang, Degen; Lin, Xiaohui

doi:10.3390/sym17040486

Open AccessArticle

Semantic Textual Similarity with Constituent Parsing Heterogeneous Graph Attention Networks

by

Hao Wu

^*

,

Degen Huang

and

Xiaohui Lin

School of Computer Science and Technology, Dalian University of Technology, Dalian 116000, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(4), 486; https://doi.org/10.3390/sym17040486

Submission received: 11 February 2025 / Revised: 11 March 2025 / Accepted: 21 March 2025 / Published: 24 March 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Semantic Textual Similarity (STS) serves as a metric for evaluating the semantic symmetry between texts, playing a pivotal role in various natural language processing (NLP) tasks. To facilitate the accurate measurement of semantic symmetry, high-quality text representation is essential. This paper studies how to utilize constituent parsing for text representation in STS. Unlike most existing syntax models, we propose a heterogeneous graph attention network that integrates constituent parsing (HGAT-CP). The heterogeneous graph contains meaningfully connected sentences, verb phrase (VP), noun phrase (NP), phrase, and word nodes, which are derived from the constituent parsing tree. This graph is fed to a graph attention network for context propagation among relevant nodes, which effectively captures the relations of inter-sentence components. In addition, we leverage the relationships between verb phrases (VPs) and noun phrases (NPs) across sentence pairs for data augmentation, which is denoted as HGAT_CP(NP, VP). We extensively evaluate our method on three datasets, and experimental results demonstrate that our proposed HGAT_CP(NP, VP) achieves significant improvements on the majority of the datasets. Notably, on the SICK dataset, HGAT_CP(NP, VP) achieved improvements of 0.39 and 1.84 compared to SimCSE-ROBERTa_large and SimCSE-ROBERTa_base, respectively.

Keywords:

semantic textual similarity; constituent parsing; heterogeneous graph attention networks; natural language processing

Graphical Abstract

1. Introduction

Semantic Textual Similarity (STS) quantifies the degree of semantic symmetry between two texts. STS serves as an essential preliminary step in various natural language processing (NLP) tasks, including intelligent question answering, machine translation (MT), and automatic summarization. In intelligent question answering systems, candidate answers are ranked based on their semantic similarity to the given question [1,2]. In MT, sentence similarity is employed to determine if two sentences convey the same meaning [3,4,5].

The semantic symmetry of text is manifested not only in the lexical symmetry between texts but also in the semantic symmetry arising from the grammatical combination of words. Lexicon-based symmetry approaches [6,7,8,9] compute the symmetry between the character or word sequences of the text being compared. In contrast, semantic-based symmetry methods lack the capability to offer intuitive comparisons directly. Therefore, it is necessary to transform textual data into comparable representations to quantify text symmetry effectively. In this study, we employ vector-based representations of text and utilize text vectors to measure the degree of semantic symmetry. Consequently, the quality of text representation plays a critical role in ensuring accurate and reliable symmetry analysis. Current text representation approaches are mostly based on two models: the sequence model and the structured model. Regarding the studies on the sequence model, Pagliardini et al. [10] and Le et al. [11] proposed Sent2vec and Doc2Vec to calculate sentence similarity. These methods leverage pre-trained word and sentence embeddings directly for similarity tasks without fine-tuning or training additional neural network models.

He et al. [12] proposed an elaborate convolutional network (ConvNet) variant that infers sentence similarity through multi-level feature extraction and diverse pooling mechanisms. Mueller et al. [13] proposed Siamese recurrent neural networks, which use two identical long short-term memory (LSTM) networks and utilize the Manhattan distance as the similarity metric between two subnetworks. Tharindu et al. [14] extended the Siamese recurrent neural networks and used thesaurus-based augmentation [15] to add 10,022 additional training examples. The Siamese recurrent neural networks acquired excellent results. Bidirectional encoder representations from transformers (BERT) [16] is a pre-trained transformer network [17] that has set new state-of-the-art results for various NLP tasks. BERT and RoBERTa [18] have established new state-of-the-art benchmarks in sentence pair regression tasks, such as those encountered in the Semantic Textual Similarity (STS) task. Reimers et al. [19] introduced Sentence-BERT (SBERT), an adaptation of the pre-trained BERT model, which employs Siamese and triplet network architectures to generate semantic sentence embeddings. Based on the BERT model, Gao et al. [20] proposed SimCSE, which is a straightforward yet effective contrastive sentence embedding framework that generates high-quality sentence embeddings. Many contrastive learning approaches [21], including SimCSE, typically consist of two key components: data augmentation techniques and an instance-level contrastive loss function. However, the inherent discreteness of textual data poses significant challenges in formulating universal guidelines for effective text augmentation. Zhang et al. [22] addressed the challenge by introducing a neighborhood-guided virtual augmentation strategy to facilitate contrastive learning. Given the widely acknowledged importance of structural information for NLP [23,24], there has been growing interest in enhancing text representation through the incorporation of syntactic information.

In order to use sentence structure information, Richard et al. [25] proposed to use a recurrent neural network to construct a text representation. Their method applied a pre-specified parse tree to the recurrent neural network (RNN) to construct a structured representation. Tai et al. [26] proposed a generalization of the standard LSTM architecture to tree-structured models and demonstrated its superiority for sentence representation compared to sequential models.

Many studies have improved pre-trained language models (LMs) by integrating syntax-driven attention mechanisms into the transformer architecture [27,28,29]. These approaches leverage the additional components to generate syntax-aware representations, which are then combined with the original outputs from the standard transformer to produce a final syntax-enhanced representation. Additionally, some research has focused on incorporating syntax-related objectives during the pre-training phase, including tasks like syntax head prediction [30] and dependency distance prediction [28]. For instance, Zhang et al. [22] introduced the syntax-guided contrastive language model (SynCLM), which employs syntactic information to construct contrastive positive and negative samples. These samples are utilized to enhance the pre-trained LM’s ability to acquire deeper syntactic knowledge through contrastive learning.

The structural information of presence generally refers to a parsing tree, and it is very easy to embed the tree structure of the parsing tree into the graph. Therefore, many researchers have started using graph neural networks to fuse parsing trees [31]. Ahmad et al. [32] adapted vanilla transformers by incorporating graph attention transformer encoders (GATEs) to encode syntactic structures derived from dependency trees. Devianti et al. [33] proposed a dependency-aware GAT-based model, specifically two-attention relational GATs (TAGATs). Zhang et al. [34] introduced Semantic Anchor Graph Neural Networks (SEAN-GNNs), a novel framework designed for fine-grained emotion classification tasks.

In this paper, we introduce a heterogeneous graph neural network designed for the Semantic Textual Similarity (STS) task. The graph structure is constructed using constituent parsing, while a graph attention mechanism facilitates information propagation based on the derived constituent parsing tree. Unlike most previous works, our proposed model is specifically designed for the STS task within the graph formulated from constituent parsing, as it specifically models sentence-level, phrase-level, and word-level information. In addition, based on the relationships between sentence nodes, we augment the dataset by establishing corresponding links between noun phrases (NPs) and verb phrases (VPs) at the first layer of the syntax tree, leveraging the syntactic characteristics of the sentences. In our experiments, we devised multiple mechanisms for message propagation among diverse node types to investigate which transmission methods result in superior inter-sentence representation learning.

This study makes three key contributions:

This paper utilizes constituent parsing to construct heterogeneous graphs and employs a graph attention mechanism to propagate information.
This paper designs different message propagation paths.
This paper augments the dataset by establishing corresponding links between noun phrases and verb phrases within the syntax tree.

2. Related Work

2.1. The Application of Graph Attention Network in Field of NLP

Since its proposal, the graph attention network (GAT) has found widespread application in various domains of natural language processing (NLP), yielding promising results. Xu et al. [35] introduced the Double-Branch Multi-Attention Graph Neural Network (MA-GNN), which is designed to preserve both global and local structural information for inductive Knowledge Graph Completion (KGC). Liang et al. [31], on the other hand, presented BiSyn-GAT+, which exploits syntactic information derived from sentence constituent trees. This approach models sentiment-aware contexts for individual aspects and sentiment relations among different aspects, ultimately enhancing the learning process. Zhang et al. [36] proposed dual-graph attention networks (DualGATs) to improve the accuracy of dialogue event relation cxtraction (DERC) by considering the complementarity of discourse structure and speaker-aware context concurrently.

To tackle the challenge of representing diverse node and edge types during composition, Wang et al. [37] explored the use of heterogeneous graphs and conducted extensive studies on heterogeneous graph attention networks (HGATs). Building on this, Chen et al. [38] developed an HGAN to improve Emotion Recognition in Conversations (ERC) by leveraging dialogue data within a graph attention framework, enabling effective context propagation across relevant nodes to capture conversational dynamics. For short-text classification in Heterogeneous Information Networks (HINs), Yang et al. [39] proposed an HGAT model incorporating a dual-level attention mechanism, combining node-level and type-level attention. Additionally, You et al. [40] introduced HeterTls, a joint learning-based heterogeneous graph attention network for Temporal Link Summarization (TLS), which unifies data selection and event detection to enhance extraction precision while reducing redundant information. In addition, Chen et al. [41] proposed a Community-Aware Heterogeneous Graph Contrastive Learning (CACL) framework for social media bot detection, which models social networks as heterogeneous graphs with multiple node and edge types to capture complex interactions. Ye et al. [42] introduced a Multi-source Data Fusion-based Heterogeneous Graph Attention Network (MHGAT), which integrates diverse company descriptions and heterogeneous business relationships to concurrently model market commonality and resource similarity, facilitating a more robust and precise competitor identification process. These works collectively highlight the significant potential of heterogeneous graph neural networks in advancing NLP applications. In this paper, we hypothesize that sentences, phrases, and words can be regarded as nodes of distinct types. Consequently, when transmitting information within the GAT framework, we adopt appropriate transmission methods tailored to their respective types.

2.2. Fusion Syntax Model

Syntax serves as a vital prior knowledge for neural network models in NLP. Numerous innovative methods have been developed in this direction, including Tree-LSTM [26], PECNN [43], SDP-LSTM [44], Supervised Treebank Conversion [45], PRPN [46], and ON-LSTM [47].

Recent works also integrate syntactic knowledge into transformer and BERT. Sachan et al. [48] examined widely adopted strategies for integrating dependency structures into pre-trained language models.Also using dependency parsing, Li et al. [27] proposed a syntax-aware local attention (SLA) which applies dependency parsing to the input text and integrates it with BERT. In addition, for constituency parsing, Syntax-BERT [29] features a comprehensive self-attention taxonomy and decomposes the self-attention network into multiple subnetworks based on the tree structure. The above research indicates that using syntactic structures in general fields can improve the performance of BERT and transformer. Some researchers combine domain-specific information to construct sentence frameworks and add them to the corresponding models. Bao et al. [49] proposed an innovative opinion tree parser that captures and analyzes sentiment elements through a structured opinion tree framework. This framework is built on a generalized and carefully crafted context-free opinion grammar, which systematically organizes sentiment elements into a unified and coherent opinion tree representation.

Specifically, Tang et al. [50] proposed three syntax-enhanced models all built upon the specialized BioBERT architecture. These models are Chunking-Enhanced-BioBERT, Constituency-Tree-BioBERT—which integrates constituency parsing details—and Multi-Task-Syntactic (MTS)-BioBERT. In the MTS-BioBERT model, syntactic knowledge is indirectly incorporated by adding syntax-related tasks to the training objectives.

Multiple studies demonstrate that golden syntax trees can significantly enhance the performance of semantic representation, our method incorporates syntax trees into heterogeneous graphs, facilitating more efficient and precise information transfer between diverse nodes.

3. Methodology

3.1. Graph Attention Network

The graph attention network (GAT) is a deep learning framework tailored for graph-structured data, allowing it to efficiently model node relationships and derive their representations. For a graph

G = (V, E)

with node embeddings

h = {{\vec{h}}_{1}, {\vec{h}}_{2}, \dots, {\vec{h}}_{N}}

, where

{\vec{h}}_{i} \in R^{F}

, a GAT layer transforms these embeddings into

h^{'} = {{\vec{h^{'}}}_{1}, {\vec{h^{'}}}_{2}, \dots, {\vec{h^{'}}}_{N}}

through a self-attention mechanism. During the attention computation, each node and its neighbors are processed to produce a set of attention coefficients

α_{i j}

known as masked attention.

α_{i j} = \frac{e x p (L e a k y R e L U ({\tilde{a}}^{T} [W {\vec{h}}_{i} | | W {\vec{h}}_{j}]))}{\sum_{K \in N} e x p (L e a k y R e L U ({\tilde{a}}^{T} [W {\vec{h}}_{i} | | W {\vec{h}}_{k}]))}

(1)

where

W \in R^{F \times F^{'}}

represents a weight matrix that maps embeddings to alternative feature spaces, while

\tilde{a} \in R^{2 F^{'}}

denotes a weight matrix within a feedforward neural network used to calculate the attention coefficients. The symbol

| |

indicates the concatenation of two vectors. These computed attention weights are then used to refine the original node features, producing enriched representations as defined by Equation (2):

{\vec{h}}_{i}^{'} = {| |}_{k = 1}^{K} σ (\sum_{j \in N_{i}} α_{i j}^{k} W^{k} {\vec{h}}_{j})

(2)

3.2. Heterogeneous Graph Neural Network (HGNN)

A substantial body of work on graph neural networks has focused on homogeneous graphs, where all nodes and edges are assumed to be of the same type. However, given the inherent complexity and diversity of real-world systems, the attributes of entities and their relationships can vary significantly. Consequently, homogeneous graphs often fall short of accurately representing these complexities. In contrast, heterogeneous graphs, which incorporate multiple types of nodes and edges, provide a more nuanced and accurate framework for the mathematical modeling of real-world phenomena.

A heterogeneous graph is characterized by its topology

G = (V, E)

, along with a node type mapping

ϕ : V \to A

and an edge type mapping

ψ : E \to R

. Specifically, a graph is considered heterogeneous if the number of node types

| A | > 1

or the number of edge types

| R | > 1

.

To adapt neural networks for heterogeneous graphs, advanced mechanisms for information extraction and message propagation must be carefully designed. One such approach involves the use of meta-paths. which are a flexible structure that captures diverse semantic relationships within heterogeneous graphs.

A meta-path is a sequence defined over the node- and edge-type sets

{A, R}

, expressed as

A_{1} \overset{R_{1}}{⟶} A_{2} \overset{R_{2}}{⟶} \dots \overset{R_{l}}{⟶} A_{l + 1}

. It represents a composite relation

R = R_{1} \dots R_{n}

connecting entities

A_{1}

and

A_{l + 1}

.

In this study, we first construct a heterogeneous graph that incorporates hierarchical and functional language elements from the dataset. We then utilize heterogeneous graph attention mechanisms guided by task-specific meta-paths to enhance performance.

3.3. Model Overview

In this work, a constituent parsing tree is modeled as a heterogeneous graph to capture both inter-phrase and inter-sentence relations. We initially employ BERT to encode information at the sentence, phrase, and word levels. The resulting sentence encodings, word embeddings, and phrase encodings are then logically integrated to construct the heterogeneous graph, which is described in detail later in this section. Additionally, this graph facilitates information propagation between adjacent nodes via a graph attention layer [51]. The specific flowchart is illustrated in Figure 1.

3.3.1. Graph Construction

In the proposed heterogeneous graph, nodes are categorized into five types: sentence nodes, phrase nodes, phrase_NP nodes, phrase_VP nodes, and word nodes. Each type of node stores information about the corresponding sentence components. Edge connections are established between nodes based on the structure of the syntax tree, which determines the connectivity between them. Utilizing these connections, we employ graph attention networks (GATs) to facilitate information transmission across the nodes. Below, we provide a detailed description of each component.

Sentence Nodes. Sentence nodes in a parsing tree are utilized to capture sentence-level information, which encompasses both the sentence itself and its corresponding encoding. A sentence node is usually the root node of a syntactic tree, and its difference from a phrase node is that it is a complete sentence, while a phrase node is usually a component of a sentence. The initialization of this sentence encoding is derived from BERT.

Phrase Node, Phrase_NP Node, Phrase_VP Node. Phrase nodes are used to represent the phrase-level information in a parsing tree. In the parsing tree, we exclude the root node and leaf node, which we call phrase nodes. The noun phrases and verb phrases connected to the root node in the syntax tree are referred to as the phrase_NP node and phrase_VP node, respectively. In this article, due to the uneven number of layers in the parsing tree, we uniformly selected the first two layers of each parsing tree as phrase nodes. We used BERT to initialize the phrase node.

Word Nodes. Word nodes correspond to the vocabulary present in the sentence and its phrases. Each word node is linked to the sentence that includes it, as well as to all associated phrases in the parsing tree. The initial states of these word nodes are set using BERT embeddings.

Edge Construction. In this paper, although the proposed graph is an undirected graph, its propagation direction was set by us. The graph includes seven types of edges: sentence-sentence, sentence-word, sentence-phrase, phrase-phrase, phrase-word, phrase_NP-phrase_NP, and phrase_VP-phrase_VP. The constructed graph is undirected, yet information propagation follows specific directions. An edge between nodes A and B, denoted as A-B, indicates a bidirectional connection.Each edge is given its type based on the relationship between its vertices. For the edges between sentences and words, as well as between phrases and words, we used POS tags as their type. POS tags can not only represent the local information of words in the sentence or phrase, but also represent their relationships. We trained the initial vector of phrase tags using word2vec. In addition, for sentences and phrases, as well as the edges between phrases, we used phrase tags as their type. The functions of phrase tags and pos tags are similar, but the phrase structure is more complex. Phrase tags are also trained using word2vec. phrase_NP-phrase_NP and phrase_VP-phrase_VP are different from other edge types in that they represent a phrase of the same type connected between sentence pairs, providing a more detailed representation of training phrases.

3.3.2. Graph Attention Layer

We employ the graph attention mechanism [51] to aggregate word-level and phrase-level information into sentence nodes. For instance, consider node i. The graph attention mechanism described below illustrates how node i’s neighborhood j aggregates information to update node i:

F (h_{i}, h_{j}) = L e a k y R e L u ({\vec{a}}^{T} ({\vec{W}}_{i} h_{i}; {\vec{W}}_{j} h_{i}; E_{i j}))

(3)

α_{i j} = s o f t m a x (F (h_{i}, h_{j})) = \frac{e x p (F (h_{i}, h_{j}))}{\sum_{k} e x p (F (h_{i}, h_{j}))}

(4)

h_{i}^{'} = {| |}_{k = 1}^{K} σ (\sum_{j} α_{i j}^{k} {\vec{W}}_{q}^{k} h_{j})

(5)

where

h_{i}

and

h_{j}

denote the representations of nodes i and j, respectively;

W_{i}

,

W_{j}

, and

W_{q}

are trainable weight matrices, while

a^{T}

is a trainable weight vector;

E_{i j}

represents the edge weight matrix mapped to a multi-dimensional embedding space;

ℵ_{i j}

denotes the attention weight between nodes i and j;

σ

is an activation function; and

| |

indicates the concatenation operation.

3.3.3. Message Propagation

While graph attention mechanisms excel at aggregating features from neighboring nodes, designing an efficient information transmission framework remains essential. To ensure a more effective node information transmission path, we designed it based on the characteristics of the sentence. To enhance the richness of node features, we iteratively updated all fundamental nodes multiple times. In the syntax tree, in order to reduce the complexity of the syntax tree, we limited it to only four layers, including a root node, two layers of phrase nodes, and one layer of word nodes.

The sentence and its directly connected first layer phrase have information transmission, and because they are directly connected to phrases in the syntax tree, this greatly weakens the role of words in the sentence. We directly connect the sentence nodes to the leaf nodes of the syntax tree and carry out information transmission. For phrases in the syntax tree, the first level phrase is directly connected to the sentence and the second level phrase, and information is transmitted between them. Finally, the second layer of phrases and words are directly connected, exchanging information between them. The path can be denoted as

V_{w}

⟶

V_{s}

,

V_{w}

⟶

V_{p}

⟶

V_{p}

⟶

V_{s}

,

V_{N P}

⟶

V_{N P}

, or

V_{V P}

⟶

V_{V P}

, where

V_{w}

,

V_{p}

,

V s

,

V_{N P}

, and

V_{V P}

refer to word nodes, phrase nodes, sentence nodes, phrase_NP nodes, and phrase_VP nodes.

Unlike other messaging methods, the category between

V_{N P}

⟶

V_{N P}

and

V_{V P}

⟶

V_{V P}

is assigned a value based on the category of the sentence pair. In the experiment, two primary phrase pair constructions were developed for sentence pairs within the Stanford Natural Language Inference (SNLI) dataset. Specifically, when the sentence pair category is “entailment”, we consider the category of the first layer’s NP and VP in the two parsing trees to be “entailment”. When the sentence pair is “contradiction”, we consider the category of the first layer’s VP in the two parsing trees to be ”contradiction”. When the sentence pair is “neutral”, we skip it directly. According to different transmission methods, we set up three models for experimentation. The first is a model that only includes the

V_{w}

⟶

V_{s}

path and is denoted as HGAT_W-S. The second one includes two paths

V_{w}

⟶

V_{s}

and

V_{w}

⟶

V_{p}

⟶

V_{p}

⟶

V_{s}

and is denoted as HGAT_CP. The third type includes all paths and is denoted as HGAT_CP(NP, VP).

3.3.4. Similarity Calculation Model

For a given sentence pair, we predict a similarity score as a real number within the interval

[1, K]

, where K > 1 is an integer. The sequence

{1, 2, \dots, K}

defines an ordinal similarity scale, where higher values correspond to stronger similarity. Real-valued scores are permitted to reflect ground truth ratings derived from the averaged assessments of multiple human annotators.

Utilizing these sentence representations, we estimate the similarity score

\hat{y}

by employing a neural network that considers both the Euclidean distance and cosine angle between the vector pair (

h_{L}

,

h_{R}

).

h_{\times} = h_{L} * h_{R}

(6)

h_= | h_{L} - h_{R} |

(7)

h_{s} = σ (W^{(\times)} h_{\times} + W^{(+) h_+ b^{(h)}})

(8)

\hat{p} θ = s o f t m a x (W^{(p)} h_{s} + b^{(p)})

(9)

\hat{y} = r^{T} \hat{p} θ

(10)

where

r^{T} = [1, 2, \dots, K]

, and the absolute value operation is applied element-wise. The inclusion of both distance metrics

h_{\times}

and

h_

is empirically supported, as their combined use has been shown to achieve better results than employing either metric individually. The multiplication measure

h_{\times}

can be interpreted as the multiplication of the elements representing the input representation. The subtractive measure

h_

can quantify the discrepancy between the input representations.

\hat{p} θ

denotes the vector representation of the predicted outcome, while

\hat{y}

represents the predicted outcome itself.

The cost function is the regularized KL-divergence [52] between p and

\hat{p} θ

:

J (θ) = \frac{1}{m} \sum_{k = 1}^{m} K L (p^{k} | | \hat{p} θ^{k}) + \frac{λ}{2} {| | θ | |}_{2}^{2}

(11)

where m denotes the number of training sentence pairs, and the superscript k represents the k-th sentence pair. It is important to note that we employed the identical KL-divergence loss function and the same sparse target distribution technique as described by [26].

4. Experiments

4.1. Tasks and Datasets

4.1.1. Sentences Involving Compositional Knowledge (SICK) Dataset

This dataset originates from the 2014 SemEval competition [53] and comprises 9927 sentence pairs, including 4500 pairs in the training set, 500 pairs in the validation set, and 4927 pairs in the test set. Each sentence pair has been manually scored for semantic relatedness, with scores ranging from 1 to 5, where higher scores indicate a stronger degree of semantic similarity.

4.1.2. STS Benchmark Dataset

This dataset consists of 8628 sentence pairs from the STS Benchmark, which was utilized for STS tasks and curated within the context of SemEval from 2012 to 2016. The dataset encompasses text from diverse sources, including image captions, news headlines, and user forums. Specifically, it is divided into a training set of 5749 sentence pairs, a validation set of 1500 sentence pairs, and a test set of 1379 sentence pairs. The Microsoft Video Paraphrase Dataset (MSRVID) was collected by Agirre et al. [54] for the 2012 SemEval competition and comprises 1500 pairs of short video descriptions that were subsequently annotated. Among them, 1000 sentences pairs are used as training corpus, 250 sentences pairs as test corpus, and 250 sentences pairs as validation corpus.

4.1.3. The Stanford Natural Language Inference (SNLI) Dataset

The Stanford Natural Language Inference (SNLI) dataset comprises 570,000 English sentence pairs, each manually annotated by humans for balanced classification into entailment, contradiction, and neutral categories. Specifically, the SNLI dataset contains 550,000 pairs of the training sets, 10,000 pairs of the validation sets, and 10,000 pairs of the test sets.

4.2. Experimental Details and Hyperparameters

In our experimental setup, we optimized key hyperparameters, including the batch size, learning rate, and BERT hidden size, by evaluating model performance on the validation set. The specific configurations of these parameters are detailed in Table 1. The initialization vectors for Word, Phrase, and Sentence in BERT originally have a dimension of 768. To optimize storage efficiency, we standardized the dimensionality reduction process, reducing the vectors to 300 dimensions uniformly. For additional training details, including required data formats, package dependencies, hardware requirements, and important notes, please refer to our repository at https://github.com/WHnihao/HGAT-STS/ (accessed on 17 March 2025). All relevant information is comprehensively documented to facilitate smooth implementation and experimentation.

During the model training process, we initially employed the SLNI corpus for preliminary training, utilizing it as the foundational dataset to develop the model. Subsequently, the model was transferred to the SICK and STS benchmark datasets for further training. The SLNI dataset includes syntactic analysis for each sentence, whereas the SICK and STS benchmark datasets necessitate the use of NLTK tools for syntactic parsing. Given the extensive size of the SLNI dataset and the substantial number of phrases it encompasses, we imposed constraints on the syntax tree to extract only the first three layers of phrases, thereby effectively reducing the phrase count. Owing to the considerable number of phrase nodes, the training process demanded significant memory resources, approximately 100 G, and each epoch required around 23 h to complete, which is notably longer compared to other models.

4.3. Results on SICK Dataset

The results obtained from our experiments are summarized in Table 2. The evaluation metrics utilized by our model include Pearson’s correlation coefficient, Spearman’s rank correlation coefficient, and the Mean Squared Error (MSE). Our experimental setup comprised two configurations: one that excludes the use of syntactic structure and another that incorporates syntactic structure in the construction of sentence representations. To quantify the specific improvements in the results, we employed Spearman’s rank correlation coefficient for analysis. In the configuration devoid of syntactic structure, we observed that the SimCSE-RoBERTa large model yielded substantial improvements. Both SimCSE and BERT leveraged extensive corpora for pre-training; however, BERT’s performance on the SICK dataset was not satisfactory. SimCSE significantly enhanced the test results by incorporating a contrastive learning model on top of BERT’s pre-trained architecture, achieving an improvement of 9.04 over SBERT.

Among models that integrate syntactic information, Dep.LSTM, while not achieving the optimal performance, showed a 1.38 improvement over the BiLSTM baseline. Notably, our method achieved the best results when syntactic structures were incorporated.

To further investigate the impact of syntactic structures, we established three distinct message passing methods in our experiment. The results indicate that HGAT_CP achieved a substantial improvement of 0.92 over HGAT_W-S, highlighting the positive impact of incorporating syntactic structures on the experimental outcomes. Specifically, for HGAT_CP (NP, VP), the results are notably superior to those of other models. The incorporation of NPs and VPs in the message passing process is akin to refining sentences and augmenting the training corpus size.

4.4. Results on STS Benchmark Dataset

The outcomes achieved on the STS benchmark dataset are presented in Table 3. When compared to the results attained by SimCSE-RoBERTa after large-scale corpus training, the performance obtained solely using the training set appears relatively suboptimal. Upon analysis, two primary factors emerge as the underlying causes. Firstly, the STS corpus encompasses a diverse array of sentence types, with considerable variability existing among them. Secondly, the number of sentences within each type is insufficient, whereas SimCSE-RoBERTa benefits from extensive corpus training to acquire richer information. Although our results fell short of those achieved by SimCSE-RoBERTa, SBERT, and SRoBERTa, they significantly outperformed other models trained exclusively on the training corpus.

4.5. Results on SLNI

We mainly used parsing-tree-based methods for comparison. For non-syntactic structural models, we chose the most classic BERT model for comparison. From the Table 4, it can be seen that the basic model surpassed our results. Compared to BERT, models based on parsing trees are more cumbersome to compute and are not suitable for training on particularly large corpora, making it difficult to improve results. Compared to BERT training in a black box, models based on the parsing tree can more clearly obtain useful information for training and have a clearer understanding of the data processing process during training.

4.6. Results Analysis

To further explore the influence of syntactic trees on the outcomes, we categorized sentences into long and short categories and conducted separate analyses on their results. Specifically, we compared the results obtained before and after the application of syntactic trees. In terms of statistical classification, sentences with fewer than 6 words were deemed short sentences, whereas those exceeding 13 words were considered long sentences. An inspection of Table 5 reveals that within the SICK and STS datasets, the performance on long sentences showed a more pronounced improvement following the utilization of syntactic trees. It is noteworthy that the HGAT_CP (NP, VP) model exhibited substantially enhanced performance on both the STS benchmark and SICK datasets, especially when dealing with long sentences, in contrast to the HGAT-W-S model. Specifically, compared to the HGAT-W-S model, HGAT_CP (NP, VP) improved short sentences by 1.81 and 1.1 on the STS benchmark and SICK datasets, respectively, which is much smaller than the 3.67 and 2.35 result on long sentences. The experimental findings indicate that the comprehension of semantic information within long sentences necessitates a meticulously constructed syntactic framework.

5. Conclusions

This paper presents a heterogeneous graph attention network that integrates constituent parsing (HGAT_CP). HGAT_CP, based on the characteristics of parsing trees, can fully utilize parsing information to achieve information transmission within and between sentences. Based on the characteristics of the inference corpus, we used an internal phrase structure to expand the corpus, allowing phrases to play the role of sentences, optimizing the expression inside the sentence, and improving the extraction of external features. Compared to parsing information based methods, our method yielded significant improvements on several datasets, especially in the SICK corpus, where HGAT_CP (NP, VP) improved by 1.42 compared to HGAT-W-S. However, when evaluated on the SNLI and STS Benchmark datasets, our results were marginally lower than those of pre-trained models trained on large-scale corpora. This highlights the competitive performance of our approach in certain contexts while also underscoring the potential for further refinement to bridge the gap with state-of-the-art pre-trained models. Given that models based on syntactic structures are limited by the quality and scale of parsing trees, it is difficult for them to surpass pre-trained models in a short period of time. However, compared to the lack of intuitive perception during pre-trained model training, the training method based on the parsing tree can pay more attention to the processing process.

The current model still presents several critical issues that need to be addressed, primarily in two key areas. First, the extensive number of phrase nodes results in significant memory consumption. To mitigate this, we intend to implement a phrase node pruning strategy in subsequent work to streamline the model architecture and reduce memory usage. Second, while this study emphasizes information transmission from word and phrase nodes to sentence nodes, it does not account for the reverse information transmission from sentence nodes back to word or phrase nodes. Addressing this bidirectional information transmission will be a pivotal and challenging focus for future research.

Author Contributions

Conceptualization, H.W. and D.H.; methodology, H.W.; software, H.W.; validation, H.W. and D.H.; formal analysis, H.W.; investigation, H.W.; resources, D.H.; data curation, H.W.; writing—original draft preparation, H.W.; writing—review and editing, H.W. and X.L.; visualization, H.W.; supervision, D.H.; project administration, D.H.; funding acquisition, DGH. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key R&D Plans of Yunnan Province, grant number 202203AA080004, and the National Key R&D Plan, grant number 2020AAA0108004.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The SICK, STS Benchmark, and SNLI datasets are all publicly available datasets, which can be found at the following URLs: https://opendatalab.com/OpenDataLab/SICK/ (accessed on 20 September 2019); http://ixa2.si.ehu.eus/stswiki/ (accessed on 17 September 2019); https://nlp.stanford.edu/projects/snli/ (accessed on 18 March 2022).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

STS	Semantic Textual Similarity
NLP	Natural Language Processing
MT	Machine Translation
LSTM	Long Short-Term Memory
HGAT-CP	Heterogeneous Graph Attention Network that Integrates Constituent Parsing
BERT	Bidirectional Encoder Representations from Transformers
GAT	Graph Attention Network
NP	Noun Phrase
VP	Verb Phrase
HGAT	Heterogeneous Graph Attention Network
SICK	Sentences Involving Compositional Knowledge
SNLI	Stanford Natural Language Inference
MSRVID	Microsoft Video Paraphrase Dataset

References

Wang, M.; Smith, N.A.; Mitamura, T. What is The Jeopardy Model? A Quasi-Synchronous Grammar for QA. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 28–30 June 2007; pp. 22–32. [Google Scholar]
Yang, Y.; Yih, W.T.; Meek, C. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 2013–2018. [Google Scholar]
Santos, J.; Alves, A.; Gonçalo Oliveira, H. Leveraging on Semantic Textual Similarity for Developing a Portuguese Dialogue System. In Proceedings of the International Conference on Computational Processing of the Portuguese Language, Evora, Portugal, 2–4 March 2020; pp. 131–142. [Google Scholar]
Yin, W.; Schütze, H. Convolutional Neural Network for Paraphrase Identification. In Proceedings of the Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, CO, USA, 31 May–5 June 2015; pp. 901–911. [Google Scholar]
He, H.; Wieting, J.; Gimpel, K.; Rao, J.; Lin, J. UMD-TTIC-UW at SemEval-2016 Task 1: Attention-Based Multi-Perspective Convolutional Neural Networks for Textual Similarity Measurement. In Proceedings of the SemEval-2016, San Diego, CA, USA, 16–17 June 2016; pp. 662–667. [Google Scholar]
Richardson, R.; Smeaton, A.F. Using Wordnet in a Knowledge-Based Approach to Information Retrieval; Dublin City University, School of Computer Applications: Dublin, Ireland, 1995. [Google Scholar]
Niwattanakul, S.; Singthongchai, J.; Naenudorn, E.; Wanapu, S. Using of Jaccard Coefficient for Keywords Similarity. In Proceedings of the International Multiconference of Engineers and Computer Scientists, Hong Kong, China, 13–15 March 2013; Volume 1, pp. 380–384. [Google Scholar]
Opitz, J.; Daza, A.; Frank, A. Weisfeiler-Leman in the BAMBOO: Novel AMR Graph Metrics and a Benchmark for AMR Graph Similarity. Trans. Assoc. Comput. Linguist. 2021, 9, 1425–1441. [Google Scholar]
Wang, H.; Yu, D. Going Beyond Sentence Embeddings: A Token-Level Matching Algorithm for Calculating Semantic Textual Similarity. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 2: Short Papers, pp. 563–570. [Google Scholar]
Pagliardini, M.; Gupta, P.; Jaggi, M. Unsupervised Learning of Sentence Embeddings using Compositional N-gram Features. In Proceedings of the NAACL-HLT 2018, New Orleans, LA, USA, 1–6 June 2018; pp. 528–540. [Google Scholar]
Le, Q.; Mikolov, T. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning, Volume 32 (ICML’14), Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
He, H.; Gimpel, K.; Lin, J. Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Network. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1576–1586. [Google Scholar]
Mueller, J.; Thyagarajan, A. Siamese Recurrent Architectures for Learning Sentence Similarity. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI’16), Phoenix, AZ, USA, 12–17 February 2016; pp. 2786–2792. [Google Scholar]
Ranasinghe, T.; Orǎsan, C.; Mitkov, R. Semantic Textual Similarity with Siamese Neural Networks. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), Varna, Bulgaria, 2–4 September 2019; pp. 1004–1011. [Google Scholar]
Miller, G.A. Wordnet: A Lexical Database for English. Commun. ACM 1992, 38, 22–32. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the EMNLP 2019, Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
Gao, T.; Yao, X.; Chen, D. Simcse: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6894–6910. [Google Scholar]
Chuang, Y.S.; Dangovski, R.; Luo, H.; Zhang, Y.; Chang, S.; Soljačić, M.; Li, S.W.; Yih, W.T.; Kim, Y.; Glass, J. Diffcse: Difference-Based Contrastive Learning for Sentence Embeddings. In Proceedings of the NAACL 2022: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 4207–4218. [Google Scholar]
Zhang, D.; Xiao, W.; Zhu, H.; Ma, X.; Arnold, A.O. Virtual Augmentation Supported Contrastive Learning of Sentence Representations. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 2430–2440. [Google Scholar]
Nguyen, X.P.; Joty, S.; Hoi, S.C.; Socher, R. Tree-Structured Attention with Hierarchical Accumulation. arXiv 2019, arXiv:2002.08046. [Google Scholar]
Zhang, Z.; Wu, Y.; Zhou, J.; Duan, S.; Zhao, H.; Wang, R. Sg-net: Syntax-Guided Machine Reading Comprehension. Aaai Conf. Artif. Intell. 2020, 34, 9636–9643. [Google Scholar] [CrossRef]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
Tai, K.S.; Socher, R.; Manning, C.D. Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, 26–31 July 2015; pp. 1556–1566. [Google Scholar]
Li, Z.; Zhou, Q.; Li, C.; Xu, K.; Cao, Y. Improving BERT with Syntax-aware Local Attention. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 645–653. [Google Scholar]
Xu, Z.; Guo, D.; Tang, D.; Su, Q.; Shou, L.; Gong, M.; Zhong, W.; Quan, X.; Duan, N.; Jiang, D. Syntax-Enhanced Pre-trained Model. In Proceedings of the ACL 2021, Online, 1–6 August 2021; pp. 5412–5422. [Google Scholar]
Bai, J.; Wang, Y.; Chen, Y.; Yang, Y.; Bai, J.; Yu, J.; Tong, Y. Syntax-BERT: Improving Pre-trained Transformers with Syntax Trees. arXiv 2021, arXiv:2103.04350. [Google Scholar]
Wang, R.; Tang, D.; Duan, N.; Wei, Z.; Huang, X.; Cao, G.; Jiang, D.; Zhou, M. K-adapter: Infusing Knowledge into Pre-trained Models with Adapters. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 1405–1418. [Google Scholar]
Liang, S.; Wei, W.; Mao, X.L.; Wang, F.; He, Z. BiSyn-GAT+: Bi-Syntax Aware Graph Attention Network for Aspect-based Sentiment Analysis. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 1835–1848. [Google Scholar]
Ahmad, W.U.; Peng, N.; Chang, K.W. Gate: Graph Attention Transformer Encoder for Crosslingual Relation and Event Extraction. Proc. AAAI Conf. Artif. Intell. 2021, 35, 12462–12470. [Google Scholar] [CrossRef]
Devianti, R.; Miyao, Y. Transferability of Syntax-Aware Graph Neural Networks in Zero-Shot Cross-Lingual Semantic Role Labeling. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 20–42. [Google Scholar]
Zhang, P.; Chen, J.; Shen, J.; Zhai, Z.; Li, P.; Zhang, J.; Zhang, K. Message Passing on Semantic-Anchor-Graphs for Fine-grained Emotion Representation Learning and Classification. In Proceedings of the 2024 Confere.nce on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 2771–2783. [Google Scholar]
Xu, H.; Bao, J.; Liu, W. Double-Branch Multi-Attention based Graph Neural Network for Knowledge Graph Completion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 15257–15271. [Google Scholar]
Zhang, D.; Chen, F.; Chen, X. DualGATs: Dual Graph Attention Networks for Emotion Recognition in Conversations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 7395–7408. [Google Scholar]
Wang, X.; Ji, H.; Shi, C.; Wang, B.; Ye, Y.; Cui, P.; Yu, P.S. Heterogeneous Graph Attention Network. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2015; pp. 2022–2032. [Google Scholar]
Chen, H.; Hong, P.; Han, W.; Majumder, N.; Poria, S. Dialogue Relation Extraction with Document-Level Heterogeneous Graph Attention Nnetworks. Cogn. Comput. 2023, 15, 793–802. [Google Scholar]
Linmei, H.; Yang, T.; Shi, C.; Ji, H.; Li, X. HGAT: Heterogeneous Graph Attention Networks for Semi-Supervised Short Text Classification. Acm Trans. Inf. Syst. (Tois) 2021, 39, 4821–4830. [Google Scholar]
You, J.; Li, D.; Kamigaito, H.; Funakoshi, K.; Okumura, M. Joint Learning-based Heterogeneous Graph Attention Network for Timeline Summarization. J. Nat. Lang. Process. 2023, 30, 184–214. [Google Scholar]
Chen, S.; Feng, S.; Liang, S.; Zong, C.C.; Li, J.; Li, P. CACL: Community-Aware Heterogeneous Graph Contrastive Learning for Social Media Bot Detection. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; Volume 30, pp. 10349–10360. [Google Scholar]
Ye, X.; Sun, Y.; Liu, D.; Li, T. A Multisource Data Fusion-based Heterogeneous Graph Attention Network for Competitor Prediction. Acm Trans. Knowl. Discov. Data 2024, 18, 1–20. [Google Scholar] [CrossRef]
Yang, Y.; Tong, Y.; Ma, S.; Deng, Z.H. A Position Encoding Convolutional Neural Network based on Dependency Tree for Relation Classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 65–74. [Google Scholar]
Xu, Y.; Mou, L.; Li, G.; Chen, Y.; Peng, H.; Jin, Z. Classifying Relations via Long Short Term Memory Networks Along Shortest Dependency Paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1785–1794. [Google Scholar]
Jiang, X.; Li, Z.; Zhang, B.; Zhang, M.; Li, S.; Si, L. Supervised Treebank Conversion: Data and Approaches. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2706–2716. [Google Scholar]
Shen, Y.; Lin, Z.; Huang, C.W.; Courville, A. Neural Language Modeling by Jointly Learning syntax and Lexicon. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Shen, Y.; Tan, S.; Sordoni, A.; Courville, A. Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Sachan, D.S.; Zhang, Y.; Qi, P.; Hamilton, W. Do Syntax Trees Help Pre-trained Transformers Extract Information? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, Online, 19–23 April 2021; pp. 2647–2661. [Google Scholar]
Bao, X.; Jiang, X.; Wang, Z.; Zhang, Y.; Zhou, G. Opinion Tree Parsing for Aspect-based Sentiment Analysis. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 7971–7984. [Google Scholar]
Tang, A.; Deleger, L.; Bossy, R.; Zweigenbaum, P.; Nédellec, C. Do Syntactic Trees Enhance Bidirectional Encoder Representations from Transformers (BERT) models for chemical–drug relation extraction? Database 2022, 2022, baac070. [Google Scholar] [CrossRef] [PubMed]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Marelli, M.; Bentivogli, L.; Baroni, M.; Bernardi, R.; Menini, S.; Zamparelli, R. SemEval-2014 task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences Through Semantic Relatedness and Textual Entailment. In Proceedings of the 8th International Workshop on Semantic Evaluation, Dublin, Ireland, 23–24 August 2014. [Google Scholar]
Agirre, E.; Cer, D.; Diab, M.; Gonzalez-Agirre, A. SemEval-2012 task 6: A Pilot on Semantic Textual Similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics, Montreal, QC, Canada, 7–8 June 2012; pp. 385–393. [Google Scholar]
Socher, R.; Karpathy, A.; Le, Q.V.; Manning, C.D.; Ng, A.Y. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Trans. Assoc. Comput. Linguist. 2014, 2, 207–218. [Google Scholar] [CrossRef]
Shao, Y. Hcti at semeval-2017 task 1: Use Convolutional Neural Network to Evaluate Semantic Textual Similarity. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 130–133. [Google Scholar]
Yang, Y.; Yuan, S.; Cer, D.; Kong, S.Y.; Constant, N.; Pilar, P.; Ge, H.; Sung, Y.H.; Strope, B.; Kurzweil, R. Learning Semantic Textual Similarity from Conversations. In Proceedings of the 3rd Workshop on Representation Learning for NLP, Melbourne, Australia, 20 July 2018; pp. 164–174. [Google Scholar]
Maillard, J.; Clark, S.; Yogatama, D. Jointly Learning Sentence Embeddings and Syntax with Unsupervised Tree-LSTMs. Nat. Lang. Eng. 2019, 25, 433–449. [Google Scholar]
Choi, J.; Yoo, K.M.; Lee, S.G. Learning to Compose Task-Specific Tree Structures. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence: Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth Symposium on Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 7. [Google Scholar]

Figure 1. Illustration of the overall process.

Table 1. Parameter settings.

Parameter	Value
Word Embedding Dimension	300
Phrase Embedding Dimension	300
Sentence Embedding Dimension	300
Multi-head Attention Number	10
Learning Rate	0.0003
Batch Size	32
Optimizer	adam
Learning Rate Decay	0.9
Dropout	0.5

Table 2. Test set results on SICK. ∖ indicates that the model was not tested using this evaluation criterion. The asterisk (*) indicates the best result.

Methods	Pearson × 100	Spearman × 100	MSE
Not-use syntactic structure
Mean vectors	76.24	70.31	0.4321
LSTM	84.32	78.27	0.2866
BiLSTM	85.68	79.45	0.2756
BERT-Sent2vec [11]	81.43	77.35	0.2886
CNN [12]	86.86	80.47	0.2606
BERT [16]	∖	42.63	∖
SRoBERTa [19])	∖	74.46	∖
SBERT [19]	∖	72.91	∖
SimCSE-RoBERTa base [20]	∖	80.50	∖
SimCSE-RoBERTa large [20]	∖	81.95	∖
Use syntactic structure
DT-RNN [55]	79.23	73.19	0.3822
SDT-RNN [55]	79.00	73.04	0.3848
Const.LSTM [26]	85.82	79.66	0.2734
Dep. LSTM [26]	86.76	80.83	0.2532
HGAT_W-S	86.31	79.51	0.2712
HGAT_CP	86.92	80.43	0.2582
HGAT_CP(NP, VP)	87.73 *	82.34 *	0.2343 *

Table 3. Test set results on the STS benchmark. The asterisk (*) indicates the best result.

Models	Pearson × 100
Not-use syntactic structure
Mean vectors	62.31
LSTM	72.36
BiLSTM	73.82
HCTI [56]	78.40
Reddit tuned [57]	78.10
BERT [16]	84.30
SRoBERTa [19]	84.92
SBERT [19]	84.67
SimCSE-RoBERTa base [20]	85.83
SimCSE-RoBERTa large [20]	86.70 *
Use syntactic structure
Dependency Tree-LSTM	71.20
Constituency Tree-LSTM	71.90
HGAT_W-S	79.31
HGAT_CP	80.23
HGAT_CP(NP, VP)	81.35

Table 4. Test set results on the SNLI. The asterisk (*) indicates the best result.

Models	Accuracy × 100
100D Latent Syntax Tree-LSTM [58]	80.5
100D CYK Tree-LSTM [58]	81.6
600D Gumbel Tree-LSTM [59]	86.0
300D Gumbel Tree-LSTM [59]	85.6
BERT_base [16]	90.9
BERT_large [16]	91.0 *
HGAT_W-S	86.3
HGAT_CP	89.6
HGAT_CP(NP, VP)	90.5

Table 5. Comparison of the results of sentences of different lengths. The asterisk (*) indicates the best result.

SICK	Length > 13	Length < 6	STS Benchmark	Length > 13	Length < 6
Method	Pearson × 100	Pearson × 100	Method	Pearson × 100	Pearson × 100
Mean vectors	62.75	82.37	Mean vectors	48.26	72.37
LSTM	75.74	87.82	LSTM	60.74	85.47
BiLSTM	77.38	87.96	BiLSTM	62.13	86.18
HGAT_W-S	77.72	88.36	HGAT_W-S	61.94	86.81
HGAT_CP	79.53	89.33	HGAT_CP	64.87	88.13
HGAT_CP(NP, VP)	80.07 *	89.46 *	HGAT_CP(NP, VP)	65.61 *	88.62 *

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, H.; Huang, D.; Lin, X. Semantic Textual Similarity with Constituent Parsing Heterogeneous Graph Attention Networks. Symmetry 2025, 17, 486. https://doi.org/10.3390/sym17040486

AMA Style

Wu H, Huang D, Lin X. Semantic Textual Similarity with Constituent Parsing Heterogeneous Graph Attention Networks. Symmetry. 2025; 17(4):486. https://doi.org/10.3390/sym17040486

Chicago/Turabian Style

Wu, Hao, Degen Huang, and Xiaohui Lin. 2025. "Semantic Textual Similarity with Constituent Parsing Heterogeneous Graph Attention Networks" Symmetry 17, no. 4: 486. https://doi.org/10.3390/sym17040486

APA Style

Wu, H., Huang, D., & Lin, X. (2025). Semantic Textual Similarity with Constituent Parsing Heterogeneous Graph Attention Networks. Symmetry, 17(4), 486. https://doi.org/10.3390/sym17040486

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semantic Textual Similarity with Constituent Parsing Heterogeneous Graph Attention Networks

Abstract

1. Introduction

2. Related Work

2.1. The Application of Graph Attention Network in Field of NLP

2.2. Fusion Syntax Model

3. Methodology

3.1. Graph Attention Network

3.2. Heterogeneous Graph Neural Network (HGNN)

3.3. Model Overview

3.3.1. Graph Construction

3.3.2. Graph Attention Layer

3.3.3. Message Propagation

3.3.4. Similarity Calculation Model

4. Experiments

4.1. Tasks and Datasets

4.1.1. Sentences Involving Compositional Knowledge (SICK) Dataset

4.1.2. STS Benchmark Dataset

4.1.3. The Stanford Natural Language Inference (SNLI) Dataset

4.2. Experimental Details and Hyperparameters

4.3. Results on SICK Dataset

4.4. Results on STS Benchmark Dataset

4.5. Results on SLNI

4.6. Results Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI