Modeling the Paraphrase Detection Task over a Heterogeneous Graph Network with Data Augmentation

Paraphrase detection is a Natural-Language Processing (NLP) task that aims at automatically identifying whether two sentences convey the same meaning (even with different words). For the Portuguese language, most of the works model this task as a machine-learning solution, extracting features and training a classifier. In this paper, following a different line, we explore a graph structure representation and model the paraphrase identification task over a heterogeneous network. We also adopt a back-translation strategy for data augmentation to balance the dataset we use. Our approach, although simple, outperforms the best results reported for the paraphrase detection task in Portuguese, showing that graph structures may capture better the semantic relatedness among sentences.


Introduction
Paraphrase detection is a Natural-Language Processing (NLP) task that aims to automatically identify whether two sentences convey the same meaning. Bhagat and Hovy [1] define paraphrase as sentences or phrases that convey the same meaning using different wording. Moreover, these sentences represent alternative surface forms in the same language, expressing the same semantic content of the original forms [2].
Formally, a paraphrase may be modeled as a mutual (or bidirectional) entailment between a text T and a hypothesis H in the form, T → H and H → T, that means T entails H and H entails T. For example, given a text T and a hypothesis H below, one may claim that they are paraphrases of each other, since T → H and H → T.
To try to overcome such barriers, researchers developed and used the ASSIN corpus [9], which is focused on the textual entailment recognition task, but includes paraphrase examples. Formally, entailment recognition is the task of deciding whether the meaning of one text may be inferred from another [9].
The existing works that aim to detect paraphrase sentences in Portuguese [3,10], model this task as a machine-learning solution, building feature-value tables and training and testing classifiers. For a new sentence pair, features are computed and fed into the classifier to predict if the two sentences are paraphrases of each other. These approaches may suffer from two drawbacks. In the first one, the features may not capture well the semantics of the sentence pairs, producing unsatisfactory results. In the second, the authors apply sampling techniques to mitigate the unbalance issues of the ASSIN corpus, aiming to get more balanced data to improve the results of their models. These under-or over-sampling techniques may suffer from some shortcomings. In the over-sampling, the minority class can lead to model overfitting, introducing duplicate instances from a pool of instances that is already small [11]. On the other hand, in the under-sampling, the majority class can end up leaving out important instances that provide important differences between the two classes [12]. Other strategies that make use of synthetic data also suffer from criticism on the quality of the generated data.
To fulfill these gaps and explore other approaches for paraphrase detection, in this paper, inspired by Sousa et al. [13], we model the paraphrase detection task as a heterogeneous network. In this network, nodes represent tokens and sentence pairs, and the edges link the two node types. Networks/graphs have shown to be a powerful data structure that may capture well the relationship among the objects of interest [14].
Based on the network, we feed and train a classifier to predict if two sentences are paraphrases of each other. To evaluate our method, we use the ASSIN corpus. However, instead of applying a sampling technique to balance it, we adopt a back-translation strategy [15] for data augmentation to balance the data. This strategy maintains the original sentence pairs from the ASSIN corpus and add real sentences from another corpus with good translation quality. Our proposed method outperforms the best reported results, both in F-score and accuracy measures. Furthermore, the back-translation strategy helps to produce better models.
The remaining of this paper is organized as follows. Section 2 briefly presents the related work. In Section 3, we show the used corpora. Section 4 details our methodology to balance the ASSIN corpus and to model the problem. Section 5 presents the conducted experiments and obtained results. Finally, in Section 6, we conclude the paper, giving directions for future work.

Related Work
As pointed by Anchiêta and Pardo [3], few approaches strictly tackle the paraphrase detection task for the Portuguese language. Most of the research is on entailment identification that, according to Souza and Sanches [10], is different from paraphrase detection. Thus, following Souza and Sanches [10], we focus on the paraphrase detection task.
Consoli et al. [16] analyzed the capabilities of the coreference resolution tool CORP [17] for identification of paraphrases. The authors used CORP to identify noun phrases that may help to detect paraphrases between sentence pairs. They evaluated their method on 116 sentence pairs from the ASSIN corpus, achieving 0.53 F-score.
Rocha and Cardoso [18] modeled the task as a supervised machine-learning problem. However, they handled the issue as a multi-class task, classifying sentence pairs into entailment, none, or paraphrase. Thus, they employed a set of features of the lexical, syntactic, and semantic levels to represent the sentences in numerical values, and fed these features into some machine-learning algorithms. They evaluated their method on the training set of the ASSIN corpus, using both European and Portuguese partitions. The method obtained 0.52 of F-score using an SVM classifier.
Souza and Sanches [10] also dealt with the problem using a supervised machine-learning strategy. However, their objective was to explore sentence embeddings for this task. They used a pre-trained FastText model [19] and the following features: the average of the vectors, the value of Smooth Inverse Frequency (SIF) [20], and weighted aggregation based on Inverse Document Frequency (IDF). With these features, their method reached 0.33 of F-score using an SVM classifier on balanced data of the ASSIN corpus for European and Portuguese partitions.
Cordeiro et al. [21] developed a metric named SUMO-METRIC for semantic relatedness between two sentences based on the overlapping of lexical units. Although the authors evaluated their metric on a corpus for the English language, the metric is language-independent.
Anchiêta and Pardo [3] explored the potentialities of four semantic features to identify paraphrase sentences. They computed the similarity of two encoded sentences as a graph using a semantic parser [22] and a semantic metric [23], the value of Smooth Inverse Frequency (SIF) [20], the cosine distance between two embedded sentences, and the value of the Word Mover's Distance (WMD) [24] between two embedded sentences. From these features, they trained an SVM classifier and obtained 0.80 F-score on the balanced ASSIN corpus.
For the English language, according to Mohamed and Oussalah [25], most of the research is categorized into three high levels, namely: corpus-based, knowledge-based, and hybrid methods. Here, in order to have a panoramic view of the achieved contributions for this language and to allow (indirect) comparisons to Portuguese state of the art, we briefly present the best results in the literature.
Mohamed and Oussalah [25] adopted a hybrid method addressing the problem of evaluating sentence-to-sentence semantic similarity when the sentences contain a set of named entities.  [28] is a widely used corpus for the paraphrase detection task for English. The sentence pairs are annotated in a binary strategy, i.e., as paraphrases or non-paraphrases. The corpus has 5801 sentence pairs, where 3900 are paraphrases and 1901 are non-paraphrases, as exposed in Table 3. As we can see, the MSRP corpus is unbalanced concerning the non-paraphrase label. In what follows, we detailed the developed methods for paraphrase identification and our strategy to mitigate the unbalance of the ASSIN corpus.

Balancing the ASSIN Corpus
To formulate the paraphrase identification task as a binary classification problem, we did a modification to the ASSIN corpus, since it has three labels. Thus, we joined the entailment and none labels of the corpus into one unique label named "non-paraphrase", which is our negative class, as presented in Table 4. As we can see, the new configuration of the corpus is unbalanced concerning non-paraphrase label. Aiming to balance the corpus, we used the MSRP corpus. For that, we adopted a back-translation strategy [15] to translate the sentences of the MSRP corpus from English to Portuguese, as illustrated in Figure 1.  According to this figure, we translated the original sentences from the MSRP corpus to Portuguese, using the machine translation model provided by the Google Translate API (https://cloud.google.com/ translate/). We adopted this model due to the good results achieved by other researchers [34,35] when using it. Next, we translate the Portuguese sentences back to English, because there are no Portuguese reference sentences to evaluate the quality of the translations. This way, we may measure the quality of the translations, comparing the original sentences (reference sentences) with the back-translated sentences (hypothesis sentences). To compute the quality of the translations, we calculated the harmonic mean between ROUGE (We used the F-score of the ROUGE-L.) [36] and BLEU [37] metrics, as in Equation (1).
We achieved 0.844 mean value (with a 0.08 standard deviation) when comparing the reference sentences to the hypothesis sentences. Taking into account the state-of-the-art results in machine translation [38], which achieves 35 points on the BLEU score, we may consider that the translated sentences from English to Portuguese have a good quality due to reached results both in the mean and the standard deviation. Thus, we used the translated sentences of the MSRP corpus to make the ASSIN corpus less unbalanced. Table 5 presents the ASSIN corpus plus the translated sentences of the MSRP corpus.
To create this less unbalanced corpus, we got the 2753 paraphrastic sentences from the training set of the MSRP corpus and put 2000 into the training set and 753 into the development set of the ASSIN corpus, respectively. As a consequence of these operations, the training set will have 2295 sentences and the development set will have 823 sentences. Moreover, we also got the 1147 paraphrastic sentences from the testing set of the MSRP corpus and put them into the testing set of the ASSIN corpus. As a result of that operation, the testing set will have 1386 sentences. After these procedures, we obtained a corpus in Portuguese with similar proportions in relation to the MSRP corpus.

Modeling the Paraphrase Identification Task
To extract features, we first model the paraphrase identification task over a heterogeneous graph network. This network contains abundant information with structural relations (edges) among multi-typed nodes as well as unstructured contend associated with each node [39,40]. It has been used to automate feature engineering tasks, facilitating machine-learning tasks, and proving good results for other tasks as helpfulness prediction, text classification and scientific impact measurement, among others [13,[41][42][43].
We borrowed the formulation of a heterogeneous network from Chang et al. [42] and adapted it to our purpose. Our network may be viewed as a graph G = (V, E), where V = {v 1 , ..., v n } is a set of vertices and E is a set of edges, where an edge e i,j ∈ {1, ..., n} belongs to the set E if and only if an undirected and unweighted link exists between nodes i and j. Moreover, the graph G is associated with an object type mapping function f v : V → O, where O represents object sets and each node v i ∈ V belongs to one particular object type as In our network, we defined two node types and two constraints. In the former, the nodes are tokens and sentence pairs, while in the latter the constraints are: (i) there is no link among token nodes and (ii) there is no link among sentence pair nodes. Thus, we link only token nodes with sentence pair nodes. We present the general scheme and an overview of the network in Figures 2 and 3, respectively. As one can see, the edges are undirected and unweighted, and a sentence pair node may share several token nodes whenever the token is in the sentence pair, i.e., the edges between token nodes and sentence pair nodes are based on word occurrence in sentence pairs. To extract the features regarding the network object classes, we applied a regularization method. Regularization is a kind of transductive classification method that aims to find a set of labels, minimizing a cost function and satisfying two conditions: (i) the method needs to be consistent with the set of labels manually annotated and (ii) the method needs to be consistent with the network topology, considering that nearest neighbors tend to have the same labels [14].

Sentence pairs Tokens
We tested three regularization methods: Gaussian Fields and Harmonic Function (GFHF) [44], Learning with Local and Global Consistence (LLGC) [45], and GnetMine [14]. The regularization algorithms require some nodes to be pre-labeled with their specific classes. Thus, one of the differences between these methods is if they modify these pre-labeled nodes. For example, the GFHF method does not modify the pre-labeled nodes, whereas the LLGC and GnetMine methods do. Besides, the GnetMine method works only on the heterogeneous networks, while the GFHF and LLCG methods work both on heterogeneous and homogeneous networks.
As a result, the regularization methods produce values related to coordinates for each object in the network, and these values may be used to feed a supervised machine-learning algorithm to learn and predict labels [46]. Table 6 presents an example of the output of a regularization method, where id is the identifier of the object, values refer to coordinates of each object in the network, and label 1 is a paraphrase, while label 0 is a non-paraphrase.

Formulating the Paraphrase Identification Task
As before mentioned, we formulated the paraphrase identification task as a binary classification problem. For that, to train machine-learning algorithms, our method receives as input data the extracted features from a regularizer method in the form of (x (i) In summary, the aim is to learn a classifier c that, given unseen sentence pairs, i.e., a set of sentences S, classifies whether they are paraphrases, as in Equation (2).
We tested four machine-learning algorithms, namely: Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), and Neural Network (NN). In what follows, we detail our experiments and the obtained results.

Experiments and Results
To evaluate our approach, we used the balanced ASSIN corpus with the translated sentences of the MSRP corpus, as depicted in Table 5. Moreover, as we commented before, we tested some classifiers from the Scikit-Learn library [47], as Support Vector Machine (SVM), Naïve Bayes (NB), Decision Trees (DT) and Neural Networks (NN), and we evaluated three regularization methods. Recall that the regularization methods require some nodes to be pre-labeled, so we ranged from 5% to 50% the number of pre-labeled nodes. The regularizers randomly pre-labeled the nodes. Supposing that the percentage of pre-labeled nodes is equal to 5%, it means that 0.25% of each class is randomly pre-labeled.
We achieved the best result with the LLGC regularizer, the NN classifier (We used a Multi-Layer Perceptron (MLP) with 2 hidden layers and 20 neurons in each hidden layer.), and 30% of the pre-labeled nodes on the balanced ASSIN corpus, as depicted in Table 7. It is important to highlight that only the training set is pre-labeled. The regularizer does not have access to labels of the testing set.
As we can see, from the 30% of pre-labeled nodes, both F-score and accuracy remain constants. We believe that the LLGC regularizer achieved the best results due to two properties. In the first place, it allows the pre-labeled nodes to be altered. This helps to correct errors in the labeling of nodes, improving the classification. In the second place, the algorithm decreases the excessive influence of objects with a high degree in the definition of the information of the nearest classes. This allows that the nodes get a different label from their neighbors. We compared our best result with the works of Anchiêta and Pardo [3] and Souza and Sanches [10], since they also deal with the paraphrase detection task for Portuguese. Furthermore, we also compared our method with another graph-based method [48]. This method is a Graph Convolutional Network, which contains word nodes and document nodes. The number of nodes in the graph is the number of documents plus the number of unique words in a corpus. The edges among the nodes are based on word occurrence in documents and word co-occurrences in the whole corpus. Moreover, the weight of the edge between a document node and a word node is the Term Frequency-Inverse Document Frequency (TF-IDF) and the weight between two word nodes is the Pointwise Mutual Information (PMI) [49] value. Equation (3) summarizes the approach to weight an edge between nodes.
In Table 8, we present the results of the comparison between the models, and, as we can see, our strategy outperformed the other methods, achieving better results than other models with only 30% of the pre-labeled data. It is important to say that we trained and evaluated these models on the balanced ASSIN corpus. We further assessed whether the trained models on the balanced ASSIN corpus improve the results of the models when evaluated on the ASSIN corpus without balancing, i.e., we are interested in check if the data-augmentation strategy used to balance the ASSIN corpus contributes to improve the results when tested on the original ASSIN corpus. The results of this investigation are shown in Table 9.
One can see that all the models improved their results when trained on the ASSIN corpus with data augmentation, showing that the back-translation strategy for the paraphrase detection task is feasible to produce better models. For this experiment, the method of Anchiêta and Pardo [3] reached the best results, on average. Also, the graph-based methods performed poorly, having difficulty to correctly predict a label with very few instances. ASSIN corpus has 239 instances as paraphrases and 3761 instances as non-paraphrases in the test set, as depicted in Table 4. To alleviate the difficulty of graph-based models to predict a label with very few instances, one may use boosting strategies, as RUSboost [50]. To investigate other approaches to tackle this subtlety remains for future work. The balanced corpus and our graph-based model are available at (https://github.com/RogerFig/ graph-paraphrase).

Final Remarks
In this paper, we presented a graph-based approach to model the paraphrase detection task. We defined a heterogeneous network with two semantic node types; sentence pairs and tokens. We created undirected and unweighted links among the token node and sentence pair node types. More than the method, we detailed a data-augmentation strategy, using back-translation to balance the dataset. Our approach outperformed the best results reported for paraphrase detection in Portuguese on the balanced ASSIN corpus both on accuracy and F-score measures.
The defined network is flexible and may be adapted to include other node types, as the embeddings of the sentences or tokens, for example. In addition, because of this flexibility, other network topologies may be explored, as creating weighted links between sentence pair nodes. Furthermore, heterogeneous networks may be applied to a broad range of other NLP tasks, as summarization, dependency parsing, sentiment classification, automatic essay scoring, and others. Another interesting future work is to verify which paraphrase types the systems detect. For that, one could follow the work of Kovatchev et al. [51] to annotate the paraphrase types that occur in the corpus. With this additional annotation layer, it may be possible to perform a qualitative evaluation, helping to explain which paraphrase types the models identify.