A novel hybrid methodology of measuring sentence similarity

The problem of measuring sentence similarity is an essential issue in the natural language processing (NLP) area. It is necessary to measure the similarity between sentences accurately. There are many approaches to measuring sentence similarity. Deep learning methodology shows a state-of-the-art performance in many natural language processing fields and is used a lot in sentence similarity measurement methods. However, in the natural language processing field, considering the structure of the sentence or the word structure that makes up the sentence is also important. In this study, we propose a methodology combined with both deep learning methodology and a method considering lexical relationships. Our evaluation metric is the Pearson correlation coefficient and Spearman correlation coefficient. As a result, the proposed method outperforms the current approaches on a KorSTS standard benchmark Korean dataset. Moreover, it performs a maximum of 65% increase than only using deep learning methodology. Experiments show that our proposed method generally results in better performance than those with only a deep learning model.


Introduction
As natural language data such as social networks and news articles are pouring out, natural language processing has received tremendous attention [1].Measuring the similarity between natural language sentences is important even within natural language processing [2].For example, many approaches such as chat-bot system, plagiarism checking system, automatically classification system depend on sentence similarity.Accurately measuring the similarity between two sentences is a crucial task.
Researches measuring sentence similarity have been conducted from various perspectives [2][3][4][5][6].Including deep learning approaches and sentence structure approaches, there are many ways to measure sentence similarity.
In a study on sentence similarity measurement using deep learning, Mueller et al. [6] proposed a method by extracting features containing the entire information of a sentence via long shortterm memory (LSTM).Heo et al. [7] proposed a method to measure sentence similarity using a combination of global features: the entire information of sentences extracted via bidirectional LSTM (Bi-LSTM) and local features: the detailed information of sentences extracted through capsule networks.
In the natural language processing field, it is necessary to focus not only on the model using deep learning but also on the structure of the sentence and the lexical relationship of the sentence.Miller et al. [4] proposed a method measuring similarity between two sentences based on the relationship between vocabulary within sentences, using WordNet made from a knowledge base using vocabulary.In contrast, Wang et al. [5] proposed a method of decomposing and reorganizing vocabulary.
In this paper, our research aims to improve the performance of methods for measuring similarity between the Korean language sentences by combining a deep learning methodology and a methodology that considers lexical relationships.We use several deep learning methodologies, such as convolutional neural networks (CNN), recurrent neural networks (RNN), and bidirectional encoder representations from transformers (BERT) to measure sentence similarity.Also, we apply cosine similarity to embedding vectors obtained from the language representation model.Finally, we calculate the final sentence similarity by combining the sentence similarity value calculated by the deep learning model and the value obtained from cosine similarity.Experiments show that our proposed method performed better compared to those with only a deep learning model.This paper is structured as follows.Section2 mentions related work.Section3 explains the proposed approach and details of its main components.Section4 describes the experiment, Section5 mentions the conclusion.

Related Work
Many approaches have been proposed to address the problem of measuring the similarity between sentences [9].Research measuring the similarity between two sentences has been conducted for a long time from various perspectives.There are many approaches to calculate the similarity between two sentences, such as using sentence structure, considering a lexical relationship, and using deep learning.
The method of measuring the similarity between two sentences using the structure of sentences is a widely used method from the early days of natural language processing to the present era of deep learning.Since many researchers have studied it for a long time, many ideas have been proposed to measure the similarity between two sentences using sentence structure.Lee et al. [10] proposed a similarity measure method of two sentences based on sentence structure grammar.As Lee et al. [11] proposed a similarity measure method of two sentences based on the part of speech tags.Ferreira et al. [2] proposed a similarity measurement method of two sentences based on the word order and sentence structure.Li et al. [3] measured the similarity of two sentences by identifying statistics on sentence structure.
A method of measuring similarity between two sentences considering lexical relationships is also one of the sentence similarity measures.Miller et al. [4] proposed a method for measuring similarity between two sentences using WordNet made from a lexical-based knowledge base.Wang et al. [5] are presented with a method to calculate the similarity between two sentences using an approach that uses the word to calculate similarity using repetitive and different parts of a sentence.Abdalgader et al. [12] proposed a method to measure sentence similarity using word detection ambiguity and synonym extension.
Deep learning has recently developed significantly since hardware development and the opening of the Big Data era [13].Sentence similarity studies using deep learning have shown good performance using various neural networks such as LSTM, gated recurrent units (GRU), CNN, and BERT [6-8, 14, 15].
Mueller et al. [6] used LSTM, which has a good performance for sequential data processing.They evaluated sentence similarity by applying the last hidden states extracted via LSTM to Manhattan distance.Pontes et al. [14] combined CNN and LSTM.They extract combined information from adjacent words through CNN and applied last hidden states extracted via LSTM to Manhattan distances to assess sentence similarity.Li et al. [15] used Group CNN (G-CNN), which extracts representative local features and bidirectional GRU (Bi-GRU), which has good performance in sequential data processing and applied last hidden states extracted via Bi-GRU to Manhattan distances.Heo et al. [7] sequentially used Bi-LSTM, self-attention reflecting contextual information, capsule networks with CNN structure.They then combined the last hidden states extracted via Bi-LSTM and local features extracted via capsule networks.Devlin et al. [8] evaluated the similarity of the two sentences using BERT, a language representation model that shows excellent performance in various natural language processing fields.
Unlike previous approaches, we propose a novel method that combines deep learning and the method considering lexical relationships to measure similarity between two sentences.

Sentence Similarity
Measuring accurately in measuring the similarity of sentence to sentence is an important task [2].To measuring similarity between sentences, we combine deep learning methodology and a method that considers lexical relationships.

Similarity based on Deep Learning Model
The models used in this study are LSTM and GRU, CNN, G-CNN, capsule networks, BERT.LSTM and GRU are a family of RNN, and G-CNN and capsule networks are neural networks using CNN.CNN and G-CNN are used as input values to the RNNs model.The sentence similarity using RNNs or capsule networks is calculated by applying the Manhattan distance, such as Fig. 1 (a).And, the sentence similarity using BERT is calculated through a special token, such as Fig. 1 (b).

Sentence similarity using Word Embedding
The RNN is a neural network that shows good performance when processing sequential data [16].When calculating the representation of each time step in text processing, it is determined through learning how much context information up to that point is reflected.However, in the case of RNN, if the sequence length is increased, gradient vanishing or gradient exploding problems may occur [17].To solve this, LSTM and GRU are devised [17,18].When calculating the similarity of two sentences, each sentence is input into the family model of RNNs to obtain the last hidden states (ℎ   , ℎ   ) of each sentence [6-7, 14 -15].The ℎ  includes the entire sentence information.Manhattan distance used in the family model of RNNs is as follows.
The CNN shows good performance in image and text processing [7].In the text processing, CNN extracts local features, which are combined information, by grouping words appearing in In equation ( 2),  refers to word index, (•) refers to the activation function.In addition,   refers to a learning weight of a CNN having a size of ℝ × ,   refers to a word embedding vector having a size of ℝ  , and   refers to a bias vector.Through equation ( 2), the feature map, which is having a size of ℝ −+1 , is generated according to , which refers to the number of words.Finally, the feature maps having a size of ℝ (−+1)× are generated by the number of filters.The G-CNN uses three CNNs with different kernel sizes in parallel and obtains representative semantic information [15].G-CNN integrates feature maps extracted from three CNNs into a feature of one and then creates the most representative feature map by applying max pooling.The equation of G-CNN is as follows.
The capsule networks use two 1 and 2 and are used to extract subdivided information in the field of sentence similarity [7].1 has a typical CNN form, and 2 receives a feature map of 2 as the input value.After that, to extract subdivided information, a kernel size corresponding to the overall size of the input value is used, and then a feature map having the size of ℝ ×   is generated by dividing the feature map into C-dimensions.Manhattan distance used in the capsule networks is as follows.

Sentence similarity using BERT Embedding
BERT is a language representation model made by stacking several transformer encoder blocks [8].The learning process of BERT is divided into a pre-training process and a fine-tuning process.In the pre-training process, after randomly masking word tokens in a sentence of a large corpus, the BERT model is learned by predicting the masked word token.Fine-tuning is a process of learning a pre-trained BERT model with labeled data once more.We train BERT on the similarity task between sentences in the finetuning process.BERT is divided into BERT-base and BERTlarge models according to the size of the model, and in this work, we use the BERT-base model.BERT-base model consists of 12 layers of transformer encoder block.There are several special tokens in the BERT model.The [CLS] token is placed at the beginning as a token indicating the beginning of the input.The [SEP] token is a token that distinguishes between sentences.In the case of the sentence similarity task, the input is entered in the BERT model in the form of "[CLS] Sentence1 [SEP] Sentence2 [SEP]".The similarity between sentences is calculated by inputting the output vector of [CLS] token extracted via the BERT model into the dense layer, as shown in equation (5).

𝑆𝑖𝑚 𝐷 = 𝜎(𝑊ℎ [CLS] + 𝑏)
(5) The ℎ [] is the vector representation of the [CLS] token extracted through the BERT model and has a matrix of ℝ 728 . refers to weights that can be learned, with a matrix of ℝ 1 × 728 , and  refers to bias vector.

Similarity based on lexical relationship
To measuring structure similarity, the proposed approach calculates the word-to-word similarity between sentences.Using language representation model in measuring similarity between words improves similarity measuring between sentences [19].
In the word embedding model, to consider the lexical relationship included semantic information, we use the embedding vector of words (  ).Word embedding used in this study uses Word2Vec, which learns by minimizing the dot product values of target word vector and neighbor word vectors surrounding the target word vector [20,21].
In the BERT, to consider the lexical relationship included contextual information, we use the hidden states of the 12 transformer blocks, which are BERT-base model components.To calculate the word-to-word similarity, we exclude hidden states of [CLS] token and [SEP] token.The embedding token of the word (  ) is calculated using the following equation.
In equation ( 6),  refers to the index of the word,  refers to the number of transformer blocks.The word-based similarity is calculated using the following equation.
In equation ( 7),   and   refer to sentence 1 and sentence 2, respectively. is the number of words in   and the s(  ,   ) is the similarity of word   in   and   .The similarity between word  and  is measured by selecting the max similarity between  and every word in the  according to the following equation (8).
In equation (8), (,   ) is the similarity between the word  and word   .  refers to the th word that appears in another sentence .The similarity between words is measured using cosine similarity between word vectors.Our method goes through the process of calculating the similarity of words belonging to   based on words belonging to   , as in the above formula, when comparing two sentences   and   , and vice versa.In other words, we calculate the word in   based on the word in   and then calculate the arithmetic mean of the method, such as equation (9), from two methods.

Novel hybrid sentence similarity
In this study, we combine deep learning methodology and a method that considers lexical relationships using the equation below. = α  + (1 − α)  (10) In equation (10), α is a weight that can adjust which information to focus on among deep learning methodology and a method that considers lexical relationships and is determined experimentally.Through Equation (10), the value of sentence similarity has a range of 0-1.

Experiment
BERT used in this study is KoBERT1 , which trained Korean texts.And, Word2Vec is trained using the Korean raw corpus2 applied Kkma3 , a Korean morpheme analyzer.Some studies have shown that splitting words into morphemes in Korean tends to perform well [22].Word vector extracted through Word2Vec has an embedding size of ℝ 768 , as same as BERT embedding size.

Datasets
In this study, KorSTS [23], consisting of 8,628 sentence pairs, is used as the experiment data.Training sets, development sets, and test sets consist of 5,749, 1,500, and 1,379 sentence pairs.The similarity score range of two sentences is composed from 0 to 5 points, as shown in Table 1.In this study, the similarity score is normalized from 0 to 1 using the minimum-maximum scaling.

Result
We compare the five models that perform high performance and the proposed model.We use the Pearson correlation coefficient and the Spearman correlation coefficient as the evaluation metrics.The performance shown in Table 2 is the average value of the results of the 5 experiments each.As shown in Table 2, we can see that the Pearson and Spearman correlation coefficients are both higher in consideration of both deep learning and lexical relationships than those using only deep learning.Our method resulted in significant performance improvement, though minor calculation cost increase compared to those models only using deep learning.Given that we achieved performance improvements in all five models, we show that our method generally increases the ability of the model to understand the semantic similarity of sentences.Especially on [14]'s model, our approach resulted in about 65% performance improvement.

Conclusion
This study measures the similarity between the Korean language sentences by combining a deep learning methodology and a method that considers lexical relationships.In a deep learning methodology, we use 5 neural networks related CNN, RNN, BERT.Also, in a method that considers lexical relationships, we use a cosine similarity in embedding vectors extracted through the word representation model.Finally, we calculate the final sentence similarity by combining the output values of the two methods.As a result, our method combining two methods shows good performance compare to using only a deep learning model.
The method considering lexical relationships used in this study is one of several linguistic methods for measuring sentence similarity.As it can see from the experimental results, it can achieve good performance by combining the deep learning method and the linguistic method.Therefore, in future studies, we will further improve the performance of similarity between the sentences by using various linguistic methods such as information on the order of words and part of speech.