Global and Local Information Adjustment for Semantic Similarity Evaluation

: Semantic similarity evaluation is used in various ﬁelds such as question-and-answering and plagiarism testing, and many studies have been conducted into this problem. In previous studies using neural networks to evaluate semantic similarity, similarity has been measured using global information of sentence pairs. However, since sentences do not only have one meaning but a variety of meanings, using only global information can have a negative effect on performance improvement. Therefore, in this study, we propose a model that uses global information and local information simultaneously to evaluate the semantic similarity of sentence pairs. The proposed model can adjust whether to focus more on global information or local information through a weight parameter. As a result of the experiment, the proposed model can show that the accuracy is higher than existing models that use only global information.


Introduction
Semantic similarity evaluation is used in various fields such as machine translation, information retrieval, question-and-answering, and plagiarism detection [1][2][3][4]. Semantic similarity is measured for two texts, regardless of the length, the location of the corresponding words, and their contexts. These semantic similarity evaluations cost a lot of time and money in order for a person to judge directly. To solve this problem, past studies have used bilingual evaluation understudy (BLEU) [5] or metric for evaluation of translation with explicit ordering (METEOR) [6]. However, these are both vocabulary-based similarity evaluation methods, and it is difficult to grasp similar expressions, and not just the same ones. Recent studies [2][3][4] have shown good performance using artificial neural networks, such as convolutional neural networks (CNN), long short-term memory (LSTM) and gated recurrent unit (GRU), to evaluate the semantic similarity of sentence pairs. These methods [2][3][4] have used the output of the last hidden state, which represents a whole sentence, to evaluate similarity. However, if sentence similarity is judged using only information representing the entire sentence, there is a limitation, in that that it is difficult to properly reflect the effect of similarity of local meaning [7,8]. If a sentence is composed of multiple sentences, it is very important to estimate the meaning similarity between individual sentence pairs. Let us take an example below.
Sentence 1: I'm a 19-year-old. How can I improve my skills or what should I do to become an entrepreneur in the next few years? Sentence 2: I am a 19 years old guy How can I become a billionaire in the next 10 years? The two sentences above have similar expressions overall, but each second sentence has a distinctly different meaning. Therefore, in this study, we propose a model that uses not only global features, i.e., entire information on sentences, but also local features, i.e., local information on sentences.

•
To evaluate the semantic similarity of sentence pairs, we propose a model that uses global features, entire sentence information, and local features, localized sentence information, simultaneously. The proposed model can adjust whether to focus more on global information or local information, and it is seen that the accuracy is higher than the existing models that use only global information.

•
In this study, the effect of dynamic routing on similarity evaluation was investigated.
Since the similarity of the meaning of sentences is relatively free in the positioning of the corresponding phrases, it was found that dynamic routing hindered the correct evaluation. In addition, experiments were conducted on both English and Korean datasets to prove the language independency of the model proposed.
This paper is composed as follows. Section 2 briefly describes the models of former studies and our proposed model in evaluating sentence similarity. Section 3 minutely describes the proposed model for global and local feature extraction. Section 4 describes the data used in this study, the hyperparameters used in the experiment, and the experimental results. Section 5 discusses the results. Finally, Section 6 discusses the summary of this study and proposes future studies.

Related Works
Sentence similarity evaluation is used in various fields, and recently, it has shown good performance by utilizing deep learning [2][3][4]. Ref. [2] evaluates the similarity by applying the Siamese network structure to LSTM [13], a family of recurrent neural networks (RNN) that perform well in sequential data. The Siamese network is a structure in which two inputs are entered simultaneously into a single neural network [2]. Words constituting sentences are represented as vectors through word embedding. The two sentences composed of word vectors are entered into the LSTM and learning proceeds. The similarity is evaluated using Manhattan distance calculation, which is exp − h l L − h r L 1 , where h l L and h r L are the output vectors of the last hidden states representing each sentence. Ref. [3] evaluates the similarity by applying the Siamese network structure to CNN and LSTM. Ref. [3] converts two sentences into two vectors. Then, local features are extracted through CNN, which extracts information on adjacent words. The extracted local features are entered as an LSTM, and LSTM is trained. The similarity between the two sentences is calculated by applying the Manhattan distance to the LSTM output vectors.
Ref. [4] evaluates the similarity by applying the Siamese network structure to group CNN and bidirectional GRU (Bi-GRU), a family of RNNs. [4] converts two sentences into two vectors. Then, multi-local features are extracted through group CNN, and the most representative local features are obtained by applying max-pooling to the multi-local features. These extracted local features are concatenated with the vector, then it enters the input of Bi-GRU. The similarity is determined by applying Manhattan distance to the Bi-GRU output vectors.
To consider whole sentence information, our proposed model in this study uses the last hidden state as a global feature by applying the Siamese network structure to Bi-LSTM. Additionally, to use localized sentence information representing contextual information, after applying self-attention the feature extracted from the capsule network is used as a local feature. Finally, sentence similarity is evaluated by applying Manhattan distance to global and local features.  through CNN, which extracts information on adjacent words. The extracted local features are entered as an LSTM, and LSTM is trained. The similarity between the two sentences is calculated by applying the Manhattan distance to the LSTM output vectors. [4] evaluates the similarity by applying the Siamese network structure to group CNN and bidirectional GRU (Bi-GRU), a family of RNNs. [4] converts two sentences into two vectors. Then, multi-local features are extracted through group CNN, and the most representative local features are obtained by applying max-pooling to the multi-local features. These extracted local features are concatenated with the vector, then it enters the input of Bi-GRU. The similarity is determined by applying Manhattan distance to the Bi-GRU output vectors.

Materials and Methods
To consider whole sentence information, our proposed model in this study uses the last hidden state as a global feature by applying the Siamese network structure to Bi-LSTM. Additionally, to use localized sentence information representing contextual information, after applying self-attention the feature extracted from the capsule network is used as a local feature. Finally, sentence similarity is evaluated by applying Manhattan distance to global and local features.

Word Embedding
In the field of natural language processing, words constituting text are expressed as vectors through word embedding [2][3][4]. Word embedding refers to expressing a word's meaning as dense vectors so that computers can understand human words [17,18]. This stems from the assumption in distributional semantics that words appearing in similar contexts have similar meanings. Based on this, words with similar meanings are expressed in similar vectors [19].
In this study, Word2Vec [17,18], one of the word-embedding techniques, is used to express words as vectors. Word2Vec is a method to vectorize words using the target word and its surrounding words in a predetermined window size. Word2Vec updates learning weights by maximizing the dot product of the target word and its surrounding words. In this way, words are expressed as vectors representing their meanings.

Bidirectional Long Short-Term Momory
RNN is a neural network with good performance for sequential data processing, but learning ability is poor due to gradient vanishing, in which the gradient disappears during backpropagation, as the timestep, which refers to the order of input data, increases [7]. This problem is called long-term dependency. To solve the shortcomings of RNN, LSTM has been proposed. LSTM has a structure in which a cell state (c t ) that can have long-term memory is added to the hidden state (h t ) of RNN and solves long-term dependency [7]. To obtain h t and c t , LSTM uses an input gate (i t ) a forget gate ( f t ), an output gate (o t ), and a cell (g t ) that analyzes h t−1 of the previous timestep and (x t ) of the input information of the current timestep. f t is a gate that determines how much information of c t−1 is preserved by calculating the Hadamard product of f t and c t−1 . i t is a gate that determines how much information of x t is also reflected by calculating the Hadamard product of i t and g t . o t is a gate that determines h t , the current hidden state, by calculating the Hadamard product of o t and tan h(c t ). The equations of LSTM are as follows: In Equation (1), the sigmoid function in i t , f t , and o t results in adjusting how much of that information will be reflected. W xi , W x f , W xo , and W xg are the learning weight matrices connected to x t , W hi , W h f , W ho , and W hg are connected to h t−1 , and b i , b f , b o , and b g are biases of the corresponding layers. c t is updated by adding the Hadamard product of f t and c t−1 and the Hadamard product of i t and g t . Finally, h t is created by Hadamard product of tan h(c t ) and o t .
Bi-LSTM has a structure with a forward LSTM that sequentially approaches from x 1 to x L and a backward LSTM that sequentially approaches from x L to x 1 . Hidden states sequentially processed in Bi-LSTM are defined as follows: Equation (2) is the concatenation of the hidden state of the forward LSTM of timestep i and the hidden state of the backward LSTM, where L refers to the length of the sequence. In Equation (3), H is the concatenation of the hidden state of all timesteps. This H is then inputted to the attention mechanism.
The global features that contain the information of the whole sentence are defined as follows: Equation (4)

Attention Mechanism
The attention mechanism [8,[14][15][16] is a method of correlating words in sentences. In this study, self-attention using only one sentence is used, and H, which is the hidden state of Bi-LSTM, is used as an input value. Self-attention is defined as follows: In Equation (5), h i is the hidden state of the current timestep, and h j is the hidden state of any timestep including the current one. W i and W j refer to the learning weights of the corresponding timesteps i and j, and b i is the bias vector. In Equation (6), W a is a learning weight that calculates the importance of each word in terms of the current word. e ij in Equation (6) is a scalar value representing the importance of h j in terms of h i . The importance of words is normalized to a probability value by Equation (7). Then, c i containing context information is extracted by Equation (8) By multiplying the hidden value of each word by the importance probability scalar and adding all the results, the final importance vector of a given word can be obtained. C is a matrix of R L×U when the length of the sentence is L and the number of units of Bi-LSTM is U.

Capsule Network
In this study, a capsule network using CNN is used to extract local features [10][11][12][13]. The capsule network consists of two CNNs, Conv1 and Conv2. In this study, c extracted from self-attention is used as the input value of the capsule network. Conv1 proceeds as follows: In Equation (9), f (·) refers to the activation function, i refers to the index of the word, and h refers to the kernel size. w k refers to a learning weight for the convolution filter and has a matrix of R h×U , and b k refers to the bias vector. Conv1 is a typical CNN, extracting combined information of adjacent words as much as h does. Through this, features for Conv1 are created. Finally, Conv1 has a matrix of R (L−h+1)×N when the length of the sentence is L and the number of filters that increase the amount of learning is N.
Conv2 uses the features created in Conv1 as an input value. Conv2 is processed by Equation (9), like Conv1, but it uses the filter size corresponding to the entire size of input values to subdivide the expression of the entire sentence [13,20]. The features of Conv2 extracted in this way have a matrix of R 1×N by the number of filters N. After that, PrimaryCaps with R D× N D dimension is created by reconstructing the number of dimensions of the Conv2 function. Here, PrimaryCaps means capsules that have subdivided the entire sentence information into D pieces. To normalize the size of the vector, PrimaryCaps uses squash, a nonlinear function [10][11][12][13]. Squash is defined as follows: In Equation (10), P j means the one capsule. In this study, PrimaryCaps with squash applied are used as local features. v j = P j 2 1 + P j 2 P j P j Appl. Sci. 2021, 11, 2161 6 of 11

Similarity Measure
Global features extracted from Bi-LSTM and local features extracted from the capsule network are generated for each of the two sentences. In this study, the similarity between two sentences is evaluated by applying the Manhattan distance for global features and local features. The Manhattan distance is close to 1 if both vectors are similar and close to 0 if they are dissimilar [2][3][4]. The similarity between global and local features is defined as follows: In Equations (11) and (12), l and r mean two sentences. In Equation (11), h L is a global feature using the last hidden state of Bi-LSTM. In Equation (12), P means PrimaryCaps, and D means the number of dimensions of PrimaryCaps.
The final similarity value is calculated by the alpha weight (α) as follows: In Equation (13), α is a weight that can adjust which information to focus on among global features and local features. The value of α is determined experimentally. Finally, the predicted Similarity of 0.5 or higher determines similarity, and the others determine to non-similarity.

Experiments
In the experiment, the improved accuracy of the model proposed in this study is compared with that of the past models. Experiments are conducted in English and Korean to show that our model develops accuracy regardless of languages. To directly check the effect of dynamic routing, we have created models with or without dynamic routing.

Dataset
In this experiment, a corpus for learning representation of words and a corpus for learning similarity of given sentences are also needed. In order to learn the representations of English words, we use the Google News corpus [21]. For Korean, we use the raw Korean sentence corpus produced and distributed by Kookmin University in Korea [22]. Individual words are embedded by Word2Vec. These corpora have shown good results in existing sentence similarity evaluation research [2,3,23]. English and Korean words are represented as 300-dimensional vectors, and when the size of vocabulary appearing in a sentence is L, they have a matrix of R L×300 . The next two subsections show the status of two datasets used for learning similarity evaluation models in English and Korean, respectively.

English Dataset
The English dataset used in this study is Quora Question Pairs [24]. A pre-processing is carried out in which the stopword and all special characters are deleted from all sentences, and upper case letters are changed to lower case letters. Besides, since the sentences could have a very short length after preprocessing, we collected sentences with more than five words. Table 1 is an example of English sentence pairs and their labels collected for the experiment.
In Table 1's label, 0 means that sentence pairs do not have similar meanings, and label 1 means that the pairs are similar in meaning. The number of data items for label 0 is 50,000, and the number of data items for label 1 is also 50,000, so a total of 100,000 data are used in this study. Thereafter, the ratio of the training set: validation set: test set is set to 8:1:1. The Korean dataset used in this study is collected through the translation of Quora Question Pairs [24] by Google translator [25], Naver question pairs [26], Exobrain Korean paraphrase corpus [27], and German translation pairs developed by Hankuk University of Foreign Studies [28]. To use the elaborate Korean dataset, Naver Korean spell checker [29] is used. The experiment is conducted after dividing words into morphemes which are the smallest unit of words with meaning by using Kkma, a Korean morphological analyzer [30]. Table 2 is an example of Korean sentence pairs and labels collected for the experiment. Table 2. Example of Korean sentence pairs and labels.

Sentence 1 Sentence 2 Label
Korean It is a day when you have to be patient and wait for the time.
It's a day that seems to be going well.
English What is the cheapest way to produce teeth whitening effect?
What is an inexpensive and efficient way to whiten teeth?
In Table 2, in English is a pair of sentences translated from Korean, and the label of the sentence pair is marked as 1 if the meaning is similar and 0 if not similar. The number of individual data for label 0 is 5500, and the number of data items for label 1 is 5500, so a total of 11,000 data are used in this study. Thereafter, the ratio of the training set: validation set: test set is set to 8:1:1. Table 3 shows the hyperparameters for each neural network used in the experiment.   Figure 2a shows the accuracy of the English dataset and Figure 2b shows the accuracy of the Korean. In the case of English, it can be seen that the higher the proportion of global features, the higher the accuracy. Its peak is at 80% of the weight of the global features. In Korea, accuracy also increases according to the weight of global features. However, the accuracy shows the peaks when the weight of the two features is around 5:5. This is because the proportion of local features is relatively higher in the case of Korean, with more free word order than English.

Hyperparameters
The number of dimensions used in the Pri-maryCaps 8 Figures 2a,b show the performance comparison according to of Equation (13) in Section 3.5. Figure 2a shows the accuracy of the English dataset and Figure 2b shows the accuracy of the Korean. In the case of English, it can be seen that the higher the proportion of global features, the higher the accuracy. Its peak is at 80% of the weight of the global features. In Korea, accuracy also increases according to the weight of global features. However, the accuracy shows the peaks when the weight of the two features is around 5:5. This is because the proportion of local features is relatively higher in the case of Korean, with more free word order than English.  Table 4 shows the accuracy of various models, including the previously proposed semantic similarity evaluation models. Acc shows the average accuracy of 10 experiments using Manhattan distance. Here, English Acc means the accuracy using the English dataset, and Korean Acc means accuracy using the Korean dataset.   Table 4 shows the accuracy of various models, including the previously proposed semantic similarity evaluation models. Acc shows the average accuracy of 10 experiments using Manhattan distance. Here, English Acc means the accuracy using the English dataset, and Korean Acc means accuracy using the Korean dataset.

Discussion
Existing studies [2][3][4] have tried to estimate the similarity of sentences mainly by using global features. We were able to improve accuracy by up to 8.32%p (Korean) by also utilizing local features. In addition, it was shown that an accuracy of up to 3.03%p can be achieved by simply changing the LSTM to Bi-LSTM (compare No. 1  A capsule network is designed to solve the problem of CNN's pooling method, but rather CNN's pooling shows higher accuracy not only for the English dataset but also for the Korean dataset (see No. 5 and No. 7. in Table 4). Furthermore, we can confirm that the capsule network model, No. 6, which does not use dynamic routing, has higher accuracy than the CNN based models, No. 5 and No. 7. Through this, it can be confirmed that dynamic routing considering the spatial relationship of local features is inappropriate for use in the task of semantic similarity estimation. Unlike images in which the relative positions of the pixel chunks are fixed to some extent, the position of words or phrases in a sentence are relatively free.
As shown from No. 8 to No. 11, which are RNN-related models, it can be confirmed that the accuracy of the capsule network model with self-attention applied to Bi-LSTM is the best. In No. 4, No. 9, and No. 11, in Korean the degree of accuracy improvement by the capsule network and the self-attention is higher than in English. The semantics of Korean sentences is concentrated in a more local area than in English. In order to grasp the entire meaning of a Korean sentence, it is important to accurately understand the local meanings.
No. 12 is a proposed model in which α of Equation (13) is 0.8 in English and 0.5 in Korean. This can be seen that α of the Korean dataset concerning global features is lower than α of the English dataset. From the fact that the larger the alpha value, the higher the proportion of global features, we can infer that Korean depends more on local features since it has a freer order of words than English.

Conclusions
In this study, the semantic similarity of sentence pairs is evaluated using global and local sentence information. The proposed model consists of Bi-LSTM, self-attention, and capsule network, and all neural networks, except self-attention, apply the Siamese network. This model extracts global features through Bi-LSTM and extracts local features through the capsule network. The extracted global and local features are used when evaluating the semantic similarity of sentence pairs using Manhattan distance. This allows not only global and local features to be considered together but also the adjustment of which information to focus more closely on by α. As a result of comparing the existing models, models using only global features, models using only local features, and the model using both global and local features at the same time, it can be seen that the accuracy of the model using both global and local features at the same time is higher.
However, this study has a limitation. Even though α plays a very important role in determining the weight of global and local features, we have not been able to come up with a universal methodology for obtaining α. We will find a way to integrate this into the entire network to obtain the optimal α through learning. We will also apply another language model instead of Word2Vec to obtain more sophisticated sentence pair representations, and update the current network structure after analyzing the strengths and weaknesses to further improve the accuracy of the model.