Estimation of Cross-Lingual News Similarities Using Text-Mining Methods

In this research, two estimation algorithms for extracting cross-lingual news pairs based on machine learning from financial news articles have been proposed. Every second, innumerable text data, including all kinds news, reports, messages, reviews, comments, and tweets are generated on the Internet, and these are written not only in English but also in other languages such as Chinese, Japanese, French, etc. By taking advantage of multi-lingual text resources provided by Thomson Reuters News, we developed two estimation algorithms for extracting cross-lingual news pairs from multilingual text resources. In our first method, we propose a novel structure that uses the word information and the machine learning method effectively in this task. Simultaneously, we developed a bidirectional Long Short-Term Memory (LSTM) based method to calculate cross-lingual semantic text similarity for long text and short text, respectively. Thus, when an important news article is published, users can read similar news articles that are written in their native language using our method.


Introduction
Text similarity, as its name suggests, refers to how similar a given text query is to others.We normally tend to consider texts based mainly on their semantic characteristics, that is, how close (i.e., similar) their meanings are.Here, the text could be in the form of character level, word level, sentence level, paragraph level, or even longer, document level.In this paper, we mainly discuss text that is in the form of sentences (i.e., short text) and documents (i.e., long text).
The objective of this research could be summarized in three key points.The fundamental objective is to develop algorithms for estimation of semantic similarity for the given two pieces of text written in different languages, applicable for both long text and short text, by taking advantage the untapped vast suppository of text resources from Thomson Reuters economics news reports.Secondly, as a practical application and a verification of our model, we are aiming at developing a cross-lingual recommendation system and test benchmark, which could provide several of the most-related (for example, 10 results) pieces of Japanese or English text when given an English (or Japanese) article.Thirdly, we excavate cross-lingual resources from the enormous database of Thomson Reuters News and build an effective cross-lingual system by taking advantage of this un-developed treasure.

Related Work and Theories
Regardless of the length of the text, most of the state-of-the-art methods have recently been implemented based on word embedding methods and thus we discuss this in detail in a separate section.To solve semantic text similarity problems, one of the most typical and inspiring methods is Siamese LSTM structure, which is considered as both a basis and a competitive baseline of this research.

Embedding Techniques for Words and Documents
Word embedding technique, also known as distributed word representation, is one of the most basic concepts and applications prevalent nowadays.Word embedding could be further extended to be performed on documents.The embedding techniques capture both the semantic and syntactic information and convert them into meaningful feature vectors which help to train accurate models for natural language processing (NLP) tasks (Tang et al. 2014).
Word embedding can be implemented for both monolingual and multilingual tasks.There are several successful papers working on the monolingual word embedding such as the continuous bag of words models and skip-gram models (Mikolov et al. 2013), monolingual document embedding such as doc2vec (Le and Mikolov 2014), cross-lingual word embedding (Zou et al. 2013), as well as cross-lingual document embedding models such as Bilingual Bag-of-Words without Word Alignments (BilBOWA) (Gouws et al. 2015).Through embedding model, each word, phrase or document would be converted into a fixed length vector representation, where the similarity between two words, phrases, or documents could be derived by calculating the cosine distance of their vector representations.Methods are distinctly different for the text data with different length when solving the text similarity problem (Le and Mikolov 2014).With respect to the length of the text, a textual similarity task could be further categorized into two sub-tasks.Prevalent methods for cross-lingual document (i.e., long text) similarity could be categorized into four aspects (Rupnik et al. 2016), Dictionary-based approaches (Kudo et al. 2004), Probabilistic topic model based approaches (Taghva et al. 2005), Matrix factorization based approaches (Lo et al. 2014), and Monolingual approaches.

Text Similarities Using Siamese LSTM
Neural network-based Siamese recurrent architectures have recently proved to be one of the most effective ways for learning semantic text similarity on the sentence level.Mueller, in his work, implements a Siamese recurrent structure called Manhattan LSTM (MaLSTM) (Mueller and Thyagarajan 2016), which is practically used as the estimation of relativeness (i.e., similarity) when given any two sentences in English.This structure uses Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) and has a state-of-the-art performance on both semantic relatednesses scoring task and entailment classification using the SICK database, one of the NLP challenges provided by SemEval (Agirre et al. 2016).This model could identify how two sentences are similar to each other by trying to "understand" their true meaning on a deeper aspect, like the sentence pairs "He is smart" and "A truly wise man" as the figure demonstrates.They have no common words with different lengths, but they are indeed highly relevant to each other in terms of their implications, which a human cannot recognize without more consideration and logical analysis, suggesting the difficulty of this challenge.
In our work, we developed a new recurrent structure inspired by MaLSTM, by modifying the Siamese (i.e., symmetric) LSTM modules to "unbalanced" ones, and adding a full-connect neural network layer following the output of LSTM modules, which is more flexible and effective than a text similarity task.

Methods for Extracting Cross-Lingual News Pairs
In this section, we will introduce all fundamental and necessary methods applied in our research.There are mainly three aspects to be elaborated, including methods we applied regarding the foundation of natural language processing, such as word embedding and TF-IDF.We explained two applied methods, one of which is the classical methods learning SVM (Support Vector Machine).The other one is the neural network method, LSTM(Long-Short Term Memory).

Distribution Representation
The most traditional and naive way to consider words as features is to treat words as discrete symbols or numbers.This results in a discrete representation of each word and hinders the establishment of relations among these features.In contrast, vector space models consider (embedded) words in a continuous vector space, in which words with similar meanings are separated by small distances.There are two main categories for continuous word embedding: count-based (such as latent semantic analysis) models and predictive-based methods (such as neural probabilistic language models).The count-based models focus on the co-occurrence of the considered word and its neighboring words, whereas the predictive-based models predict a word based on its neighbors using embedding vectors Baroni et al. (2014).In this research, we implement a predictive-based model that is known as word2vec; it is based on the skip-gram or continuous bag-of-words model Mikolov et al. (2013).
We train each word from the training text sequence w 1 , w 2 , w 3 , ..., w T to maximize the objective function wherein c is the so-called "window size," which determines how much context information is to be considered for each of the training words.More specifically, we define p(w t+j |w t ) using a softmax function: wherein W is the size of the vocabulary (i.e., the number of disparate words to be considered), and v is the vector representations for either the word w, the input word w I , or the output word w O .However, the calculation of Equation ( 2) is impractical because the computational cost for calculating the gradient of log p(w t+j |w t ) is proportional to W, which consists of as many as 10 5 to 10 7 terms.In practical terms, to train the model (i.e., optimize the cost function) in a more computationally efficient manner, we use Noise Contrastive Estimation for approximation during training, as described in Mikolov et al. (2013).
Finally, vector representations with fixed dimension (e.g., 200) can be extracted from the trained model.These word vectors have some outstanding attributes.Because we train our model for each word using its neighboring words, and words with similar meaning usually tend to have similar context, we can calculate the similarity among words using the cosine distance.

Term Frequency-Inversed Document Frequency (TF-IDF)
TF-IDF is one of the classical weighting models for words, which uses text representations.It is widely used in the natural language processing domain wherein it is commonly applied for weighting words or document features, such as in one-hot bag-of-words representation.The term frequency stands for the number of times a considered word occurs in a specific document, while the document frequency is the number of documents in the corpus that include the word.The inverse document frequency term for a specific word can be expressed as wherein N is the total number of documents in the corpus.Combining these two concepts, the TF-IDF weight is the product of the TF and the IDF.This scheme loses semantic information for words; thus, it usually cannot achieve satisfactory performance.However, it measures the weights and importance of each word inside documents and among other documents according to a reasonable definition.
In this study, we apply TF-IDF to weight words during document embedding.

TF-IDF Weighting for Word Vectors
Although there are several ways to form vector representations for documents (i.e., document embedding), we have experimentally discovered that the most effective strategy is to use the TF-IDF weighted sum of the word vectors that are present in each document as features.First, we calculate two TF-IDF weighting models, namely TF-IDF jp and TF-IDF en , for each word from English training documents and Japanese training documents.Second, for each Japanese document, the weighted sum document representation can be derived as wherein N i refers to the number of words in this Japanese document (i.e., Japanese document i), and t i,m stands for the Japanese TF-IDF weight for the m-th word in document i with respect to the considered word.The final term w i,m is the word vector of the m-th word in document i, that is, the vector representation for this considered word.We apply the same weighting scheme to the English documents.The vector representation for English document i can be expressed as wherein all the definitions of the above variables are the same as those in the Japanese processing case, except the texts are in English.

Feature Engineering
The selection of features is possibly the most significant and tricky step, in particular, for classical machine learning algorithms such as SVM.This is called "feature engineering" because sometimes the choice of features can greatly affect the results.Fortunately, as one of the most exciting results in this research, we discover that satisfactory results can be generated using the joint cross-lingual document vector that is based on TF-IDF weighted word2vec as a training feature for the SVM model.Although both SVM and TF-IDF weighted word vectors are common in the text mining domain, to the best of our knowledge, this is the first time that the effectiveness of using joint cross-lingual text feature vectors as input for SVM on the cross-lingual text similarity problem has been proved.
More specifically, for the vector representation of a Japanese document J i and an English document E j , the joint features are defined by Via feature engineering, we prepare our training datasets S, which contain a subset S 1 of instances for which the similarity scores are all equal to 1: and another subset S 0 of instances for which the similarity scores are all equal to 0: wherein N is the total number of cross-lingual training pairs with similarity of 1 (i.e., similar pairs) for training and o is an arbitrary number that belongs to (1, N) and is not equal to 1, such that f 1,o is the set of dissimilar pairs with similarity of 0 (i.e., the pairs are totally unrelated).Moreover, note that Hence, our final training data S is S = S 1 ∪ S 0 (9)

The SVM-Based Method
SVM is one of the most popular methods for solving both classification and regression tasks.It was originally purposed in 1990s and gradually proved to be effective in many fields including Natural language processing (NLP), pattern recognition and so on (Burges 1998;Malakasiotis and Androutsopoulos 2007;Béchara et al. 2015).TF-IDF and SVM are useful for tasks in the field of natural language processing.Therefore, we employ TF-IDF and SVM in our method as core technologies.Additionally, we propose a novel structure that uses TF-IDF and SVM effectively for this task.An overview of the structure is illustrated in Figure 1.
The system mainly contains three processing models.As our our training datasets, S only contains the data with label 0 or 1, the classification training objective of SVM is very similar to classification using Triplet Loss, which is proved to be quite effective in embedding and classification tasks (Schroff et al. 2015).The training procedures normally include the following steps: 1.
Use the cross-lingual training data in the form of pre-trained word vectors as input, which is discussed in detail in Section 3.1.

2.
Weight the word vectors for each of language models using TF-IDF, as introduced in Subsections 3.2 and 3.3.

3.
Train the proposed model using SVM with Platt's probability estimation for the connected cross-lingual document features, each of which are the naive join of two weighted word sum vectors in English and Japanese.This is explained in Section 3.4.

A Bidirectional LSTM Based Method
We implement the two independent modules of bi-directional LSTM recurrent neural networks on both English input and Japanese input respectively and the overview of this structure is shown in the Figure 2. We use the cross-lingual training data in the form of pre-trained word vectors as input.Feed the word vector sequentially to LSTM modules.This is discussed in detail in the Section 3.1.Furthermore, as a limitation of our LSTM modules, we have uniform length of data as input, denoted as "maxlen".The residue of the parts of sequence longer than maxlen will be abandoned, while those with input sequence shorter than "maxlen" will be padded with a predefined value (i.e., a word) such as "null" at the tail so that all the input data could have the same length.The two bi-LSTM modules are responsible for the English sequence and Japanese Sequence respectively.They generate four hidden layer outputs and we concatenate them into a joint feature.Details are elaborated in the Section 3.6.1.The joined feature is further fed into a densely-connected neural network of 1 depth, resulting in 1 dimension output y ∈ [0, 1] as the final similarity score of the two inputs of cross-lingual data, by means of regression.In general, the LSTM-based model pays more attention to the order information of the input sequence, which might significantly determine the real meaning of a sentence written in natural languages.

The Bi-LSTM Layer
In this research, we take advantage of bi-LSTM (bi-directional long short-term memory), to enhance the ordinary RNN performance considering both forward and backward information and solve the problem of the long-term dependencies.The updates rules of LSTM for each sequential input x 1 , x 2 , ..., x t , ..., x T could be express as: where h t−1 is the hidden layer value of the previous states and the sigmoid and tanh functions in the above equations are also used as activation functions: The weights (i.e., parameters) we need to train include i .As the results, we obtain four feature vectors derived from hidden layer values of the four LSTM modules, keeping all necessary information regarding to the cross-lingual inputs.We then merge these four features by concatenating them directly: where i and j refer to the document number of the input text for Japanese and English respectively, and vector h refers to the final status (i.e., the value) of the hidden layers of the LSTM module after feeding the last (or the first, if backwards) word. !"#$"%&'()##"(*"!+,"-./%,"*0).1+%/&".

Dense Layer
We use the most basic component of the basic full-dense Neural Network layer as the top layer.The function of this layer could be expressed as: Here, the function f is also known as "activation" function, b is the one dimensional bias for the neural network and w is the weight (i.e., the parameters to be trained) of the neural network.In this project, we mainly apply the softplus Nair and Hinton (2010) function as the activation function in the dense layer: As for the optimization, although we are handling a classification problem, based on the experimental results, we find that, instead of using ordinary cross-entropy cost, it performs better if we use Quadratic cost (i.e., mean square error) as the cost function, which could be described as: where N is the total number of the training data, while y true,v and y pred,v refer to the true similarity and the predicted similarity, respectively.In practice, the stochastic gradient descent (SGD) is implemented by means of the back-propagation scheme.After computing the outputs and errors based on the cost function J, which is usually equal to the negative log of the maximum likelihood function, we update parameters by the gradient descent method, expressed as: where ε is known as "learning rate", defining the update speed of the hyper-parameters w.However, the training process might fail due to either improper initialization regarding weights or the improper learning rate value set.Practically, based on the results of the experiments, the best performance is achieved by applying the Adam optimizer Kingma and Ba (2014) to perform the parameter updates.

Evaluation Methods
We mainly use two categories of the evaluations, TOP-N benchmark based on ranks, and traditional criteria for classification such as precision, recall as well as the F1-value.As the applications of this project aim to suggest several cross-lingual (For instance, English) alternative news stories to the users, when the user provides a Japanese article as a query, we make the system pick up 1, 5 and 10 of the most similar Japanese alternatives during the evaluation process.The Figure 3 illustrates the relationship and evaluation procedures for ranks, TOP-N index.For a given Japanese text (i.e., the query) J x , calculate the similarity score between J x and all English text of test data sets (E 1 , E 2 , ..., E x , ..., E M ) to derive a list of scores L x = (S x,1 , S x,2 , ..., S x,x , ..., S x,M ) , where the corner mark M is the total number of English documents to be considered, and E x is the true similar article with a similarity score of 1. Then sort this list in the order from large to small and find out the rank (i.e., position, index) of the score S x,x inside this sorted list noted as R x , the rank for the query document J x .Repeat this process recursively for N Japanese articles (J 1 , J 2 , ..., J N ), result in a list of ranks R = (R 1 , R 2 , ..., R N ) regarding the collections of J x .Then we take the number of query documents with ranks smaller than N as TOP-N.In other words, TOP-1 refers to the number of query documents with rank equal to 1 and TOP-5 refers to the number of a query with rank equal to or smaller than 5.

Baseline: Siamese LSTM with Google-Translation
Siamese LSTM is one of the deep learning-based models with the art-of-state performance on the semantic text similarity problems.In this research, we make this model a baseline by extending this model from a monolingual domain to a cross-lingual domain with the help of the Google Translation services.We first translate all Japanese text into the English version on both test and training data by using the google translate service 1 Then we implement the Siamese LSTM model as described in the original paper for Siamese LSTM Mueller and Thyagarajan (2016) with the help of the open source code on the Github 2 To illustrate this baseline method regarding a two cross-lingual input, we first translate the Japanese input into an English sentence using Google Translation service.Then, we can consider the cross-lingual task as monolingual one so that we can apply the Siamese LSTM model for training as a baseline.

Datasets and Pre-Processing
Thomson Reuters news 3 is a worldwide news agency providing worldwide news in multiple languages.Most of the reports are originally written in English and translated and edited into other languages including Chinese, Japanese, etc.These multi-lingual texts are expected to be highly potential resources for tasks related to the multi-lingual natural languages processing.In this research, we use 60,000 news articles in 2014 from Thomson Reuters News related to the economics.For the preprocessing of text, we convert raw data to normalized data, which could be further used to train word2vec models 1 Google Translation Web API could be accessed from https://github.com/aditya1503/Siamese-LSTM.

2
The open source code for Siamese LSTM can be accessed from https://github.com/aditya1503/Siamese-LSTM.for both English and Japanese text, respectively.We train the Japanese word2vec model and English word2vec model separately using news articles with the same contents in 2014.In our experiment, we use the model of a Continuous Bag of Words (CBOW), with 200 fixed dimensions of word embedding.Other parameters are set using the default value used in the Gensim package 4 .
As discussed in Section 3.1, the word2vec could build relationships among words based on their original context.We could find several of the most similar words when given a query word by calculating their cosine similarity.The Tables 1 and 2 demonstrate examples to find the most similar words when given a word query in English and in Japanese respectively.All these results suggest the effectiveness of word2vec algorithms and success of the training processes.Regarding Short Text, we firstly pick up 4000 pairs of parallel Japanese-English cross-lingual news titles from the database with the period from the January to February of 2014, all of which are labeled with a similarity score of 1.To provide balance training data, we also generated 4000 pairs of un-parallel Japanese-English cross-lingual news titles by a random combination.In order to simplify our model and experiments, we use the assumption that the similarity the random combination of Japanese text and English text is 0.Then, for Long Text, similar to the data preparation of experiments 4 To see more specific of the configuration of word2vec model, see the documentation of Word2Vec class from https: //radimrehurek.com/gensim/models/word2vec.html for short text introduced, we prepare 4000 parallel (i.e., similarity = 1) Japanese-English news articles and 4000 un-parallel (i.e., similarity = 0) ones for training data through random combination.

Test Data
For Short Text, in order to evaluate our model more comprehensively, we have prepared two sets of independent test data.TEST-1S contains 1000 pairs of parallel Japanese-English news titles, selected and split from the same period of training data, from January 2014 to the middle of February in 2014.Similarly, TEST-2S contains title pairs with time stamps of December 2014.For Long Text, similar to the case of short test evaluation, we have prepared two sets of independent test data.For training data, we prepared a similar dataset as for the short text experiments.TEST-1L and TEST-2L contain 1000 pairs of parallel Japanese-English long news articles respectively.First, we could notice that both our purposed, LSTM-based model and SVM-based model, outperform the baseline in terms of all three TOP-N criteria.In terms of TOP-N, the LSTM-based model obtains around twice the performance of the baseline (511 vs. 243) on short test data, while LSTM-based model also has twice the performance of the baseline (685 vs. 302) on long test data, suggesting the effectiveness and efficiency of our purposed models.Furthermore, we may also easily notice that the SVM-based method outperforms the LSTM-based methods in terms of TOP-N criteria, in case of long text, around 50%.In contrast, the LSTM-based model has a TOP-10 score around 10% higher than that of the SVM-based model on both test datasets.The dominant performance of the SVM-based model on long test data is also maintained in terms of TOP-1 and TOP-5, twice the score compared to the LSTM-based model for TOP-5 and three times the score for the TOP-1 benchmark.On the other hand, although the LSTM-based model still performs better than SVM-based with respect to TOP-5, as for TOP-5 LSTM-based model failed to be in the lead anymore.We are going to discuss these results and propose possible hypotheses and provide explanations in Section 5.The performance of successful recommendation numbers from our bi-LSTM based model is twice that of the baseline.

Comparison of the Baseline and the LSTM-Based Model
The performance of the LSTM-based model is twice that of the baseline, even though they are both based on LSTM structures.The differences, which are also the innovations for this purposed method, compared to the baseline, include the using of bi-LSTM, independent LSTM modules as well as using the fully connected neural network as the final layer.
First, the baseline method is able to calculate the similarity of two sentences, no matter whether there are different types of word arrangement for the two inputs, or if there are different words used referring to the same meaning, which proves the effectiveness of the encoding (i.e., embedding) ability for the input text.However, the baseline model has the "Siamese LSTM structure", which means, in other words, that the two LSTM instances always share the same parameters during the training.This might be effective for a monolingual case, but not good enough on the cross-lingual case.Thus, the LSTM instances used in our purposed model are all independently holding their own unique parameters.In addition, the bi-directional structure also helps to encode the feature of each input text more comprehensively.Finally, instead of using cosine similarity as the final layer in the baseline method, we used the fully connected neural network as output, making the output layer adjust (i.e., train) its parameters so as to learn precise patterns from the features generated by LSTMs.We believe these three modifications improve the final results for our LSTM-based model.

Comparison of the LSTM-Based Model and SVM-Based Model
The experiments above leave us with an interesting question about why the LSTM-based model and SVM-based model perform differently regarding the length of the target text we train and test.We explain this question in two aspects.

From the Point of View of the SVM-Based Model
Since the SVM-based methods use the TF-IDF weighting which is a classical and an effective method for NLP fields to extract the most important and representative features for each of document comprehensively, it could accurately identify the most significant feature, a few key words, from a very long and complex article containing hundreds of words, in both Japanese and English, and then finally feed them into the SVM classifier to get the similarity estimation universally.However, due to the attributes of TF-IDF algorithms, the shorter the length of each document is, the less information the TF-IDF could extract.This is because if there are fewer words in one document, every word could be either unique or common regarding other documents, resulting in the failure of TF-IDF.This might be the reason why SVM-based model performs well on long datasets but this performance becomes poor on shorter data sets.

From the Point of View of the LSTM-Based Model
On the other hand, the LSTM is good at understanding sentences by means of grasping the order information of each words, since for any natural languages, not only words themselves but also the order of words, to some extent, define the true meaning of a sentence.Especially for short text, a slight change of the order could alter the meaning of the sentences significantly and thus the LSTM-based model outperformed the LSTM model by around 10% on short datasets.However, LSTM is not good at extracting the key idea of longer documents since, although LSTM solves the problem of memorizing long text (i.e., solve of the problem of gradient vanishing and gradient explosion), it could tell the importance of each word as TF-IDF does.That might be the possible reason why it fails to perform effectively on a long text.

Conclusions
We developed a bi-LSTM-based model to calculate cross-lingual similarities given a pair of English and Japanese articles.Instead of using a translation module or a dictionary to translate from one to another language, our model has outstanding performance with short text.Furthermore, we modified and implemented a popular Siamese LSTM model as the baseline and we found both of our models outperform the baseline.For practical testing, we defined the concept of "TOP-N" and "ranks" to test the overall performance of the model, with visualized results.We also make a comparative study based on the results of the experiments that bi-LSTM based obtains better performance on short text data such as news titles and alert messages, which are on average shorter than 20 words, in contrast to normal news articles with more than 200 words on average.As the results show, both models obtained satisfactory performance with over half of the test documents of 1000 holding ranks lower than 10 (i.e., TOP-10).As a high-performance cross-lingual news calculating system, we expect that it could achieve optimal performance by taking advantage of both models to form a complete system.
and bias vectors b i , b f , b c , b o .A more thorough exposition of the LSTM model and its variants is provided by(Graves 2012) and(Greff et al. 2017).In this layer, we use the cross-lingual training data in the form of pre-trained word vectors as input, which is discussed in detail in Section 3.1.There are four LSTM modules, constructing two bi-LSTM structures, where we only consider the final output (i.e., final value of the hidden layer) of each LSTM module: LSTM-a read Japanese text in a forward direction.The value of a hidden layer is denoted as h (a) i where i is the i-th input of the sequence, while LSTM-b read backwards, denoted as h (b) i .Symmetrically, LSTM-c and LSTM-d are used to read English text, denoted as h

Figure 3 .
Figure 3. Illustration of an evaluation procedures using ranks and TOP-N index.
4.4.3.Ranks and TOP-NTable 3 also summarizes and compares our two purposed models, LSTM-based model and SVM-based model respectively in terms of TOP-N benchmark regarding LONG text scenario and SHORT text scenario.

Table 1 .
Example of similarity relationship for Japanese words (translated).

Table 2 .
Example of similarity relationship for English words.

Table 3 .
Summary of in terms of TOP-N benchmark.