The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures

: In the paper the word-level n-grams based approach is proposed to ﬁnd similarity between texts. The approach is a combination of two separate and independent techniques: self-organizing map (SOM) and text similarity measures. SOM’s uniqueness is that the obtained results of data clustering, as well as dimensionality reduction, are presented in a visual form. The four measures have been evaluated: cosine, dice, extended Jaccard’s, and overlap. First of all, texts have to be converted to numerical expression. For that purpose, the text has been split into the word-level n-grams and after that, the bag of n-grams has been created. The n-grams’ frequencies are calculated and the frequency matrix of dataset is formed. Various ﬁlters are used to create a bag of n-grams: stemming algorithms, number and punctuation removers, stop words, etc. All experimental investigation has been made using a corpus of plagiarized short answers dataset.


Introduction
Nowadays text mining can be used in different practice areas [1] but the most common are: information extraction and retrieval, text classification and clustering, natural language processing, concept extraction, and web mining. Text analysis can be useful and helps to solve problems, such as plagiarism detection, creating effective anti-spam filters [2], finding duplicates in a large number of documents or finding duplicates on the Internet [3]. Some methods focus on keywords from scientific papers' extraction which helps to find the main aim of papers automatically [4]. In an education system, plagiarism detection is a sensitive issue [5]. Plagiarism is one of the common problems because students keep trying to cheat and present writings which they have not created. Usually the main technique to detect similarity between texts is to extract the bag of words from all text datasets. Then the frequency matrix is created, in other words, texts are converted to numerical expressions. In such a technique, the results depend on selected filters when the bag of words is created. Therefore, it is important to select the right filters to get accurate results. Using this technique, we can analyze all texts or just split it into the parts: sentences, paragraphs, pages or n-grams. Depending on the solving task, n-grams could be formed by using character-level or word-level [2,6]. Similarity results can be evaluated using different methods, for example, statistical, estimation of numerical values, or using various clustering methods, such as k-means, Bayesian, artificial neural networks, etc. [7].
In this paper an approach is proposed to find similarity between texts by integrating not only a numerical estimation but also text clustering and visualization. The text similarity detection is based Appl. Sci. 2019, 9,1870 2 of 14 on text splitting into word-level n-grams and evaluating it using a self-organizing map (SOM) and four numerical measures. The text analysis using a bag of words is not effective because it is difficult to detect how similar two texts are just analyzing the frequency of separate words. Two different people writing some kind of text individually can use similar or almost the same number of words in the text. The bag of words analysis will show that both texts are similar but there is a bigger probability of accidental text match than analyzing a word-level bag of n-grams [8]. In this paper, four specific measures have been used to evaluate texts similarity: cosine, dice, extended Jaccard's, and overlap coefficient. There is significant different evaluations in the literature [9] but these four measures are commonly used in various fields [10,11]. The other part of the approach is based on the detection of the text similarity of a SOM. The advantage of this method compared with other clustering methods is that we can get a visual representation of all texts in a dataset, cluster, as well as similarities. It helps to make decisions much quicker than analyzing numerical estimation. The main problem of the SOM is that it does not have measures that help to define how similar texts are in the same cell of SOM. Thus, for this reason, it is effective to combine analysis of the texts using SOM and numerical similarity measures. To get accurate results, we extract word-level n-grams of different length from texts and analyze them. It allows us to find the same phrases between different texts. In such a way, instead of a bag of words, we have a bag of n-grams, which characterize all texts. The experimental investigation was made using a corpus of plagiarized short answers dataset.

Proposed Approach to Evaluate Text Similarity
As was mentioned earlier, there are various methods to find similarity between texts but mostly in all of them similarity is evaluated by numerical measures and is based on usage of bag of words. Instead of this, we propose an n-grams based approach and result estimating in two ways: visual and numerical. The scheme of the proposed approach is presented in Figure 1. The approach consists of three main parts: text preprocessing, visualization and clustering, and numerical estimation. The main aim of text preprocessing is to find the numerical expression of texts (to find the frequency matrix), which will be used for visualization, clustering, and numerical estimation. The detailed description of text preprocessing is presented in Section 2.2. After the frequency matrix is created, the matrix is given to the SOM, where the dataset is clustered and visualized in the SOM. It allows detecting texts similarity in a visual form. In parallel, the four similarity measures are calculated (numerical estimation). The combination of these two separate techniques allows performing deeper texts similarity analysis. The SOM helps to see the whole text dataset similarity in one map and the numerical estimation justifies and specifies the results quantitatively.

Preparation of Frequency Matrix
To analyze texts, it is necessary to convert textual information to numerical expression; the so-called frequency matrix needs to be created. There are many different tools to create it [12][13][14], but the main steps are usually the same (

Preparation of Frequency Matrix
To analyze texts, it is necessary to convert textual information to numerical expression; the socalled frequency matrix needs to be created. There are many different tools to create it [12][13][14], but the main steps are usually the same ( Figure 2).

Preparation of Frequency Matrix
To analyze texts, it is necessary to convert textual information to numerical expression; the socalled frequency matrix needs to be created. There are many different tools to create it [12][13][14], but the main steps are usually the same ( Figure 2).  At first, a text's dataset has to be parsed, all textual information is extracted from the original source and Meta information is not included (pictures, tables frames, schemes, and other not necessary information are rejected). After parsing is done, tokenization has to be made. Tokenization is a process of breaking a stream of text up into words, phrases, symbols, sentences or other meaningful elements called tokens. The list of tokens becomes input for further processing, such as text mining. Afterwards, we can choose a different filter. It is obvious that all texts have some information that do not characterize the text or is simply not important in the text analysis. Therefore, the aim of selected filters is to reject not important information from texts datasets, such as numbers, punctuation, stop words, etc. The most popular filters and their descriptions are presented in Table 1. In some text mining systems, a specific filter which helps to reject just links, keywords or other not ordinary information can be found. It is obvious that the most important part in text conversation is filter selection because it has the biggest influence on the results. So, it is important to choose the right options, to get accurate results [15]. Otherwise, useful information can be rejected and results can be inappropriate.

Stemming algorithm
The stemming algorithm separates the stem from the word [16]. For example, we have four words 'accepted', 'acceptation', 'acceptance', and 'acceptably'. The stem of the words is 'accept', so only this word will be analyzed, other words are ignored.
According to the selected filters, a so-called bag of words is created. The bag of words is a list of terms from texts excluding the words that do not satisfy the conditions defined by the selected filters. Suppose we have a texts dataset D = {D 1 , D 2 , . . . , D N }. According to the frequency of the words in the texts, a so-called frequency matrix is created: Here x pl is the frequency of the lth word in the pth text, p = 1, . . . , N, l = 1, . . . , m. N is the number of the analyzed texts, and n is the number of words in the bag of words. In the simplest case, frequency value is equal to number that shows how many words appear in the text. A row of matrix (1) is a vector, corresponding to a text. The vectors X 1 , X 2 , . . . , X N , X p = x p1 , x p2 , . . . , x pm , p = 1, . . . , N, can be used for a text analysis using various methods.
Sometimes it is not enough to analyze just words extracted from texts, especially when similarity has to be found. The analysis of n-grams can be used [17]. An n-gram is a contiguous sequence of n items from a given sequence. The item can be described as word, letter, phonemes, etc. In our research, we have used a word as the item. In this way, we have a bag of n-grams, where each text is characterized by unique n-grams (a few words from the texts). The n-grams analysis allows to compare a few words in texts, and so obtained similarity results are more accurate. The main steps of n-grams usage are the same as presented in the scheme in Figure 2. We further suggest adding sorting ( Figure 1) which helps to avoid a problem when the words in different texts are written in a different order. Suppose we have two n-grams 'data mining methods' and 'methods of data mining'. After filtering (common words are rejected) and the sorting (ascending) step is completed, we get the same n-gram: 'data methods mining'. In the final results, we get the frequency matrix (1) where each x pl will be equal to frequency of the lth n-gram in the pth text. The proposed approach to find similarity between texts can be used to detect plagiarism (see Figure 1).

Self-Organizing Maps
There are many different clustering methods which can be used in text analysis [18][19][20]: artificial neural network (ANN), k-means, agglomerative hierarchical clustering, etc. The SOM is one of the most popular artificial neural network models, proposed by Professor T. Kohonen [21]. New extensions and modifications are developed constantly. SOMs can be used to cluster, classify, and visualize the data. The main advantage of this method is to show results in visual form [22]. There are many different tasks where SOM can be used and solve it. SOM can be useful in text mining, too [23,24]. The main aim of SOM is to preserve the topology of multidimensional data when they are transformed into a lower dimensional space (usually two-dimensional). The SOM is a set of nodes, connected to one another via a rectangular or hexagonal topology. The rectangular topology of SOM is presented in Figure 3. items from a given sequence. The item can be described as word, letter, phonemes, etc. In our research, we have used a word as the item. In this way, we have a bag of n-grams, where each text is characterized by unique n-grams (a few words from the texts). The n-grams analysis allows to compare a few words in texts, and so obtained similarity results are more accurate. The main steps of n-grams usage are the same as presented in the scheme in Figure 2. We further suggest adding sorting ( Figure 1) which helps to avoid a problem when the words in different texts are written in a different order. Suppose we have two n-grams 'data mining methods' and 'methods of data mining'. After filtering (common words are rejected) and the sorting (ascending) step is completed, we get the same n-gram: 'data methods mining'. In the final results, we get the frequency matrix (1) where each will be equal to frequency of the th n-gram in the th text. The proposed approach to find similarity between texts can be used to detect plagiarism (see Figure 1).

Self-Organizing Maps
There are many different clustering methods which can be used in text analysis [18][19][20]: artificial neural network (ANN), k-means, agglomerative hierarchical clustering, etc. The SOM is one of the most popular artificial neural network models, proposed by Professor T. Kohonen [21]. New extensions and modifications are developed constantly. SOMs can be used to cluster, classify, and visualize the data. The main advantage of this method is to show results in visual form [22]. There are many different tasks where SOM can be used and solve it. SOM can be useful in text mining, too [23,24]. The main aim of SOM is to preserve the topology of multidimensional data when they are transformed into a lower dimensional space (usually two-dimensional). The SOM is a set of nodes, connected to one another via a rectangular or hexagonal topology. The rectangular topology of SOM is presented in Figure 3.
... The set of weights forms a vector , = 1, … , , = 1, … , that is usually called a neuron or codebook vector, where is the number of rows, and is the number of columns of the SOM. All texts of the analyzed dataset converted to SOM are given as a matrix (1). The learning process of the SOM algorithm starts from initialization of the components of the vectors (neurons) . They can be initialized at random (usually these values are random numbers from the interval (0, 1)) or by the principal components. At each learning step, an input vector is passed to the SOM. The vector is compared to all neurons . Usually, the Euclidean distance between this input vector and each neuron are calculated. The vector (neuron) with the minimal Euclidean distance to is designated as a neuron winner (the best match unit). All the neuron's components are adapted according to the learning rule: The set of weights forms a vector M ij , i = 1, . . . , k x , j = 1, . . . , k y that is usually called a neuron or codebook vector, where k x is the number of rows, and k y is the number of columns of the SOM. All texts of the analyzed dataset converted to SOM are given as a matrix (1). The learning process of the SOM algorithm starts from initialization of the components of the vectors (neurons) M ij . They can be initialized at random (usually these values are random numbers from the interval (0, 1)) or by the principal components. At each learning step, an input vector X p is passed to the SOM. The vector X p is compared to all neurons M ij . Usually, the Euclidean distance between this input vector X p and each neuron M ij are calculated. The vector (neuron) M w with the minimal Euclidean distance to X p is designated as a neuron winner (the best match unit). All the neuron's components are adapted according to the learning rule: Here t is the number of learning step, h w ij is a neighboring function, w is a pair of indices of the neuron winner of vector X p . The learning is repeated until the maximum number of learning step T is reached.

Measures for Text Similarity Detection
To evaluate the similarity between texts, it is necessary to use some mathematical expressions which can evaluate and give the answer to one single numeric value [25,26]. The widest known and used texts similarity measures are cosine, dice, the extended Jaccard's, and the overlap coefficient: Here To show how these four measures are calculated, a simple example is presented. Let us say we have four texts D = {D 1 , D 2 , D 3 , D 4 } with few words inside of them (Table 2). Let us say we do not use any filters, so the bag of words list contains all terms from texts follows: text, message, computer, science, data, mining, and, methods. According to the frequency of each term, the frequency matrix is obtained (Table 3).
After a frequency matrix is obtained, we can calculate the similarity measures. The results of calculated measures as an example are given in Table 2, presented in Table 4. As we can see, the results of cosine and dice measures are almost the same. The values of extended Jaccard's are lower compared to the others. The overlap measure shows the highest values, and as there is no difference between overlap (D 1 , D 3 ) and overlap (D 1 , D 4 ), it means that these texts are equal in the point of similarity. All measures can be used equally to find similarity between texts, so it is hard to say which one is the most accurate and the deep investigation has to be made.

Dataset
A corpus of plagiarized short answers [27] has been used for experimental investigation. This dataset is also suitable to find similarity between texts. The corpus consists of one hundred texts: 95 answers provided by the 19 participants and 5 original Wikipedia source articles. The questions for students are given bellow:  Table 5). The average length of text in the corpus is 208 words and 113 unique tokens. The description of each revision level is given:

•
Near copy (cut)-participants were asked to answer the question by simply copying the text from the relevant Wikipedia article.
• Light revision (light)-participants were asked to base their answer on the text found in the Wikipedia article and were, once again, given no instructions about which parts of the article to copy. • Heavy revision (heavy)-participants were once again asked to base their answer on the relevant Wikipedia article but were instructed to rephrase the text to generate the answer with the same meaning as the source text, but expressed using different words and structure. • Non-plagiarism (non)-participants were provided with learning materials in the form of either lecture notes or sections from textbooks that could be used to answer the relevant question.

Steps of the Experiment
To find the similarity between the analyzed dataset, the experimental investigation was made in three steps. At the first step, the way to create a bag of n-grams was analyzed. The primary research shows that for this dataset, the maximum words in n-grams can be five, otherwise some data is lost because of short texts. In addition, to create the bag of n-grams, all filters given in Table 1 were included. The focus is given when the words in n-grams are equal from three to five, so in total fifteen variants were analyzed. The size of bag of n-grams is given in Figure 4.
At the second step, four similarity measures (Table 6) were calculated between all twenty texts to detect which texts are similar, to compare it with the given categorical descriptions (Table 5), and to decide which measure gives better results. At the last step, the same dataset has been presented with SOM. In SOM, we can see all twenty texts' similarity at once and according to the obtained results, decide how similar each text is to each other. Table 6. Texts similarity results sorted as Near copy (cut), Light revision (light), Heavy revision (heavy), Non-plagiarism (non) (results are given in percent).

Cut
Light Heavy Non Cosine 64    To find the similarity between the analyzed dataset, the experimental investigation was made in three steps. At the first step, the way to create a bag of n-grams was analyzed. The primary research shows that for this dataset, the maximum words in n-grams can be five, otherwise some data is lost because of short texts. In addition, to create the bag of n-grams, all filters given in Table 1 were included. The focus is given when the words in n-grams are equal from three to five, so in total fifteen variants were analyzed. The size of bag of n-grams is given in Figure 4.

Experimental Results
Deeper analysis showed that for this dataset there was no big difference between three, four or five words n-grams used so the final experimental results will be presented when the bag of n-grams is created using three words. In Table 6, we can see all calculated measures, which represent the similarity between text D 20 (original text) and other texts in the dataset. The variable Q n , where n = 1, . . . , 5 is the question number from the original dataset [28]. All values in Table 6 are in percent so the lowest percent means the worst result (texts are not similar), the highest-best (similar). The highest percent has been marked in bold.
The highest percent in Q 1 analysis were obtained for all measures when D 20 was compared to texts D 5 and D 17 . According to the Table 5, the most similar texts to the original are texts D 4 , D 11 , D 15 , D 17 . All measures get the highest percent when the original text was compared to the D 17 , which proves that this text is a near copy. The text D 5 is marked as light revision, but all measures showed that it is mostly copied text. As we can see, the other near copy texts (D 4 , D 11 , D 15 ) were detected (the highest percent) as a near copy when overlap measure (5) was used alone (D 4 = 80%, D 11 = 66%, D 15 = 113%). The value of D 15 is higher than 100%, because the original text and D 15 fully overlap and some n-grams in original text were even mentioned a few times more. In this case, it meant that these two texts were totally similar. If we look to the Q 2 answers' text similarity results, we can see that the highest percent is obtained with text D 1 and D 19 using the overlap measure. According to Table 5, the heavy copies are D 1 , D 8 , and D 18 . So the only one overlap measure can confirm just one near copy text similarity (D 1 = 100%) and light revision (D 19 = 81%). Neither measure detected the similarity of text D 18 , all of them got 0%. Deeper analysis showed that it was some a mistake given in dataset description because looking at the D 18 text and comparing it with the original text it was confirmed that these two texts cannot be marked as a near copy because the text is totally different.
The results of questions Q 1 and Q 2 answers' text similarity are presented using SOM ( Figure 5) [24]. The color scale from white to black in cells means the values of the U-matrix [22]. The lighter color means that the distance between some data is short and the dark otherwise. The pie charts represent the texts of the dataset. If the dataset items are very similar among each other, they will fall to the same cell (one pie chart divided to pieces). As we can see in Figure 5a, the texts D 5 , D 17 , D 20 fall in the same cell so it means that these two texts are similar to the original text D 20 , that was earlier proved by calculating similarity measures. The other texts also make some groups or fall to the same cell. For example, according to the Table 5, the texts D 2 , D 6 , D 10 , D 16 are non-copies so in SOM their fall out in the same cell. Using SOM, we can easily identify which texts are similar to each other. In the right side of Figure 5b, the near copy and light revision texts D 1 , D 4 , D 5 , D 8 , D 12 , D 19 are located near original text D 20 . It also confirms that these texts are the most similar to the original text. The results of answers texts showed that the highest percent according to the all measures were obtained when the original text was compared to the texts (near copy) and (light revision). The same results as the previous question were found; the highest percent when the overlap measure was used ( = 92%, = 79%). As we can see, in the right top corner of the SOM ( Figure  6a) all these three texts are located in the same cell. The other texts also make some clusters which are formed according to the Table 5 described categories. The results of answers texts in the SOM (Figure 6b) shows that the texts (near copy), (light revision) are similar to the original text , but the other near copy texts , , are located and grouped far away from original text cell. So in this case, the SOM recognize similarity partly. The similarity measures also confirmed similarity of three texts (Table 6) only, where overlap measure gave the highest values: = 105%, = 99% , = 96% . In this case the text fully overlapped the original text so it is obviously plagiarism. The results of Q 3 answers texts showed that the highest percent according to the all measures were obtained when the original text was compared to the texts D 2 (near copy) and D 9 (light revision). The same results as the previous question were found; the highest percent when the overlap measure was used (D 2 = 92%, D 9 = 79%). As we can see, in the right top corner of the SOM (Figure 6a) all these three texts are located in the same cell. The other texts also make some clusters which are formed according to the Table 5 described categories. The results of Q 4 answers texts in the SOM (Figure 6b) shows that the texts D 13 (near copy), D 17 (light revision) are similar to the original text D 20 , but the other near copy texts D 3 , D 6 , D 9 are located and grouped far away from original text cell. So in this case, the SOM recognize similarity partly. The similarity measures also confirmed similarity of three texts (Table 6) only, where overlap measure gave the highest values: D 3 = 105%, D 13 = 99%, D 19 = 96%. In this case the text D 3 fully overlapped the original text so it is obviously plagiarism. The results of the last question answers texts similarity were almost all confirmed by calculated measures. According to Table 5, the near copy texts are  ,  ,  , , and . All four measures proved that four of five texts are similar to the original text. As with previous results, the highest percent was obtained using the overlap measure: = 100%, = 97%, = 100%, and = 97%. Only one overlap measure confirmed the similarity of the text , with other measures the value is small. In the bottom left corner of the SOM (Figure 7), the original text is located in the same cell with text . The other near copy texts are scattered over all map so in this case, it is hard to confirm similarity just using the SOM. The results of the last question Q 5 answers texts similarity were almost all confirmed by calculated measures. According to Table 5, the near copy texts are D 5 , D 7 , D 10 , D 14 , and D 16 . All four measures proved that four of five texts are similar to the original text. As with previous results, the highest percent was obtained using the overlap measure: D 5 = 100%, D 10 = 97%, D 14 = 100%, and D 16 = 97%. Only one overlap measure confirmed the similarity of the text D 5 , with other measures the value is small. In the bottom left corner of the SOM (Figure 7), the original text D 20 is located in the same cell with text D 16 . The other near copy texts are scattered over all map so in this case, it is hard to confirm similarity just using the SOM.
Appl. Sci. 2019, 9, x; doi: www.mdpi.com/journal/applsci calculated measures. According to Table 5, the near copy texts are  ,  ,  , , and . All four measures proved that four of five texts are similar to the original text. As with previous results, the highest percent was obtained using the overlap measure: = 100%, = 97%, = 100%, and = 97%. Only one overlap measure confirmed the similarity of the text , with other measures the value is small. In the bottom left corner of the SOM (Figure 7), the original text is located in the same cell with text . The other near copy texts are scattered over all map so in this case, it is hard to confirm similarity just using the SOM.

Conclusions
In this paper an approach was proposed to detect similarity between texts. The approach was based on the text splitting into n-grams and evaluating it using a SOM and similarity measures. The detection of similar texts was made in three steps: (1) text dataset conversion to numerical expression using n-grams; (2) calculation of similarity measures; (3) text dataset visualization using SOM and similarity representation on it. At the first step, the main focus was to create a bag of n-grams of all datasets. The various number of words in n-grams were analyzed. In addition, different filters were applied: numbers and punctuation removing, words frequency, uppercase transform, stemming algorithm, etc. The analysis showed the filters and size of n-grams influenced the final results. For

Conclusions
In this paper an approach was proposed to detect similarity between texts. The approach was based on the text splitting into n-grams and evaluating it using a SOM and similarity measures. The detection of similar texts was made in three steps: (1) text dataset conversion to numerical expression using n-grams; (2) calculation of similarity measures; (3) text dataset visualization using SOM and similarity representation on it. At the first step, the main focus was to create a bag of n-grams of all datasets. The various number of words in n-grams were analyzed. In addition, different filters were applied: numbers and punctuation removing, words frequency, uppercase transform, stemming algorithm, etc. The analysis showed the filters and size of n-grams influenced the final results. For this dataset, the size of the n-grams was selected and equal to three for the experimental investigation. At the second step, the four similarity measures were calculated: cosine, dice, extended Jaccard's, and overlap. Final results showed that the highest percent of similarity was obtained using overlap measures. The other three measure values were always similar and smaller. The usage of SOM showed that SOM helps to see the summarized results of all texts' similarity in visual form quickly. It is very easy to understand which texts are similar to each other or not. In the analyzed dataset case, the SOM helped to detect similarity, and the formed clusters were correlated with the given categorical description of the dataset.
The experimental investigation showed that the most accurate similarity measure is overlap because this measure detected more near copy texts and gained the highest percent. Sometimes it showed even full texts overlap which can be defined as plagiarism. The SOM helps to summarize the full dataset similarity in visual form, but it is hard to confirm how much texts are similar to each other. The investigations showed that SOM was more useful as an additional tool to decide which texts could be similar and deeper investigation could then be applied. The usage of n-grams and creation of a bag of words showed that it is an effective way to find similarity between texts. Deeper analysis has to be made to detect how all filters, size of n-grams, and other texts' conversation to numerical expression affect the final results for much longer texts' datasets. So it is purposeful to analyze them in more detailed in the future. The proposed approach allowed finding similarity between texts and evaluating results by combining SOM and numerical estimations helped to make a deep analysis.