Next Article in Journal
Parallel Image Captioning Using 2D Masked Convolution
Next Article in Special Issue
Heated Metal Mark Attribute Recognition Based on Compressed CNNs Model
Previous Article in Journal
Complex Human–Object Interactions Analyzer Using a DCNN and SVM Hybrid Approach
Previous Article in Special Issue
A Deep Learning Method for Bearing Fault Diagnosis through Stacked Residual Dilated Convolutions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures

1
Faculty of Fundamental Science, Vilnius Gediminas Technical University, Saulėtekio al. 11, LT-10223 Vilnius, Lithuania
2
Institute of Data Science and Digital Technologies, Vilnius University, Akademijos str. 4, LT–08663 Vilnius, Lithuania
*
Author to whom correspondence should be addressed.
Appl. Sci. 2019, 9(9), 1870; https://doi.org/10.3390/app9091870
Submission received: 2 April 2019 / Revised: 24 April 2019 / Accepted: 30 April 2019 / Published: 7 May 2019
(This article belongs to the Special Issue Advances in Deep Learning)

Abstract

:
In the paper the word-level n-grams based approach is proposed to find similarity between texts. The approach is a combination of two separate and independent techniques: self-organizing map (SOM) and text similarity measures. SOM’s uniqueness is that the obtained results of data clustering, as well as dimensionality reduction, are presented in a visual form. The four measures have been evaluated: cosine, dice, extended Jaccard’s, and overlap. First of all, texts have to be converted to numerical expression. For that purpose, the text has been split into the word-level n-grams and after that, the bag of n-grams has been created. The n-grams’ frequencies are calculated and the frequency matrix of dataset is formed. Various filters are used to create a bag of n-grams: stemming algorithms, number and punctuation removers, stop words, etc. All experimental investigation has been made using a corpus of plagiarized short answers dataset.

Graphical Abstract

1. Introduction

Nowadays text mining can be used in different practice areas [1] but the most common are: information extraction and retrieval, text classification and clustering, natural language processing, concept extraction, and web mining. Text analysis can be useful and helps to solve problems, such as plagiarism detection, creating effective anti-spam filters [2], finding duplicates in a large number of documents or finding duplicates on the Internet [3]. Some methods focus on keywords from scientific papers’ extraction which helps to find the main aim of papers automatically [4]. In an education system, plagiarism detection is a sensitive issue [5]. Plagiarism is one of the common problems because students keep trying to cheat and present writings which they have not created. Usually the main technique to detect similarity between texts is to extract the bag of words from all text datasets. Then the frequency matrix is created, in other words, texts are converted to numerical expressions. In such a technique, the results depend on selected filters when the bag of words is created. Therefore, it is important to select the right filters to get accurate results. Using this technique, we can analyze all texts or just split it into the parts: sentences, paragraphs, pages or n-grams. Depending on the solving task, n-grams could be formed by using character-level or word-level [2,6]. Similarity results can be evaluated using different methods, for example, statistical, estimation of numerical values, or using various clustering methods, such as k-means, Bayesian, artificial neural networks, etc. [7].
In this paper an approach is proposed to find similarity between texts by integrating not only a numerical estimation but also text clustering and visualization. The text similarity detection is based on text splitting into word-level n-grams and evaluating it using a self-organizing map (SOM) and four numerical measures. The text analysis using a bag of words is not effective because it is difficult to detect how similar two texts are just analyzing the frequency of separate words. Two different people writing some kind of text individually can use similar or almost the same number of words in the text. The bag of words analysis will show that both texts are similar but there is a bigger probability of accidental text match than analyzing a word-level bag of n-grams [8]. In this paper, four specific measures have been used to evaluate texts similarity: cosine, dice, extended Jaccard’s, and overlap coefficient. There is significant different evaluations in the literature [9] but these four measures are commonly used in various fields [10,11]. The other part of the approach is based on the detection of the text similarity of a SOM. The advantage of this method compared with other clustering methods is that we can get a visual representation of all texts in a dataset, cluster, as well as similarities. It helps to make decisions much quicker than analyzing numerical estimation. The main problem of the SOM is that it does not have measures that help to define how similar texts are in the same cell of SOM. Thus, for this reason, it is effective to combine analysis of the texts using SOM and numerical similarity measures. To get accurate results, we extract word-level n-grams of different length from texts and analyze them. It allows us to find the same phrases between different texts. In such a way, instead of a bag of words, we have a bag of n-grams, which characterize all texts. The experimental investigation was made using a corpus of plagiarized short answers dataset.

2. Text Similarity Detection

2.1. Proposed Approach to Evaluate Text Similarity

As was mentioned earlier, there are various methods to find similarity between texts but mostly in all of them similarity is evaluated by numerical measures and is based on usage of bag of words. Instead of this, we propose an n-grams based approach and result estimating in two ways: visual and numerical. The scheme of the proposed approach is presented in Figure 1. The approach consists of three main parts: text preprocessing, visualization and clustering, and numerical estimation. The main aim of text preprocessing is to find the numerical expression of texts (to find the frequency matrix), which will be used for visualization, clustering, and numerical estimation. The detailed description of text preprocessing is presented in Section 2.2. After the frequency matrix is created, the matrix is given to the SOM, where the dataset is clustered and visualized in the SOM. It allows detecting texts similarity in a visual form. In parallel, the four similarity measures are calculated (numerical estimation). The combination of these two separate techniques allows performing deeper texts similarity analysis. The SOM helps to see the whole text dataset similarity in one map and the numerical estimation justifies and specifies the results quantitatively.

2.2. Preparation of Frequency Matrix

To analyze texts, it is necessary to convert textual information to numerical expression; the so-called frequency matrix needs to be created. There are many different tools to create it [12,13,14], but the main steps are usually the same (Figure 2).
At first, a text’s dataset has to be parsed, all textual information is extracted from the original source and Meta information is not included (pictures, tables frames, schemes, and other not necessary information are rejected). After parsing is done, tokenization has to be made. Tokenization is a process of breaking a stream of text up into words, phrases, symbols, sentences or other meaningful elements called tokens. The list of tokens becomes input for further processing, such as text mining. Afterwards, we can choose a different filter. It is obvious that all texts have some information that do not characterize the text or is simply not important in the text analysis. Therefore, the aim of selected filters is to reject not important information from texts datasets, such as numbers, punctuation, stop words, etc. The most popular filters and their descriptions are presented in Table 1. In some text mining systems, a specific filter which helps to reject just links, keywords or other not ordinary information can be found. It is obvious that the most important part in text conversation is filter selection because it has the biggest influence on the results. So, it is important to choose the right options, to get accurate results [15]. Otherwise, useful information can be rejected and results can be inappropriate.
According to the selected filters, a so-called bag of words is created. The bag of words is a list of terms from texts excluding the words that do not satisfy the conditions defined by the selected filters. Suppose we have a texts dataset D = { D 1 , D 2 , , D N } . According to the frequency of the words in the texts, a so-called frequency matrix is created:
( x 11 x 12 x 1 m x 21 x 22 x 2 m x N 1 x N 2 x N m )
Here x p l is the frequency of the l th word in the p th text, p = 1 , , N , l = 1 , , m . N is the number of the analyzed texts, and n is the number of words in the bag of words. In the simplest case, frequency value is equal to number that shows how many words appear in the text. A row of matrix (1) is a vector, corresponding to a text. The vectors X 1 , X 2 , , X N , X p = ( x p 1 , x p 2 , , x p m ) , p = 1 , , N , can be used for a text analysis using various methods.
Sometimes it is not enough to analyze just words extracted from texts, especially when similarity has to be found. The analysis of n-grams can be used [17]. An n-gram is a contiguous sequence of n items from a given sequence. The item can be described as word, letter, phonemes, etc. In our research, we have used a word as the item. In this way, we have a bag of n-grams, where each text is characterized by unique n-grams (a few words from the texts). The n-grams analysis allows to compare a few words in texts, and so obtained similarity results are more accurate. The main steps of n-grams usage are the same as presented in the scheme in Figure 2. We further suggest adding sorting (Figure 1) which helps to avoid a problem when the words in different texts are written in a different order. Suppose we have two n-grams ’data mining methods’ and ’methods of data mining’. After filtering (common words are rejected) and the sorting (ascending) step is completed, we get the same n-gram: ’data methods mining’. In the final results, we get the frequency matrix (1) where each x p l will be equal to frequency of the l th n-gram in the p th text. The proposed approach to find similarity between texts can be used to detect plagiarism (see Figure 1).

2.3. Self-Organizing Maps

There are many different clustering methods which can be used in text analysis [18,19,20]: artificial neural network (ANN), k-means, agglomerative hierarchical clustering, etc. The SOM is one of the most popular artificial neural network models, proposed by Professor T. Kohonen [21]. New extensions and modifications are developed constantly. SOMs can be used to cluster, classify, and visualize the data. The main advantage of this method is to show results in visual form [22]. There are many different tasks where SOM can be used and solve it. SOM can be useful in text mining, too [23,24]. The main aim of SOM is to preserve the topology of multidimensional data when they are transformed into a lower dimensional space (usually two-dimensional). The SOM is a set of nodes, connected to one another via a rectangular or hexagonal topology. The rectangular topology of SOM is presented in Figure 3.
The set of weights forms a vector M i j , i = 1 , , k x ,   j = 1 , , k y that is usually called a neuron or codebook vector, where k x is the number of rows, and k y is the number of columns of the SOM. All texts of the analyzed dataset converted to SOM are given as a matrix (1). The learning process of the SOM algorithm starts from initialization of the components of the vectors (neurons) M i j . They can be initialized at random (usually these values are random numbers from the interval (0, 1)) or by the principal components. At each learning step, an input vector X p is passed to the SOM. The vector X p is compared to all neurons M i j . Usually, the Euclidean distance between this input vector X p and each neuron M i j are calculated. The vector (neuron) M w with the minimal Euclidean distance to X p is designated as a neuron winner (the best match unit). All the neuron’s components are adapted according to the learning rule:
M i j ( t + 1 ) = M i j ( t ) + h i j w ( X p M i j ( t ) )
Here t is the number of learning step, h i j w is a neighboring function, w is a pair of indices of the neuron winner of vector X p . The learning is repeated until the maximum number of learning step T is reached.

2.4. Measures for Text Similarity Detection

To evaluate the similarity between texts, it is necessary to use some mathematical expressions which can evaluate and give the answer to one single numeric value [25,26]. The widest known and used texts similarity measures are cosine, dice, the extended Jaccard’s, and the overlap coefficient:
cos ( D 1 , D 2 ) = D 1 × D 2 | D 1 | × | D 2 |
dice ( D 1 , D 2 ) = 2 D 1 × D 2 | D 1 | + | D 2 |
jaccard ( D 1 , D 2 ) = D 1 × D 2 | D 1 | + | D 2 | D 1 × D 2
overlap ( D 1 , D 2 ) = D 1 × D 2 min ( | D 1 | , | D 2 | )
Here | D N | = ( x N 1 2 + x N 2 2 + x N 3 2 + + x N m 2 ) , D 1 × D 2 = x 11 x 21 + x 12 x 22 + + x 1 m x 2 m . To show how these four measures are calculated, a simple example is presented. Let us say we have four texts D = { D 1 , D 2 , D 3 , D 4 } with few words inside of them (Table 2).
Let us say we do not use any filters, so the bag of words list contains all terms from texts follows: text, message, computer, science, data, mining, and, methods. According to the frequency of each term, the frequency matrix is obtained (Table 3).
After a frequency matrix is obtained, we can calculate the similarity measures. The results of calculated measures as an example are given in Table 2, presented in Table 4.
As we can see, the results of cosine and dice measures are almost the same. The values of extended Jaccard’s are lower compared to the others. The overlap measure shows the highest values, and as there is no difference between overlap ( D 1 , D 3 ) and overlap ( D 1 , D 4 ) , it means that these texts are equal in the point of similarity. All measures can be used equally to find similarity between texts, so it is hard to say which one is the most accurate and the deep investigation has to be made.

3. Experimental Investigation

3.1. Dataset

A corpus of plagiarized short answers [27] has been used for experimental investigation. This dataset is also suitable to find similarity between texts. The corpus consists of one hundred texts: 95 answers provided by the 19 participants and 5 original Wikipedia source articles. The questions for students are given bellow:
  • Q1—‘What is inheritance in object oriented programming?’
  • Q2—‘Explain the PageRank algorithm that is used by the Google search engine’
  • Q3—‘Explain the vector space model for Information Retrieval’
  • Q4—‘Explain Bayes Theorem from probability theory’
  • Q5—‘What is dynamic programming?’
For each question, there are 19 examples of each of the heavy revision, light revision, and near copy evaluation, and 38 non-plagiarized examples written independently from the Wikipedia source (Table 5). The average length of text in the corpus is 208 words and 113 unique tokens. The description of each revision level is given:
  • Near copy (cut)—participants were asked to answer the question by simply copying the text from the relevant Wikipedia article.
  • Light revision (light)—participants were asked to base their answer on the text found in the Wikipedia article and were, once again, given no instructions about which parts of the article to copy.
  • Heavy revision (heavy)—participants were once again asked to base their answer on the relevant Wikipedia article but were instructed to rephrase the text to generate the answer with the same meaning as the source text, but expressed using different words and structure.
  • Non-plagiarism (non)—participants were provided with learning materials in the form of either lecture notes or sections from textbooks that could be used to answer the relevant question.

3.2. Steps of the Experiment

To find the similarity between the analyzed dataset, the experimental investigation was made in three steps. At the first step, the way to create a bag of n-grams was analyzed. The primary research shows that for this dataset, the maximum words in n-grams can be five, otherwise some data is lost because of short texts. In addition, to create the bag of n-grams, all filters given in Table 1 were included. The focus is given when the words in n-grams are equal from three to five, so in total fifteen variants were analyzed. The size of bag of n-grams is given in Figure 4.
At the second step, four similarity measures (Table 6) were calculated between all twenty texts to detect which texts are similar, to compare it with the given categorical descriptions (Table 5), and to decide which measure gives better results. At the last step, the same dataset has been presented with SOM. In SOM, we can see all twenty texts’ similarity at once and according to the obtained results, decide how similar each text is to each other.

3.3. Experimental Results

Deeper analysis showed that for this dataset there was no big difference between three, four or five words n-grams used so the final experimental results will be presented when the bag of n-grams is created using three words. In Table 6, we can see all calculated measures, which represent the similarity between text D 20 (original text) and other texts in the dataset. The variable Q n , where n = 1 , , 5 is the question number from the original dataset [28]. All values in Table 6 are in percent so the lowest percent means the worst result (texts are not similar), the highest-best (similar). The highest percent has been marked in bold.
The highest percent in Q 1 analysis were obtained for all measures when D 20 was compared to texts D 5 and D 17 . According to the Table 5, the most similar texts to the original are texts D 4 , D 11 , D 15 , D 17 . All measures get the highest percent when the original text was compared to the D 17 , which proves that this text is a near copy. The text D 5 is marked as light revision, but all measures showed that it is mostly copied text. As we can see, the other near copy texts ( D 4 , D 11 , D 15 ) were detected (the highest percent) as a near copy when overlap measure (5) was used alone ( D 4 = 80 % , D 11 = 66 % , D 15 = 113 % ). The value of D 15 is higher than 100%, because the original text and D 15 fully overlap and some n-grams in original text were even mentioned a few times more. In this case, it meant that these two texts were totally similar. If we look to the Q 2 answers’ text similarity results, we can see that the highest percent is obtained with text D 1 and D 19 using the overlap measure. According to Table 5, the heavy copies are D 1 , D 8 , and D 18 . So the only one overlap measure can confirm just one near copy text similarity ( D 1 = 100 % ) and light revision ( D 19 = 81 % ). Neither measure detected the similarity of text D 18 , all of them got 0%. Deeper analysis showed that it was some a mistake given in dataset description because looking at the D 18 text and comparing it with the original text it was confirmed that these two texts cannot be marked as a near copy because the text is totally different.
The results of questions Q 1 and Q 2 answers’ text similarity are presented using SOM (Figure 5) [24]. The color scale from white to black in cells means the values of the U-matrix [22]. The lighter color means that the distance between some data is short and the dark otherwise. The pie charts represent the texts of the dataset. If the dataset items are very similar among each other, they will fall to the same cell (one pie chart divided to pieces). As we can see in Figure 5a, the texts D 5 , D 17 , D 20 fall in the same cell so it means that these two texts are similar to the original text D 20 , that was earlier proved by calculating similarity measures. The other texts also make some groups or fall to the same cell. For example, according to the Table 5, the texts D 2 , D 6 , D 10 , D 16 are non-copies so in SOM their fall out in the same cell. Using SOM, we can easily identify which texts are similar to each other. In the right side of Figure 5b, the near copy and light revision texts D 1 , D 4 , D 5 , D 8 , D 12 , D 19 are located near original text D 20 . It also confirms that these texts are the most similar to the original text.
The results of Q 3 answers texts showed that the highest percent according to the all measures were obtained when the original text was compared to the texts D 2 (near copy) and D 9 (light revision). The same results as the previous question were found; the highest percent when the overlap measure was used ( D 2 = 92 % , D 9 = 79 % ). As we can see, in the right top corner of the SOM (Figure 6a) all these three texts are located in the same cell. The other texts also make some clusters which are formed according to the Table 5 described categories. The results of Q 4 answers texts in the SOM (Figure 6b) shows that the texts D 13 (near copy), D 17 (light revision) are similar to the original text D 20 , but the other near copy texts D 3 , D 6 , D 9 are located and grouped far away from original text cell. So in this case, the SOM recognize similarity partly. The similarity measures also confirmed similarity of three texts (Table 6) only, where overlap measure gave the highest values: D 3 = 105 % , D 13 = 99 % , D 19 = 96 % . In this case the text D 3 fully overlapped the original text so it is obviously plagiarism.
The results of the last question Q 5 answers texts similarity were almost all confirmed by calculated measures. According to Table 5, the near copy texts are D 5 , D 7 , D 10 , D 14 , and D 16 . All four measures proved that four of five texts are similar to the original text. As with previous results, the highest percent was obtained using the overlap measure: D 5 = 100%, D 10 = 97 % , D 14 = 100%, and D 16 = 97 % . Only one overlap measure confirmed the similarity of the text D 5 , with other measures the value is small. In the bottom left corner of the SOM (Figure 7), the original text D 20 is located in the same cell with text D 16 . The other near copy texts are scattered over all map so in this case, it is hard to confirm similarity just using the SOM.

4. Conclusions

In this paper an approach was proposed to detect similarity between texts. The approach was based on the text splitting into n-grams and evaluating it using a SOM and similarity measures. The detection of similar texts was made in three steps: (1) text dataset conversion to numerical expression using n-grams; (2) calculation of similarity measures; (3) text dataset visualization using SOM and similarity representation on it. At the first step, the main focus was to create a bag of n-grams of all datasets. The various number of words in n-grams were analyzed. In addition, different filters were applied: numbers and punctuation removing, words frequency, uppercase transform, stemming algorithm, etc. The analysis showed the filters and size of n-grams influenced the final results. For this dataset, the size of the n-grams was selected and equal to three for the experimental investigation. At the second step, the four similarity measures were calculated: cosine, dice, extended Jaccard’s, and overlap. Final results showed that the highest percent of similarity was obtained using overlap measures. The other three measure values were always similar and smaller. The usage of SOM showed that SOM helps to see the summarized results of all texts’ similarity in visual form quickly. It is very easy to understand which texts are similar to each other or not. In the analyzed dataset case, the SOM helped to detect similarity, and the formed clusters were correlated with the given categorical description of the dataset.
The experimental investigation showed that the most accurate similarity measure is overlap because this measure detected more near copy texts and gained the highest percent. Sometimes it showed even full texts overlap which can be defined as plagiarism. The SOM helps to summarize the full dataset similarity in visual form, but it is hard to confirm how much texts are similar to each other. The investigations showed that SOM was more useful as an additional tool to decide which texts could be similar and deeper investigation could then be applied. The usage of n-grams and creation of a bag of words showed that it is an effective way to find similarity between texts. Deeper analysis has to be made to detect how all filters, size of n-grams, and other texts’ conversation to numerical expression affect the final results for much longer texts’ datasets. So it is purposeful to analyze them in more detailed in the future. The proposed approach allowed finding similarity between texts and evaluating results by combining SOM and numerical estimations helped to make a deep analysis.

Author Contributions

Designing and performing measurements, P.S., O.K. and R.S.; Data analysis, P.S., O.K. and R.S.; Scientific discussions, P.S., O.K. and R.S.; Writing the article, P.S., O.K. and R.S.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Miner, G.; Elder, J.; Fast, A.; Hill, T.; Nisbet, R.; Delen, D. Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications; Elsevier Inc.: Orlando, FL, USA, 2012. [Google Scholar]
  2. Kanaris, I.; Kanaris, K.; Houvardas, I.; Stamatatos, E. Words versus character n-grams for anti-spam filtering. Int. J. Artif. Intell. Tools 2007, 16, 1047–1067. [Google Scholar] [CrossRef]
  3. Mohammadi, H.; Khasteh, S.H. A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm. arXiv 2018, arXiv:1810.03102. [Google Scholar]
  4. Camacho, J.E.P.; Ledeneva, Y.; García-Hernandez, R.A. Comparison of Automatic Keyphrase Extraction Systems in Scientific Papers. Res. Comput. Sci. 2016, 115, 181–191. [Google Scholar]
  5. Kundu, R.; Karthik, K. Contextual plagiarism detection using latent semantic analysis. Int. Res. J. Adv. Eng. Sci. 2017, 2, 214–217. [Google Scholar]
  6. Lopez-Gazpio, I.; Maritxalar, M.; Lapata, M.; Agirre, E. Word n-gram attention models for sentence similarity and inference. Expert Syst. Appl. 2019, 132, 1–11. [Google Scholar] [CrossRef]
  7. Bao, J.; Lyon, C.; Lane, P.C.R.; Ji, W.; Malcolm, J. Comparing Different Text Similarity Methods; Technical Report; University of Hertfordshire: Hatfield, UK, 2007; p. 461. [Google Scholar]
  8. Arora, S.; Khodak, M.; Saunshi, N.; Vodrahalli, K. A compressed sensingview of unsupervised text embeddings, bag-of-n-grams, and lstms. In Proceedings of the ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  9. Gali, N.; Mariescu-Istodor, R.; Hostettler, D.; Fränti, P. Framework for syntactic string similarity measures. Expert Syst. Appl. 2019, 129, 169–185. [Google Scholar] [CrossRef]
  10. Nguyen, H.V.; Bai, L. Cosine Similarity Metric Learning for Face Verification. In Proceedings of the Computer Vision (ACCV 2010), Queenstown, New Zealand, 8–12 November 2010; Lecture Notes in Computer Science. Kimmel, R., Klette, R., Sugimoto, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6493. [Google Scholar]
  11. Niwattanakul, S.; Singthongchai, J.; Naenudorn, E.; Wanapu, S. Using of jaccard coefficient for keywords similarity. In Proceedings of the International Multi Conference of Engineers and Computer Scientists, Hong Kong, China, 13–15 March 2013. [Google Scholar]
  12. Zeimpekis, D.; Gallopoulos, E. TMG: A Matlab Toolbox for Generating Term-Text document Matrices from Text Collections; Technical Report HPCLAB-SCG 1/01-05; University of Patras: Patras, Greece, 2005. [Google Scholar]
  13. Berthold, M.R.; Cebron, N.; Dill, F.; Gabriel, T.R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. Data Analysis, Machine Learning and Applications. In Studies in Classification, Data Analysis, and Knowledge Organization; Springer: Berlin, Germany, 2017. [Google Scholar]
  14. Ritthoff, O.; Klinkenberg, R.; Fisher, S.; Mierswa, I.; Felske, S. YALE: Yet Another Learning Environment; Technical Report 763; University of Dortmund: Dortmund, Germany, 2001; pp. 84–92. [Google Scholar]
  15. Stefanovič, P.; Kurasova, O. Creation of Text document Matrices and Visualization by Self-Organizing Map. Inf. Technol. Control 2014, 43, 37–46. [Google Scholar] [CrossRef]
  16. Porter, M.F. An algorithm for suffix stripping. Program 1980, 14, 130–137. [Google Scholar] [CrossRef]
  17. Li, B.; Liu, T.; Zhao, Z.; Wang, P.; Du, X. Neural Bag-of-Ngrams. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 3067–3074. [Google Scholar]
  18. Balabantaray, R.C.; Sarma, C.; Jha, M. Document Clustering using K-Means and K-Medoids. arXiv 2015, arXiv:1502.07938. [Google Scholar]
  19. Shah, N.; Mahajan, S. Document Clustering: A Detailed Review. Int. J. Appl. Inf. Syst. 2012, 4, 30–38. [Google Scholar] [CrossRef]
  20. Aggarwal, C.C.; Zhai, C. A Survey of Text Clustering Algorithms. In Mining Text Data; Springer: Boston, MA, USA, 2012; pp. 77–128. [Google Scholar] [Green Version]
  21. Kohonen, T. Self-Organizing Maps, 3rd ed.; Springer Series in Information Sciences; Springer: Berlin, Germany, 2001. [Google Scholar]
  22. Stefanovič, P.; Kurasova, O. Visual analysis of self-organizing maps. Nonlinear Anal. Model. Control. 2011, 16, 488–504. [Google Scholar]
  23. Liu, Y.C.; Liu, M.; Wang, X.L. Application of Self-Organizing Maps in Text Clustering: A Review. In Applications of Self-Organizing Maps; Johnsson, M., Ed.; InTech: London, UK, 2012. [Google Scholar] [Green Version]
  24. Sharma, A.; Dey, S. Using Self-Organizing Maps for Sentiment Analysis. arXiv 2013, arXiv:1309.3946. [Google Scholar]
  25. Metzler, D.; Dumais, S.T.; Meek, C. Similarity Measures for Short Segments of Text. In Proceedings of the European Conference on Information Retrieval 2007, Rome, Italy, 2–5 April 2007; pp. 16–27. [Google Scholar]
  26. Rokach, L.; Maimon, O. Clustering Methods. In The Data Mining and Knowledge Discovery Handbook 2005; Springer: Boston, MA, USA, 2005; pp. 321–352. [Google Scholar]
  27. Clough, P.; Stevenson, M. Developing a Corpus of Plagiarized Short Answers. In Language Resources and Evaluation; Special Issue on Plagiarism and Authorship Analysis; Springer: Berlin, Germany, 2011. [Google Scholar]
  28. Demšar, J.; Curk, T.; Erjavec, A. Orange: Data Mining Toolbox in Python. J. Mach. Learn. Res. 2013, 14, 2349–2353. [Google Scholar]
Figure 1. The scheme of the proposed approach; SOM: self-organizing maps.
Figure 1. The scheme of the proposed approach; SOM: self-organizing maps.
Applsci 09 01870 g001
Figure 2. The process of creation of frequency matrix.
Figure 2. The process of creation of frequency matrix.
Applsci 09 01870 g002
Figure 3. Two-dimensional self-organizing map (SOM) (rectangular topology).
Figure 3. Two-dimensional self-organizing map (SOM) (rectangular topology).
Applsci 09 01870 g003
Figure 4. The size of the obtained bag of words.
Figure 4. The size of the obtained bag of words.
Applsci 09 01870 g004
Figure 5. 9 × 9 self-organizing maps: (a) first question ( Q 1 ); (b) second question ( Q 2 ).
Figure 5. 9 × 9 self-organizing maps: (a) first question ( Q 1 ); (b) second question ( Q 2 ).
Applsci 09 01870 g005
Figure 6. 9 × 9 self-organizing maps: (a) third question ( Q 3 ); (b) fourth question ( Q 4 ).
Figure 6. 9 × 9 self-organizing maps: (a) third question ( Q 3 ); (b) fourth question ( Q 4 ).
Applsci 09 01870 g006
Figure 7. 9 × 9 self-organizing map: fifth question ( Q 5 ).
Figure 7. 9 × 9 self-organizing map: fifth question ( Q 5 ).
Applsci 09 01870 g007
Table 1. Descriptions of filters.
Table 1. Descriptions of filters.
FiltersDescription
Diacritics filterRemoves all diacritical marks. Diacritical marks are signs that have been attached to a character usually to indicate distinct sound or special pronunciation. Examples of words (terms) containing diacritical marks are naïve, jäger, réclame, etc. If the specific language texts are analyzed, the diacritical marks cannot be rejected because it can change the meaning of the word. For example, törn/torn (Swedish), sääri/saari (Finnish).
Number filterFilters all terms that consist of digits, including decimal separators ‘,’ or ‘.’ and possible signs ‘+’ or ‘-’.
N chars filterFilters all terms with less than the specified number N characters.
Case converterConverts all words to lower or upper case.
Punctuation filterRemoves all punctuation characters of terms.
Stop words filterRemoves all words which are contained in the specified stop word list. Often words such as ‘there’, ‘where’, ‘that’, ‘when’, etc. compose the stop word list. Not all of them are important for texts analysis. However, the common word list can depend on the domain of texts. For example, if we analyze scientific papers, the words such as ‘describe’, ‘present’, ‘new’, ‘propose’, ‘method’, etc. also do not characterize the papers and it is not purposeful to include the words into the texts dictionary. Stop words list can be adapted for any language.
Stemming algorithmThe stemming algorithm separates the stem from the word [16]. For example, we have four words ‘accepted’, ‘acceptation’, ‘acceptance’, and ‘acceptably’. The stem of the words is ‘accept’, so only this word will be analyzed, other words are ignored.
Table 2. Text dataset.
Table 2. Text dataset.
Text InsideTexts
text messageD1
computer scienceD2
data mining and text miningD3
methods of text data miningD4
Table 3. Frequency matrix.
Table 3. Frequency matrix.
textMessageComputerScienceDataMiningandMethodsof
110000000D1
001100000D2
100012100D3
100011011D4
Table 4. Results of similarity measures.
Table 4. Results of similarity measures.
Cosine Measure
D1D2D3D4
D110002631
D2010000
D326010067
D431067100
Extended Jaccard’s Measure
D1D2D3D4
D110001317
D2010000
D313010050
D417050100
Dice Measure
D1D2D3D4
D110002229
D2010000
D322010067
D429067100
Overlap Measure
D1D2D3D4
D110005050
D2010000
D350010080
D450080100
Table 5. The level of texts plagiarism.
Table 5. The level of texts plagiarism.
Texts IDCategory
Q1Q2Q3Q4Q5
D1noncutlightheavynon
D2nonnoncutlightheavy
D3heavynonnoncutlight
D4cutlightheavynonnon
D5lightheavynonnoncut
D6nonheavylightcutnon
D7nonnonheavylightcut
D8lightcutnonnonheavy
D9nonheavylightcutnon
D10nonnonheavylightcut
D11cutnonnonheavylight
D12heavylightcutnonnon
D13nonheavylightcutnon
D14nonnonheavylightcut
D15cutnonnonheavylight
D16nonnonheavylightcut
D17cutnonnonheavylight
D18lightcutnonnonheavy
D19heavylightcutnonnon
D20OriginalOriginalOriginalOriginalOriginal
Table 6. Texts similarity results sorted as Near copy (cut), Light revision (light), Heavy revision (heavy), Non-plagiarism (non) (results are given in percent).
Table 6. Texts similarity results sorted as Near copy (cut), Light revision (light), Heavy revision (heavy), Non-plagiarism (non) (results are given in percent).
Q1 D 20 CutLightHeavyNon
D 4 D 11 D 15 D 17 D 5 D 8 D 18 D 3 D 12 D 19 D 1 D 2 D 6 D 7 D 9 D 10 D 13 D 14 D 16
Cosine615155969419217530100500400
Dice595044969419206530100400300
Extended Jaccard’s423328928810113360000200200
Overlap80661139997263010540200800500
Q2 D 20 CutLightHeavyNon
D 1 D 8 D 18 D 14 D 12 D 19 D 5 D 6 D 9 D 13 D 2 D 3 D 7 D 10 D 11 D 14 D 15 D 16 D 17
Cosine6428016295425987010701000
Dice5827011235018876010601000
Extended Jaccard’s411506133310443000300000
Overlap100390425881561412120101101000
Q3 D 20 CutLightHeavyNon
D 2 D 12 D 19 D 1 D 6 D 9 D 13 D 4 D 7 D 10 D 14 D 16 D 3 D 5 D 8 D 11 D 15 D 17 D 18
Cosine790245118774636321417453366425
Dice780225117774536321417442366325
Extended Jaccard’s6401234106329221989281133213
Overlap920365420795739331517513486625
Q4 D 20 CutLightHeavyNon
D 3 D 6 D 9 D 13 D 2 D 7 D 10 D 14 D 16 D 1 D 11 D 15 D 17 D 4 D 5 D 8 D 12 D 18 D 19
Cosine6036369824135918431427693000010
Dice5236369821135817421327493000010
Extended Jaccard’s3522229712741927715287000000
Overlap105393999391766255622281696000010
Q5 D 20 CutLightHeavy Non
D 5 D 7 D 10 D 14 D 16 D 3 D 11 D 15 D 17 D 2 D 8 D 18 D 13 D 1 D 4 D 16 D 9 D 12 D 19
Cosine42367750823592162383311000011
Dice30357540812891258363281000011
Extended Jaccard’s1721592569165740221160000000
Overlap10046971009767136593546501000011

Share and Cite

MDPI and ACS Style

Stefanovič, P.; Kurasova, O.; Štrimaitis, R. The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures. Appl. Sci. 2019, 9, 1870. https://doi.org/10.3390/app9091870

AMA Style

Stefanovič P, Kurasova O, Štrimaitis R. The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures. Applied Sciences. 2019; 9(9):1870. https://doi.org/10.3390/app9091870

Chicago/Turabian Style

Stefanovič, Pavel, Olga Kurasova, and Rokas Štrimaitis. 2019. "The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures" Applied Sciences 9, no. 9: 1870. https://doi.org/10.3390/app9091870

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop