#
^{Cross-}Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization ^{ † }

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Literature Review

#### 2.1. Latent Semantic Analysis (LSA)

#### 2.2. Learning Vector Quantization (LVQ)

_{j}matches the class target; when it does not match, (3) is applied for weight modification.

## 3. Methodology of the Automatic Cross-Language Plagiarism Detection System

_{T}; for the reference document, the result of SVD is a singular matrix S

_{Ref}. After both singular matrices are generated, the feature extraction is conducted; as the last stage, LVQ is used as the classifier. A more detailed process is explained below.

#### 3.1. Pre-Processing

#### 3.2. Word Level Translation for the Test Paragraph

#### 3.3. Extracting Keywords from Reference Paragraph

#### 3.4. LSA Process

_{T}vector for the test document and S

_{Ref}vector for the reference document, which will be processed further in the feature extraction process. Furthermore, the S

_{T}and S

_{Ref}vectors are processed to form feature vectors so that they can be classified using a classifier.

#### 3.5. Feature Extraction

#### 3.6. Classification Using LVQ

Algorithm 1 Learning Vector Quantization Algorithm |

Training Phase: 1. Initialize weights w_{ij}2. Define: epoch_max, α_min3. If ((epoch < epoch_max) AND (α > α_min))4. For each vector (x_{i}) dimension k = 1:K5. For each j dimensional output neuron, 6. Calculate Euclidean distance between input x _{i} to the w_{ij} based on (1) 7. end8. Determine index j that gives the minimum distance D _{j}9. If (x_{i} = y_{i})10. Update weight according to (2) 11. else12. Update weight according to (3) 13. end14. Modify the pace of learning based on (5) 15. end16. end |

Training Phase: 17. For each vector data (x_{i}) dimension k = 1:K18. For each j dimensional output neuron, 19. Calculate Euclidean distance between input x _{i} to the w_{ij} based on (1) 20. end21. Determine index j that gives the minimum distance D _{j}22. Output: class of data x_{i} is j23. end |

## 4. Experiments and Results

#### 4.1. Comparison of Various Learning Rate Parameters for LVQ

^{−1}, 10

^{−2}and 10

^{−3}. Thus, there are three variations in the experiment, which are: (1) α = 0.1 and n = 0.5, (2) α = 0.01 and n = 0.05 and (3) α = 0.001 and n = 0.005. The result of this experiment is presented in Table 2 and the corresponding graph in Figure 6.

#### 4.2. Comparison of the Feature Types of the LSA

#### 4.3. Comparison of the Term–Document Matrix Definition

#### 4.4. Comparison of Frequency and Binary Occurrence in the Term–Document Matrix Definition

## 5. Discussion

## 6. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Gipp, B.; Meuschke, N. Citation Pattern Matching Algorithms for Citation-Based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence. In Proceedings of the 2011 ACM Symposium on Document Engineering, Mountain View, CA, USA, 19–22 September 2011. [Google Scholar]
- Lancaster, T. Effective and Efficient Plagiarism Detection. Ph.D. Thesis, South Bank University, London, UK, 2003. [Google Scholar]
- Kakkonen, T.; Mozgovoy, M. Hermetic and Web Plagiarism Detection Systems for Student Essays—An Evaluation of the State-of-the-Art. J. Educ. Comput.
**2011**, 42, 135–159. [Google Scholar] [CrossRef] - Maurer, H.; Kappe, F.; Zaka, B. Plagiarism-A Survey. J. Univers. Comput. Sci.
**2006**, 12, 1050. [Google Scholar] - Bretag, T.; Saadia, M. Self-Plagiarism or Appropriate Textual Re-Use? J. Acad. Eth.
**2009**, 7, 193. [Google Scholar] [CrossRef] - Errami, M.; Sun, Z.; Long, T.C.; George, A.C.; Garner, H.R. Deja vu: A Database of Highly Similar Citations in the Scientific Literature. Nucleic Acids Res.
**2009**, 37, D921–D924. [Google Scholar] [CrossRef] [PubMed] - Ali, A.; Abdulla, H.; Snasel, V. Overview and Comparison of Plagiarism Detection Tools. In Proceedings of the Dateso 2011: Annual International Workshop on DAtabases, TExts, Specifications and Objects, Pisek, Czech Republic, 20 April 2011. [Google Scholar]
- Olson, D.; Delen, D. Advanced Data Mining Techniques; Springer: Berlin, Germany, 2008; p. 138. [Google Scholar]
- Landauer, T.K.; Foltz, P.W.; Laham, D. An Introduction to Latent Semantic Analysis. Discourse Processes
**1998**, 25, 259–284. [Google Scholar] [CrossRef] - Dumais, S. Latent Semantic Analysis. Annu. Rev. Inf. Sci. Technol.
**2004**, 38, 188–230. [Google Scholar] [CrossRef] - Britt, M.A.; Wiemer-Hastings, P.; Larson, A.A.; Perfetti, C.A. Using Intelligent Feedback to Improve Sourcing and Integration in Students’ Essays. Int. J. Artif. Intell. Educ.
**2004**, 14, 359–374. [Google Scholar] - Seaward, L.; Matwin, S. Intrinsic Plagiarism Detection Using Complexity Analysis. In Proceedings of the 25th Annual Conference of the Spanish Society for Natural Language Processing, SEPLN 2009, San Sebastian, Spain, 8–10 September 2009; pp. 56–61. [Google Scholar]
- Zechner, M.; Muhr, R.; Kern, M.; Granitzer, M. External and Intrinsic Plagiarism Detection Using Vector Space Models. In Proceedings of the 3rd PAN Workshop. Uncoveringplagiarism, Authorship And Social Software Misuse, San Sebastian, Spain, 8–10 September 2009; pp. 47–55. [Google Scholar]
- Alsallal, M.; Iqbal, R.; Amin, S.; James, A. Intrinsic Plagiarism Detection Using Latent Semantic Indexing and Stylometry. In Proceedings of the 2013 Sixth International Conference on Developments in eSystems Engineering (DeSE), Abu Dhabi, UA, 16–18 December 2013; pp. 145–150. [Google Scholar]
- Asuncion, G.-P.; Jerome, E. The Semantic Web: Research and Application; Springer: Berlin, Germany, 2005. [Google Scholar]
- Lappin, S.; Fox, C. The Handbook of Contemporary Semantic Theory; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
- Kohonen, T. The Self-Organizing Map. Proc. IEEE
**1990**, 78, 1464–1480. [Google Scholar] [CrossRef] - Soleman, S.; Purwarianti, A. Experiments on the Indonesian plagiarism detection using latent semantic analysis. Proceedings of 2014 2nd International Conference on Information and Communication Technology (ICoICT), Bandung, Indonesia, 28–30 May 2014. [Google Scholar]

**Figure 2.**Features for classification: (

**a**) the Frobenius norm; (

**b**) angle α

_{1}, between vectors after slicing the longer vector and (

**c**) angle α

_{2}

_{,}between vectors after padding ‘0’ over the shorter vector.

**Figure 6.**Effect of variation in the alpha and n values at information retrieval measurement values from the evaluation of a plagiarism detection system using the learning vector quantization algorithm.

**Figure 9.**LSA processing results (number of true positive occurrences) of the binary method and the frequency method using the normal distribution.

**Table 1.**Example of the feature value extracted after the LSA process for test documents no #3001 and no #3002.

test_doc | ref_doc | Fnorm | α_Slice | α_Pad | test_doc | ref_doc | Fnorm | α_Slice | α_Pad |
---|---|---|---|---|---|---|---|---|---|

3001 * | 3001 | 60.6977 | 5.95616 | 5.95616 | 3002 | 3001 | 53.8028 | 21.0574 | 21.0574 |

3001 | 3002 | 35.3553 | 47.6318 | 47.6318 | 3002 * | 3002 | 55.9017 | 24.8395 | 24.8395 |

3001 | 3003 | 0 | 90 | 90 | 3002 | 3003 | 0 | 90 | 90 |

3001 | 3004 | 0 | 90 | 90 | 3002 | 3004 | 32.1634 | 35.0656 | 35.0656 |

3001 | 3005 | 28.4747 | 7.26953 | 7.26953 | 3002 | 3005 | 0 | 90 | 90 |

3001 | 3006 | 0 | 90 | 90 | 3002 | 3006 | 0 | 90 | 90 |

3001 | 3007 | 0 | 90 | 90 | 3002 | 3007 | 0 | 90 | 90 |

3001 | 3008 | 0 | 90 | 90 | 3002 | 3008 | 0 | 90 | 90 |

3001 | 3009 | 14.2857 | 58.706 | 58.706 | 3002 | 3009 | 0 | 90 | 90 |

**Table 2.**Variation results of the alpha and n values in the evaluation of the plagiarism detection system using the learning vector quantization algorithm.

Multiplication Factor | Precision | Recall | F-Measure |
---|---|---|---|

10^{−1} | 0.49833 | 0.76804 | 0.60446 |

10^{−2} | 0.45482 | 0.77835 | 0.57414 |

10^{−3} | 0.26725 | 0.93814 | 0.416 |

**Table 3.**Results of various features using the Frobenius norm (f), Cos α with slice (s), and Cos α with pad (p).

Feature * | Precision | Recall | F-Measure |
---|---|---|---|

Frobenius norm (f) | 0.381253 | 0.654639 | 0.465479 |

Cos α with slice (s) | 0.280162 | 0.331615 | 0.123266 |

Cos α with pad (p) | 0.158213 | 0.32646 | 0.142965 |

fsp | 0.4068 | 0.828179 | 0.531536 |

fs | 0.406962 | 0.797251 | 0.526073 |

fp | 0.408638 | 0.792096 | 0.528116 |

sp | 0.172015 | 0.333333 | 0.141452 |

Size | 5 | 10 | 15 | 20 | 25 |
---|---|---|---|---|---|

True Neg | 3691 | 3775 | 3753 | 3774 | 3801 |

True Pos | 10 | 88 | 3 | 1 | 4 |

TRUE | 3701 | 3863 | 3756 | 3775 | 3805 |

False Neg | 184 | 106 | 191 | 193 | 190 |

False Pos | 165 | 81 | 103 | 82 | 55 |

FALSE | 349 | 187 | 294 | 275 | 245 |

Precision | 0.057143 | 0.52071 | 0.028302 | 0.012048 | 0.067797 |

Recall | 0.051546 | 0.453608 | 0.015464 | 0.005155 | 0.020619 |

F-Measure | 0.054201 | 0.484848 | 0.02 | 0.00722 | 0.031621 |

Size | 5 | 10 | 15 | 20 | 25 |
---|---|---|---|---|---|

True Neg | 3792 | 3797 | 3801 | 3808 | 3810 |

True Pos | 103 | 96 | 58 | 36 | 25 |

TRUE | 3895 | 3893 | 3859 | 3844 | 3835 |

False Neg | 91 | 97 | 136 | 158 | 169 |

False Pos | 64 | 60 | 55 | 48 | 46 |

FALSE | 155 | 157 | 191 | 206 | 215 |

Precision | 0.616766 | 0.615385 | 0.513274 | 0.428571 | 0.352113 |

Recall | 0.530928 | 0.497409 | 0.298969 | 0.185567 | 0.128866 |

F-Measure | 0.570637 | 0.550143 | 0.37785 | 0.258993 | 0.188679 |

© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ratna, A.A.P.; Purnamasari, P.D.; Adhi, B.A.; Ekadiyanto, F.A.; Salman, M.; Mardiyah, M.; Winata, D.J.
^{Cross-}Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization ^{. Algorithms 2017, 10, 69.
https://doi.org/10.3390/a10020069
}

**AMA Style**

Ratna AAP, Purnamasari PD, Adhi BA, Ekadiyanto FA, Salman M, Mardiyah M, Winata DJ.
^{Cross-}Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization ^{. Algorithms. 2017; 10(2):69.
https://doi.org/10.3390/a10020069
}

**Chicago/Turabian Style**

Ratna, Anak Agung Putri, Prima Dewi Purnamasari, Boma Anantasatya Adhi, F. Astha Ekadiyanto, Muhammad Salman, Mardiyah Mardiyah, and Darien Jonathan Winata.
2017. "^{Cross-}Language Plagiarism Detection System Using Latent Semantic Analysis and Learning Vector Quantization ^{" Algorithms 10, no. 2: 69.
https://doi.org/10.3390/a10020069
}