Maintaining Academic Integrity in Programming: Locality-Sensitive Hashing and Recommendations
Abstract
:1. Introduction
- RQ1: Is locality-sensitive hashing more efficient than common similarity algorithms used for automated similarity detectors?
- RQ2: How much is the effectiveness trade-off introduced by locality-sensitive hashing?
- RQ3: What are the impacts of overlapping adjacent token substrings, number of clusters, and number of stages in locality-sensitive hashing for similarity detection?
- RQ4: Which are the recommendations for employing automated similarity detectors for the first time?
2. Literature Review
3. Our Search for Suitable Similarity Detectors
4. SSTRANGE: Scalable STRANGE
5. Methodology for Addressing the Research Questions
5.1. Addressing RQ1, RQ2, and RQ3: SSTRANGE’s Evaluation
5.2. Addressing RQ4: Recommendations
6. Results and Discussion
6.1. The Most Effective Minimum Matching Length for the Similarity Measurements
6.2. Efficiency of Locality-Sensitive Hashing (RQ1)
6.3. Effectiveness of Locality-Sensitive Hashing (RQ2)
6.4. Impact of Overlapping Adjacent Token Substrings in Building Indexes (RQ3)
6.5. Impact of Number of Clusters in Locality-Sensitive Hashing (RQ3)
6.6. Impact of Number of Stages in Locality-Sensitive Hashing (RQ3)
6.7. Experience and Recommendations on Employing Similarity Detectors (RQ4)
6.7.1. Experience from the First Semester of 2020
6.7.2. Experience from the Second Semester of 2020
6.7.3. Experience from the First Semester of 2021
6.7.4. Experience from the Second Semester of 2021
6.7.5. Experience from the First Semester of 2022
6.8. Recommendations
- While choosing a similarity detector, it is recommended to consider the assessment design. An effective similarity detector is not always the one that nullifies many code variations. Some variations need to be kept to prevent coincidental similarities from being reported, especially on assessments with small tasks, trivial solutions and/or detailed instructions [10,44]. Many of our assessments, for example, expect similar program flow in the solutions. Common similarity detectors, such as JPlag [13], might report most submissions to be similar as they report program flow similarity.
- Instructors need to generally understand about how the employed similarity detector works. Some instructors can be uncertain about why two programs are considered similar despite having different superficial looks. They can also get confused when a number of copied programs are not reported by the tool. Instructors should read the tool’s documentation. Otherwise, they can be briefed with other instructors who have used the tool before.
- If possible, assessments need to have space for students’ creativity and/or space for using their own case studies. Detection of plagiarism and collusion can be easier as reported similarities are less likely to be a result of coincidence [58].
- Strong evidence can be obtained by observing irregular patterns of similarity. These include similarities in comments, strange white space, unique identifier names, original author’s student information, and unexpected program flow. Many students conduct plagiarism and collusion because they are not fluent in programming [59]. They lack skill to disguise copied programs with unique characteristics.
- Template and common code should not become evidence for raising suspicion. They are expected among submissions. For each similar code segment, it is important to check whether it is part of template code: whether it is legitimately copiable code (from lecture slides or assessment documents); whether it is compulsorily used for compilation; whether it is suggested by instructors; and whether it is the most obvious implementation for the task [11]. Several similarity detectors offer automatic removal of such code but the result is not always accurate (for the purpose of efficiency). JPlag [13], MOSS [5], and CSTRANGE [49] are three examples of similarity detectors that offer template code removal. The last two also cater for common code removal.
- Evidence should be based on manually written code. Some assessments (e.g., those developing applications) encourage the use of automatically generated code and searching evidence on such code is labor intensive. It is preferred to focus on code parts that are manually written such as the main program or major methods.
- To have a reasonable number of similarity reports, several short programs in one assessment can be merged before comparison. More programs tend to result in more similarity reports. The number of similarity reports needs to be reasonable given that each of the reports will be manually investigated.
- Assessments expecting the same solutions even at superficial level (e.g., following a detailed tutorial) should be ignored for similarity detection. Searching for strong evidence in such assessments is labor intensive. Instructors are expected not to allocate a large portion of marks for such assessments.
- Investigation can be more comprehensive via the consideration of several similarity measurements. Each similarity measurement has its own benefits and weaknesses. Having more than one similarity measurement can lead in more objective investigation. Sherlock [8] and CSTRANGE [49] are two examples of tools facilitating multiple measurements in the similarity reports.
- For assessments expecting many large submissions, the employed similarity detector should be efficient, preferably the one that works in linear time. Some instructors do not have much time to run the detector and wait for the similarity reports. The issue is exacerbated on similarity detectors with non-linear time complexity as the processing time becomes much longer.
7. Limitations
- Although the SOCO dataset is common for evaluation, SSTRANGE was evaluated only on a single dataset. Replicating the evaluation with different datasets can enrich the findings.
- SSTRANGE’s evaluation only considers f-score for effectiveness and processing time for efficiency. While these metrics are relatively common, utilizing other performance metrics might be useful for further observation.
- While some recommendations for instructors interested in employing similarity detectors appear to be more aligned in justifying the need to improve SSTRANGE’s predecessors, we believe they are somewhat applicable for general use of similarity detectors. Our 13 course offerings have various assessment designs, and they are quite representative for programming.
- Recommendations for instructors interested in employing similarity detectors are based on instructors’ experience in a single institution in a particular country. Some of them might be in need for stronger evidence. More comprehensive and well-supported findings might be obtained by reporting such experience from more institutions, especially those located in different countries.
8. Conclusions and Future Work
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Fraser, R. Collaboration, collusion and plagiarism in computer science coursework. Inform. Educ. 2014, 13, 179–195. [Google Scholar] [CrossRef] [Green Version]
- Lancaster, T. Academic integrity for computer science instructors. In Higher Education Computer Science; Springer Nature: Cham, Switzerland, 2018; pp. 59–71. [Google Scholar] [CrossRef]
- Simon; Sheard, J.; Morgan, M.; Petersen, A.; Settle, A.; Sinclair, J. Informing students about academic integrity in programming. In Proceedings of the 20th Australasian Computing Education Conference, Brisbane, QLD, Australia, 30 January–2 February 2018; pp. 113–122. [Google Scholar] [CrossRef]
- Kustanto, C.; Liem, I. Automatic source code plagiarism detection. In Proceedings of the 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing, Daegu, Republic of Korea, 27–29 May 2009; pp. 481–486. [Google Scholar] [CrossRef]
- Schleimer, S.; Wilkerson, D.S.; Aiken, A. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the International Conference on Management of Data, San Diego, CA, USA, 9–12 June 2003; pp. 76–85. [Google Scholar] [CrossRef]
- Novak, M.; Joy, M.; Kermek, D. Source-code similarity detection and detection tools used in academia: A systematic review. ACM Trans. Comput. Educ. 2019, 19, 27:1–27:37. [Google Scholar] [CrossRef]
- Karnalim, O.; Simon; Chivers, W. Similarity detection techniques for academic source code plagiarism and collusion: A review. In Proceedings of the IEEE International Conference on Engineering, Technology and Education, Yogyakarta, Indonesia, 10–13 December 2019. [Google Scholar] [CrossRef]
- Joy, M.; Luck, M. Plagiarism in programming assignments. IEEE Trans. Educ. 1999, 42, 129–133. [Google Scholar] [CrossRef] [Green Version]
- Ahadi, A.; Mathieson, L. A comparison of three popular source code similarity tools for detecting student plagiarism. In Proceedings of the 21st Australasian Computing Education Conference, Sydney, NSW, Australia, 29–31 January 2019; pp. 112–117. [Google Scholar] [CrossRef]
- Karnalim, O.; Simon; Ayub, M.; Kurniawati, G.; Nathasya, R.A.; Wijanto, M.C. Work-in-progress: Syntactic code similarity detection in strongly directed assessments. In Proceedings of the IEEE Global Engineering Education Conference, Vienna, Austria, 21–23 April 2021; pp. 1162–1166. [Google Scholar] [CrossRef]
- Simon; Karnalim, O.; Sheard, J.; Dema, I.; Karkare, A.; Leinonen, J.; Liut, M.; McCauley, R. Choosing Code Segments to Exclude from Code Similarity Detection. In Proceedings of the ACM Working Group Reports on Innovation and Technology in Computer Science Education, Trondheim, Norway, 15–19 June 2020; pp. 1–19. [Google Scholar] [CrossRef]
- Chi, L.; Zhu, X. Hashing techniques: A survey and taxonomy. ACM Comput. Surv. 2018, 50, 11. [Google Scholar] [CrossRef]
- Prechelt, L.; Malpohl, G.; Philippsen, M. Finding plagiarisms among a set of programs with JPlag. J. Univers. Comput. Sci. 2002, 8, 1016–1038. [Google Scholar]
- Sulistiani, L.; Karnalim, O. ES-Plag: Efficient and sensitive source code plagiarism detection tool for academic environment. Comput. Appl. Eng. Educ. 2019, 27, 166–182. [Google Scholar] [CrossRef] [Green Version]
- Karnalim, O.; Simon. Explanation in code similarity investigation. IEEE Access 2021, 9, 59935–59948. [Google Scholar] [CrossRef]
- Ferdows, J.; Khatun, S.; Mokarrama, M.J.; Arefin, M.S. A framework for checking plagiarized contents in programs submitted at online judging systems. In Proceedings of the International Conference on Image Processing and Capsule Networks, Bangkok, Thailand, 6–7 May 2020; pp. 546–560. [Google Scholar] [CrossRef]
- Durić, Z.; Gašević, D. A source code similarity system for plagiarism detection. Comput. J. 2013, 56, 70–86. [Google Scholar] [CrossRef] [Green Version]
- Ahtiainen, A.; Surakka, S.; Rahikainen, M. Plaggie: GNU-licensed source code plagiarism detection engine for Java exercises. In Proceedings of the Sixth Baltic Sea Conference on Computing Education Research, Koli Calling Uppsala, Sweden, 1 February 2006; pp. 141–142. [Google Scholar] [CrossRef]
- Acampora, G.; Cosma, G. A fuzzy-based approach to programming language independent source-code plagiarism detection. In Proceedings of the IEEE International Conference on Fuzzy Systems, Istanbul, Turkey, 2–5 August 2015; pp. 1–8. [Google Scholar] [CrossRef]
- Domin, C.; Pohl, H.; Krause, M. Improving plagiarism detection in coding assignments by dynamic removal of common ground. In Proceedings of the CHI Conference Extended Abstracts on Human Factors in Computing Systems, San Jose, CA, USA, 7–12 May 2016; pp. 1173–1179. [Google Scholar] [CrossRef]
- Ji, J.H.; Woo, G.; Cho, H.G. A plagiarism detection technique for Java program using bytecode analysis. In Proceedings of the Third International Conference on Convergence and Hybrid Information Technology, Busan, Korea, 11–13 November 2008; pp. 1092–1098. [Google Scholar] [CrossRef]
- Jiang, Y.; Xu, C. Needle: Detecting code plagiarism on student submissions. In Proceedings of the ACM Turing Celebration Conference, Shanghai, China, 19–20 May 2018; pp. 27–32. [Google Scholar] [CrossRef]
- Fu, D.; Xu, Y.; Yu, H.; Yang, B. WASTK: A weighted abstract syntax tree kernel method for source code plagiarism detection. Sci. Program. 2017, 2017, 1–8. [Google Scholar] [CrossRef] [Green Version]
- Nichols, L.; Dewey, K.; Emre, M.; Chen, S.; Hardekopf, B. Syntax-based improvements to plagiarism detectors and their evaluations. In Proceedings of the 24th ACM Conference on Innovation and Technology in Computer Science Education, Aberdeen, UK, 15–17 July 2019; pp. 555–561. [Google Scholar] [CrossRef]
- Kikuchi, H.; Goto, T.; Wakatsuki, M.; Nishino, T. A source code plagiarism detecting method using alignment with abstract syntax tree elements. In Proceedings of the 15th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, Las Vegas, NV, USA, 30 June–2 July 2014; pp. 1–6. [Google Scholar] [CrossRef]
- Cheers, H.; Lin, Y.; Smith, S.P. Academic source code plagiarism detection by measuring program behavioral similarity. IEEE Access 2021, 9, 50391–50412. [Google Scholar] [CrossRef]
- Song, H.J.; Park, S.B.; Park, S.Y. Computation of program source code similarity by composition of parse tree and call graph. Math. Probl. Eng. 2015, 2015, 1–12. [Google Scholar] [CrossRef] [Green Version]
- Ottenstein, K.J. An algorithmic approach to the detection and prevention of plagiarism. ACM SIGCSE Bull. 1976, 8, 30–41. [Google Scholar] [CrossRef] [Green Version]
- Donaldson, J.L.; Lancaster, A.-M.; Sposato, P.H. A plagiarism detection system. In Proceedings of the 12th SIGCSE Technical Symposium on Computer Science Education, St. Louis, MO, USA, 26–27 February 1981; Volume 13. pp. 21–25. [Google Scholar] [CrossRef]
- Al-Khanjari, Z.A.; Fiaidhi, J.A.; Al-Hinai, R.A.; Kutti, N.S. PlagDetect: A Java programming plagiarism detection tool. ACM Inroads 2010, 1, 66–71. [Google Scholar] [CrossRef]
- Allyson, F.B.; Danilo, M.L.; José, S.M.; Giovanni, B.C. Sherlock N-Overlap: Invasive normalization and overlap coefficient for the similarity analysis between source code. IEEE Trans. Comput. 2019, 68, 740–751. [Google Scholar] [CrossRef]
- Inoue, U.; Wada, S. Detecting plagiarisms in elementary programming courses. In Proceedings of the Ninth International Conference on Fuzzy Systems and Knowledge Discovery, Chongqing, China, 29–31 May 2012; pp. 2308–2312. [Google Scholar] [CrossRef]
- Ullah, F.; Wang, J.; Farhan, M.; Habib, M.; Khalid, S. Software plagiarism detection in multiprogramming languages using machine learning approach. Concurr. Comput. Pract. Exp. 2018, 33, e5000. [Google Scholar] [CrossRef]
- Yasaswi, J.; Purini, S.; Jawahar, C.V. Plagiarism detection in programming assignments using deep features. In Proceedings of the Fourth IAPR Asian Conference on Pattern Recognition, Nanjing, China, 26–29 November 2017; pp. 652–657. [Google Scholar] [CrossRef]
- Elenbogen, B.S.; Seliya, N. Detecting outsourced student programming assignments. J. Comput. Sci. Coll. 2008, 23, 50–57. [Google Scholar]
- Jadalla, A.; Elnagar, A. PDE4Java: Plagiarism detection engine for Java source code: A clustering approach. Int. J. Bus. Intell. Data Min. 2008, 3, 121. [Google Scholar] [CrossRef]
- Moussiades, L.; Vakali, A. PDetect: A clustering approach for detecting plagiarism in source code datasets. Comput. J. 2005, 48, 651–661. [Google Scholar] [CrossRef]
- Flores, E.; Barrón-Cedeño, A.; Moreno, L.; Rosso, P. Uncovering source code reuse in large-scale academic environments. Comput. Appl. Eng. Educ. 2015, 23, 383–390. [Google Scholar] [CrossRef] [Green Version]
- Foltýnek, T.; Všianský, R.; Meuschke, N.; Dlabolová, D.; Gipp, B. Cross-language source code plagiarism detection using explicit semantic analysis and scored greedy string tilling. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, Virtual Event, China, 1–5 August 2020; pp. 523–524. [Google Scholar] [CrossRef]
- Ullah, F.; Jabbar, S.; Mostarda, L. An intelligent decision support system for software plagiarism detection in academia. Int. J. Intell. Syst. 2021, 36, 2730–2752. [Google Scholar] [CrossRef]
- Cosma, G.; Joy, M. An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans. Comput. 2012, 61, 379–394. [Google Scholar] [CrossRef]
- Ganguly, D.; Jones, G.J.F.; Ramírez-de-la Cruz, A.; Ramírez-de-la Rosa, G.; Villatoro-Tello, E. Retrieving and classifying instances of source code plagiarism. Inf. Retr. J. 2018, 21, 1–23. [Google Scholar] [CrossRef]
- Mozgovoy, M.; Karakovskiy, S.; Klyuev, V. Fast and reliable plagiarism detection system. In Proceedings of the 37th IEEE Annual Frontiers in Education Conference, Milwaukee, Wisconsin, 10–13 October 2007; pp. 11–14. [Google Scholar] [CrossRef]
- Mann, S.; Frew, Z. Similarity and originality in code: Plagiarism and normal variation in student assignments. In Proceedings of the Eighth Australasian Computing Education Conference, Hobart, Australia, 16–19 January 2006; pp. 143–150. [Google Scholar]
- Bowyer, K.W.; Hall, L.O. Experience using “MOSS” to detect cheating on programming assignments. In Proceedings of the 29th Annual Frontiers in Education Conference, San Juan, Puerto Rico, 10–13 November 1999. [Google Scholar] [CrossRef] [Green Version]
- Pawelczak, D. Benefits and drawbacks of source code plagiarism detection in engineering education. In Proceedings of the IEEE Global Engineering Education Conference, Santa Cruz de Tenerife, Spain, 17–20 April 2018; pp. 1048–1056. [Google Scholar] [CrossRef]
- Le Nguyen, T.T.; Carbone, A.; Sheard, J.; Schuhmacher, M. Integrating source code plagiarism into a virtual learning environment: Benefits for students and staff. In Proceedings of the 15th Australasian Computing Education Conference, Adelaide, Australia, 29 January–1 February 2013; pp. 155–164. [Google Scholar] [CrossRef]
- Karnalim, O. Detecting source code plagiarism on introductory programming course assignments using a bytecode approach. In Proceedings of the 10th International Conference on Information & Communication Technology and Systems, Lisbon, Portugal, 6–9 September 2016; pp. 63–68. [Google Scholar] [CrossRef]
- Karnalim, O.; Simon; Chivers, W. Layered similarity detection for programming plagiarism and collusion on weekly assessments. Comput. Appl. Eng. Educ. 2022, 30, 1739–1752. [Google Scholar] [CrossRef]
- Parr, T. The Definitive ANTLR 4 Reference; The Pragmatic Programmers: Raleigh, NC, USA, 2013. [Google Scholar]
- Sraka, D.; Kaučič, B. Source code plagiarism. In Proceedings of the 31st International Conference on Information Technology Interfaces, Dubrovnik, Croatia, 22–25 June 2009; pp. 461–466. [Google Scholar] [CrossRef]
- Croft, W.B.; Metzler, D.; Strohman, T. Search Engines: Information Retrieval in Practice; Pearson Education: London, UK, 2010. [Google Scholar]
- Broder, A. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of SEQUENCES 1997, Salerno, Italy, 11–13 June 1997; pp. 21–29. [Google Scholar] [CrossRef]
- Ji, J.; Li, J.; Yan, S.; Zhang, B.; Tian, Q. Super-Bit locality-sensitive hashing. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Curran Associates Inc.: Red Hook, NY, USA, 2012; pp. 108–116. [Google Scholar]
- Wise, M.J. YAP3: Improved detection of similarities in computer program and other texts. In Proceedings of the 27th SIGCSE Technical Symposium on Computer Science Education, Philadelphia, PA, USA, 15–17 February 1996; pp. 130–134. [Google Scholar] [CrossRef]
- Arwin, C.; Tahaghoghi, S.M.M. Plagiarism detection across programming languages. In Proceedings of the Sixth Australasian Computer Science Conference, Reading, UK, 28–31 May 2006; pp. 277–286. [Google Scholar]
- Ragkhitwetsagul, C.; Krinke, J.; Clark, D. A comparison of code similarity analysers. Empir. Softw. Eng. 2018, 23, 2464–2519. [Google Scholar] [CrossRef]
- Bradley, S. Creative assessment in programming: Diversity and divergence. In Proceedings of the Fourth Conference on Computing Education Practice, Durham, UK, 9 January 2020; pp. 13:1–13:4. [Google Scholar] [CrossRef] [Green Version]
- Vogts, D. Plagiarising of source code by novice programmers a “cry for help”? In Proceedings of the Research Conference of the South African Institute of Computer Scientists and Information Technologists, Vanderbijlpark, Emfuleni, South Africa, 12–14 October 2009; pp. 141–149. [Google Scholar] [CrossRef]
Semester | Name | Major | Enrollment Year | Students |
---|---|---|---|---|
2020 1st sem | Basic data structure | IT | 1st year | 46 |
Adv. data structure | IT | 2nd year | 9 | |
2020 2nd sem | Introductory prog. | IS | 1st year | 33 |
Adv. data structure | IT | 2nd year | 43 | |
2021 1st sem | Object oriented prog. | IS | 1st year | 37 |
Introductory prog. | IT | 1st year | 45 | |
2021 2nd sem | Introductory prog. | IS | 1st year | 35 |
Business application prog. | IS | 3rd year | 19 | |
Data structure | IT | 2nd year | 33 | |
Adv. object oriented prog. | IT | 3rd year | 32 | |
2022 1st sem | Object oriented prog. | IS | 1st year | 15 |
Introductory prog. | IT | 1st year | 55 | |
Machine intelligence | IT | 3rd year | 33 |
MML | MinHash | Super-Bit | Jaccard | Cosine | RKRGST |
---|---|---|---|---|---|
10 | 6 | 4 | 12 | 11 | 1324 |
20 | 5 | 5 | 13 | 12 | 1441 |
30 | 6 | 6 | 13 | 12 | 1354 |
40 | 7 | 6 | 14 | 13 | 1400 |
50 | 7 | 7 | 15 | 13 | 1339 |
60 | 7 | 7 | 15 | 14 | 1328 |
70 | 8 | 7 | 15 | 14 | 1337 |
80 | 8 | 8 | 16 | 15 | 1373 |
90 | 8 | 8 | 17 | 17 | 1384 |
100 | 11 | 10 | 17 | 16 | 1355 |
110 | 11 | 10 | 22 | 17 | 1444 |
120 | 10 | 11 | 20 | 18 | 1390 |
130 | 15 | 14 | 17 | 18 | 1448 |
140 | 15 | 15 | 19 | 19 | 1315 |
150 | 16 | 13 | 22 | 20 | 1253 |
160 | 13 | 13 | 19 | 21 | 1238 |
170 | 14 | 14 | 19 | 22 | 1200 |
180 | 15 | 14 | 20 | 25 | 1179 |
190 | 15 | 15 | 20 | 26 | 1170 |
200 | 16 | 16 | 21 | 21 | 1141 |
Metric | MinHash | Super-Bit | Jaccard | Cosine | RKRGST |
---|---|---|---|---|---|
MML | 20 | 60 | 30 | 130 | 20 |
F-score (%) | 70% | 81% | 93% | 84% | 94% |
Processing time (s) | 5 | 7 | 13 | 18 | 1441 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Karnalim, O. Maintaining Academic Integrity in Programming: Locality-Sensitive Hashing and Recommendations. Educ. Sci. 2023, 13, 54. https://doi.org/10.3390/educsci13010054
Karnalim O. Maintaining Academic Integrity in Programming: Locality-Sensitive Hashing and Recommendations. Education Sciences. 2023; 13(1):54. https://doi.org/10.3390/educsci13010054
Chicago/Turabian StyleKarnalim, Oscar. 2023. "Maintaining Academic Integrity in Programming: Locality-Sensitive Hashing and Recommendations" Education Sciences 13, no. 1: 54. https://doi.org/10.3390/educsci13010054
APA StyleKarnalim, O. (2023). Maintaining Academic Integrity in Programming: Locality-Sensitive Hashing and Recommendations. Education Sciences, 13(1), 54. https://doi.org/10.3390/educsci13010054