Abstract
Source code plagiarism is a common occurrence in undergraduate computer science education. In order to identify such cases, many source code plagiarism detection tools have been proposed. A source code plagiarism detection tool evaluates pairs of assignment submissions to detect indications of plagiarism. However, a plagiarising student will commonly apply plagiarism-hiding modifications to source code in an attempt to evade detection. Subsequently, prior work has implied that currently available source code plagiarism detection tools are not robust to the application of pervasive plagiarism-hiding modifications. In this article, 11 source code plagiarism detection tools are evaluated for robustness against plagiarism-hiding modifications. The tools are evaluated with data sets of simulated undergraduate plagiarism, constructed with source code modifications representative of undergraduate students. The results of the performed evaluations indicate that currently available source code plagiarism detection tools are not robust against modifications which apply fine-grained transformations to the source code structure. Of the evaluated tools, JPlag and Plaggie demonstrates the greatest robustness to different types of plagiarism-hiding modifications. However, the results also indicate that graph-based tools, specifically those that compare programs as program dependence graphs, show potentially greater robustness to pervasive plagiarism-hiding modifications.
Similar content being viewed by others
Data Availability
Data are available on request due to privacy or other restrictions
Code Availability
Available under MIT license (noted in manuscript where applicable)
Notes
https://github.com/javaparser/javaparser, last accessed May 1 2020.
http://commons.apache.org/proper/commons-text, last accessed May 1 2020.
https://github.com/DatabaseGroup/apted, last accessed May 1 2020.
https://www.eclipse.org/jdt/, last accessed June 30 2020.
https://github.com/google/google-java-format, last accessed June 30 2020
https://www.eclipse.org, last accessed January 15 2021.
https://www.jetbrains.com/idea/, last accessed January 15 2021.
https://kotlinlang.org, last accessed May 1 2020.
References
Ahadi A, Mathieson L (2019) A comparison of three popular source code similarity tools for detecting student plagiarism. In: Proceedings of the twenty-first australasian computing education conference, ACE ’19. https://doi.org/10.1145/3286960.3286974. Association for Computing Machinery, New York, pp 112–117
Ahtiainen A, Surakka S, Rahikainen M (2006) Plaggie: Gnu-licensed source code plagiarism detection engine for java exercises. In: Proceedings of the 6th Baltic sea conference on computing education research: Koli Calling 2006, Baltic Sea ’06. https://doi.org/10.1145/1315803.1315831. Association for Computing Machinery, New York, pp 141–142
Allyson FB, Danilo ML, José S M, Giovanni BC (2019) Sherlock n-overlap: Invasive normalization and overlap coefficient for the similarity analysis between source code. IEEE Trans Comput 68(5):740–751
Anjali V, Swapna T, Jayaraman B (2015) Plagiarism detection for java programs without source codes. Procedia Comput Sci 46:749–758. https://doi.org/10.1016/j.procs.2015.02.143, proceedings of the International Conference on Information and Communication Technologies, ICICT 2014, 3-5 December 2014 at Bolgatty Palace & Island Resort, Kochi, India
Anzai K, Watanobe Y (2019) Algorithm to determine extended edit distance between program codes. In: 2019 IEEE 13th International symposium on embedded multicore/many-core systems-on-chip (MCSoC). pp 180–186
Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: Proceedings of the international conference on software maintenance, ICSM ’98. IEEE Computer Society, Washington. pp 368–377
Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Trans Softw Eng 33(9):577–591
Burd E, Bailey J (2002) Evaluating clone detection tools for use during preventative maintenance. In: Proceedings of the Second IEEE International workshop on source code analysis and manipulation. pp 36–43
Cebrian M, Alfonseca M, Ortega A (2009) Towards the validation of plagiarism detection tools by means of grammar evolution. IEEE Trans Evol Comput 13(3):477–485
Chae DK, Ha J, Kim SW, Kang B, Im EG (2013) Software plagiarism detection: A graph-based approach. In: Proceedings of the 22nd ACM International conference on information & knowledge management, CIKM ’13. Association for Computing Machinery, New York. pp 1577–1580. https://doi.org/10.1145/2505515.2507848
Cheers H, Lin Y, Smith SP (2019) Spplagiarise: A tool for generating simulated semantics-preserving plagiarism of java source code. In: 2019 IEEE 10th International conference on software engineering and service science (ICSESS). pp 617–622
Cheers H, Lin Y, Smith SP (2020) Detecting pervasive source code plagiarism through dynamic program behaviours. In: Proceedings of the twenty-second australasian computing education conference, ACE’20. Association for Computing Machinery, New York, pp 21–30. https://doi.org/10.1145/3373165.3373168
Chen R, Hong L, Chunyan Lü C, Deng W (2010) Author identification of software source code with program dependence graphs. In: 2010 IEEE 34th annual computer software and applications conference workshops. pp 281–286
Chen X, Francia B, Li Ming, McKinnon B, Seker A (2004) Shared information and program plagiarism detection. IEEE Trans Inf Theory 50 (7):1545–1551
Cosma G, Joy M (2008) Towards a definition of source-code plagiarism. IEEE Trans Educ 51(2):195–200
Cosma G, Joy M (2012) An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans Comput 61 (3):379–394
Curtis G, Popal R (2011) An examination of factors related to plagiarism and a five-year follow-up of plagiarism at an australian university. Int J Educ Integr 7(1):30–42. https://doi.org/10.21913/IJEI.v7i1.742
Faidhi J, Robinson S (1987) An empirical approach for detecting program similarity and plagiarism within a university programming environment. Comput Educ 11(1):11–19. https://doi.org/10.1016/0360-1315(87)90042-X
Ferrante J, Ottenstein KJ, Warren JD (1987) The program dependence graph and its use in optimization. ACM Trans Program Lang Syst 9(3):319–349. https://doi.org/10.1145/24039.24041
Flores E, Rosso P, Moreno L, Villatoro-Tello E (2014) On the detection of source code re-use. In: Proceedings of the forum for information retrieval evaluation, association for computing machinery, FIRE ’14. Association for Computing Machinery, New York. pp 21–30. https://doi.org/10.1145/2824864.2824878
Freire M, Cebrián M, del Rosal E (2007) AC: an integrated source code plagiarism detection environment. arXiv:cs/0703136
Gitchell D, Tran N (1999a) Sim: A utility for detecting similarity in computer programs. In: The proceedings of the thirtieth SIGCSE technical symposium on computer science education, SIGCSE ’99. Association for Computing Machinery, New York, pp 266–270. https://doi.org/10.1145/299649.299783
Gitchell D, Tran N (1999b) Sim: A utility for detecting similarity in computer programs. SIGCSE Bull 31(1):266–270. https://doi.org/10.1145/384266.299783
Granzer W, Praus F, Balog P (2013) Source code plagiarism in computer engineering courses. J Syst Cybern Inform 11(6):22–26
Grune D, Huntjens M (1989) Het detecteren van kopieën bij informatica-practica. Informatie (in Dutch) 31(11):864–867
Halstead MH (1977) Elements of software science (operating and programming systems series). Elsevier Science Inc, New York
Jadalla A, Elnagar A (2008) Pde4java: Plagiarism detection engine for java source code: A clustering approach. Int J Bus Intell Data Min 3(2):121–135. https://doi.org/10.1504/IJBIDM.2008.020514
Jhi Y, Wang X, Jia X, Zhu S, Liu P, Wu D (2011) Value-based program characterization and its application to software plagiarism detection. In: 2011 33rd International conference on software engineering (ICSE). pp 756–765
Jones E (2001) Metrics based plagarism monitoring. J Comput Sci Colleges 16:253–261
Joy M, Luck M (1999) Plagiarism in programming assignments. IEEE Trans Educ 42(2):129–133
Kapser C, Godfrey MW (2003) Toward a taxonomy of clones in source code: A case study. In: ELISA ’03. pp 67–78
Karnalim O (2016) Detecting source code plagiarism on introductory programming course assignments using a bytecode approach. In: 2016 International conference on information communication technology and systems (ICTS). pp 63–68
Ko S, Choi J, Kim H (2017) Coat: Code obfuscation tool to evaluate the performance of code plagiarism detection tools. In: 2017 International conference on software security and assurance (ICSSA). pp 32–37
Kolmogorov A (1998) On tables of random numbers. Theor Comput Sci 207(2):387–395. https://doi.org/10.1016/S0304-3975(98)00075-9
Kustanto C, Liem I (2009) Automatic source code plagiarism detection. In: 2009 10th ACIS International conference on software engineering, artificial intelligences, networking and parallel/distributed computing, pp 481–486
Lancaster T, Tetlow M (2005) Does automated anti-plagiarism have to be complex? evaluating more appropriate software metrics for finding collusion
Li X, Zhong XJ (2010) The source code plagiarism detection using ast. In: 2010 International symposium on intelligence information processing and trusted computing. pp 406–408
Liu C, Chen C, Han J, Yu PS (2006) Gplag: Detection of software plagiarism by program dependence graph analysis. In: Proceedings of the 12th ACM SIGKDD International conference on knowledge discovery and data mining, association for computing machinery, New York, NY, USA, KDD ’06. pp 872–881. https://doi.org/10.1145/1150402.1150522
Luo L, Ming J, Wu D, Liu P, Zhu S (2017) Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Trans Softw Eng 43(12):1157–1177
Martins VT, Fonte D, Henriques PR, da Cruz D (2014) Plagiarism detection: a tool survey and comparison. In: Pereira M J V, Leal J P, Simões A (eds) 3rd symposium on languages, applications and technologies, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, OpenAccess Series in Informatics (OASIcs). https://doi.org/10.4230/OASIcs.SLATE.2014.143, vol 38, pp 143–158
Mozgovoy M (2006) Desktop tools for offline plagiarism detection in computer programs. Inform Educ 5(1):97–112
Novak M (2016) Review of source-code plagiarism detection in academia. In: 2016 39th International convention on information and communication technology, electronics and microelectronics (MIPRO). pp 796–801
Novak M, Joy M, Kermek D (2019) Source-code similarity detection and detection tools used in academia: A systematic review. ACM Trans Comput Educ 19(3). https://doi.org/10.1145/3313290
Ottenstein KJ (1976) An algorithmic approach to the detection and prevention of plagiarism. SIGCSE Bull 8(4):30–41. https://doi.org/10.1145/382222.382462
Parker A, Hamblen JO (1989) Computer algorithms for plagiarism detection. IEEE Trans Educ 32(2):94–99
Pawlik M, Augsten N (2015) Efficient computation of the tree edit distance. ACM Trans Database Syst 40(1). https://doi.org/10.1145/2699485
Pawlik M, Augsten N (2016) Tree edit distance: Robust and memory-efficient. Inf Syst 56:157–173. https://doi.org/10.1016/j.is.2015.08.004
Pierce J, Zilles C (2017) Investigating student plagiarism patterns and correlations to grades. In: Proceedings of the 2017 ACM SIGCSE Technical symposium on computer science education, SIGCSE ’17. Association for Computing Machinery, New York. pp 471–476. https://doi.org/10.1145/3017680.3017797
Pike R (n.d.) Sherlock plagiarism detector. https://web.archive.org/web/20150323030146/http://rp-www.cs.usyd.edu.au/scilect/sherlock/
Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with jplag. J. Univers. Comput. Sci. 8(11):1016–1038
Ragkhitwetsagul C, Krinke J, Clark D (2018) A comparison of code similarity analysers. Empir Softw Eng 23(4):2464–2519
Rani S, Singh J (2018) Enhancing levenshtein’s edit distance algorithm for evaluating document similarity. In: Sharma R, Mantri A, Dua S (eds) Computing, analytics and networks. Springer Singapore, Singapore, pp 72–80
Roy C, Cordy J (2007) A survey on software clone detection research. School of Computing TR, 2007–541
Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci Comput Program 74(7):470–495. https://doi.org/10.1016/j.scico.2009.02.007
Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: Local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, SIGMOD ’03. ACM, New York, pp 76–85. https://doi.org/10.1145/872757.872770
Schulze S, Meyer D (2013) On the robustness of clone detection to code obfuscation. In: 2013 7th International workshop on software clones (IWSC), pp 62–68
Shan SQ, Tian ZG, Guo FJ, Ren JX (2014) Similarity detection’s application using chi-square test in the property of counting method. In: Advances in computers, electronics and mechatronics, Trans Tech Publications Ltd, Applied Mechanics and Materials. https://doi.org/10.4028/www.scientific.net/AMM.667.32, vol 667, pp 32–35
Sheard J, Markham S, Dick M (2003) Investigating differences in cheating behaviours of it undergraduate and graduate students: The maturity and motivation factors. Higher Educ Res Dev 22(1):91–108. https://doi.org/10.1080/0729436032000056526
Sraka D, Kaucic B (2009) Source code plagiarism. In: Proceedings of the ITI 2009 31st international conference on information technology interfaces. pp 461–466
Svajlenko J, Roy CK (2015) Evaluating clone detection tools with bigclonebench. In: 2015 IEEE international conference on software maintenance and evolution (ICSME). pp 131–140
Svajlenko J, Roy CK, Duszynski S (2013) Forksim: Generating software forks for evaluating cross-project similarity analysis tools. In: 2013 IEEE 13th International working conference on source code analysis and manipulation (SCAM). pp 37–42
Verco KL, Wise MJ (1996) Plagiarism à la Mode: a comparison of automated systems for detecting suspected plagiarism. https://doi.org/10.1093/comjnl/39.9.741, https://academic.oup.com/comjnl/article-pdf/39/9/741/993714/390741.pdf, vol 39, pp 741–750
Walker A, Cerny T, Song E (2020) Open-source tools and benchmarks for code-clone detection: Past, present, and future trends. SIGAPP Appl Comput Rev 19(4):28–39. https://doi.org/10.1145/3381307.3381310
Whale G (1990a) Identification of program similarity in large populations. Comput J 33(2):140–146. https://doi.org/10.1093/comjnl/33.2.140
Whale G (1990b) Software metrics and plagiarism detection. special Issue on Using Software Metrics, vol 13, pp 131–138
Wise MJ (1996) Yap3: Improved detection of similarities in computer program and other texts. SIGCSE Bull 28(1):130–134. https://doi.org/10.1145/236462.236525
Yeo S (2007) First-year university science and engineering students’ understanding of plagiarism. Higher Educ Res Dev 26(2):199–216. https://doi.org/10.1080/07294360701310813
Zhang F, Wu D, Liu P, Zhu S (2014) Program logic based software plagiarism detection. In: 2014 IEEE 25th international symposium on software reliability engineering. pp 66–77
Zhao J, Xia K, Fu Y, Cui B (2015) An ast-based code plagiarism detection algorithm. In: 2015 10th International conference on broadband and wireless computing, communication and applications (BWCCA). pp 178–182
Funding
This work was supported by an Australian Government Research Training Program Scholarship at the University of Newcastle, Australia.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Conflict of Interests
Not applicable
Additional information
Communicated by: Gabriele Bavota
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by an Australian Government Research Training Program Scholarship at the University of Newcastle, Australia.
Rights and permissions
About this article
Cite this article
Cheers, H., Lin, Y. & Smith, S.P. Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications. Empir Software Eng 26, 83 (2021). https://doi.org/10.1007/s10664-021-09990-4
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-021-09990-4