Skip to main content
Log in

Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Source code plagiarism is a common occurrence in undergraduate computer science education. In order to identify such cases, many source code plagiarism detection tools have been proposed. A source code plagiarism detection tool evaluates pairs of assignment submissions to detect indications of plagiarism. However, a plagiarising student will commonly apply plagiarism-hiding modifications to source code in an attempt to evade detection. Subsequently, prior work has implied that currently available source code plagiarism detection tools are not robust to the application of pervasive plagiarism-hiding modifications. In this article, 11 source code plagiarism detection tools are evaluated for robustness against plagiarism-hiding modifications. The tools are evaluated with data sets of simulated undergraduate plagiarism, constructed with source code modifications representative of undergraduate students. The results of the performed evaluations indicate that currently available source code plagiarism detection tools are not robust against modifications which apply fine-grained transformations to the source code structure. Of the evaluated tools, JPlag and Plaggie demonstrates the greatest robustness to different types of plagiarism-hiding modifications. However, the results also indicate that graph-based tools, specifically those that compare programs as program dependence graphs, show potentially greater robustness to pervasive plagiarism-hiding modifications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Data Availability

Data are available on request due to privacy or other restrictions

Code Availability

Available under MIT license (noted in manuscript where applicable)

Notes

  1. https://github.com/javaparser/javaparser, last accessed May 1 2020.

  2. http://commons.apache.org/proper/commons-text, last accessed May 1 2020.

  3. https://github.com/DatabaseGroup/apted, last accessed May 1 2020.

  4. https://www.eclipse.org/jdt/, last accessed June 30 2020.

  5. https://github.com/google/google-java-format, last accessed June 30 2020

  6. https://www.eclipse.org, last accessed January 15 2021.

  7. https://www.jetbrains.com/idea/, last accessed January 15 2021.

  8. https://kotlinlang.org, last accessed May 1 2020.

References

  • Ahadi A, Mathieson L (2019) A comparison of three popular source code similarity tools for detecting student plagiarism. In: Proceedings of the twenty-first australasian computing education conference, ACE ’19. https://doi.org/10.1145/3286960.3286974. Association for Computing Machinery, New York, pp 112–117

  • Ahtiainen A, Surakka S, Rahikainen M (2006) Plaggie: Gnu-licensed source code plagiarism detection engine for java exercises. In: Proceedings of the 6th Baltic sea conference on computing education research: Koli Calling 2006, Baltic Sea ’06. https://doi.org/10.1145/1315803.1315831. Association for Computing Machinery, New York, pp 141–142

  • Allyson FB, Danilo ML, José S M, Giovanni BC (2019) Sherlock n-overlap: Invasive normalization and overlap coefficient for the similarity analysis between source code. IEEE Trans Comput 68(5):740–751

    Article  MathSciNet  Google Scholar 

  • Anjali V, Swapna T, Jayaraman B (2015) Plagiarism detection for java programs without source codes. Procedia Comput Sci 46:749–758. https://doi.org/10.1016/j.procs.2015.02.143, proceedings of the International Conference on Information and Communication Technologies, ICICT 2014, 3-5 December 2014 at Bolgatty Palace & Island Resort, Kochi, India

    Article  Google Scholar 

  • Anzai K, Watanobe Y (2019) Algorithm to determine extended edit distance between program codes. In: 2019 IEEE 13th International symposium on embedded multicore/many-core systems-on-chip (MCSoC). pp 180–186

  • Baxter ID, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: Proceedings of the international conference on software maintenance, ICSM ’98. IEEE Computer Society, Washington. pp 368–377

  • Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Trans Softw Eng 33(9):577–591

    Article  Google Scholar 

  • Burd E, Bailey J (2002) Evaluating clone detection tools for use during preventative maintenance. In: Proceedings of the Second IEEE International workshop on source code analysis and manipulation. pp 36–43

  • Cebrian M, Alfonseca M, Ortega A (2009) Towards the validation of plagiarism detection tools by means of grammar evolution. IEEE Trans Evol Comput 13(3):477–485

    Article  Google Scholar 

  • Chae DK, Ha J, Kim SW, Kang B, Im EG (2013) Software plagiarism detection: A graph-based approach. In: Proceedings of the 22nd ACM International conference on information & knowledge management, CIKM ’13. Association for Computing Machinery, New York. pp 1577–1580. https://doi.org/10.1145/2505515.2507848

  • Cheers H, Lin Y, Smith SP (2019) Spplagiarise: A tool for generating simulated semantics-preserving plagiarism of java source code. In: 2019 IEEE 10th International conference on software engineering and service science (ICSESS). pp 617–622

  • Cheers H, Lin Y, Smith SP (2020) Detecting pervasive source code plagiarism through dynamic program behaviours. In: Proceedings of the twenty-second australasian computing education conference, ACE’20. Association for Computing Machinery, New York, pp 21–30. https://doi.org/10.1145/3373165.3373168

  • Chen R, Hong L, Chunyan Lü C, Deng W (2010) Author identification of software source code with program dependence graphs. In: 2010 IEEE 34th annual computer software and applications conference workshops. pp 281–286

  • Chen X, Francia B, Li Ming, McKinnon B, Seker A (2004) Shared information and program plagiarism detection. IEEE Trans Inf Theory 50 (7):1545–1551

    Article  MathSciNet  Google Scholar 

  • Cosma G, Joy M (2008) Towards a definition of source-code plagiarism. IEEE Trans Educ 51(2):195–200

    Article  Google Scholar 

  • Cosma G, Joy M (2012) An approach to source-code plagiarism detection and investigation using latent semantic analysis. IEEE Trans Comput 61 (3):379–394

    Article  MathSciNet  Google Scholar 

  • Curtis G, Popal R (2011) An examination of factors related to plagiarism and a five-year follow-up of plagiarism at an australian university. Int J Educ Integr 7(1):30–42. https://doi.org/10.21913/IJEI.v7i1.742

    Article  Google Scholar 

  • Faidhi J, Robinson S (1987) An empirical approach for detecting program similarity and plagiarism within a university programming environment. Comput Educ 11(1):11–19. https://doi.org/10.1016/0360-1315(87)90042-X

    Article  Google Scholar 

  • Ferrante J, Ottenstein KJ, Warren JD (1987) The program dependence graph and its use in optimization. ACM Trans Program Lang Syst 9(3):319–349. https://doi.org/10.1145/24039.24041

    Article  Google Scholar 

  • Flores E, Rosso P, Moreno L, Villatoro-Tello E (2014) On the detection of source code re-use. In: Proceedings of the forum for information retrieval evaluation, association for computing machinery, FIRE ’14. Association for Computing Machinery, New York. pp 21–30. https://doi.org/10.1145/2824864.2824878

  • Freire M, Cebrián M, del Rosal E (2007) AC: an integrated source code plagiarism detection environment. arXiv:cs/0703136

  • Gitchell D, Tran N (1999a) Sim: A utility for detecting similarity in computer programs. In: The proceedings of the thirtieth SIGCSE technical symposium on computer science education, SIGCSE ’99. Association for Computing Machinery, New York, pp 266–270. https://doi.org/10.1145/299649.299783

  • Gitchell D, Tran N (1999b) Sim: A utility for detecting similarity in computer programs. SIGCSE Bull 31(1):266–270. https://doi.org/10.1145/384266.299783

    Article  Google Scholar 

  • Granzer W, Praus F, Balog P (2013) Source code plagiarism in computer engineering courses. J Syst Cybern Inform 11(6):22–26

    Google Scholar 

  • Grune D, Huntjens M (1989) Het detecteren van kopieën bij informatica-practica. Informatie (in Dutch) 31(11):864–867

    Google Scholar 

  • Halstead MH (1977) Elements of software science (operating and programming systems series). Elsevier Science Inc, New York

    MATH  Google Scholar 

  • Jadalla A, Elnagar A (2008) Pde4java: Plagiarism detection engine for java source code: A clustering approach. Int J Bus Intell Data Min 3(2):121–135. https://doi.org/10.1504/IJBIDM.2008.020514

    Google Scholar 

  • Jhi Y, Wang X, Jia X, Zhu S, Liu P, Wu D (2011) Value-based program characterization and its application to software plagiarism detection. In: 2011 33rd International conference on software engineering (ICSE). pp 756–765

  • Jones E (2001) Metrics based plagarism monitoring. J Comput Sci Colleges 16:253–261

    Google Scholar 

  • Joy M, Luck M (1999) Plagiarism in programming assignments. IEEE Trans Educ 42(2):129–133

    Article  Google Scholar 

  • Kapser C, Godfrey MW (2003) Toward a taxonomy of clones in source code: A case study. In: ELISA ’03. pp 67–78

  • Karnalim O (2016) Detecting source code plagiarism on introductory programming course assignments using a bytecode approach. In: 2016 International conference on information communication technology and systems (ICTS). pp 63–68

  • Ko S, Choi J, Kim H (2017) Coat: Code obfuscation tool to evaluate the performance of code plagiarism detection tools. In: 2017 International conference on software security and assurance (ICSSA). pp 32–37

  • Kolmogorov A (1998) On tables of random numbers. Theor Comput Sci 207(2):387–395. https://doi.org/10.1016/S0304-3975(98)00075-9

    Article  MathSciNet  Google Scholar 

  • Kustanto C, Liem I (2009) Automatic source code plagiarism detection. In: 2009 10th ACIS International conference on software engineering, artificial intelligences, networking and parallel/distributed computing, pp 481–486

  • Lancaster T, Tetlow M (2005) Does automated anti-plagiarism have to be complex? evaluating more appropriate software metrics for finding collusion

  • Li X, Zhong XJ (2010) The source code plagiarism detection using ast. In: 2010 International symposium on intelligence information processing and trusted computing. pp 406–408

  • Liu C, Chen C, Han J, Yu PS (2006) Gplag: Detection of software plagiarism by program dependence graph analysis. In: Proceedings of the 12th ACM SIGKDD International conference on knowledge discovery and data mining, association for computing machinery, New York, NY, USA, KDD ’06. pp 872–881. https://doi.org/10.1145/1150402.1150522

  • Luo L, Ming J, Wu D, Liu P, Zhu S (2017) Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Trans Softw Eng 43(12):1157–1177

    Article  Google Scholar 

  • Martins VT, Fonte D, Henriques PR, da Cruz D (2014) Plagiarism detection: a tool survey and comparison. In: Pereira M J V, Leal J P, Simões A (eds) 3rd symposium on languages, applications and technologies, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, OpenAccess Series in Informatics (OASIcs). https://doi.org/10.4230/OASIcs.SLATE.2014.143, vol 38, pp 143–158

  • Mozgovoy M (2006) Desktop tools for offline plagiarism detection in computer programs. Inform Educ 5(1):97–112

    Google Scholar 

  • Novak M (2016) Review of source-code plagiarism detection in academia. In: 2016 39th International convention on information and communication technology, electronics and microelectronics (MIPRO). pp 796–801

  • Novak M, Joy M, Kermek D (2019) Source-code similarity detection and detection tools used in academia: A systematic review. ACM Trans Comput Educ 19(3). https://doi.org/10.1145/3313290

  • Ottenstein KJ (1976) An algorithmic approach to the detection and prevention of plagiarism. SIGCSE Bull 8(4):30–41. https://doi.org/10.1145/382222.382462

    Article  Google Scholar 

  • Parker A, Hamblen JO (1989) Computer algorithms for plagiarism detection. IEEE Trans Educ 32(2):94–99

    Article  Google Scholar 

  • Pawlik M, Augsten N (2015) Efficient computation of the tree edit distance. ACM Trans Database Syst 40(1). https://doi.org/10.1145/2699485

  • Pawlik M, Augsten N (2016) Tree edit distance: Robust and memory-efficient. Inf Syst 56:157–173. https://doi.org/10.1016/j.is.2015.08.004

    Article  Google Scholar 

  • Pierce J, Zilles C (2017) Investigating student plagiarism patterns and correlations to grades. In: Proceedings of the 2017 ACM SIGCSE Technical symposium on computer science education, SIGCSE ’17. Association for Computing Machinery, New York. pp 471–476. https://doi.org/10.1145/3017680.3017797

  • Pike R (n.d.) Sherlock plagiarism detector. https://web.archive.org/web/20150323030146/http://rp-www.cs.usyd.edu.au/scilect/sherlock/

  • Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with jplag. J. Univers. Comput. Sci. 8(11):1016–1038

    Google Scholar 

  • Ragkhitwetsagul C, Krinke J, Clark D (2018) A comparison of code similarity analysers. Empir Softw Eng 23(4):2464–2519

    Article  Google Scholar 

  • Rani S, Singh J (2018) Enhancing levenshtein’s edit distance algorithm for evaluating document similarity. In: Sharma R, Mantri A, Dua S (eds) Computing, analytics and networks. Springer Singapore, Singapore, pp 72–80

  • Roy C, Cordy J (2007) A survey on software clone detection research. School of Computing TR, 2007–541

  • Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Sci Comput Program 74(7):470–495. https://doi.org/10.1016/j.scico.2009.02.007

    Article  MathSciNet  Google Scholar 

  • Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: Local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, SIGMOD ’03. ACM, New York, pp 76–85. https://doi.org/10.1145/872757.872770

  • Schulze S, Meyer D (2013) On the robustness of clone detection to code obfuscation. In: 2013 7th International workshop on software clones (IWSC), pp 62–68

  • Shan SQ, Tian ZG, Guo FJ, Ren JX (2014) Similarity detection’s application using chi-square test in the property of counting method. In: Advances in computers, electronics and mechatronics, Trans Tech Publications Ltd, Applied Mechanics and Materials. https://doi.org/10.4028/www.scientific.net/AMM.667.32, vol 667, pp 32–35

  • Sheard J, Markham S, Dick M (2003) Investigating differences in cheating behaviours of it undergraduate and graduate students: The maturity and motivation factors. Higher Educ Res Dev 22(1):91–108. https://doi.org/10.1080/0729436032000056526

    Article  Google Scholar 

  • Sraka D, Kaucic B (2009) Source code plagiarism. In: Proceedings of the ITI 2009 31st international conference on information technology interfaces. pp 461–466

  • Svajlenko J, Roy CK (2015) Evaluating clone detection tools with bigclonebench. In: 2015 IEEE international conference on software maintenance and evolution (ICSME). pp 131–140

  • Svajlenko J, Roy CK, Duszynski S (2013) Forksim: Generating software forks for evaluating cross-project similarity analysis tools. In: 2013 IEEE 13th International working conference on source code analysis and manipulation (SCAM). pp 37–42

  • Verco KL, Wise MJ (1996) Plagiarism à la Mode: a comparison of automated systems for detecting suspected plagiarism. https://doi.org/10.1093/comjnl/39.9.741, https://academic.oup.com/comjnl/article-pdf/39/9/741/993714/390741.pdf, vol 39, pp 741–750

  • Walker A, Cerny T, Song E (2020) Open-source tools and benchmarks for code-clone detection: Past, present, and future trends. SIGAPP Appl Comput Rev 19(4):28–39. https://doi.org/10.1145/3381307.3381310

    Article  Google Scholar 

  • Whale G (1990a) Identification of program similarity in large populations. Comput J 33(2):140–146. https://doi.org/10.1093/comjnl/33.2.140

    Article  Google Scholar 

  • Whale G (1990b) Software metrics and plagiarism detection. special Issue on Using Software Metrics, vol 13, pp 131–138

  • Wise MJ (1996) Yap3: Improved detection of similarities in computer program and other texts. SIGCSE Bull 28(1):130–134. https://doi.org/10.1145/236462.236525

    Article  Google Scholar 

  • Yeo S (2007) First-year university science and engineering students’ understanding of plagiarism. Higher Educ Res Dev 26(2):199–216. https://doi.org/10.1080/07294360701310813

    Article  Google Scholar 

  • Zhang F, Wu D, Liu P, Zhu S (2014) Program logic based software plagiarism detection. In: 2014 IEEE 25th international symposium on software reliability engineering. pp 66–77

  • Zhao J, Xia K, Fu Y, Cui B (2015) An ast-based code plagiarism detection algorithm. In: 2015 10th International conference on broadband and wireless computing, communication and applications (BWCCA). pp 178–182

Download references

Funding

This work was supported by an Australian Government Research Training Program Scholarship at the University of Newcastle, Australia.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hayden Cheers.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Conflict of Interests

Not applicable

Additional information

Communicated by: Gabriele Bavota

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by an Australian Government Research Training Program Scholarship at the University of Newcastle, Australia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheers, H., Lin, Y. & Smith, S.P. Evaluating the robustness of source code plagiarism detection tools to pervasive plagiarism-hiding modifications. Empir Software Eng 26, 83 (2021). https://doi.org/10.1007/s10664-021-09990-4

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-021-09990-4

Keywords

Navigation