Article

GPLAG: detection of software plagiarism by program dependence graph analysis

Authors:
Chao Liu

University of Illinois-UC, Urbana, IL

University of Illinois-UC, Urbana, IL
View Profile

,
Chen Chen

University of Illinois-UC, Urbana, IL

University of Illinois-UC, Urbana, IL
View Profile

,
Jiawei Han

University of Illinois-UC, Urbana, IL

University of Illinois-UC, Urbana, IL
View Profile

,
Philip S. Yu

IBM T. J. Watson Research Center, Hawthorne, NY

IBM T. J. Watson Research Center, Hawthorne, NY
View Profile

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2006Pages 872–881https://doi.org/10.1145/1150402.1150522

Published:20 August 2006Publication History

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 872–881

ABSTRACT

Along with the blossom of open source projects comes the convenience for software plagiarism. A company, if less self-disciplined, may be tempted to plagiarize some open source projects for its own products. Although current plagiarism detection tools appear sufficient for academic use, they are nevertheless short for fighting against serious plagiarists. For example, disguises like statement reordering and code insertion can effectively confuse these tools. In this paper, we develop a new plagiarism detection tool, called GPLAG, which detects plagiarism by mining program dependence graphs (PDGs). A PDG is a graphic representation of the data and control dependencies within a procedure. Because PDGs are nearly invariant during plagiarism, GPLAG is more effective than state-of-the-art tools for plagiarism detection. In order to make GPLAG scalable to large programs, a statistical lossy filter is proposed to prune the plagiarism search space. Experiment study shows that GPLAG is both effective and efficient: It detects plagiarism that easily slips over existing tools, and it usually takes a few seconds to find (simulated) plagiarism in programs having thousands of lines of code.

References

B. S. Baker. On finding duplication and near duplication in large software systems. In Proc. of 2nd Working Conf. on Reverse Engineering, 1995.]] Google ScholarDigital Library
I. D. Baxter, A. Yahin, L. Moura, M. Sant'Anna, and L. Bier. Clone detection using abstract syntax trees. In Proc. of Int. Conf. on Software Maintenance, 1998.]] Google ScholarDigital Library
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. In Proc. of 2000 Int. Conf. Data Engineering, 2000.]] Google ScholarDigital Library
J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9 (3):319--349, 1987.]] Google ScholarDigital Library
M. Garey and D. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, 1979.]] Google ScholarDigital Library
C. Hoffman. Group-theoretic Algorithms and Graph Isomorphism. Springer Verlag, 1982.]]Google ScholarCross Ref
J. E. Hopcroft and J. K. Wong. Linear time algorithm for isomorphism of planar graphs. In Proc. of 6th ACM Symp. on Theory of Computing, 1974.]] Google ScholarDigital Library
J. E. Hopcroft and J. K. Wong. Performance evaluation of the VF graph matching algorithm. In Proc. of 10th Int. Conf. on Image Analysis and Processing, 1999.]] Google ScholarDigital Library
T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), 2002.]] Google ScholarDigital Library
R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. In Proc. of 8th Int. Symp. on Static Analysis, pages 40--56. Springer-Verlag, 2001.]] Google ScholarDigital Library
K. Kontogiannis, M. Galler, and R. DeMori. Detecting code similarity using patterns. In Working Notes of 3rd Workshop on AI and Software Engineering, 1995.]]Google Scholar
J. Krinke. Identifying similar code with program dependence graphs. In Proc. of 8th Working Conf. on Reverse Engineering, 2001.]] Google ScholarDigital Library
E. Lehmann. Testing Statistical Hypotheses. Springer Verlag, 2nd edition, 1997.]]Google Scholar
C. Liu, X. Yan, and J. Han. Mining control flow abnormality for logic error isolation. In In Proc. 2006 SIAM Int. Conf. on Data Mining, 2006.]]Google ScholarCross Ref
C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining behavior graphs for "backtrace" of noncrashing bugs. In Proc. 2005 SIAM Int. Conf. on Data Mining, 2005.]]Google ScholarCross Ref
V. B. Livshits and T. Zimmermann. Dynamine: Finding common error patterns by mining software revision histories. In Proc. of 13th Int. Symp. on the Foundations of Software Engineering, 2005.]] Google ScholarDigital Library
L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. J. of Universal Computer Science, 8(11), 2002.]]Google Scholar
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, 2003.]] Google ScholarDigital Library
R. R. Sokal and F. J. Rohlf. Biometry: the principles and practice of statistics in biological research. Freeman, 3rd edition, 1994.]]Google Scholar
J. R. Ullmann. An algorithm for subgraph isomorphism. J. of the Association for Computing Machinery, 23(1), 1976.]] Google ScholarDigital Library
X. Yan and J. Han. CloseGraph: mining closed frequent graph patterns. In Proc. of 9th Int. ACM SIGKDD Conf. on Knowledge Discovery and Data Mining, 2003.]] Google ScholarDigital Library

Index Terms

GPLAG: detection of software plagiarism by program dependence graph analysis
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Detecting Automatic Software Plagiarism via Token Sequence Normalization
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

While software plagiarism detectors have been used for decades, the assumption that evading detection requires programming proficiency is challenged by the emergence of automated plagiarism generators. These generators enable effortless obfuscation ...
Read More
Detection of Plagiarism in Programming Assignments

Laboratory work assignments are very important for computer science learning. Over the last 12 years many students have been involved in solving such assignments in the authors' department, having reached a figure of more than 400 students doing the ...
Read More
On the control dependence in the program dependence graph
CSC '88: Proceedings of the 1988 ACM sixteenth annual conference on Computer science

The program dependence graph, PDG, is used to represent the data and control dependencies between the statements of some program. The data dependencies between the statements are fully understood and they correspond to the definition-use chain. On the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2006
986 pages
ISBN:1595933395
DOI:10.1145/1150402
Conference Chair:
Tina Eliassi-Rad
LLNL
,
General Chair:
Lyle Ungar
University of Pennsylvania
,
Program Chairs:
Mark Craven
University of Wisconsin
,
Dimitrios Gunopulos
University of California, Riverside
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
graph mining
program dependence graph
software plagiarism detection
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 272
  Total Citations
  View Citations
- 2,490
  Total Downloads
- Downloads (Last 12 months)88
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

GPLAG: detection of software plagiarism by program dependence graph analysis

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Detecting Automatic Software Plagiarism via Token Sequence Normalization

Detection of Plagiarism in Programming Assignments

On the control dependence in the program dependence graph

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

GPLAG: detection of software plagiarism by program dependence graph analysis

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Detecting Automatic Software Plagiarism via Token Sequence Normalization

Detection of Plagiarism in Programming Assignments

On the control dependence in the program dependence graph

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media