Semantic Sequence Kin: A Method of Document Copy Detection

Bao, Jun-Peng; Shen, Jun-Yi; Liu, Xiao-Dong; Liu, Hai-Yan; Zhang, Xiao-Di

doi:10.1007/978-3-540-24775-3_63

Jun-Peng Bao¹⁹,
Jun-Yi Shen¹⁹,
Xiao-Dong Liu¹⁹,
Hai-Yan Liu¹⁹ &
…
Xiao-Di Zhang¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3056))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2989 Accesses
8 Citations

Abstract

The string matching and global word frequency model are two basic models of Document Copy Detection, although they are both unsatisfied in some respects. The String Kernel (SK) and Word Sequence Kernel (WSK) may map string pairs into a new feature space directly, in which the data is linearly separable. This idea inspires us with the Semantic Sequence Kin (SSK) and we apply it to document copy detection. SK and WSK only take into account the gap between the first word/term and the last word/term so that it is not good for plagiarism detection. SSK considers each common word’s position information so as to detect plagiarism in a fine granularity. SSK is based on semantic density that is indeed the local word frequency information. We believe these measures diminish the noise of rewording greatly. We test SSK in a small corpus with several common copy types. The result shows that SSK is excellent for detecting non-rewording plagiarism and valid even if documents are reworded to some extent.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Cancedda, N., Gaussier, E., Goutte, C., Renders, J.M.: Word-Sequence Kernels. Journal of Machine Learning Research 3, 1059–1082 (2003)
Article MATH MathSciNet Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text Classification using String Kernels. Journal of Machine Learning Research 2, 419–444 (2002)
Article MATH Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD Annual Conference, San Francisco, USA (1995)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S.: Syntactic Clustering of the Web. In: Proceedings of the Sixth International Web Conference, Santa Clara, California (1997)
Google Scholar
Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the Second USENIX Workshop on Electronic Commerce, Oakland, California (1996)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM: A copy detection mechanism for digital documents. In: Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries, Austin, Texas (1995)
Google Scholar
Monostori, K., Zaslavsky, A., Schmidt, H.: MatchDetectReveal: Finding Overlapping and Similar Digital Documents. In: Proceedings of Information Resources Management Association International Conference, Alaska (2000)
Google Scholar
Si, A., Leong, H.V., Lau, R.W.H.: CHECK: A Document Plagiarism Detection System. In: Proceedings of ACM Symposium for Applied Computing, pp. 70–77 (1997)
Google Scholar
Qinbao, S., Junyi, S.: On illegal coping and distributing detection mechanism for digital goods. Journal of Computer Research and Development 38, 121–125 (2001)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Google Scholar
Denning, P.J.: Editorial: Plagiarism in the web. Communications of the ACM 38 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Xi’an Jiaotong University, 710049, Xi’an, People’s Republic of China
Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, Hai-Yan Liu & Xiao-Di Zhang

Authors

Jun-Peng Bao
View author publications
You can also search for this author in PubMed Google Scholar
Jun-Yi Shen
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Dong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hai-Yan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Di Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering and Information Technology, Deakin University, VIC 3125, Australia
Honghua Dai
University of Illinois at Urbana-Champaign, 61801, Urbana, IL, USA
Ramakrishnan Srikant
Faculty of Engineering and Information Technology, Centre for Quantum Computation and Intelligent Systems, and Australian ACS National Committee for Artificial Intelligence, University of Technology, Sydney, Australia
Chengqi Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bao, JP., Shen, JY., Liu, XD., Liu, HY., Zhang, XD. (2004). Semantic Sequence Kin: A Method of Document Copy Detection. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_63

Download citation

DOI: https://doi.org/10.1007/978-3-540-24775-3_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22064-0
Online ISBN: 978-3-540-24775-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics