Abstract
Authorship attribution assigns works of contentious authorship to their rightful owners solving cases of theft, plagiarism and authorship disputes in academia and industry. In this paper we investigate the application of information retrieval techniques to attribution of authorship of C source code. In particular, we explore novel methods for converting C code into documents suitable for retrieval systems, experimenting with 1,597 student programming assignments. We investigate several possible program derivations, partition attribution results by original program length to measure effectiveness of modest and lengthy programs separately, and evaluate three different methods for interpreting document rankings as authorship attribution. The best of our methods achieves an average of 76.78% classification accuracy for a one-in-ten classification problem which is competitive against six existing baselines. The techniques that we present can be the basis of practical software to support source code authorship investigations.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Frantzeskou, G., Gritzalis, S., MacDonell, S.: Source code authorship analysis for supporting the cybercrime investigation process. In: Filipe, J., Belo, C., Vasiu, L. (eds.) Proceedings of the First International Conference on E-business and Telecommunication Networks, Setubal, Portugal, pp. 85–92. Kluwer Academic Publishers, Dordrecht (2004)
Longstaff, T.A., Schultz, E.E.: Beyond preliminary analysis of the WANK and OILZ worms: A case study of malicious code. Computers and Security 12(1), 61–77 (1993)
Spafford, E.H.: The internet worm: Crisis and aftermath. Communications of the ACM 32(6), 678–687 (1989)
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 1st edn. Addison Wesley Longman, Amsterdam (1999)
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1999)
Glass, R.L.: Special feature: Software theft. IEEE Software 2(4), 82–85 (1985)
Ding, H., Samadzadeh, M.H.: Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software 72(1), 49–57 (2004)
Drucker, H., Wu, D., Vapnik, V.N.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)
MacDonell, S.G., Gray, A.R., MacLennan, G., Sallis, P.J.: Software forensics for discriminating between program authors using case-based reasoning, feed-forward neural networks and multiple discriminant analysis. In: Proceedings of the Sixth International Conference on Neural Information Processing, Perth, Australia, Perth, Australia, pp. 66–71 (November 1999)
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Vitter, J. (ed.) Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, ACM Special Interest Group on Algorithms and Computation Theory, Dallas, Texas, pp. 604–613. ACM Press, New York (1998)
Burrows, S., Tahaghoghi, S.M.M.: Source code authorship attribution using n-grams. In: Spink, A., Turpin, A., Wu, M. (eds.) Proceedings of the Twelth Australasian Document Computing Symposium, Melbourne, Australia, RMIT University, pp. 32–39 (December 2007)
Schleimer, S., Wilkerson, D., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: Ives, Z., Papakonstantinou, Y., Halevy, A. (eds.) Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, ACM Special Interest Group on Management of Data, San Diego, California, pp. 76–85. ACM Press, New York (2003)
Jones, E.: Metrics based plagiarism monitoring. In: Meinke, J.G. (ed.) Proceedings of the Sixth Annual CCSC Northeastern Conference on The Journal of Computing in Small Colleges, Middlebury, Vermont, Consortium for Computing Sciences in Colleges, pp. 253–261 (April 2001)
Elenbogen, B., Seliya, N.: Detecting outsourced student programming assignments. Journal of Computing Sciences in Colleges 23(3), 50–57 (2008)
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Effective identification of source code authors using byte-level information. In: Anderson, K. (ed.) Proceedings of the Twenty-Eighth International Conference on Software Engineering, Shanghai, China, ACM Special Interest Group on Software Engineering, pp. 893–896 (May 2006)
Krsul, I., Spafford, E.H.: Authorship analysis: Identifying the author of a program. Computers and Security 16(3), 233–257 (1997)
Lange, R.C., Mancoridis, S.: Using code metric histograms and genetic algorithms to perform author identification for software forensics. In: Thierens, D. (ed.) Proceedings of the Ninth Annual Conference on Genetic and Evolutionary Computation, London, England, ACM Special Interest Group on Genetic and Evolutionary Computation, pp. 2082–2089. ACM Press, New York (2007)
Moffat, A., Zobel, J.: Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems 27(1), 1–27 (2008)
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Frei, H.-P., Harman, D., Schaubie, P., Wilkinson, R. (eds.) Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 21–29. ACM Press, New York (1996)
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems 22(2), 179–214 (2004)
Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: Development and comparative experiments part 1. Information Processing and Management 36(6), 779–808 (2000)
Jones, K.S., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval: Development and comparative experiments part 2. Information Processing and Management 36(6), 809–840 (2000)
Cannon, L.W., Elliott, R.A., Kirchhoff, L.W., Miller, J.H., Miller, J.M., Mitze, R.W., Schan, E.P., Whittington, N.O., Spencer, H., Keppel, D., Brader, M.: Recommended C style and coding standards. Technical report, Bell Labs, University of Toronto, University of Washington and SoftQuad Incorporated (February 1997) (accessed September 24, 2008), http://vlsi.cornell.edu/courses/eecs314/tutorials/cstyle.pdf
Oman, P.W., Cook, C.R.: A taxonomy for programming style. In: Sood, A. (ed.) Proceedings of the 1990 ACM Annual Conference on Cooperation, Association for Computing Machinery, pp. 244–250. ACM Press, New York (1990)
Koppel, M., Akiva, N., Dagan, I.: A corpus-independent feature set for style-based text categorization. In: Proceedings of the IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico (2003)
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Source code author identification based on n-gram author profiles. In: Maglogiannis, I., Karpouzis, K., Bramer, M. (eds.) Artificial Intelligence Applications and Innovations, vol. 204, pp. 508–515. Springer, New York (2006)
Stein, B., zu Eissen, S.M., Potthast, M.: Strategies for retrieving plagiarized documents. In: Kraaij, W., de Vries, A.P., Clarke, C.L.A., Fuhr, N., Kando, N. (eds.) Proceedings of the Thirtieth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 825–826. ACM Press, Amsterdam (2007)
Burrows, S., Tahaghoghi, S.M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Software: Practice and Experience 37(2), 151–175 (2006)
Zhao, Y., Zobel, J., Vines, P.: Using relative entropy for authorship attribution. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 92–105. Springer, Heidelberg (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Burrows, S., Uitdenbogerd, A.L., Turpin, A. (2009). Application of Information Retrieval Techniques for Source Code Authorship Attribution. In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds) Database Systems for Advanced Applications. DASFAA 2009. Lecture Notes in Computer Science, vol 5463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00887-0_61
Download citation
DOI: https://doi.org/10.1007/978-3-642-00887-0_61
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00886-3
Online ISBN: 978-3-642-00887-0
eBook Packages: Computer ScienceComputer Science (R0)