ABSTRACT
We have developed a technique to characterize software developers- styles using a set of source code metrics. This style fingerprint can be used to identify the likely author of a piece of code from a pool of candidates. Author identification has applications in criminal justice, corporate litigation, and plagiarism detection. Furthermore, we can identify candidate developers who share similar styles, making our technique useful for software maintenance as well. Our method involves measuring the differences in histogram distributions for code metrics.Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics, and the time involved in exhaustive searching of the problem space prevented us from adding additional metrics. Using a genetic algorithm to perform the search, we were able to find good metric combinations in hours as opposed to weeks. The genetic algorithm has enabled us to begin adding new metrics to our catalog of available metrics. This paper documents the results of our experiments in author identification for software forensics and outlines future directions of research to improve the utility of our method.
- http://www.sourceforge.net/.Google Scholar
- S. Berchtold, C. Böhm, D. Keim, and H. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 78--86, 1997. Google ScholarDigital Library
- R. Bouckaert. Bayesian network classifiers in weka. Technical Report 14/2004, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2004.Google Scholar
- G. Demiroz and H. A. Guvenir. Classification by voting feature intervals. In ECML '97: Proceedings of the 9th European Conference on Machine Learning, pages 85--92, London, UK, 1997. Springer-Verlag. Google ScholarDigital Library
- H. Ding and M. Samadzadeh. Extraction of Java program fingerprints for software authorship identification. The Journal of Systems & Software, 72(1):49--57, 2004. Google ScholarDigital Library
- D. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison--Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1989. Google ScholarDigital Library
- A. Gray, P. Sallis, and S. MacDonell. Identified: A dictionary-based system for extracting source code metrics for software forensics. seep, 00:252, 1998. Google ScholarDigital Library
- J. Lenahan. Pygene Open Source Evolutionary Computation Tool. SIGEVOlution Newsletter, 1(2):27, 2006.Google Scholar
- S. Macdonell, A. Gray, G. MacLennan, and P. Sallis. Software forensics for discriminating between program authors usingcase--based reasoning, feedforward neural networks and multiplediscriminant analysis. Neural Information Processing, 1999. Proceedings. ICONIP'99. 6th International Conference on, 1, 1999.Google Scholar
- P. W. Oman and C. R. Cook. Programming style authorship analysis. In CSC '89: Proceedings of the 17th conference on ACM Annual Computer Science Conference, pages 320--326, New York, NY, USA, 1989. ACM Press. Google ScholarDigital Library
- J. Quinlan. C 4. 5: Programs for Machine Learning. Morgan Kaufmann, 1992. Google ScholarDigital Library
- P. Sallis. Contemporary Computing Methods for the Authorship Characterisation Problem in Computational Linguistics. New Zealand Journal of Computing, 5(1):85--95, 1994.Google Scholar
- P. Sallis, S. MacDonell, G. MacLennan, A. Gray, and R. Kilgour. Identified: Software authorship analysis with case--based reasoning. Proc. Addendum Session Int. Conf. Neural Info. Processing and Intelligent Info. Systems, pages 53--56, 1997.Google Scholar
- E. Spafford and S. Weeber. Software forensics: Can we track code to its authors. Technical Report CSD-TR 92-010, Purdue University, Dept. of Computer Sciences, 1992.Google Scholar
Index Terms
- Using code metric histograms and genetic algorithms to perform author identification for software forensics
Recommendations
New internal metric for software clustering algorithms validity
Clustering (modularisation) techniques are often employed for the meaningful decomposition of a program aiming to understand it. In the software clustering context, several external metrics are presented to evaluate and validate the resultant clustering ...
Toward a Software Testing and Reliability Early Warning Metric Suite
ICSE '04: Proceedings of the 26th International Conference on Software EngineeringThe field reliability is measured too late for affordablyguiding corrective action to improve the quality of thesoftware. Software developers can benefit from an earlywarning of their reliability while they can still affordablyreact. This early warning ...
A comparison between software design and code metrics for the prediction of software fault content
Software metrics play an important role in measuring the quality of software. It is desirable to predict the quality of software as early as possible, and hence metrics have to be collected early as well. This raises a number of questions that has not ...
Comments