skip to main content
10.1145/1276958.1277364acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
Article

Using code metric histograms and genetic algorithms to perform author identification for software forensics

Published:07 July 2007Publication History

ABSTRACT

We have developed a technique to characterize software developers- styles using a set of source code metrics. This style fingerprint can be used to identify the likely author of a piece of code from a pool of candidates. Author identification has applications in criminal justice, corporate litigation, and plagiarism detection. Furthermore, we can identify candidate developers who share similar styles, making our technique useful for software maintenance as well. Our method involves measuring the differences in histogram distributions for code metrics.Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics, and the time involved in exhaustive searching of the problem space prevented us from adding additional metrics. Using a genetic algorithm to perform the search, we were able to find good metric combinations in hours as opposed to weeks. The genetic algorithm has enabled us to begin adding new metrics to our catalog of available metrics. This paper documents the results of our experiments in author identification for software forensics and outlines future directions of research to improve the utility of our method.

References

  1. http://www.sourceforge.net/.Google ScholarGoogle Scholar
  2. S. Berchtold, C. Böhm, D. Keim, and H. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 78--86, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Bouckaert. Bayesian network classifiers in weka. Technical Report 14/2004, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2004.Google ScholarGoogle Scholar
  4. G. Demiroz and H. A. Guvenir. Classification by voting feature intervals. In ECML '97: Proceedings of the 9th European Conference on Machine Learning, pages 85--92, London, UK, 1997. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Ding and M. Samadzadeh. Extraction of Java program fingerprints for software authorship identification. The Journal of Systems & Software, 72(1):49--57, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison--Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Gray, P. Sallis, and S. MacDonell. Identified: A dictionary-based system for extracting source code metrics for software forensics. seep, 00:252, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Lenahan. Pygene Open Source Evolutionary Computation Tool. SIGEVOlution Newsletter, 1(2):27, 2006.Google ScholarGoogle Scholar
  9. S. Macdonell, A. Gray, G. MacLennan, and P. Sallis. Software forensics for discriminating between program authors usingcase--based reasoning, feedforward neural networks and multiplediscriminant analysis. Neural Information Processing, 1999. Proceedings. ICONIP'99. 6th International Conference on, 1, 1999.Google ScholarGoogle Scholar
  10. P. W. Oman and C. R. Cook. Programming style authorship analysis. In CSC '89: Proceedings of the 17th conference on ACM Annual Computer Science Conference, pages 320--326, New York, NY, USA, 1989. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Quinlan. C 4. 5: Programs for Machine Learning. Morgan Kaufmann, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Sallis. Contemporary Computing Methods for the Authorship Characterisation Problem in Computational Linguistics. New Zealand Journal of Computing, 5(1):85--95, 1994.Google ScholarGoogle Scholar
  13. P. Sallis, S. MacDonell, G. MacLennan, A. Gray, and R. Kilgour. Identified: Software authorship analysis with case--based reasoning. Proc. Addendum Session Int. Conf. Neural Info. Processing and Intelligent Info. Systems, pages 53--56, 1997.Google ScholarGoogle Scholar
  14. E. Spafford and S. Weeber. Software forensics: Can we track code to its authors. Technical Report CSD-TR 92-010, Purdue University, Dept. of Computer Sciences, 1992.Google ScholarGoogle Scholar

Index Terms

  1. Using code metric histograms and genetic algorithms to perform author identification for software forensics

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      GECCO '07: Proceedings of the 9th annual conference on Genetic and evolutionary computation
      July 2007
      2313 pages
      ISBN:9781595936974
      DOI:10.1145/1276958

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 July 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      GECCO '07 Paper Acceptance Rate266of577submissions,46%Overall Acceptance Rate1,669of4,410submissions,38%

      Upcoming Conference

      GECCO '24
      Genetic and Evolutionary Computation Conference
      July 14 - 18, 2024
      Melbourne , VIC , Australia

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader