Article

Using code metric histograms and genetic algorithms to perform author identification for software forensics

Authors:
Robert Charles Lange

Drexel University, Philadelphia, PA

Drexel University, Philadelphia, PA
View Profile

,
Spiros Mancoridis

Drexel University, Philadelphia, PA

Drexel University, Philadelphia, PA
View Profile

GECCO '07: Proceedings of the 9th annual conference on Genetic and evolutionary computationJuly 2007Pages 2082–2089https://doi.org/10.1145/1276958.1277364

Published:07 July 2007Publication History

GECCO '07: Proceedings of the 9th annual conference on Genetic and evolutionary computation

Pages 2082–2089

ABSTRACT

We have developed a technique to characterize software developers- styles using a set of source code metrics. This style fingerprint can be used to identify the likely author of a piece of code from a pool of candidates. Author identification has applications in criminal justice, corporate litigation, and plagiarism detection. Furthermore, we can identify candidate developers who share similar styles, making our technique useful for software maintenance as well. Our method involves measuring the differences in histogram distributions for code metrics.Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics, and the time involved in exhaustive searching of the problem space prevented us from adding additional metrics. Using a genetic algorithm to perform the search, we were able to find good metric combinations in hours as opposed to weeks. The genetic algorithm has enabled us to begin adding new metrics to our catalog of available metrics. This paper documents the results of our experiments in author identification for software forensics and outlines future directions of research to improve the utility of our method.

References

http://www.sourceforge.net/.Google Scholar
S. Berchtold, C. Böhm, D. Keim, and H. Kriegel. A cost model for nearest neighbor search in high-dimensional data space. Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 78--86, 1997. Google ScholarDigital Library
R. Bouckaert. Bayesian network classifiers in weka. Technical Report 14/2004, The University of Waikato, Department of Computer Science, Hamilton, New Zealand, 2004.Google Scholar
G. Demiroz and H. A. Guvenir. Classification by voting feature intervals. In ECML '97: Proceedings of the 9th European Conference on Machine Learning, pages 85--92, London, UK, 1997. Springer-Verlag. Google ScholarDigital Library
H. Ding and M. Samadzadeh. Extraction of Java program fingerprints for software authorship identification. The Journal of Systems & Software, 72(1):49--57, 2004. Google ScholarDigital Library
D. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison--Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1989. Google ScholarDigital Library
A. Gray, P. Sallis, and S. MacDonell. Identified: A dictionary-based system for extracting source code metrics for software forensics. seep, 00:252, 1998. Google ScholarDigital Library
J. Lenahan. Pygene Open Source Evolutionary Computation Tool. SIGEVOlution Newsletter, 1(2):27, 2006.Google Scholar
S. Macdonell, A. Gray, G. MacLennan, and P. Sallis. Software forensics for discriminating between program authors usingcase--based reasoning, feedforward neural networks and multiplediscriminant analysis. Neural Information Processing, 1999. Proceedings. ICONIP'99. 6th International Conference on, 1, 1999.Google Scholar
P. W. Oman and C. R. Cook. Programming style authorship analysis. In CSC '89: Proceedings of the 17th conference on ACM Annual Computer Science Conference, pages 320--326, New York, NY, USA, 1989. ACM Press. Google ScholarDigital Library
J. Quinlan. C 4. 5: Programs for Machine Learning. Morgan Kaufmann, 1992. Google ScholarDigital Library
P. Sallis. Contemporary Computing Methods for the Authorship Characterisation Problem in Computational Linguistics. New Zealand Journal of Computing, 5(1):85--95, 1994.Google Scholar
P. Sallis, S. MacDonell, G. MacLennan, A. Gray, and R. Kilgour. Identified: Software authorship analysis with case--based reasoning. Proc. Addendum Session Int. Conf. Neural Info. Processing and Intelligent Info. Systems, pages 53--56, 1997.Google Scholar
E. Spafford and S. Weeber. Software forensics: Can we track code to its authors. Technical Report CSD-TR 92-010, Purdue University, Dept. of Computer Sciences, 1992.Google Scholar

Index Terms

Using code metric histograms and genetic algorithms to perform author identification for software forensics
1. Computing methodologies
  1. Machine learning

Recommendations

New internal metric for software clustering algorithms validity

Clustering (modularisation) techniques are often employed for the meaningful decomposition of a program aiming to understand it. In the software clustering context, several external metrics are presented to evaluate and validate the resultant clustering ...
Read More
Toward a Software Testing and Reliability Early Warning Metric Suite
ICSE '04: Proceedings of the 26th International Conference on Software Engineering

The field reliability is measured too late for affordablyguiding corrective action to improve the quality of thesoftware. Software developers can benefit from an earlywarning of their reliability while they can still affordablyreact. This early warning ...
Read More
A comparison between software design and code metrics for the prediction of software fault content

Software metrics play an important role in measuring the quality of software. It is desirable to predict the quality of software as early as possible, and hence metrics have to be collected early as well. This raises a number of questions that has not ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GECCO '07: Proceedings of the 9th annual conference on Genetic and evolutionary computation
July 2007
2313 pages
ISBN:9781595936974
DOI:10.1145/1276958
General Chair:
Hod Lipson
Cornell University, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 July 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
author identification
genetic algorithm applications
pygene
software forensics
software metrics
Qualifiers
- Article
Conference

Acceptance Rates
GECCO '07 Paper Acceptance Rate266of577submissions,46%Overall Acceptance Rate1,669of4,410submissions,38%
More
Upcoming Conference
GECCO '24

Sponsor:

sigevo

Genetic and Evolutionary Computation Conference

July 14 - 18, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 47
  Total Citations
  View Citations
- 614
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Using code metric histograms and genetic algorithms to perform author identification for software forensics

GECCO '07: Proceedings of the 9th annual conference on Genetic and evolutionary computation

ABSTRACT

References

Cited By

Index Terms

Recommendations

New internal metric for software clustering algorithms validity

Toward a Software Testing and Reliability Early Warning Metric Suite

A comparison between software design and code metrics for the prediction of software fault content

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Using code metric histograms and genetic algorithms to perform author identification for software forensics

GECCO '07: Proceedings of the 9th annual conference on Genetic and evolutionary computation

ABSTRACT

References

Cited By

Index Terms

Recommendations

New internal metric for software clustering algorithms validity

Toward a Software Testing and Reliability Early Warning Metric Suite

A comparison between software design and code metrics for the prediction of software fault content

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media