research-article

Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

Authors:
Ahmed Abbasi

The University of Arizona, Tucson, AZ

The University of Arizona, Tucson, AZ
View Profile

,
Hsinchun Chen

The University of Arizona, Tucson, AZ

The University of Arizona, Tucson, AZ
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 26 Issue 2Article No.: 7pp 1–29https://doi.org/10.1145/1344411.1344413

Published:08 April 2008Publication History

ACM Transactions on Information Systems

Abstract

One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.

References

Abbasi, A. and Chen, H. 2005. Identification and comparison of extremist-group Web forum messages using authorship analysis. IEEE Intel. Syst. 20, 5, 67--75. Google ScholarDigital Library
Abbasi, A. and Chen, H. 2006. Visualizing authorship for identification. In Proceedings of the 4th IEEE Symposium on Intelligence and Security Informatics, San Diego, CA. Google ScholarDigital Library
Airoldi, E. and Malin, B. 2004. Data mining challenges for electronic safety: The case of fraudulent intent detection in e-mails. In Proceedings of the Workshop on Privacy and Security Aspects of Data Mining.Google Scholar
Argamon, S., Saric, M., and Stein, S. S. 2003. Style mining of electronic messages for multiple authorship discrimination: First results In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
Argamon, S., Koppel, M., and Avneri, G. 1998. Routing documents according to style. In Proceedings of the 1st International Workshop on Innovative Information.Google Scholar
Bayyen, R. H., Halteren, H. V., Neijt, A., and Tweedie, F. J. 2002. An experiment in authorship attribution. In Proceedings of the 6th International Conference on Statistical Analysis of Textual Data.Google Scholar
Bayyen, R. H., Halteren, H. V., and Tweedie, F. J. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Liter. Linguist. Comput. 2, 110--120.Google Scholar
Berry, R. E. and Meekings, B. A. E. 1985. A style analysis of C programs. Commun. ACM 28, 1, 80--88. Google ScholarDigital Library
Binongo, J. N. G. and Smith, M. W. A. 1999. The application of principal component analysis to stylometry. Liter. Linguist. Compu. 14, 4, 445--466.Google ScholarCross Ref
Burrows, J. F. 1987. Word patterns and story shapes: The statistical analysis of narrative style. Liter. Linguist. Comput. 2, 61--67.Google ScholarCross Ref
Chaski, C. E. 2005. Who's at the keyboard&quest; Authorship attribution in digital evidence investigation. Int. J. Digit. Evidence 4, 1, 1--13.Google Scholar
Chaski, C. E. 2001. Empirical evaluation of language-based author identification techniques. Forensic Linguist. 8, 1, 1--65.Google Scholar
Cherkauer, K. J. 1996. Human expert-level performance on a scientific image analysis task by a system using combined artificial neural networks. In Working Notes of the AAAI Workshop on Integrating Multiple Learned Models, P. Chan, ed., 15--21.Google Scholar
Corney, M., De Vel, O., Anderson, A., and Mohay, G. 2002. Gender-Preferential text mining of email discourse. In 18th Annual Computer Security Applications Conference, Las Vegas, NV. Google ScholarDigital Library
Dash, M. and Liu, H. 1997. Feature selection for classification. Intell. Data Anal. 1, 131--156.Google ScholarCross Ref
De Vel, O., Anderson, A., Corney, M., and Mohay, G. 2001. Mining e-mail content for author identification forensics. ACM SIGMOD Rec. 30, 4, 55--64. Google ScholarDigital Library
Dietterich, T. G. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems, 1--15. Google ScholarDigital Library
Diederich, J., Kindermann, J., Leopold, E., and Paass, G. 2003. Authorship attribution with support vector machines. Appl. Intell. 19, 109--123. Google ScholarDigital Library
Ding, H. and Samadzaheh, H. M. 2004. Extraction of Java program fingerprints for software authorship identification. J. Syst. Softw. 72, 49--57. Google ScholarDigital Library
Efron, M., Marchionini, G., and Zhiang, J. 2004. Implications of the recursive representation problem for automatic concept identification in on-line government information. In Proceedings of the ASIST SIG-CR Workshop.Google Scholar
Erickson, T. and Kellogg, W. A. 2000. Social translucence: An approach to designing systems that support social processes. ACM Trans. Comput. Hum. Interact. 7, 1, 59--83. Google ScholarDigital Library
Fellbaum, C. 1998. Wordnet: An Electronic Lexical Database. MIT Press, Cambridge, MA.Google ScholarCross Ref
Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research 3, 1289--1305. Google ScholarDigital Library
Forsyth, R. S. and Holmes, D. I. 1996. Feature finding for text classification. Litera. Linguist. Comput. 11, 4, 163--174.Google ScholarCross Ref
Garson, G. D. 2006. Public Information Technology and E-Governance: Managing the Virtual State. Jones and Bartlet, Boston, MA.Google Scholar
Gray, A., Sallis, P., and MacDonnel, S. 1997. Software forensics: Extended authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference on the International Association of Forensic Linguists.Google Scholar
Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182. Google ScholarDigital Library
Hayne, C. S. and Rice, E. R. 1997. Attribution accuracy when using anonymity in group support systems. Int. J. Hum. Comput. Studies 47, 429--452. Google ScholarDigital Library
Hayne, C. S., PolLard, E. C., and Rice, E. R. 2003. Identification of comment authorship in anonymous group support systems. J. Manage. Inf. Syst. 20, 1, 301--329. Google ScholarDigital Library
Herring, S. C. 2002. Computer-Mediated communication on the Internet. Ann. Rev. Inf. Sci. Technol. 36, 1, 109--168.Google ScholarCross Ref
Holmes, D. I. 1992. A stylometric analysis of Mormon scripture and related texts. J. Royal Statis. Soci. 155, 91--120.Google ScholarCross Ref
Jackson, D. 1993. Stopping rules in principal component analysis: A comparison of heuristical and statistical approaches. Ecol. 74, 8, 2204--2214.Google ScholarCross Ref
Josang, A., Ismail, R., and Boyd, C. 2007. A survey of trust and reputation systems for online service provision. Decis. Support Syst. 43, 2, 618--644. Google ScholarDigital Library
Juola, P. and Baayen, H. 2005. A controlled-corpus experiment in authorship identification by cross-entropy. Liter. Linguist. Comput. 20, 59--67.Google ScholarCross Ref
Kirby, M. and Sirovich, L. 1990. Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans. Pattern Anal. Mach. Intell. 12, 1, 103--108. Google ScholarDigital Library
Kjell, B. Woods, W. A., and Frieder, O. 1994. Discrimination of authorship using visualization. Inf. Process. Manage. 30, 1, 141--150. Google ScholarDigital Library
Koppel, M. and Schler, J. 2003. Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of the IJCAI Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico.Google Scholar
Koppel, M. Akiva, N., and Dagan, I. 2006. Feature instability as a criterion for selecting potential style markers. J. Amer. Soc. Inf. Sci. Technol. 57, 11, 1519--1525. Google ScholarDigital Library
Krsul, I. and Spafford, H. E. 1997. Authorship analysis: Identifying the author of a program. Comput. Secur. 16, 3, 233--257.Google ScholarDigital Library
Li, J., Zheng, R., and Chen, H. 2006. From fingerprint to writeprint. Commun. ACM 49, 4, 76--82. Google ScholarDigital Library
Martindale, C. and McKenzie, D. 1995. On the utility of content analysis in author attribution: The federalist. Comput. Humanit. 29, 259--270.Google ScholarCross Ref
McDonald, D., Chen, H., Hua, S., and Marshall, B. 2004. Extracting gene pathway relations using a hybrid grammar: The Arizona relation parser. Bioinf. 20, 18, 3370--3378. Google ScholarDigital Library
Merriam, T. V. N. and Matthews, R. A. J. 1994. Neural computation in stylometry II: An application to the works of Shakespeare and Marlowe. Liter. Linguist. Comput. 9, 1--6.Google ScholarCross Ref
Moores, T. and Dhillon, G. 2000. Software piracy: A view from Hong Kong. Commun. ACM 43, 12, 88--93. Google ScholarDigital Library
Morzy, M. 2005. New algorithms for mining the reputation of participants of online auctions. In Proceedings of the 1st Workshop on Internet and Network Economics, Hong Kong. Google ScholarDigital Library
Mosteller, F. 1964. Applied Bayesian and Classical Inference: The Case of the Federalist Papers 2nd ed., Springer.Google Scholar
Oman, W. P. and Cook, R. C. 1989. Programming style authorship analysis. In Proceedings of the 17th Annual ACM Computer Science Conference, 320--326. Google ScholarDigital Library
Pan, Y. 2006. ID identification in online communities. Working paper.Google Scholar
Peng, F., Schuurmans, D., Keselj, V., and Wang, S. 2003. Automated authorship attribution with character level language models. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics. Google ScholarDigital Library
Platt, J. 1999. Fast training on SVMs using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Learning, B. Scholkopf et al., eds. MIT Press, Cambridge, MA, 185--208. Google ScholarDigital Library
Rudman, J. 1997. The state of authorship attribution studies: Some problems and solutions. Comput. Humanit. 31, 351--365.Google ScholarCross Ref
Sack, W. 2000. Conversation Map: An interface for very large-scale conversations. J. Manage. Inf. Syst. 17, 3, 73--92. Google ScholarDigital Library
Stamatatos, E. and Widme, R. G. 2002. Music performer recognition using an ensemble of simple classifiers. In Proceedings of the 15th European Conference on Artificial Intelligence, Lyon, France.Google Scholar
Stamatatos, E., Fakotakis, N., and Kokkinakis, G. 2000. Automatic text categorization in terms of genre and author. Comput. Linguist 26, 4, 471--495. Google ScholarDigital Library
Sullivan, B. 2005. Seduced into scams: Online lovers often duped. MSNBC, July 28.Google Scholar
Tweedie, F. J., Singh, S., and Holmes, D. I. 1996. Neural network applications in stylometry: The Federalist papers. Comput. Humanit. 30, 1, 1--10.Google ScholarCross Ref
Uenohara, M. and Kanade, T. 1997. Use of the Fourier and Karhunen-Loeve decomposition for fast pattern matching with a large set of features. IEEE Trans. Pattern Analy. Mach. Intell. 19, 8, 891--897. Google ScholarDigital Library
Wang, H., Fan, W., and Yu, S. P. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
Watanbe, S. 1985. Pattern Recognition: Human and Mechanical. John Wiley, New York. Google ScholarDigital Library
Webb, A. 2002. Statistical Pattern Recognition. John Wiley, New York.Google Scholar
Whitelaw, C. and Argamon, S. 2004. Systemic functional features in stylistic text classification. In Proceedings of the AAAI Symposium on Style and Meaning in Language, Art, Music and Design, Washington, DC.Google Scholar
Yang, Y. and Pederson, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, 412--420. Google ScholarDigital Library
Yule, G. U. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.Google Scholar
Yule, G. U. 1938. On sentence length as a statistical characteristic on style prose. Biometrika 30.Google Scholar
Zheng, R., Li, J., Huang, Z., and Chen, H. 2006. A framework for authorship analysis of online messages: Writing-style features and techniques. J. Amer. Soc. Inf. Sci. Technol. 57, 3, 378--393. Google ScholarDigital Library

Index Terms

Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Recommendations

UrduAI: Writeprints for Urdu Authorship Identification
The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains ...
Read More
Handwriting style classification

This paper describes an independent handwriting style classifier that has been designed to select the best recognizer for a given style of writing. For this purpose a definition of handwriting legibility has been defined and a method implemented that ...
Read More
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 26, Issue 2
March 2008
214 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1344411
Issue’s Table of Contents

Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 April 2008
- Revised: 1 May 2007
- Accepted: 1 May 2007
- Received: 1 November 2006
Published in tois Volume 26, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Stylometry
discourse
online text
style classification
text mining
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 304
  Total Citations
  View Citations
- 2,781
  Total Downloads
- Downloads (Last 12 months)117
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

UrduAI: Writeprints for Urdu Authorship Identification

Handwriting style classification

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

UrduAI: Writeprints for Urdu Authorship Identification

Handwriting style classification

Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media