skip to main content
research-article

Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

Published:08 April 2008Publication History
Skip Abstract Section

Abstract

One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.

References

  1. Abbasi, A. and Chen, H. 2005. Identification and comparison of extremist-group Web forum messages using authorship analysis. IEEE Intel. Syst. 20, 5, 67--75. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Abbasi, A. and Chen, H. 2006. Visualizing authorship for identification. In Proceedings of the 4th IEEE Symposium on Intelligence and Security Informatics, San Diego, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Airoldi, E. and Malin, B. 2004. Data mining challenges for electronic safety: The case of fraudulent intent detection in e-mails. In Proceedings of the Workshop on Privacy and Security Aspects of Data Mining.Google ScholarGoogle Scholar
  4. Argamon, S., Saric, M., and Stein, S. S. 2003. Style mining of electronic messages for multiple authorship discrimination: First results In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Argamon, S., Koppel, M., and Avneri, G. 1998. Routing documents according to style. In Proceedings of the 1st International Workshop on Innovative Information.Google ScholarGoogle Scholar
  6. Bayyen, R. H., Halteren, H. V., Neijt, A., and Tweedie, F. J. 2002. An experiment in authorship attribution. In Proceedings of the 6th International Conference on Statistical Analysis of Textual Data.Google ScholarGoogle Scholar
  7. Bayyen, R. H., Halteren, H. V., and Tweedie, F. J. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Liter. Linguist. Comput. 2, 110--120.Google ScholarGoogle Scholar
  8. Berry, R. E. and Meekings, B. A. E. 1985. A style analysis of C programs. Commun. ACM 28, 1, 80--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Binongo, J. N. G. and Smith, M. W. A. 1999. The application of principal component analysis to stylometry. Liter. Linguist. Compu. 14, 4, 445--466.Google ScholarGoogle ScholarCross RefCross Ref
  10. Burrows, J. F. 1987. Word patterns and story shapes: The statistical analysis of narrative style. Liter. Linguist. Comput. 2, 61--67.Google ScholarGoogle ScholarCross RefCross Ref
  11. Chaski, C. E. 2005. Who's at the keyboard? Authorship attribution in digital evidence investigation. Int. J. Digit. Evidence 4, 1, 1--13.Google ScholarGoogle Scholar
  12. Chaski, C. E. 2001. Empirical evaluation of language-based author identification techniques. Forensic Linguist. 8, 1, 1--65.Google ScholarGoogle Scholar
  13. Cherkauer, K. J. 1996. Human expert-level performance on a scientific image analysis task by a system using combined artificial neural networks. In Working Notes of the AAAI Workshop on Integrating Multiple Learned Models, P. Chan, ed., 15--21.Google ScholarGoogle Scholar
  14. Corney, M., De Vel, O., Anderson, A., and Mohay, G. 2002. Gender-Preferential text mining of email discourse. In 18th Annual Computer Security Applications Conference, Las Vegas, NV. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Dash, M. and Liu, H. 1997. Feature selection for classification. Intell. Data Anal. 1, 131--156.Google ScholarGoogle ScholarCross RefCross Ref
  16. De Vel, O., Anderson, A., Corney, M., and Mohay, G. 2001. Mining e-mail content for author identification forensics. ACM SIGMOD Rec. 30, 4, 55--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Dietterich, T. G. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems, 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Diederich, J., Kindermann, J., Leopold, E., and Paass, G. 2003. Authorship attribution with support vector machines. Appl. Intell. 19, 109--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ding, H. and Samadzaheh, H. M. 2004. Extraction of Java program fingerprints for software authorship identification. J. Syst. Softw. 72, 49--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Efron, M., Marchionini, G., and Zhiang, J. 2004. Implications of the recursive representation problem for automatic concept identification in on-line government information. In Proceedings of the ASIST SIG-CR Workshop.Google ScholarGoogle Scholar
  21. Erickson, T. and Kellogg, W. A. 2000. Social translucence: An approach to designing systems that support social processes. ACM Trans. Comput. Hum. Interact. 7, 1, 59--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fellbaum, C. 1998. Wordnet: An Electronic Lexical Database. MIT Press, Cambridge, MA.Google ScholarGoogle ScholarCross RefCross Ref
  23. Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research 3, 1289--1305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Forsyth, R. S. and Holmes, D. I. 1996. Feature finding for text classification. Litera. Linguist. Comput. 11, 4, 163--174.Google ScholarGoogle ScholarCross RefCross Ref
  25. Garson, G. D. 2006. Public Information Technology and E-Governance: Managing the Virtual State. Jones and Bartlet, Boston, MA.Google ScholarGoogle Scholar
  26. Gray, A., Sallis, P., and MacDonnel, S. 1997. Software forensics: Extended authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference on the International Association of Forensic Linguists.Google ScholarGoogle Scholar
  27. Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hayne, C. S. and Rice, E. R. 1997. Attribution accuracy when using anonymity in group support systems. Int. J. Hum. Comput. Studies 47, 429--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hayne, C. S., PolLard, E. C., and Rice, E. R. 2003. Identification of comment authorship in anonymous group support systems. J. Manage. Inf. Syst. 20, 1, 301--329. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Herring, S. C. 2002. Computer-Mediated communication on the Internet. Ann. Rev. Inf. Sci. Technol. 36, 1, 109--168.Google ScholarGoogle ScholarCross RefCross Ref
  31. Holmes, D. I. 1992. A stylometric analysis of Mormon scripture and related texts. J. Royal Statis. Soci. 155, 91--120.Google ScholarGoogle ScholarCross RefCross Ref
  32. Jackson, D. 1993. Stopping rules in principal component analysis: A comparison of heuristical and statistical approaches. Ecol. 74, 8, 2204--2214.Google ScholarGoogle ScholarCross RefCross Ref
  33. Josang, A., Ismail, R., and Boyd, C. 2007. A survey of trust and reputation systems for online service provision. Decis. Support Syst. 43, 2, 618--644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Juola, P. and Baayen, H. 2005. A controlled-corpus experiment in authorship identification by cross-entropy. Liter. Linguist. Comput. 20, 59--67.Google ScholarGoogle ScholarCross RefCross Ref
  35. Kirby, M. and Sirovich, L. 1990. Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans. Pattern Anal. Mach. Intell. 12, 1, 103--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kjell, B. Woods, W. A., and Frieder, O. 1994. Discrimination of authorship using visualization. Inf. Process. Manage. 30, 1, 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Koppel, M. and Schler, J. 2003. Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of the IJCAI Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico.Google ScholarGoogle Scholar
  38. Koppel, M. Akiva, N., and Dagan, I. 2006. Feature instability as a criterion for selecting potential style markers. J. Amer. Soc. Inf. Sci. Technol. 57, 11, 1519--1525. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Krsul, I. and Spafford, H. E. 1997. Authorship analysis: Identifying the author of a program. Comput. Secur. 16, 3, 233--257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Li, J., Zheng, R., and Chen, H. 2006. From fingerprint to writeprint. Commun. ACM 49, 4, 76--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Martindale, C. and McKenzie, D. 1995. On the utility of content analysis in author attribution: The federalist. Comput. Humanit. 29, 259--270.Google ScholarGoogle ScholarCross RefCross Ref
  42. McDonald, D., Chen, H., Hua, S., and Marshall, B. 2004. Extracting gene pathway relations using a hybrid grammar: The Arizona relation parser. Bioinf. 20, 18, 3370--3378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Merriam, T. V. N. and Matthews, R. A. J. 1994. Neural computation in stylometry II: An application to the works of Shakespeare and Marlowe. Liter. Linguist. Comput. 9, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  44. Moores, T. and Dhillon, G. 2000. Software piracy: A view from Hong Kong. Commun. ACM 43, 12, 88--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Morzy, M. 2005. New algorithms for mining the reputation of participants of online auctions. In Proceedings of the 1st Workshop on Internet and Network Economics, Hong Kong. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mosteller, F. 1964. Applied Bayesian and Classical Inference: The Case of the Federalist Papers 2nd ed., Springer.Google ScholarGoogle Scholar
  47. Oman, W. P. and Cook, R. C. 1989. Programming style authorship analysis. In Proceedings of the 17th Annual ACM Computer Science Conference, 320--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Pan, Y. 2006. ID identification in online communities. Working paper.Google ScholarGoogle Scholar
  49. Peng, F., Schuurmans, D., Keselj, V., and Wang, S. 2003. Automated authorship attribution with character level language models. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Platt, J. 1999. Fast training on SVMs using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Learning, B. Scholkopf et al., eds. MIT Press, Cambridge, MA, 185--208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Rudman, J. 1997. The state of authorship attribution studies: Some problems and solutions. Comput. Humanit. 31, 351--365.Google ScholarGoogle ScholarCross RefCross Ref
  52. Sack, W. 2000. Conversation Map: An interface for very large-scale conversations. J. Manage. Inf. Syst. 17, 3, 73--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Stamatatos, E. and Widme, R. G. 2002. Music performer recognition using an ensemble of simple classifiers. In Proceedings of the 15th European Conference on Artificial Intelligence, Lyon, France.Google ScholarGoogle Scholar
  54. Stamatatos, E., Fakotakis, N., and Kokkinakis, G. 2000. Automatic text categorization in terms of genre and author. Comput. Linguist 26, 4, 471--495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Sullivan, B. 2005. Seduced into scams: Online lovers often duped. MSNBC, July 28.Google ScholarGoogle Scholar
  56. Tweedie, F. J., Singh, S., and Holmes, D. I. 1996. Neural network applications in stylometry: The Federalist papers. Comput. Humanit. 30, 1, 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  57. Uenohara, M. and Kanade, T. 1997. Use of the Fourier and Karhunen-Loeve decomposition for fast pattern matching with a large set of features. IEEE Trans. Pattern Analy. Mach. Intell. 19, 8, 891--897. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Wang, H., Fan, W., and Yu, S. P. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Watanbe, S. 1985. Pattern Recognition: Human and Mechanical. John Wiley, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Webb, A. 2002. Statistical Pattern Recognition. John Wiley, New York.Google ScholarGoogle Scholar
  61. Whitelaw, C. and Argamon, S. 2004. Systemic functional features in stylistic text classification. In Proceedings of the AAAI Symposium on Style and Meaning in Language, Art, Music and Design, Washington, DC.Google ScholarGoogle Scholar
  62. Yang, Y. and Pederson, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, 412--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Yule, G. U. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.Google ScholarGoogle Scholar
  64. Yule, G. U. 1938. On sentence length as a statistical characteristic on style prose. Biometrika 30.Google ScholarGoogle Scholar
  65. Zheng, R., Li, J., Huang, Z., and Chen, H. 2006. A framework for authorship analysis of online messages: Writing-style features and techniques. J. Amer. Soc. Inf. Sci. Technol. 57, 3, 378--393. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Information Systems
        ACM Transactions on Information Systems  Volume 26, Issue 2
        March 2008
        214 pages
        ISSN:1046-8188
        EISSN:1558-2868
        DOI:10.1145/1344411
        Issue’s Table of Contents

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 April 2008
        • Revised: 1 May 2007
        • Accepted: 1 May 2007
        • Received: 1 November 2006
        Published in tois Volume 26, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader