Abstract
One of the problems often associated with online anonymity is that it hinders social accountability, as substantiated by the high levels of cybercrime. Although identity cues are scarce in cyberspace, individuals often leave behind textual identity traces. In this study we proposed the use of stylometric analysis techniques to help identify individuals based on writing style. We incorporated a rich set of stylistic features, including lexical, syntactic, structural, content-specific, and idiosyncratic attributes. We also developed the Writeprints technique for identification and similarity detection of anonymous identities. Writeprints is a Karhunen-Loeve transforms-based technique that uses a sliding window and pattern disruption algorithm with individual author-level feature sets. The Writeprints technique and extended feature set were evaluated on a testbed encompassing four online datasets spanning different domains: email, instant messaging, feedback comments, and program code. Writeprints outperformed benchmark techniques, including SVM, Ensemble SVM, PCA, and standard Karhunen-Loeve transforms, on the identification and similarity detection tasks with accuracy as high as 94% when differentiating between 100 authors. The extended feature set also significantly outperformed a baseline set of features commonly used in previous research. Furthermore, individual-author-level feature sets generally outperformed use of a single group of attributes.
- Abbasi, A. and Chen, H. 2005. Identification and comparison of extremist-group Web forum messages using authorship analysis. IEEE Intel. Syst. 20, 5, 67--75. Google ScholarDigital Library
- Abbasi, A. and Chen, H. 2006. Visualizing authorship for identification. In Proceedings of the 4th IEEE Symposium on Intelligence and Security Informatics, San Diego, CA. Google ScholarDigital Library
- Airoldi, E. and Malin, B. 2004. Data mining challenges for electronic safety: The case of fraudulent intent detection in e-mails. In Proceedings of the Workshop on Privacy and Security Aspects of Data Mining.Google Scholar
- Argamon, S., Saric, M., and Stein, S. S. 2003. Style mining of electronic messages for multiple authorship discrimination: First results In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
- Argamon, S., Koppel, M., and Avneri, G. 1998. Routing documents according to style. In Proceedings of the 1st International Workshop on Innovative Information.Google Scholar
- Bayyen, R. H., Halteren, H. V., Neijt, A., and Tweedie, F. J. 2002. An experiment in authorship attribution. In Proceedings of the 6th International Conference on Statistical Analysis of Textual Data.Google Scholar
- Bayyen, R. H., Halteren, H. V., and Tweedie, F. J. 1996. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Liter. Linguist. Comput. 2, 110--120.Google Scholar
- Berry, R. E. and Meekings, B. A. E. 1985. A style analysis of C programs. Commun. ACM 28, 1, 80--88. Google ScholarDigital Library
- Binongo, J. N. G. and Smith, M. W. A. 1999. The application of principal component analysis to stylometry. Liter. Linguist. Compu. 14, 4, 445--466.Google ScholarCross Ref
- Burrows, J. F. 1987. Word patterns and story shapes: The statistical analysis of narrative style. Liter. Linguist. Comput. 2, 61--67.Google ScholarCross Ref
- Chaski, C. E. 2005. Who's at the keyboard? Authorship attribution in digital evidence investigation. Int. J. Digit. Evidence 4, 1, 1--13.Google Scholar
- Chaski, C. E. 2001. Empirical evaluation of language-based author identification techniques. Forensic Linguist. 8, 1, 1--65.Google Scholar
- Cherkauer, K. J. 1996. Human expert-level performance on a scientific image analysis task by a system using combined artificial neural networks. In Working Notes of the AAAI Workshop on Integrating Multiple Learned Models, P. Chan, ed., 15--21.Google Scholar
- Corney, M., De Vel, O., Anderson, A., and Mohay, G. 2002. Gender-Preferential text mining of email discourse. In 18th Annual Computer Security Applications Conference, Las Vegas, NV. Google ScholarDigital Library
- Dash, M. and Liu, H. 1997. Feature selection for classification. Intell. Data Anal. 1, 131--156.Google ScholarCross Ref
- De Vel, O., Anderson, A., Corney, M., and Mohay, G. 2001. Mining e-mail content for author identification forensics. ACM SIGMOD Rec. 30, 4, 55--64. Google ScholarDigital Library
- Dietterich, T. G. 2000. Ensemble methods in machine learning. In Proceedings of the 1st International Workshop on Multiple Classifier Systems, 1--15. Google ScholarDigital Library
- Diederich, J., Kindermann, J., Leopold, E., and Paass, G. 2003. Authorship attribution with support vector machines. Appl. Intell. 19, 109--123. Google ScholarDigital Library
- Ding, H. and Samadzaheh, H. M. 2004. Extraction of Java program fingerprints for software authorship identification. J. Syst. Softw. 72, 49--57. Google ScholarDigital Library
- Efron, M., Marchionini, G., and Zhiang, J. 2004. Implications of the recursive representation problem for automatic concept identification in on-line government information. In Proceedings of the ASIST SIG-CR Workshop.Google Scholar
- Erickson, T. and Kellogg, W. A. 2000. Social translucence: An approach to designing systems that support social processes. ACM Trans. Comput. Hum. Interact. 7, 1, 59--83. Google ScholarDigital Library
- Fellbaum, C. 1998. Wordnet: An Electronic Lexical Database. MIT Press, Cambridge, MA.Google ScholarCross Ref
- Forman, G. 2003. An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research 3, 1289--1305. Google ScholarDigital Library
- Forsyth, R. S. and Holmes, D. I. 1996. Feature finding for text classification. Litera. Linguist. Comput. 11, 4, 163--174.Google ScholarCross Ref
- Garson, G. D. 2006. Public Information Technology and E-Governance: Managing the Virtual State. Jones and Bartlet, Boston, MA.Google Scholar
- Gray, A., Sallis, P., and MacDonnel, S. 1997. Software forensics: Extended authorship analysis techniques to computer programs. In Proceedings of the 3rd Biannual Conference on the International Association of Forensic Linguists.Google Scholar
- Guyon, I., and Elisseeff, A. 2003. An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157--1182. Google ScholarDigital Library
- Hayne, C. S. and Rice, E. R. 1997. Attribution accuracy when using anonymity in group support systems. Int. J. Hum. Comput. Studies 47, 429--452. Google ScholarDigital Library
- Hayne, C. S., PolLard, E. C., and Rice, E. R. 2003. Identification of comment authorship in anonymous group support systems. J. Manage. Inf. Syst. 20, 1, 301--329. Google ScholarDigital Library
- Herring, S. C. 2002. Computer-Mediated communication on the Internet. Ann. Rev. Inf. Sci. Technol. 36, 1, 109--168.Google ScholarCross Ref
- Holmes, D. I. 1992. A stylometric analysis of Mormon scripture and related texts. J. Royal Statis. Soci. 155, 91--120.Google ScholarCross Ref
- Jackson, D. 1993. Stopping rules in principal component analysis: A comparison of heuristical and statistical approaches. Ecol. 74, 8, 2204--2214.Google ScholarCross Ref
- Josang, A., Ismail, R., and Boyd, C. 2007. A survey of trust and reputation systems for online service provision. Decis. Support Syst. 43, 2, 618--644. Google ScholarDigital Library
- Juola, P. and Baayen, H. 2005. A controlled-corpus experiment in authorship identification by cross-entropy. Liter. Linguist. Comput. 20, 59--67.Google ScholarCross Ref
- Kirby, M. and Sirovich, L. 1990. Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans. Pattern Anal. Mach. Intell. 12, 1, 103--108. Google ScholarDigital Library
- Kjell, B. Woods, W. A., and Frieder, O. 1994. Discrimination of authorship using visualization. Inf. Process. Manage. 30, 1, 141--150. Google ScholarDigital Library
- Koppel, M. and Schler, J. 2003. Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of the IJCAI Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico.Google Scholar
- Koppel, M. Akiva, N., and Dagan, I. 2006. Feature instability as a criterion for selecting potential style markers. J. Amer. Soc. Inf. Sci. Technol. 57, 11, 1519--1525. Google ScholarDigital Library
- Krsul, I. and Spafford, H. E. 1997. Authorship analysis: Identifying the author of a program. Comput. Secur. 16, 3, 233--257.Google ScholarDigital Library
- Li, J., Zheng, R., and Chen, H. 2006. From fingerprint to writeprint. Commun. ACM 49, 4, 76--82. Google ScholarDigital Library
- Martindale, C. and McKenzie, D. 1995. On the utility of content analysis in author attribution: The federalist. Comput. Humanit. 29, 259--270.Google ScholarCross Ref
- McDonald, D., Chen, H., Hua, S., and Marshall, B. 2004. Extracting gene pathway relations using a hybrid grammar: The Arizona relation parser. Bioinf. 20, 18, 3370--3378. Google ScholarDigital Library
- Merriam, T. V. N. and Matthews, R. A. J. 1994. Neural computation in stylometry II: An application to the works of Shakespeare and Marlowe. Liter. Linguist. Comput. 9, 1--6.Google ScholarCross Ref
- Moores, T. and Dhillon, G. 2000. Software piracy: A view from Hong Kong. Commun. ACM 43, 12, 88--93. Google ScholarDigital Library
- Morzy, M. 2005. New algorithms for mining the reputation of participants of online auctions. In Proceedings of the 1st Workshop on Internet and Network Economics, Hong Kong. Google ScholarDigital Library
- Mosteller, F. 1964. Applied Bayesian and Classical Inference: The Case of the Federalist Papers 2nd ed., Springer.Google Scholar
- Oman, W. P. and Cook, R. C. 1989. Programming style authorship analysis. In Proceedings of the 17th Annual ACM Computer Science Conference, 320--326. Google ScholarDigital Library
- Pan, Y. 2006. ID identification in online communities. Working paper.Google Scholar
- Peng, F., Schuurmans, D., Keselj, V., and Wang, S. 2003. Automated authorship attribution with character level language models. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics. Google ScholarDigital Library
- Platt, J. 1999. Fast training on SVMs using sequential minimal optimization. In Advances in Kernel Methods: Support Vector Learning, B. Scholkopf et al., eds. MIT Press, Cambridge, MA, 185--208. Google ScholarDigital Library
- Rudman, J. 1997. The state of authorship attribution studies: Some problems and solutions. Comput. Humanit. 31, 351--365.Google ScholarCross Ref
- Sack, W. 2000. Conversation Map: An interface for very large-scale conversations. J. Manage. Inf. Syst. 17, 3, 73--92. Google ScholarDigital Library
- Stamatatos, E. and Widme, R. G. 2002. Music performer recognition using an ensemble of simple classifiers. In Proceedings of the 15th European Conference on Artificial Intelligence, Lyon, France.Google Scholar
- Stamatatos, E., Fakotakis, N., and Kokkinakis, G. 2000. Automatic text categorization in terms of genre and author. Comput. Linguist 26, 4, 471--495. Google ScholarDigital Library
- Sullivan, B. 2005. Seduced into scams: Online lovers often duped. MSNBC, July 28.Google Scholar
- Tweedie, F. J., Singh, S., and Holmes, D. I. 1996. Neural network applications in stylometry: The Federalist papers. Comput. Humanit. 30, 1, 1--10.Google ScholarCross Ref
- Uenohara, M. and Kanade, T. 1997. Use of the Fourier and Karhunen-Loeve decomposition for fast pattern matching with a large set of features. IEEE Trans. Pattern Analy. Mach. Intell. 19, 8, 891--897. Google ScholarDigital Library
- Wang, H., Fan, W., and Yu, S. P. 2003. Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Google ScholarDigital Library
- Watanbe, S. 1985. Pattern Recognition: Human and Mechanical. John Wiley, New York. Google ScholarDigital Library
- Webb, A. 2002. Statistical Pattern Recognition. John Wiley, New York.Google Scholar
- Whitelaw, C. and Argamon, S. 2004. Systemic functional features in stylistic text classification. In Proceedings of the AAAI Symposium on Style and Meaning in Language, Art, Music and Design, Washington, DC.Google Scholar
- Yang, Y. and Pederson, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning, 412--420. Google ScholarDigital Library
- Yule, G. U. 1944. The Statistical Study of Literary Vocabulary. Cambridge University Press.Google Scholar
- Yule, G. U. 1938. On sentence length as a statistical characteristic on style prose. Biometrika 30.Google Scholar
- Zheng, R., Li, J., Huang, Z., and Chen, H. 2006. A framework for authorship analysis of online messages: Writing-style features and techniques. J. Amer. Soc. Inf. Sci. Technol. 57, 3, 378--393. Google ScholarDigital Library
Index Terms
- Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace
Recommendations
UrduAI: Writeprints for Urdu Authorship Identification
The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains ...
Handwriting style classification
This paper describes an independent handwriting style classifier that has been designed to select the best recognizer for a given style of writing. For this purpose a definition of handwriting legibility has been defined and a method implemented that ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Comments