Skip to main content

Effective and Scalable Authorship Attribution Using Function Words

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3689))

Abstract

Techniques for identifying the author of an unattributed document can be applied to problems in information analysis and in academic scholarship. A range of methods have been proposed in the research literature, using a variety of features and machine learning approaches, but the methods have been tested on very different data and the results cannot be compared. It is not even clear whether the differences in performance are due to feature selection or other variables. In this paper we examine the use of a large publicly available collection of newswire articles as a benchmark for comparing authorship attribution methods. To demonstrate the value of having a benchmark, we experimentally compare several recent feature-based techniques for authorship attribution, and test how well these methods perform as the volume of data is increased. We show that the benchmark is able to clearly distinguish between different approaches, and that the scalability of the best methods based on using function words features is acceptable, with only moderate decline as the difficulty of the problem is increased.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aha, D., Kibler, D.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)

    Google Scholar 

  2. Baayen, H., Halteren, H.V., Neijt, A., Tweedie, F.: An experiment in authorship attribution. In: 6th JADT (2002)

    Google Scholar 

  3. Baayen, H., Halteren, H.V., Tweedie, F.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11(3), 121–132 (1996)

    Article  Google Scholar 

  4. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (May 1999)

    Google Scholar 

  5. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. The American Physical Society 88(4) (2002)

    Google Scholar 

  6. Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 55–67. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  7. Binongo, J.N.G.: Who wrote the 15th book of oz? an application of multivariate statistics to authorship attribution. Computational Linguistics 16(2), 9–17 (2003)

    MathSciNet  Google Scholar 

  8. Burrows, J.: Word patterns and story shapes: the statistical analysis of narrative style. Literary and linguistic Computing 2, 61–70 (1987)

    Article  Google Scholar 

  9. Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing 17, 267–287 (2002)

    Article  Google Scholar 

  10. Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2003)

    Article  MATH  Google Scholar 

  11. D’Souza, D., Thom, J., Zobel, J.: Collection selection for managed distributed document databases. Information Processing & Management 40, 527–546 (2004)

    Article  Google Scholar 

  12. Fung, G.: The disputed federalist papers: Svm feature selection via concave minimization. In: Proceedings of the 2003 conference on Diversity in computing, pp. 42–46. ACM Press, New York (2003)

    Chapter  Google Scholar 

  13. Goodman, J.: Extended comment on language trees and zipping

    Google Scholar 

  14. Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing & Management 31(3), 271–289 (1995)

    Article  Google Scholar 

  15. Heckerman, D., Geiger, D., Chickering, D.: Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning 20, 197–243 (1995)

    MATH  Google Scholar 

  16. Holmes, D.I., Robertson, M., paez, R.: Stephen crane and the new-york tribune: A case study in traditional and non-traditional authorship attribution. Computers and the Humanities 35(3), 315–331 (2001)

    Article  Google Scholar 

  17. John, G.H., Langley, P.: Estimating continuous distributions in Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345. Morgan Kaufmann Publisher, San Francisco (1995)

    Google Scholar 

  18. Juola, P., Baayen, H.: A controlled-corpus experiment in authorship identification by cross-entropy. Literary and Linguistic Computing (2003)

    Google Scholar 

  19. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Pasific Association for Computational Linguistics, pp. 256–264 (2003)

    Google Scholar 

  20. Khmelev, D.V., Tweedie, F.J.: Using markov chains for identification of writers. Literary and Linguistic Computing 16(4), 229–307 (2002)

    Google Scholar 

  21. Langley, P., Sage, S.: Tractable average-case analysis of naive Bayesian classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 220–228. Morgan Kaufmann Publisher, San Francisco (1999)

    Google Scholar 

  22. Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Language independent authorship attribution using character level language models. In: 10th Conference of the European Chapter of the Association for Computational Linguistics, EACL (2003)

    Google Scholar 

  23. Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  24. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  25. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic authorship attribution. In: Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, pp. 158–164 (1999)

    Google Scholar 

  26. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35(2), 193–214 (2001)

    Article  Google Scholar 

  27. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhao, Y., Zobel, J. (2005). Effective and Scalable Authorship Attribution Using Function Words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_14

Download citation

  • DOI: https://doi.org/10.1007/11562382_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29186-2

  • Online ISBN: 978-3-540-32001-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics