Abstract
Plagiarism is considered to be a highly unethical activity in the academic world. Text-alignment is currently the preferred technique for estimating the degree of similarity with existing written works. Due to its dependency on other documents it becomes increasingly tedious and time-consuming to scale up to the growing number of online and offline documents. Thus, this paper aims at studying the use of stylometric features present in a document in order to verify its authorship. Two machine learning algorithms, namely k-NN and SMO, were used to predict the authenticity of the writings. A computer program consisting of 446 features was implemented. Ten PhD theses, split into different segments of 1000, 5000 and 10000 words, were used, totaling 520 documents as our corpus. Our results show that authorship attribution using stylometry method has generated an accuracy of above 90 %, except for 7-NN with 1000 words. We also showed how authorship attribution can be used to identify potential cases of plagiarism in formal writings.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship Attribution Using Word Sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)
Kim, S., Kim, H., Weninger, T., Han, J. and Kim, H. D.: Authorship Classification: A Discriminative Syntactic Tree Mining Approach. In: Proceedings of the ACM SIGIR, July 24–28, Beijing, China (2011)
Nirkhi, S.M., Dharaskar, R.V.: Comparative Study of Authorship Identification Techniques for Cyber Forensics Analysis. International Journal of Advanced Computer Science and Applications 4(5), 32–35 (2013)
Khan, S.R., Nirkhi, S.M., Dharaskar, R..V.: E-mail Data Analysis for Application to Cyber Forensic Investigation using Data Mining. In: Proceedings of the 2nd National Conference on Innovative Paradigms in Engineering & Technology (NCIPET 2013), New York, USA (2013)
Maurer, H., Zaka, B.: Plagiarism–A Problem and How to Fight It. In: Proceedings of World Conference on Education Multimedia, Hypermedia and Telecommunications, AACE, pp. 4451–4458 (2007)
Mozgovoy, M., Kakkonen, T., Cosma, G.: Automatic student plagiarism detection: future perspectives. Journal Educational Computing Research 43(4), 511–531 (2010)
ICAI, Current Cheating Statistics. http://www.academicintegrity.org/icai/integrity-3.php. (accessed April 3, 2015)
Mechti, S., Jaoua, M. Belguith, L H.: A framework for Plagiarism Detection based on Author Profiling. In: Notebook for PAN at CLEF 2013 (2013). http://www.clef-initiative.eu/documents/71612/c7a0e432-dd82-46b1-ab9e-5d0dd98c3a8d (accessed March 3, 2015)
Smith, I.: The Invisible Web: Where Search Engines Fear to Go (2015). http://www.powerhomebiz.com/vol25/invisible.htm (accessed April 1, 2015)
Turnitin, iParadigms (2015). http://turnitin.com/ (accessed March 22, 2015)
Viper, Viper the Anti-plagiarism Scanner, Viper’s features (2015). http://www.scanmyessay.com/features.php (accessed April 2, 2015)
Plagium, Plagium (2015). http://www.plagium.com/ (accessed April 2, 2015)
PlagTracker, PlagTracker (2015). http://www.plagtracker.com/(accessed April 2, 2015)
Paper Rater, About Paper Rater (2015). http://www.paperrater.com/about (accessed April 2, 2015)
Grammarly, Grammarly (2015). http://www.grammarly.com (accessed April 2, 2015)
Horovitz, S.J.: Two Wrong Don’t Negate a Copyright: Don’t Make Students Turnitin if You Won’t Give it Back. Florida Law Review 60(1), 229–268 (2008)
TurnitinBot, TurnitinBot General Information Page (2015). https://turnitin.com/robot/crawlerinfo.html (accessed: March 15, 2015)
Cheat For Turnitin, Limitations to Turnitin. Tips For How To Cheat Turnitin? (2015). http://cheatturnitin.blogspot.com/ (accessed March 15, 2015)
Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of the 2005 ACH/ALLC Conference (2005)
Hoover, D.L.: Frequent collocations and authorial style. Literary and Linguistic Computing 19(3), 261(28) (2004)
Nirkhi, S.M., Dharaskar, R.V., Thakare, V.M.: Authorship Attribution of online messages using Stylometry: An Exploratory Study. In: International Conference on Advances in Engineering and Technology (ICAET’2014) (2014)
Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceeding of the 22nd International Conference on Computational Linguistics, Vol. 1, pp. 513–520 (2008)
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorisation Research. Journal of Machine Learning Research 5, 361–397 (2004)
Iqbal, F., Hadjidj, R., Fung, B.C.M., Debbadi, M.: A Novel Approach of Mining Write-Prints for Authorship Attribution in E-mail Forensics. Proceedings of the Digital Forensic Research Workshop, pp. 42–51. Elsevier Ltd., Quebec (2008)
Abbasi, A., Chen, H.: Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems 2(2), Article 7 (2008)
Abbasi, A., Chen, H.: Visualizing Authorship for Identification. In: Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang, F.-Y. (eds.) ISI 2006. LNCS, vol. 3975, pp. 60–71. Springer, Heidelberg (2006)
Pavelec, D., Justino, E., Oliveira, L.S.: Author Identification using Stylometric Features. Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial 11(36), 59–65 (2007)
Stańczyk, U., Cyran, K.A.: Machine learning approach to authorship attribution of literary texts. International Journal of Applied Mathematics & Informatics 1(4), 151–158 (2007)
Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining writeprints from anonymous e-mails for forensic investigation. Digital Investigation, Science Direct 7(1), 56–64 (2010)
López-Monroy, A.P., Montes-y-Gómez, M., Villaseñor-Pineda, L., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: A New Document Author Representation for Authorship Attribution. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera López, J.A., Boyer, K.L. (eds.) MCPR 2012. LNCS, vol. 7329, pp. 283–292. Springer, Heidelberg (2012)
Koppel, M., Schler J., Argamon, S., Winter, Y.: The Fundamental Problem of Authorship Attribution. English Studies 93(3), 284–291 (2012). Taylor & Francis
Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17(4), 401–412 (2002)
Halteren, H.V.: Linguistic Profiling for Author Recognition and Verification. In Proceedings: 42nd Annual Meeting on Association for Computational Linguistics (ACL04), Barcelona, Spain, pp. 199–206 (2004)
Koppel, M., Schler, J., Argamon, S., Messeri, E.: Authorship attribution with thousands of candidate authors. In: Proceedings of the ACM SIGIR, New York, USA, pp. 659–660 (2006)
Stamatatos, E.: Author identification: Using text sampling to handle the class imbalance problem. ECAI, IOS Press, Vol. 44, pp. 790–799 (2008)
Allison, B., Guthrie, L.: Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation. In: International Conference on Language Resources and Evaluation, Marrakech, Morocco (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ramnial, H., Panchoo, S., Pudaruth, S. (2016). Authorship Attribution Using Stylometry and Machine Learning Techniques. In: Berretti, S., Thampi, S., Srivastava, P. (eds) Intelligent Systems Technologies and Applications. Advances in Intelligent Systems and Computing, vol 384. Springer, Cham. https://doi.org/10.1007/978-3-319-23036-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-319-23036-8_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23035-1
Online ISBN: 978-3-319-23036-8
eBook Packages: EngineeringEngineering (R0)