Does size matter? Authorship attribution, small samples, big problem

Eder, Maciej

doi:10.1093/llc/fqt066

Abstract

The aim of this study is to find such a minimal size of text samples for authorship attribution that would provide stable results independent of random noise. A few controlled tests for different sample lengths, languages, and genres are discussed and compared. Depending on the corpus used, the minimal sample length varied from 2,500 words (Latin prose) to 5,000 or so words (in most cases, including English, German, Polish, and Hungarian novels). Another observation is connected with the method of sampling: contrary to common sense, randomly excerpted ‘bags of words’ turned out to be much more effective than the classical solution, i.e. using original sequences of words (‘passages’) of desired size. Although the tests have been performed using the Delta method (Burrows, J.F. (2002). ‘Delta’: a measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3): 267–87) applied to the most frequent words, some additional experiments have been conducted for support vector machines and k-NN applied to most frequent words, character 3-grams, character 4-grams, and parts-of-speech-tag 3-grams. Despite significant differences in overall attributive success rate between particular methods and/or style markers, the minimal amount of textual data needed for reliable authorship attribution turned out to be method-independent.

You do not currently have access to this article.

Download all slides

Month:	Total Views:
December 2016	3
January 2017	1
February 2017	22
March 2017	16
April 2017	12
May 2017	12
June 2017	15
July 2017	13
August 2017	6
September 2017	20
October 2017	11
November 2017	22
December 2017	8
January 2018	6
February 2018	9
March 2018	9
April 2018	32
May 2018	21
June 2018	23
July 2018	29
August 2018	22
September 2018	9
October 2018	18
November 2018	18
December 2018	15
January 2019	17
February 2019	12
March 2019	26
April 2019	24
May 2019	9
June 2019	25
July 2019	10
August 2019	17
September 2019	13
October 2019	8
November 2019	19
December 2019	25
January 2020	17
February 2020	17
March 2020	5
April 2020	12
May 2020	12
June 2020	7
July 2020	9
August 2020	9
September 2020	14
October 2020	3
November 2020	8
December 2020	13
January 2021	34
February 2021	16
March 2021	15
April 2021	21
May 2021	7
June 2021	13
July 2021	15
August 2021	13
September 2021	12
October 2021	31
November 2021	20
December 2021	5
January 2022	21
February 2022	10
March 2022	6
April 2022	12
May 2022	22
June 2022	19
July 2022	19
August 2022	18
September 2022	17
October 2022	16
November 2022	16
December 2022	16
January 2023	19
February 2023	17
March 2023	25
April 2023	16
May 2023	20
June 2023	9
July 2023	9
August 2023	12
September 2023	12
October 2023	18
November 2023	14
December 2023	20
January 2024	37
February 2024	23
March 2024	36
April 2024	20

Does size matter? Authorship attribution, small samples, big problem

Abstract

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Does size matter? Authorship attribution, small samples, big problem

Abstract

Sign in

Personal account

Institutional access

Institutional account management

Get help with access

Institutional access

IP based access

Sign in through your institution

Sign in with a library card

Society Members

Sign in through society site

Sign in using a personal account

Personal account

Viewing your signed in accounts

Signed in but can't access content

Institutional account management

Purchase

Short-term Access

Rental

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only