poster

Detection of text quality flaws as a one-class classification problem

Authors:
Maik Anderka

Bauhaus-Universität Weimar, Weimar, Germany

Bauhaus-Universität Weimar, Weimar, Germany
View Profile

,
Benno Stein

Bauhaus-Universität Weimar, Weimar, Germany

Bauhaus-Universität Weimar, Weimar, Germany
View Profile

,
Nedim Lipka

Bauhaus-Universität Weimar, Weimar, Germany

Bauhaus-Universität Weimar, Weimar, Germany
View Profile

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementOctober 2011Pages 2313–2316https://doi.org/10.1145/2063576.2063954

Published:24 October 2011Publication History

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Pages 2313–2316

ABSTRACT

For Web applications that are based on user generated content the detection of text quality flaws is a key concern. Our research contributes to automatic quality flaw detection. In particular, we propose to cast the detection of text quality flaws as a one-class classification problem: we are given only positive examples (= texts containing a particular quality flaw) and decide whether or not an unseen text suffers from this flaw. We argue that common binary or multiclass classification approaches are ineffective in here, and we underpin our approach by a real-world application: we employ a dedicated one-class learning approach to determine whether a given Wikipedia article suffers from certain quality flaws. Since in the Wikipedia setting the acquisition of sensible test data is quite intricate, we analyze the effects of a biased sample selection. In addition, we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. Altogether, provided test data with little noise, four from ten important quality flaws in Wikipedia can be detected with a precision close to 1.

References

M. Anderka, B. Stein, and N. Lipka. Towards automatic quality assurance in Wikipedia. In Proceedings of WWW'11, pages 5--6, 2011. ACM. Google ScholarDigital Library
E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proceedings of WSDM'08, pages 183--194, 2008. ACM. Google ScholarDigital Library
R. Baeza-Yates. User generated content: how good is it? In Proceedings of WICOW'09, pages 1--2, 2009. ACM. Google ScholarDigital Library
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: a survey. ACM Computing Surveys, 41(3):15:1--15:58, 2009. Google ScholarDigital Library
D. Dalip, M. Gonçalves, M. Cristo, and P. Calado. Automatic quality assessment of content created collaboratively by Web communities: a case study of Wikipedia. In Proceedings of JCDL'09, pages 295--304, 2009. ACM. Google ScholarDigital Library
L. Gaio, M. den Besten, A. Rossi, and J. Dalle. Wikibugs: using template messages in open content collections. In Proceedings of WikiSym'09, pages 1--7, 2009. ACM. Google ScholarDigital Library
K. Hempstalk, E. Frank, and I. Witten. One-class classification by combining density and class probability estimation. In Proceedings of ECML/PKDD'08, pages 505--519, 2008. Springer. Google ScholarDigital Library
V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2):85--126, 2004. Google ScholarDigital Library
M. Koppel and J. Schler. Authorship verification as a one-class classification problem. In Proceedings of ICML'04, pages 1--7, 2004. ACM. Google ScholarDigital Library
N. Lipka and B. Stein. Identifying featured articles in Wikipedia: writing style matters. In Proceedings of WWW'10, pages 1147--1148, 2010. ACM. Google ScholarDigital Library
B. Stein, N. Lipka, and P. Prettenhofer. Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1):63--82, 2011. Springer. Google ScholarDigital Library
B. Stvilia, M. Twidale, L. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In Proceedings of ICIQ'05, pages 442--454, 2005. MITGoogle Scholar
D. Tax. One-Class Classification. PhD thesis, Delft University of Technology, 2001.Google Scholar

Index Terms

Detection of text quality flaws as a one-class classification problem
1. Human-centered computing
  1. Collaborative and social computing
    1. Collaborative and social computing design and evaluation methods
2. Information systems
  1. Information retrieval

Recommendations

Size matters: word count as a measure of quality on wikipedia
WWW '08: Proceedings of the 17th international conference on World Wide Web

Wikipedia, "the free encyclopedia", now contains over two million English articles, and is widely regarded as a high-quality, authoritative encyclopedia. Some Wikipedia articles, however, are of questionable quality, and it is not always apparent to the ...
Read More
Towards automatic quality assurance in Wikipedia
WWW '11: Proceedings of the 20th international conference companion on World wide web

Featured articles in Wikipedia stand for high information quality, and it has been found interesting to researchers to analyze whether and how they can be distinguished from "ordinary" articles. Here we point out that article discrimination falls far ...
Read More
A breakdown of quality flaws in Wikipedia
WebQuality '12: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality

The online encyclopedia Wikipedia is a successful example of the increasing popularity of user generated content on the Web. Despite its success, Wikipedia is often criticized for containing low-quality information, which is mainly attributed to its ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
October 2011
2712 pages
ISBN:9781450307178
DOI:10.1145/2063576
Editors:
Bettina Berendt,
Arjen de Vries,
Wenfei Fan,
Craig Macdonald
University of Glasgow, UK
,
Iadh Ounis
University of Glasgow, UK
,
Ian Ruthven
University of Strathclyde, UK
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information quality
one-class classification
text quality flaw detection
wikipedia
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 297
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detection of text quality flaws as a one-class classification problem

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Size matters: word count as a measure of quality on wikipedia

Towards automatic quality assurance in Wikipedia

A breakdown of quality flaws in Wikipedia