ABSTRACT
For Web applications that are based on user generated content the detection of text quality flaws is a key concern. Our research contributes to automatic quality flaw detection. In particular, we propose to cast the detection of text quality flaws as a one-class classification problem: we are given only positive examples (= texts containing a particular quality flaw) and decide whether or not an unseen text suffers from this flaw. We argue that common binary or multiclass classification approaches are ineffective in here, and we underpin our approach by a real-world application: we employ a dedicated one-class learning approach to determine whether a given Wikipedia article suffers from certain quality flaws. Since in the Wikipedia setting the acquisition of sensible test data is quite intricate, we analyze the effects of a biased sample selection. In addition, we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. Altogether, provided test data with little noise, four from ten important quality flaws in Wikipedia can be detected with a precision close to 1.
- M. Anderka, B. Stein, and N. Lipka. Towards automatic quality assurance in Wikipedia. In Proceedings of WWW'11, pages 5--6, 2011. ACM. Google ScholarDigital Library
- E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proceedings of WSDM'08, pages 183--194, 2008. ACM. Google ScholarDigital Library
- R. Baeza-Yates. User generated content: how good is it? In Proceedings of WICOW'09, pages 1--2, 2009. ACM. Google ScholarDigital Library
- L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
- V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: a survey. ACM Computing Surveys, 41(3):15:1--15:58, 2009. Google ScholarDigital Library
- D. Dalip, M. Gonçalves, M. Cristo, and P. Calado. Automatic quality assessment of content created collaboratively by Web communities: a case study of Wikipedia. In Proceedings of JCDL'09, pages 295--304, 2009. ACM. Google ScholarDigital Library
- L. Gaio, M. den Besten, A. Rossi, and J. Dalle. Wikibugs: using template messages in open content collections. In Proceedings of WikiSym'09, pages 1--7, 2009. ACM. Google ScholarDigital Library
- K. Hempstalk, E. Frank, and I. Witten. One-class classification by combining density and class probability estimation. In Proceedings of ECML/PKDD'08, pages 505--519, 2008. Springer. Google ScholarDigital Library
- V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2):85--126, 2004. Google ScholarDigital Library
- M. Koppel and J. Schler. Authorship verification as a one-class classification problem. In Proceedings of ICML'04, pages 1--7, 2004. ACM. Google ScholarDigital Library
- N. Lipka and B. Stein. Identifying featured articles in Wikipedia: writing style matters. In Proceedings of WWW'10, pages 1147--1148, 2010. ACM. Google ScholarDigital Library
- B. Stein, N. Lipka, and P. Prettenhofer. Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1):63--82, 2011. Springer. Google ScholarDigital Library
- B. Stvilia, M. Twidale, L. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In Proceedings of ICIQ'05, pages 442--454, 2005. MITGoogle Scholar
- D. Tax. One-Class Classification. PhD thesis, Delft University of Technology, 2001.Google Scholar
Index Terms
- Detection of text quality flaws as a one-class classification problem
Recommendations
Size matters: word count as a measure of quality on wikipedia
WWW '08: Proceedings of the 17th international conference on World Wide WebWikipedia, "the free encyclopedia", now contains over two million English articles, and is widely regarded as a high-quality, authoritative encyclopedia. Some Wikipedia articles, however, are of questionable quality, and it is not always apparent to the ...
Towards automatic quality assurance in Wikipedia
WWW '11: Proceedings of the 20th international conference companion on World wide webFeatured articles in Wikipedia stand for high information quality, and it has been found interesting to researchers to analyze whether and how they can be distinguished from "ordinary" articles. Here we point out that article discrimination falls far ...
A breakdown of quality flaws in Wikipedia
WebQuality '12: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web QualityThe online encyclopedia Wikipedia is a successful example of the increasing popularity of user generated content on the Web. Despite its success, Wikipedia is often criticized for containing low-quality information, which is mainly attributed to its ...
Comments