skip to main content
10.1145/2063576.2063954acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
poster

Detection of text quality flaws as a one-class classification problem

Published:24 October 2011Publication History

ABSTRACT

For Web applications that are based on user generated content the detection of text quality flaws is a key concern. Our research contributes to automatic quality flaw detection. In particular, we propose to cast the detection of text quality flaws as a one-class classification problem: we are given only positive examples (= texts containing a particular quality flaw) and decide whether or not an unseen text suffers from this flaw. We argue that common binary or multiclass classification approaches are ineffective in here, and we underpin our approach by a real-world application: we employ a dedicated one-class learning approach to determine whether a given Wikipedia article suffers from certain quality flaws. Since in the Wikipedia setting the acquisition of sensible test data is quite intricate, we analyze the effects of a biased sample selection. In addition, we illustrate the classifier effectiveness as a function of the flaw distribution in order to cope with the unknown (real-world) flaw-specific class imbalances. Altogether, provided test data with little noise, four from ten important quality flaws in Wikipedia can be detected with a precision close to 1.

References

  1. M. Anderka, B. Stein, and N. Lipka. Towards automatic quality assurance in Wikipedia. In Proceedings of WWW'11, pages 5--6, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne. Finding high-quality content in social media. In Proceedings of WSDM'08, pages 183--194, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baeza-Yates. User generated content: how good is it? In Proceedings of WICOW'09, pages 1--2, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: a survey. ACM Computing Surveys, 41(3):15:1--15:58, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Dalip, M. Gonçalves, M. Cristo, and P. Calado. Automatic quality assessment of content created collaboratively by Web communities: a case study of Wikipedia. In Proceedings of JCDL'09, pages 295--304, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Gaio, M. den Besten, A. Rossi, and J. Dalle. Wikibugs: using template messages in open content collections. In Proceedings of WikiSym'09, pages 1--7, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. Hempstalk, E. Frank, and I. Witten. One-class classification by combining density and class probability estimation. In Proceedings of ECML/PKDD'08, pages 505--519, 2008. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2):85--126, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Koppel and J. Schler. Authorship verification as a one-class classification problem. In Proceedings of ICML'04, pages 1--7, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Lipka and B. Stein. Identifying featured articles in Wikipedia: writing style matters. In Proceedings of WWW'10, pages 1147--1148, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Stein, N. Lipka, and P. Prettenhofer. Intrinsic plagiarism analysis. Language Resources and Evaluation, 45(1):63--82, 2011. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Stvilia, M. Twidale, L. Smith, and L. Gasser. Assessing information quality of a community-based encyclopedia. In Proceedings of ICIQ'05, pages 442--454, 2005. MITGoogle ScholarGoogle Scholar
  14. D. Tax. One-Class Classification. PhD thesis, Delft University of Technology, 2001.Google ScholarGoogle Scholar

Index Terms

  1. Detection of text quality flaws as a one-class classification problem

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
        October 2011
        2712 pages
        ISBN:9781450307178
        DOI:10.1145/2063576

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 October 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • poster

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader