ABSTRACT
We report on the construction of the PAN Wikipedia vandalism corpus, PAN-WVC-10, using Amazon's Mechanical Turk. The corpus compiles 32452 edits on 28468 Wikipedia articles, among which 2391 vandalism edits have been identified. 753 human annotators cast a total of 193022 votes on the edits, so that each edit was reviewed by at least 3 annotators, whereas the achieved level of agreement was analyzed in order to label an edit as "regular" or "vandalism." The corpus is available free of charge.
- O. Alonso and S. Mizzaro. Can We Get Rid of TREC Assessors? Using Mechanical Turk for Relevance Assessment. In Proc. of SIGIR'09.Google Scholar
- R. S. Geiger and D. Ribes. The Work of Sustaining Order in Wikipedia: The Banning of a Vandal. In Proc. of CSCW'10. Google ScholarDigital Library
- K. Y. Itakura and C. L. A. Clarke. Using Dynamic Markov Compression to Detect Vandalism in the Wikipedia. In Proc. of SIGIR'09. Google ScholarDigital Library
- M. Potthast and R. Gerling. Webis Wikipedia Vandalism Corpus Webis-WVC-07. http://www.webis.de/research/corpora, 2007.Google Scholar
- M. Potthast, B. Stein, and R. Gerling. Automatic Vandalism Detection in Wikipedia. In Proc. of ECIR'08. Google ScholarDigital Library
- R. Priedhorsky, J. Chen, S. Lam, K. Panciera, L. Terveen, and J. Riedl. Creating, Destroying, and Restoring Value in Wikipedia. In Proc. of Group'07. Google ScholarDigital Library
- K. Smets, B. Goethals, and B. Verdonk. Automatic Vandalism Detection in Wikipedia: Towards a Machine Learning Approach. In Proc. of WikiAI at AAAI'08.Google Scholar
Index Terms
- Crowdsourcing a wikipedia vandalism corpus
Recommendations
Detecting wikipedia vandalism with a contributing efficiency-based approach
WISE'12: Proceedings of the 13th international conference on Web Information Systems EngineeringThe collaborative nature of wiki has distinguished Wikipedia as an online encyclopedia but also makes the open contents vulnerable against vandalism. The current vandalism detection methods relying on basic statistic language features work well for ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and CommunicationIn natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Comments