poster

Crowdsourcing a wikipedia vandalism corpus

Author:
Martin Potthast

Bauhaus-Universität Weimar, Weimar, Germany

Bauhaus-Universität Weimar, Weimar, Germany
View Profile

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalJuly 2010Pages 789–790https://doi.org/10.1145/1835449.1835617

Published:19 July 2010Publication History

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Pages 789–790

ABSTRACT

We report on the construction of the PAN Wikipedia vandalism corpus, PAN-WVC-10, using Amazon's Mechanical Turk. The corpus compiles 32452 edits on 28468 Wikipedia articles, among which 2391 vandalism edits have been identified. 753 human annotators cast a total of 193022 votes on the edits, so that each edit was reviewed by at least 3 annotators, whereas the achieved level of agreement was analyzed in order to label an edit as "regular" or "vandalism." The corpus is available free of charge.

References

O. Alonso and S. Mizzaro. Can We Get Rid of TREC Assessors? Using Mechanical Turk for Relevance Assessment. In Proc. of SIGIR'09.Google Scholar
R. S. Geiger and D. Ribes. The Work of Sustaining Order in Wikipedia: The Banning of a Vandal. In Proc. of CSCW'10. Google ScholarDigital Library
K. Y. Itakura and C. L. A. Clarke. Using Dynamic Markov Compression to Detect Vandalism in the Wikipedia. In Proc. of SIGIR'09. Google ScholarDigital Library
M. Potthast and R. Gerling. Webis Wikipedia Vandalism Corpus Webis-WVC-07. http://www.webis.de/research/corpora, 2007.Google Scholar
M. Potthast, B. Stein, and R. Gerling. Automatic Vandalism Detection in Wikipedia. In Proc. of ECIR'08. Google ScholarDigital Library
R. Priedhorsky, J. Chen, S. Lam, K. Panciera, L. Terveen, and J. Riedl. Creating, Destroying, and Restoring Value in Wikipedia. In Proc. of Group'07. Google ScholarDigital Library
K. Smets, B. Goethals, and B. Verdonk. Automatic Vandalism Detection in Wikipedia: Towards a Machine Learning Approach. In Proc. of WikiAI at AAAI'08.Google Scholar

Index Terms

Crowdsourcing a wikipedia vandalism corpus
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Detecting wikipedia vandalism with a contributing efficiency-based approach
WISE'12: Proceedings of the 13th international conference on Web Information Systems Engineering

The collaborative nature of wiki has distinguished Wikipedia as an online encyclopedia but also makes the open contents vulnerable against vandalism. The current vandalism detection methods relying on basic statistic language features work well for ...
Read More
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Read More
Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
July 2010
944 pages
ISBN:9781450301534
DOI:10.1145/1835449
General Chairs:
Fabio Crestani
University of Lugano, CH
,
Stéphane Marchand-Maillet
University of Geneva, CH
,
Program Chairs:
Hsin-Hsi Chen
National Taiwan University, TW
,
Efthimis N. Efthimiadis
University of Washington, USA
,
Jacques Savoy
University of Neuchatel, CH
Copyright © 2010 Copyright is held by the owner/author(s)
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2010
Check for updates
Author Tags
corpus
evaluation
vandalism detection
wikipedia
Qualifiers
- poster
Conference

Acceptance Rates
SIGIR '10 Paper Acceptance Rate87of520submissions,17%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 655
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Crowdsourcing a wikipedia vandalism corpus

SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Detecting wikipedia vandalism with a contributing efficiency-based approach

Two-stage approach to named entity recognition using Wikipedia and DBpedia

Learning multilingual named entity recognition from Wikipedia