Article

Redundant documents and search effectiveness

Authors:
Yaniv Bernstein

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
Justin Zobel

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge managementOctober 2005Pages 736–743https://doi.org/10.1145/1099554.1099733

Published:31 October 2005Publication History

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

Pages 736–743

ABSTRACT

The web contains a great many documents that are content-equivalent, that is, informationally redundant with respect to each other. The presence of such mutually redundant documents in search results can degrade the user search experience. Previous attempts to address this issue, most notably the TREC novelty track, were characterized by difficulties with accuracy and evaluation. In this paper we explore syntactic techniques --- particularly document fingerprinting --- for detecting content equivalence. Using these techniques on the TREC GOV1 and GOV2 corpora revealed a high degree of redundancy; a user study confirmed that our metrics were accurately identifying content-equivalence. We show, moreover, that content-equivalent documents have a significant effect on the search experience: we found that 16.6% of all relevant documents in runs submitted to the TREC 2004 terabyte track were redundant.

References

Allan, J., Wade, C. & Bolivar, A. (2003), Retrieval and novelty detection at the sentence level, in 'Proc. ACM SIGIR conference', ACM Press, pp. 314--321.]] Google ScholarDigital Library
Bernstein, Y. & Zobel, J. (2004), A scalable system for identifying co-derivative documents, in 'Proc. String Processing and Information Retrieval Symposium (SPIRE)', Springer, pp. 55--67.]]Google ScholarCross Ref
Brin, S., Davis, J. & Garcíía-Molina, H. (1995), Copy detection mechanisms for digital documents, in 'Proceedings of the ACM SIGMOD Annual Conference', pp. 398--409.]] Google ScholarDigital Library
Broder, A. Z., Glassman, S. C., Manasse, M. S. & Zweig, G. (1997), 'Syntactic clustering of the Web', Computer Networks and ISDN Systems 29(8-13), 1157--1166.]] Google ScholarDigital Library
Buckley, C. & Voorhees, E. M. (2000), Evaluating evaluation measure stability, in 'Proc. ACM SIGIR conference', ACM Press, pp. 33--40.]] Google ScholarDigital Library
Cho, J., Shivakumar, N. & Garcia-Molina, H. (2000), Finding Replicated Web Collections, in 'Proc. ACM SIGMOD Conference', pp. 355--366.]] Google ScholarDigital Library
Chowdhury, A., Frieder, O., Grossman, D. & McCabe, M. C. (2002), 'Collection statistics for fast duplicate document detection', ACM Transactions on Information Systems (TOIS) 20(2), 171--191.]] Google ScholarDigital Library
Clarke, C., Craswell, N. & Soboroff, I. (2004), Overview of the TREC 2004 Terabyte Track, in 'Proceedings of the 13th Text REtrieval Conference (TREC 2004)'.]]Google Scholar
Fetterly, D., Manasse, M. & Najork, M. (2003), On the Evolution of Clusters of Near-Duplicate Web Pages, in 'Proceedings of the 1st Latin American Web Congress', IEEE, pp. 37--45.]] Google ScholarDigital Library
Garcia, S., Williams, H. E. & Cannane, A. (2004), Access-ordered indexes, in 'Proc. 27th conference on Australasian computer science', pp. 7--14.]] Google ScholarDigital Library
Harman, D. (2002), Overview of the TREC 2002 Novelty Track, in 'The Eleventh Text REtrieval Conference (TREC 2002)'.]]Google Scholar
Hearst, M. A. & Pedersen, J. O. (1996), Reexamining the cluster hypothesis: scatter/gather on retrieval results, in 'Proc. ACM SIGIR conference', ACM Press, pp. 76--84.]] Google ScholarDigital Library
Heintze, N. (1996), Scalable Document Fingerprinting, in '1996 USENIX Workshop on Electronic Commerce'.]]Google Scholar
Hoad, T. C. & Zobel, J. (2003), 'Methods for Identifying Versioned and Plagiarised Documents', Journal of the American Society for Information Science and Technology 54(3), 203--215.]] Google ScholarDigital Library
Manber, U. (1994), Finding Similar Files in a Large File System, in 'Proceedings of the USENIX Winter 1994 Technical Conference', pp. 1--10.]] Google ScholarDigital Library
Rivest, R. (1992), 'The MD5 Message-Digest Algorithm'. RFC 1321.]] Google ScholarDigital Library
Sanderson, M. & Zobel, J. (2005), Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, in 'Proc. ACM SIGIR conference', pp. 162--169.]] Google ScholarDigital Library
Schleimer, S., Wilkerson, D. S. & Aiken, A. (2003), Winnowing: local algorithms for document fingerprinting, in 'Proc. ACM SIGMOD conference', ACM Press, pp. 76--85.]] Google ScholarDigital Library
Shivakumar, N. & Garcíía-Molina, H. (1995), SCAM: A Copy Detection Mechanism for Digital Documents, in 'Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries'.]]Google Scholar
Shivakumar, N. & Garcíía-Molina, H. (1999), Finding Near-Replicas of Documents on the Web, in 'WEBDB: International Workshop on the World Wide Web and Databases, WebDB', Springer-Verlag.]] Google ScholarDigital Library
Soboroff, I. & Harman, D. (2003), Overview of the TREC 2003 Novelty Track, in 'The Twelfth Text REtrieval Conference (TREC 2003)', pp. 38--53.]]Google Scholar
van Rijsbergen, C. J. (1979), Information Retrieval, Butterworth-Heinemann.]] Google ScholarDigital Library
Voorhees, E. M. & Buckley, C. (2002), The effect of topic set size on retrieval experiment error, in 'Proc. ACM SIGIR conference', ACM Press, pp. 316--323.]] Google ScholarDigital Library
Witten, I. H., Moffat, A. & Bell, T. C. (1999), Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kauffman.]] Google ScholarDigital Library
Zhai, C. X., Cohen, W. W. & Lafferty, J. (2003), Beyond independent relevance: methods and evaluation metrics for subtopic retrieval, in 'Proc. ACM SIGIR conference', ACM Press, pp. 10--17.]] Google ScholarDigital Library
Zhang, Y., Callan, J. & Minka, T. (2002), Novelty and redundancy detection in adaptive filtering, in 'Proc. ACM SIGIR conference', ACM Press, pp. 81--88.]] Google ScholarDigital Library

Index Terms

Redundant documents and search effectiveness
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Accurate discovery of co-derivative documents via duplicate text detection

Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting ...
Read More
Improving retrieval effectiveness by using key terms in top retrieved documents
ECIR'05: Proceedings of the 27th European conference on Advances in Information Retrieval Research

In this paper, we propose a method to improve the precision of top retrieved documents in Chinese information retrieval where the query is a short description by re-ordering retrieved documents in the initial retrieval. To re-order the documents, we ...
Read More
Probabilistic models of ranking novel documents for faceted topic retrieval
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Traditional models of information retrieval assume documents are independently relevant. But when the goal is retrieving diverse or novel information about a topic, retrieval models need to capture dependencies between documents. Such tasks require ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management
October 2005
854 pages
ISBN:1595931406
DOI:10.1145/1099554
General Chair:
Otthein Herzog
University of Bremen, Germany
,
Program Chairs:
Hans-Jörg Schek
University for Health Sciences, Medical Informatics and Technology, Austria
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Abdur Chowdhury
America Online, USA
,
Wilfried Teiken
IBM T.J. Watson Research Center, USA
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 October 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
duplicate detection
novelty
search effectiveness
Qualifiers
- Article
Conference

Acceptance Rates
CIKM '05 Paper Acceptance Rate77of425submissions,18%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 33
  Total Citations
  View Citations
- 743
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Redundant documents and search effectiveness

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accurate discovery of co-derivative documents via duplicate text detection

Improving retrieval effectiveness by using key terms in top retrieved documents

Probabilistic models of ranking novel documents for faceted topic retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Redundant documents and search effectiveness

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accurate discovery of co-derivative documents via duplicate text detection

Improving retrieval effectiveness by using key terms in top retrieved documents

Probabilistic models of ranking novel documents for faceted topic retrieval

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media