skip to main content
10.1145/1099554.1099733acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Redundant documents and search effectiveness

Published:31 October 2005Publication History

ABSTRACT

The web contains a great many documents that are content-equivalent, that is, informationally redundant with respect to each other. The presence of such mutually redundant documents in search results can degrade the user search experience. Previous attempts to address this issue, most notably the TREC novelty track, were characterized by difficulties with accuracy and evaluation. In this paper we explore syntactic techniques --- particularly document fingerprinting --- for detecting content equivalence. Using these techniques on the TREC GOV1 and GOV2 corpora revealed a high degree of redundancy; a user study confirmed that our metrics were accurately identifying content-equivalence. We show, moreover, that content-equivalent documents have a significant effect on the search experience: we found that 16.6% of all relevant documents in runs submitted to the TREC 2004 terabyte track were redundant.

References

  1. Allan, J., Wade, C. & Bolivar, A. (2003), Retrieval and novelty detection at the sentence level, in 'Proc. ACM SIGIR conference', ACM Press, pp. 314--321.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bernstein, Y. & Zobel, J. (2004), A scalable system for identifying co-derivative documents, in 'Proc. String Processing and Information Retrieval Symposium (SPIRE)', Springer, pp. 55--67.]]Google ScholarGoogle ScholarCross RefCross Ref
  3. Brin, S., Davis, J. & Garcíía-Molina, H. (1995), Copy detection mechanisms for digital documents, in 'Proceedings of the ACM SIGMOD Annual Conference', pp. 398--409.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Broder, A. Z., Glassman, S. C., Manasse, M. S. & Zweig, G. (1997), 'Syntactic clustering of the Web', Computer Networks and ISDN Systems 29(8-13), 1157--1166.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Buckley, C. & Voorhees, E. M. (2000), Evaluating evaluation measure stability, in 'Proc. ACM SIGIR conference', ACM Press, pp. 33--40.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Cho, J., Shivakumar, N. & Garcia-Molina, H. (2000), Finding Replicated Web Collections, in 'Proc. ACM SIGMOD Conference', pp. 355--366.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chowdhury, A., Frieder, O., Grossman, D. & McCabe, M. C. (2002), 'Collection statistics for fast duplicate document detection', ACM Transactions on Information Systems (TOIS) 20(2), 171--191.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Clarke, C., Craswell, N. & Soboroff, I. (2004), Overview of the TREC 2004 Terabyte Track, in 'Proceedings of the 13th Text REtrieval Conference (TREC 2004)'.]]Google ScholarGoogle Scholar
  9. Fetterly, D., Manasse, M. & Najork, M. (2003), On the Evolution of Clusters of Near-Duplicate Web Pages, in 'Proceedings of the 1st Latin American Web Congress', IEEE, pp. 37--45.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Garcia, S., Williams, H. E. & Cannane, A. (2004), Access-ordered indexes, in 'Proc. 27th conference on Australasian computer science', pp. 7--14.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Harman, D. (2002), Overview of the TREC 2002 Novelty Track, in 'The Eleventh Text REtrieval Conference (TREC 2002)'.]]Google ScholarGoogle Scholar
  12. Hearst, M. A. & Pedersen, J. O. (1996), Reexamining the cluster hypothesis: scatter/gather on retrieval results, in 'Proc. ACM SIGIR conference', ACM Press, pp. 76--84.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Heintze, N. (1996), Scalable Document Fingerprinting, in '1996 USENIX Workshop on Electronic Commerce'.]]Google ScholarGoogle Scholar
  14. Hoad, T. C. & Zobel, J. (2003), 'Methods for Identifying Versioned and Plagiarised Documents', Journal of the American Society for Information Science and Technology 54(3), 203--215.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Manber, U. (1994), Finding Similar Files in a Large File System, in 'Proceedings of the USENIX Winter 1994 Technical Conference', pp. 1--10.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rivest, R. (1992), 'The MD5 Message-Digest Algorithm'. RFC 1321.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sanderson, M. & Zobel, J. (2005), Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, in 'Proc. ACM SIGIR conference', pp. 162--169.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Schleimer, S., Wilkerson, D. S. & Aiken, A. (2003), Winnowing: local algorithms for document fingerprinting, in 'Proc. ACM SIGMOD conference', ACM Press, pp. 76--85.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Shivakumar, N. & Garcíía-Molina, H. (1995), SCAM: A Copy Detection Mechanism for Digital Documents, in 'Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries'.]]Google ScholarGoogle Scholar
  20. Shivakumar, N. & Garcíía-Molina, H. (1999), Finding Near-Replicas of Documents on the Web, in 'WEBDB: International Workshop on the World Wide Web and Databases, WebDB', Springer-Verlag.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Soboroff, I. & Harman, D. (2003), Overview of the TREC 2003 Novelty Track, in 'The Twelfth Text REtrieval Conference (TREC 2003)', pp. 38--53.]]Google ScholarGoogle Scholar
  22. van Rijsbergen, C. J. (1979), Information Retrieval, Butterworth-Heinemann.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Voorhees, E. M. & Buckley, C. (2002), The effect of topic set size on retrieval experiment error, in 'Proc. ACM SIGIR conference', ACM Press, pp. 316--323.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Witten, I. H., Moffat, A. & Bell, T. C. (1999), Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kauffman.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Zhai, C. X., Cohen, W. W. & Lafferty, J. (2003), Beyond independent relevance: methods and evaluation metrics for subtopic retrieval, in 'Proc. ACM SIGIR conference', ACM Press, pp. 10--17.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Zhang, Y., Callan, J. & Minka, T. (2002), Novelty and redundancy detection in adaptive filtering, in 'Proc. ACM SIGIR conference', ACM Press, pp. 81--88.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Redundant documents and search effectiveness

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management
          October 2005
          854 pages
          ISBN:1595931406
          DOI:10.1145/1099554

          Copyright © 2005 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 October 2005

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          CIKM '05 Paper Acceptance Rate77of425submissions,18%Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader