ABSTRACT
The web contains a great many documents that are content-equivalent, that is, informationally redundant with respect to each other. The presence of such mutually redundant documents in search results can degrade the user search experience. Previous attempts to address this issue, most notably the TREC novelty track, were characterized by difficulties with accuracy and evaluation. In this paper we explore syntactic techniques --- particularly document fingerprinting --- for detecting content equivalence. Using these techniques on the TREC GOV1 and GOV2 corpora revealed a high degree of redundancy; a user study confirmed that our metrics were accurately identifying content-equivalence. We show, moreover, that content-equivalent documents have a significant effect on the search experience: we found that 16.6% of all relevant documents in runs submitted to the TREC 2004 terabyte track were redundant.
- Allan, J., Wade, C. & Bolivar, A. (2003), Retrieval and novelty detection at the sentence level, in 'Proc. ACM SIGIR conference', ACM Press, pp. 314--321.]] Google ScholarDigital Library
- Bernstein, Y. & Zobel, J. (2004), A scalable system for identifying co-derivative documents, in 'Proc. String Processing and Information Retrieval Symposium (SPIRE)', Springer, pp. 55--67.]]Google ScholarCross Ref
- Brin, S., Davis, J. & Garcíía-Molina, H. (1995), Copy detection mechanisms for digital documents, in 'Proceedings of the ACM SIGMOD Annual Conference', pp. 398--409.]] Google ScholarDigital Library
- Broder, A. Z., Glassman, S. C., Manasse, M. S. & Zweig, G. (1997), 'Syntactic clustering of the Web', Computer Networks and ISDN Systems 29(8-13), 1157--1166.]] Google ScholarDigital Library
- Buckley, C. & Voorhees, E. M. (2000), Evaluating evaluation measure stability, in 'Proc. ACM SIGIR conference', ACM Press, pp. 33--40.]] Google ScholarDigital Library
- Cho, J., Shivakumar, N. & Garcia-Molina, H. (2000), Finding Replicated Web Collections, in 'Proc. ACM SIGMOD Conference', pp. 355--366.]] Google ScholarDigital Library
- Chowdhury, A., Frieder, O., Grossman, D. & McCabe, M. C. (2002), 'Collection statistics for fast duplicate document detection', ACM Transactions on Information Systems (TOIS) 20(2), 171--191.]] Google ScholarDigital Library
- Clarke, C., Craswell, N. & Soboroff, I. (2004), Overview of the TREC 2004 Terabyte Track, in 'Proceedings of the 13th Text REtrieval Conference (TREC 2004)'.]]Google Scholar
- Fetterly, D., Manasse, M. & Najork, M. (2003), On the Evolution of Clusters of Near-Duplicate Web Pages, in 'Proceedings of the 1st Latin American Web Congress', IEEE, pp. 37--45.]] Google ScholarDigital Library
- Garcia, S., Williams, H. E. & Cannane, A. (2004), Access-ordered indexes, in 'Proc. 27th conference on Australasian computer science', pp. 7--14.]] Google ScholarDigital Library
- Harman, D. (2002), Overview of the TREC 2002 Novelty Track, in 'The Eleventh Text REtrieval Conference (TREC 2002)'.]]Google Scholar
- Hearst, M. A. & Pedersen, J. O. (1996), Reexamining the cluster hypothesis: scatter/gather on retrieval results, in 'Proc. ACM SIGIR conference', ACM Press, pp. 76--84.]] Google ScholarDigital Library
- Heintze, N. (1996), Scalable Document Fingerprinting, in '1996 USENIX Workshop on Electronic Commerce'.]]Google Scholar
- Hoad, T. C. & Zobel, J. (2003), 'Methods for Identifying Versioned and Plagiarised Documents', Journal of the American Society for Information Science and Technology 54(3), 203--215.]] Google ScholarDigital Library
- Manber, U. (1994), Finding Similar Files in a Large File System, in 'Proceedings of the USENIX Winter 1994 Technical Conference', pp. 1--10.]] Google ScholarDigital Library
- Rivest, R. (1992), 'The MD5 Message-Digest Algorithm'. RFC 1321.]] Google ScholarDigital Library
- Sanderson, M. & Zobel, J. (2005), Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, in 'Proc. ACM SIGIR conference', pp. 162--169.]] Google ScholarDigital Library
- Schleimer, S., Wilkerson, D. S. & Aiken, A. (2003), Winnowing: local algorithms for document fingerprinting, in 'Proc. ACM SIGMOD conference', ACM Press, pp. 76--85.]] Google ScholarDigital Library
- Shivakumar, N. & Garcíía-Molina, H. (1995), SCAM: A Copy Detection Mechanism for Digital Documents, in 'Proceedings of the Second Annual Conference on the Theory and Practice of Digital Libraries'.]]Google Scholar
- Shivakumar, N. & Garcíía-Molina, H. (1999), Finding Near-Replicas of Documents on the Web, in 'WEBDB: International Workshop on the World Wide Web and Databases, WebDB', Springer-Verlag.]] Google ScholarDigital Library
- Soboroff, I. & Harman, D. (2003), Overview of the TREC 2003 Novelty Track, in 'The Twelfth Text REtrieval Conference (TREC 2003)', pp. 38--53.]]Google Scholar
- van Rijsbergen, C. J. (1979), Information Retrieval, Butterworth-Heinemann.]] Google ScholarDigital Library
- Voorhees, E. M. & Buckley, C. (2002), The effect of topic set size on retrieval experiment error, in 'Proc. ACM SIGIR conference', ACM Press, pp. 316--323.]] Google ScholarDigital Library
- Witten, I. H., Moffat, A. & Bell, T. C. (1999), Managing Gigabytes: Compressing and Indexing Documents and Images, Morgan Kauffman.]] Google ScholarDigital Library
- Zhai, C. X., Cohen, W. W. & Lafferty, J. (2003), Beyond independent relevance: methods and evaluation metrics for subtopic retrieval, in 'Proc. ACM SIGIR conference', ACM Press, pp. 10--17.]] Google ScholarDigital Library
- Zhang, Y., Callan, J. & Minka, T. (2002), Novelty and redundancy detection in adaptive filtering, in 'Proc. ACM SIGIR conference', ACM Press, pp. 81--88.]] Google ScholarDigital Library
Index Terms
- Redundant documents and search effectiveness
Recommendations
Accurate discovery of co-derivative documents via duplicate text detection
Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other, or some portion of both must be derived from a third document. An existing technique for concurrently detecting ...
Improving retrieval effectiveness by using key terms in top retrieved documents
ECIR'05: Proceedings of the 27th European conference on Advances in Information Retrieval ResearchIn this paper, we propose a method to improve the precision of top retrieved documents in Chinese information retrieval where the query is a short description by re-ordering retrieved documents in the initial retrieval. To re-order the documents, we ...
Probabilistic models of ranking novel documents for faceted topic retrieval
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementTraditional models of information retrieval assume documents are independently relevant. But when the goal is retrieving diverse or novel information about a topic, retrieval models need to capture dependencies between documents. Such tasks require ...
Comments