Abstract
In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists. We demonstrate that, for a modest bandwidth and computational cost, many near-duplicates can be accurately removed from result lists produced by a cooperative distributed information retrieval system.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Allan, J., et al.: Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval. In: SIGIR Forum, University of Massachusetts Amherst, September 2002, vol. 37(1), pp. 31–47 (2003)
Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: Proc. String Processing and Information Retrieval Symposium, Padova, Italy, pp. 55–67 (2004)
Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Proc. ACM CIKM Conf., Bremen, Germany, pp. 736–743 (2005)
Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: Proc. ACM SIGMOD international conference on Management of Data, San Jose, California, pp. 398–409 (1995)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proc. ACM symposium on Theory of computing (STOC), pp. 327–336. ACM Press, New York (1998)
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Transactions on Information Systems 19(2), 97–130 (2001)
Callan, J., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proc. Int. ACM-SIGIR Conf., Seattle, Washington, pp. 21–28 (1995)
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 20(2), 171–191 (2002)
Conrad, J.G., Guo, X.S., Schriber, C.P.: Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In: Proc. ACM-CIKM Conf., New Orleans, Louisiana, pp. 443–452 (2003)
Cooper, J.W., Coden, A.R., Brown, E.W.: Detecting similar documents using salient terms. In: Proc. ACM-CIKM Conf., McLean, Virginia, pp. 245–251 (2002)
Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Proc. first Latin American Web Congress, pp. 37–45. IEEE, Los Alamitos (2003)
Gauch, S., Wang, G., Gomez, M.: ProFusion: Intelligent fusion from multiple, distributed search engines. J. Universal Computer Science 2(9), 637–649 (1996)
Gravano, L., Chang, C.K., Garcia-Molina, H., Paepcke, A.: STARTS: Stanford proposal for Internet meta-searching. In: Proc. ACM SIGMOD international conference on Management of Data, Tucson, Arizona, pp. 207–218 (1997)
Harman, D.: Overview of the first TREC conference. In: Proc. ACM-SIGIR Conf., Pittsburgh, Pennsylvania, pp. 36–47 (1993)
Hernandez, T., Kambhampati, S.: Improving text collection selection with coverage and overlap statistics. In: Proc. Int. Conf. on World Wide Web, Chiba, Japan, pp. 1128–1129 (2005)
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarised documents. J. the American Society for Information Science and Technology 54(3), 203–215 (2003)
Ilyinski, S., Kuzmin, M., Melkov, A., Segalovich, I.: An efficient method to detect duplicates of web documents with the use of inverted index. In: Proc. Int. Conf. on World Wide Web, Honolulu, Hawaii (2002)
Kolcz, A., Chowdhury, A., Alspector, J.: Improved robustness of signature-based near-replica detection via lexicon randomization. In: Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Seattle, WA, pp. 605–610 (2004)
Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document collections. In: Proc. Conf. on Empirical Methods in Natural Language Processing, Philadelphia, Pennsylvania (2001)
Manber, U.: Finding similar files in a large file system. In: Proc. USENIX Winter Technical Conf., San Fransisco, CA, pp. 1–10, 17–21 (1994)
Meng, W., Yu, C., Liu, K.: Building efficient and effective metasearch engines. ACM Computing Surveys 34(1), 48–89 (2002)
Nottelmann, H., Fuhr, N.: Evaluating different methods of estimating retrieval quality for resource selection. In: Proc. Int. ACM-SIGIR Conf., Toronto, Canada, pp. 290–297 (2003)
Powell, A.L., French, J.: Comparing the performance of collection selection algorithms. ACM Transactions on Information Systems 21(4), 412–456 (2003)
Pugh, W., Henzinger, M.H.: Detecting duplicate and near-duplicate files (United States Patent 6,658,423) (2003)
Selberg, E., Etzioni, O.: The MetaCrawler architecture for resource aggregation on the Web. In: IEEE Expert (January–February 1997), pp. 11–14 (1997)
Si, L., Callan, J.: Unified utility maximization framework for resource selection. In: Proc. ACM-CIKM Conf., Washington, D.C., pp. 32–41 (2004)
Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proc. ACM-SIGIR Conf., Toronto, Canada, pp. 298–305 (2003)
Warren Jr., H.S.: Hacker’s Delight. Addison-Wesley, Reading (2002)
Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to web search results. In: Proc. Int. Conf. on World Wide Web, Toronto, Canada, pp. 1361–1374 (1999)
Zobel, J., Bernstein, Y.: The case of the duplicate documents: Measurement, search, and science. In: Proc. Asia-Pacific Web Conf., Harbin, China, pp. 26–39 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bernstein, Y., Shokouhi, M., Zobel, J. (2006). Compact Features for Detection of Near-Duplicates in Distributed Retrieval. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_10
Download citation
DOI: https://doi.org/10.1007/11880561_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)