Compact Features for Detection of Near-Duplicates in Distributed Retrieval

Bernstein, Yaniv; Shokouhi, Milad; Zobel, Justin

doi:10.1007/11880561_10

Compact Features for Detection of Near-Duplicates in Distributed Retrieval

Yaniv Bernstein¹⁹,
Milad Shokouhi¹⁹ &
Justin Zobel¹⁹

Conference paper

606 Accesses
9 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4209))

Abstract

In distributed information retrieval, answers from separate collections are combined into a single result set. However, the collections may overlap. The fact that the collections are distributed means that it is not in general feasible to prune duplicate and near-duplicate documents at index time. In this paper we introduce and analyze the grainy hash vector, a compact document representation that can be used to efficiently prune duplicate and near-duplicate documents from result lists. We demonstrate that, for a modest bandwidth and computational cost, many near-duplicates can be accurately removed from result lists produced by a cooperative distributed information retrieval system.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allan, J., et al.: Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval. In: SIGIR Forum, University of Massachusetts Amherst, September 2002, vol. 37(1), pp. 31–47 (2003)
Google Scholar
Bernstein, Y., Zobel, J.: A scalable system for identifying co-derivative documents. In: Proc. String Processing and Information Retrieval Symposium, Padova, Italy, pp. 55–67 (2004)
Google Scholar
Bernstein, Y., Zobel, J.: Redundant documents and search effectiveness. In: Proc. ACM CIKM Conf., Bremen, Germany, pp. 736–743 (2005)
Google Scholar
Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: Proc. ACM SIGMOD international conference on Management of Data, San Jose, California, pp. 398–409 (1995)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13), 1157–1166 (1997)
Article Google Scholar
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations (extended abstract). In: Proc. ACM symposium on Theory of computing (STOC), pp. 327–336. ACM Press, New York (1998)
Google Scholar
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Transactions on Information Systems 19(2), 97–130 (2001)
Article Google Scholar
Callan, J., Lu, Z., Croft, W.B.: Searching distributed collections with inference networks. In: Proc. Int. ACM-SIGIR Conf., Seattle, Washington, pp. 21–28 (1995)
Google Scholar
Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Transactions on Information Systems 20(2), 171–191 (2002)
Article Google Scholar
Conrad, J.G., Guo, X.S., Schriber, C.P.: Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In: Proc. ACM-CIKM Conf., New Orleans, Louisiana, pp. 443–452 (2003)
Google Scholar
Cooper, J.W., Coden, A.R., Brown, E.W.: Detecting similar documents using salient terms. In: Proc. ACM-CIKM Conf., McLean, Virginia, pp. 245–251 (2002)
Google Scholar
Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. In: Proc. first Latin American Web Congress, pp. 37–45. IEEE, Los Alamitos (2003)
Google Scholar
Gauch, S., Wang, G., Gomez, M.: ProFusion: Intelligent fusion from multiple, distributed search engines. J. Universal Computer Science 2(9), 637–649 (1996)
Google Scholar
Gravano, L., Chang, C.K., Garcia-Molina, H., Paepcke, A.: STARTS: Stanford proposal for Internet meta-searching. In: Proc. ACM SIGMOD international conference on Management of Data, Tucson, Arizona, pp. 207–218 (1997)
Google Scholar
Harman, D.: Overview of the first TREC conference. In: Proc. ACM-SIGIR Conf., Pittsburgh, Pennsylvania, pp. 36–47 (1993)
Google Scholar
Hernandez, T., Kambhampati, S.: Improving text collection selection with coverage and overlap statistics. In: Proc. Int. Conf. on World Wide Web, Chiba, Japan, pp. 1128–1129 (2005)
Google Scholar
Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarised documents. J. the American Society for Information Science and Technology 54(3), 203–215 (2003)
Article Google Scholar
Ilyinski, S., Kuzmin, M., Melkov, A., Segalovich, I.: An efficient method to detect duplicates of web documents with the use of inverted index. In: Proc. Int. Conf. on World Wide Web, Honolulu, Hawaii (2002)
Google Scholar
Kolcz, A., Chowdhury, A., Alspector, J.: Improved robustness of signature-based near-replica detection via lexicon randomization. In: Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Seattle, WA, pp. 605–610 (2004)
Google Scholar
Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document collections. In: Proc. Conf. on Empirical Methods in Natural Language Processing, Philadelphia, Pennsylvania (2001)
Google Scholar
Manber, U.: Finding similar files in a large file system. In: Proc. USENIX Winter Technical Conf., San Fransisco, CA, pp. 1–10, 17–21 (1994)
Google Scholar
Meng, W., Yu, C., Liu, K.: Building efficient and effective metasearch engines. ACM Computing Surveys 34(1), 48–89 (2002)
Article Google Scholar
Nottelmann, H., Fuhr, N.: Evaluating different methods of estimating retrieval quality for resource selection. In: Proc. Int. ACM-SIGIR Conf., Toronto, Canada, pp. 290–297 (2003)
Google Scholar
Powell, A.L., French, J.: Comparing the performance of collection selection algorithms. ACM Transactions on Information Systems 21(4), 412–456 (2003)
Article Google Scholar
Pugh, W., Henzinger, M.H.: Detecting duplicate and near-duplicate files (United States Patent 6,658,423) (2003)
Google Scholar
Selberg, E., Etzioni, O.: The MetaCrawler architecture for resource aggregation on the Web. In: IEEE Expert (January–February 1997), pp. 11–14 (1997)
Google Scholar
Si, L., Callan, J.: Unified utility maximization framework for resource selection. In: Proc. ACM-CIKM Conf., Washington, D.C., pp. 32–41 (2004)
Google Scholar
Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proc. ACM-SIGIR Conf., Toronto, Canada, pp. 298–305 (2003)
Google Scholar
Warren Jr., H.S.: Hacker’s Delight. Addison-Wesley, Reading (2002)
Google Scholar
Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to web search results. In: Proc. Int. Conf. on World Wide Web, Toronto, Canada, pp. 1361–1374 (1999)
Google Scholar
Zobel, J., Bernstein, Y.: The case of the duplicate documents: Measurement, search, and science. In: Proc. Asia-Pacific Web Conf., Harbin, China, pp. 26–39 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Information Technology, RMIT University, Melbourne, Australia
Yaniv Bernstein, Milad Shokouhi & Justin Zobel

Authors

Yaniv Bernstein
View author publications
You can also search for this author in PubMed Google Scholar
Milad Shokouhi
View author publications
You can also search for this author in PubMed Google Scholar
Justin Zobel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Strathclyde, Scotland
Fabio Crestani
Dipartimento di Informatica, University of Pisa, Largo B. Pontecorvo 3, 56127, Pisa, Italy
Paolo Ferragina
Department of Information Studies, University of Sheffield, Sheffield, UK
Mark Sanderson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bernstein, Y., Shokouhi, M., Zobel, J. (2006). Compact Features for Detection of Near-Duplicates in Distributed Retrieval. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_10

Download citation

DOI: https://doi.org/10.1007/11880561_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics