skip to main content
10.1145/1367497.1367516acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Efficient similarity joins for near duplicate detection

Authors Info & Claims
Published:21 April 2008Publication History

ABSTRACT

With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pairs of records such that their similarities are above a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. Experimental results show that our proposed algorithms can achieve up to 2.6x - 5x speed-up over previous algorithms on several real datasets and provide alternative solutions to the near duplicate Web page detection problem.

References

  1. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1st edition edition, May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Bilenko, R. J. Mooney, W. W. Cohen, P. Ravikumar, and S. E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Sys., 18(5):16--23, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Z. Broder. On the resemblance and containment of documents. In SEQS, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. G. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In CIKM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In LA-WEB, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. C. Russell. Index, U.S. patent 1,261,167, April 1918.Google ScholarGoogle Scholar
  23. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In KDD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Ukkonen. On approximate string matching. In FCT, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. E. Winkler. The state of record linkage and current research problems. Technical report, U.S. Bureau of the Census, 1999.Google ScholarGoogle Scholar

Index Terms

  1. Efficient similarity joins for near duplicate detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WWW '08: Proceedings of the 17th international conference on World Wide Web
          April 2008
          1326 pages
          ISBN:9781605580852
          DOI:10.1145/1367497

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 April 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%

          Upcoming Conference

          WWW '24
          The ACM Web Conference 2024
          May 13 - 17, 2024
          Singapore , Singapore

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader