research-article

Efficient similarity joins for near duplicate detection

Authors:
Chuan Xiao

University of New South Wales, Kensington, NSW, Australia

University of New South Wales, Kensington, NSW, Australia
View Profile

,
Wei Wang

University of New South Wales, Kensington, NSW, Australia

University of New South Wales, Kensington, NSW, Australia
View Profile

,
Xuemin Lin

University of New South Wales, Kensington, NSW, Australia

University of New South Wales, Kensington, NSW, Australia
View Profile

,
Jeffrey Xu Yu

Chinese University of Hong Kong, Hong Kong, China

Chinese University of Hong Kong, Hong Kong, China
View Profile

WWW '08: Proceedings of the 17th international conference on World Wide WebApril 2008Pages 131–140https://doi.org/10.1145/1367497.1367516

Published:21 April 2008Publication History

WWW '08: Proceedings of the 17th international conference on World Wide Web

Pages 131–140

ABSTRACT

With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pairs of records such that their similarities are above a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. Experimental results show that our proposed algorithms can achieve up to 2.6x - 5x speed-up over previous algorithms on several real datasets and provide alternative solutions to the near duplicate Web page detection problem.

References

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. Google ScholarDigital Library
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1st edition edition, May 1999. Google ScholarDigital Library
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. Google ScholarDigital Library
M. Bilenko, R. J. Mooney, W. W. Cohen, P. Ravikumar, and S. E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Sys., 18(5):16--23, 2003. Google ScholarDigital Library
A. Z. Broder. On the resemblance and containment of documents. In SEQS, 1997. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarDigital Library
M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. Google ScholarDigital Library
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In SIGMOD, 2000. Google ScholarDigital Library
A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002. Google ScholarDigital Library
J. G. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In CIKM, 2003. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007. Google ScholarDigital Library
R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In SIGMOD, 2003. Google ScholarDigital Library
D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In LA-WEB, 2003. Google ScholarDigital Library
D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB, 2005. Google ScholarDigital Library
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Google ScholarDigital Library
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. Google ScholarDigital Library
M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, 2006. Google ScholarDigital Library
M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarDigital Library
T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998. Google ScholarDigital Library
R. C. Russell. Index, U.S. patent 1,261,167, April 1918.Google Scholar
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002. Google ScholarDigital Library
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. Google ScholarDigital Library
E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In KDD, 2005. Google ScholarDigital Library
E. Ukkonen. On approximate string matching. In FCT, 1983. Google ScholarDigital Library
W. E. Winkler. The state of record linkage and current research problems. Technical report, U.S. Bureau of the Census, 1999.Google Scholar

Index Terms

Efficient similarity joins for near duplicate detection
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Efficient similarity joins for near-duplicate detection

With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near-duplicate records efficiently. In this article, we focus on efficient algorithms to find a pair of records ...
Read More
Near duplicate detection in an academic digital library
DocEng '13: Proceedings of the 2013 ACM symposium on Document engineering

The detection and potential removal of duplicates is desirable for a number of reasons, such as to reduce the need for unnecessary storage and computation, and to provide users with uncluttered search results. This paper describes an investigation into ...
Read More
Efficient duplicate record detection based on similarity estimation
WAIM'10: Proceedings of the 11th international conference on Web-age information management

In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '08: Proceedings of the 17th international conference on World Wide Web
April 2008
1326 pages
ISBN:9781605580852
DOI:10.1145/1367497
General Chairs:
Jinpeng Huai
Beihang University, China
,
Robin Chen
AT&T Labs, USA
,
Hsiao-Wuen Hon
Microsoft Research Asia, China
,
Yunhao Liu
HK University of Science and Technology, Hong Kong
,
Program Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Andrew Tomkins
Yahoo! Research, USA
,
Xiaodong Zhang
The Ohio State University, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 April 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
near duplicate detection
similarity join
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 268
  Total Citations
  View Citations
- 1,294
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient similarity joins for near duplicate detection

WWW '08: Proceedings of the 17th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient similarity joins for near-duplicate detection

Near duplicate detection in an academic digital library

Efficient duplicate record detection based on similarity estimation