ABSTRACT
With the increasing amount of data and the need to integrate data from multiple data sources, a challenging issue is to find near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pairs of records such that their similarities are above a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. Experimental results show that our proposed algorithms can achieve up to 2.6x - 5x speed-up over previous algorithms on several real datasets and provide alternative solutions to the near duplicate Web page detection problem.
- A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. Google ScholarDigital Library
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1st edition edition, May 1999. Google ScholarDigital Library
- R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. Google ScholarDigital Library
- M. Bilenko, R. J. Mooney, W. W. Cohen, P. Ravikumar, and S. E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Sys., 18(5):16--23, 2003. Google ScholarDigital Library
- A. Z. Broder. On the resemblance and containment of documents. In SEQS, 1997. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarDigital Library
- M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. Google ScholarDigital Library
- J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In SIGMOD, 2000. Google ScholarDigital Library
- A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002. Google ScholarDigital Library
- J. G. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: signature reliability in a dynamic retrieval environment. In CIKM, 2003. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007. Google ScholarDigital Library
- R. Fagin, R. Kumar, and D. Sivakumar. Efficient similarity search and classification via rank aggregation. In SIGMOD, 2003. Google ScholarDigital Library
- D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In LA-WEB, 2003. Google ScholarDigital Library
- D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB, 2005. Google ScholarDigital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Google ScholarDigital Library
- L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. Google ScholarDigital Library
- M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, 2006. Google ScholarDigital Library
- M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998. Google ScholarDigital Library
- T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998. Google ScholarDigital Library
- R. C. Russell. Index, U.S. patent 1,261,167, April 1918.Google Scholar
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002. Google ScholarDigital Library
- S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004. Google ScholarDigital Library
- E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In KDD, 2005. Google ScholarDigital Library
- E. Ukkonen. On approximate string matching. In FCT, 1983. Google ScholarDigital Library
- W. E. Winkler. The state of record linkage and current research problems. Technical report, U.S. Bureau of the Census, 1999.Google Scholar
Index Terms
- Efficient similarity joins for near duplicate detection
Recommendations
Efficient similarity joins for near-duplicate detection
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near-duplicate records efficiently. In this article, we focus on efficient algorithms to find a pair of records ...
Near duplicate detection in an academic digital library
DocEng '13: Proceedings of the 2013 ACM symposium on Document engineeringThe detection and potential removal of duplicates is desirable for a number of reasons, such as to reduce the need for unnecessary storage and computation, and to provide users with uncluttered search results. This paper describes an investigation into ...
Efficient duplicate record detection based on similarity estimation
WAIM'10: Proceedings of the 11th international conference on Web-age information managementIn information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on ...
Comments