research-article

Duplicate detection in web shops using LSH to reduce the number of computations

Authors:
Iris van Dam

Erasmus University Rotterdam, DR Rotterdam, The Netherlands

Erasmus University Rotterdam, DR Rotterdam, The Netherlands
View Profile

,
Gerhard van Ginkel

Erasmus University Rotterdam, DR Rotterdam, The Netherlands

Erasmus University Rotterdam, DR Rotterdam, The Netherlands
View Profile

,
Wim Kuipers

Erasmus University Rotterdam, DR Rotterdam, The Netherlands

Erasmus University Rotterdam, DR Rotterdam, The Netherlands
View Profile

,
Nikki Nijenhuis

Erasmus University Rotterdam, DR Rotterdam, The Netherlands

Erasmus University Rotterdam, DR Rotterdam, The Netherlands
View Profile

,
Damir Vandic

Erasmus University Rotterdam, DR Rotterdam, The Netherlands

Erasmus University Rotterdam, DR Rotterdam, The Netherlands
View Profile

,
Flavius Frasincar

Erasmus University Rotterdam, DR Rotterdam, The Netherlands

Erasmus University Rotterdam, DR Rotterdam, The Netherlands
View Profile

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied ComputingApril 2016Pages 772–779https://doi.org/10.1145/2851613.2851861

Published:04 April 2016Publication History

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

Pages 772–779

ABSTRACT

The amount of online shops is growing daily and many Web shops focus on the same product types, like consumer electronics. Since Web shops use different product representations, it is hard to compare products among different Web shops. Duplicate detection methods aim to solve this problem by identifying the same products in differentWeb shops. In this paper, we focus on reducing the computation time of a state-of-the-art duplicate detection algorithm. First, we construct uniform vector representations for the products. We use these vectors as input for a Locality Sensitive Hashing (LSH) algorithm, which pre-selects potential duplicates. Finally, duplicate products are found by applying the Multi-component Similarity Method (MSM). Compared to original MSM, the number of needed computations can be reduced by 95% with only a minor decrease by 9% in the F₁-measure.

References

P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537--1555, 2012. Google ScholarDigital Library
O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: Min-Hash and TF-IDF weighting. In 19th British Machine Vision Conference. British Machine Vision Association, 2008. http://www.bmva.org/bmvc/2008/papers/119.pdf.Google ScholarCross Ref
K. L. Clarkson. An algorithm for approximate closest-point queries. In Proceedings of the Tenth Annual Symposium on Computational Geometry, pages 160--164. ACM, 1994. Google ScholarDigital Library
M. de Bakker, F. Frasincar, and D. Vandic. A hybrid model words-driven approach for web product duplicate detection. In 25th International Conference on Advanced Information Systems Engineering, volume 7908, pages 149--161. Springer, 2013. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007. Google ScholarDigital Library
L. Getoor and A. Machanavajjhala. Entity resolution: Tutorial. http://www.umiacs.umd.edu/~getoor/Tutorials/ER VLDB2012.pdf, 2012.Google Scholar
S.-C. Hsueh, M.-Y. Lin, and Y.-C. Chiu. A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys. In Twelfth Australasian Symposium on Parallel and Distributed Computing, volume 152. Australian Computer Society, 2014. Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In 30th Annual ACM Symposium on Theory of Computing, pages 604--613. ACM, 1998. Google ScholarDigital Library
C. Jin, M. M. A. Patwary, A. Agrawal, W. Hendrix, W. k. Liao, and A. Choudhary. Disc: A distributed single-linkage hierarchical clustering algorithm using mapreduce. In 4th International SC Workshop on Data Intensive Computing in the Clouds, 2013.Google Scholar
Y. Ke, R. Sukthankar, and L. Huston. Efficient near-duplicate detection and sub-image retrieval. In 12th ACM International Conference on Multimedia, pages 869--876. ACM, 2004. Google ScholarDigital Library
G. Papadakis, E. Ioannou, T. Palpanas, C. Niederee, and W. Nejdl. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering, 25(12):2655--2682, 2013. Google ScholarDigital Library
M. Slaney and M. Casey. Locality-sensitive hashing for finding nearest neighbors. IEEE Signal Processing Magazine, 25(2):128--131, 2008.Google ScholarCross Ref
R. van Bezu, S. Borst, R. Rijkse, J. Verhagen, F. Frasincar, and D. Vandic. Multi-component similarity method for web product duplicate detection. In 30th Annual Symposium on Applied Computing, pages 761--768. ACM, 2015. Google ScholarDigital Library
D. Vandic, J.-W. Van Dam, and F. Frasincar. Faceted product search powered by the Semantic Web. Decision Support Systems, 53(3):425--437, 2012. Google ScholarDigital Library

Index Terms

Duplicate detection in web shops using LSH to reduce the number of computations
1. Information systems
  1. Data management systems
    1. Information integration
      1. Deduplication
      2. Entity resolution
  2. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification

Recommendations

Multi-component similarity method for web product duplicate detection
SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing

Due to the growing number of Web shops, aggregating product data from the Web is growing in importance. One of the problems encountered in product aggregation is duplicate detection. In this paper, we extend and significantly improve an existing state-...
Read More
A Review on Fairness in Machine Learning
An increasing number of decisions regarding the daily lives of human beings are being controlled by artificial intelligence and machine learning (ML) algorithms in spheres ranging from healthcare, transportation, and education to college admissions, ...
Read More
XGBoost: A Scalable Tree Boosting System
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing
April 2016
2360 pages
ISBN:9781450337397
DOI:10.1145/2851613
Conference Chair:
Sascha Ossowski
University Rey Juan Carlos, Spain
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 April 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
duplicate detection
locality-sensitive hashing
web shop products
Qualifiers
- research-article
Conference

Acceptance Rates
SAC '16 Paper Acceptance Rate252of1,047submissions,24%Overall Acceptance Rate1,650of6,669submissions,25%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 128
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Duplicate detection in web shops using LSH to reduce the number of computations

SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-component similarity method for web product duplicate detection

A Review on Fairness in Machine Learning

XGBoost: A Scalable Tree Boosting System