ABSTRACT
The amount of online shops is growing daily and many Web shops focus on the same product types, like consumer electronics. Since Web shops use different product representations, it is hard to compare products among different Web shops. Duplicate detection methods aim to solve this problem by identifying the same products in differentWeb shops. In this paper, we focus on reducing the computation time of a state-of-the-art duplicate detection algorithm. First, we construct uniform vector representations for the products. We use these vectors as input for a Locality Sensitive Hashing (LSH) algorithm, which pre-selects potential duplicates. Finally, duplicate products are found by applying the Multi-component Similarity Method (MSM). Compared to original MSM, the number of needed computations can be reduced by 95% with only a minor decrease by 9% in the F1-measure.
- P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537--1555, 2012. Google ScholarDigital Library
- O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: Min-Hash and TF-IDF weighting. In 19th British Machine Vision Conference. British Machine Vision Association, 2008. http://www.bmva.org/bmvc/2008/papers/119.pdf.Google ScholarCross Ref
- K. L. Clarkson. An algorithm for approximate closest-point queries. In Proceedings of the Tenth Annual Symposium on Computational Geometry, pages 160--164. ACM, 1994. Google ScholarDigital Library
- M. de Bakker, F. Frasincar, and D. Vandic. A hybrid model words-driven approach for web product duplicate detection. In 25th International Conference on Advanced Information Systems Engineering, volume 7908, pages 149--161. Springer, 2013. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007. Google ScholarDigital Library
- L. Getoor and A. Machanavajjhala. Entity resolution: Tutorial. http://www.umiacs.umd.edu/~getoor/Tutorials/ER VLDB2012.pdf, 2012.Google Scholar
- S.-C. Hsueh, M.-Y. Lin, and Y.-C. Chiu. A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys. In Twelfth Australasian Symposium on Parallel and Distributed Computing, volume 152. Australian Computer Society, 2014. Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In 30th Annual ACM Symposium on Theory of Computing, pages 604--613. ACM, 1998. Google ScholarDigital Library
- C. Jin, M. M. A. Patwary, A. Agrawal, W. Hendrix, W. k. Liao, and A. Choudhary. Disc: A distributed single-linkage hierarchical clustering algorithm using mapreduce. In 4th International SC Workshop on Data Intensive Computing in the Clouds, 2013.Google Scholar
- Y. Ke, R. Sukthankar, and L. Huston. Efficient near-duplicate detection and sub-image retrieval. In 12th ACM International Conference on Multimedia, pages 869--876. ACM, 2004. Google ScholarDigital Library
- G. Papadakis, E. Ioannou, T. Palpanas, C. Niederee, and W. Nejdl. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering, 25(12):2655--2682, 2013. Google ScholarDigital Library
- M. Slaney and M. Casey. Locality-sensitive hashing for finding nearest neighbors. IEEE Signal Processing Magazine, 25(2):128--131, 2008.Google ScholarCross Ref
- R. van Bezu, S. Borst, R. Rijkse, J. Verhagen, F. Frasincar, and D. Vandic. Multi-component similarity method for web product duplicate detection. In 30th Annual Symposium on Applied Computing, pages 761--768. ACM, 2015. Google ScholarDigital Library
- D. Vandic, J.-W. Van Dam, and F. Frasincar. Faceted product search powered by the Semantic Web. Decision Support Systems, 53(3):425--437, 2012. Google ScholarDigital Library
Index Terms
- Duplicate detection in web shops using LSH to reduce the number of computations
Recommendations
Multi-component similarity method for web product duplicate detection
SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied ComputingDue to the growing number of Web shops, aggregating product data from the Web is growing in importance. One of the problems encountered in product aggregation is duplicate detection. In this paper, we extend and significantly improve an existing state-...
A Review on Fairness in Machine Learning
An increasing number of decisions regarding the daily lives of human beings are being controlled by artificial intelligence and machine learning (ML) algorithms in spheres ranging from healthcare, transportation, and education to college admissions, ...
XGBoost: A Scalable Tree Boosting System
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningTree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many ...
Comments