skip to main content
10.1145/2851613.2851861acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Duplicate detection in web shops using LSH to reduce the number of computations

Authors Info & Claims
Published:04 April 2016Publication History

ABSTRACT

The amount of online shops is growing daily and many Web shops focus on the same product types, like consumer electronics. Since Web shops use different product representations, it is hard to compare products among different Web shops. Duplicate detection methods aim to solve this problem by identifying the same products in differentWeb shops. In this paper, we focus on reducing the computation time of a state-of-the-art duplicate detection algorithm. First, we construct uniform vector representations for the products. We use these vectors as input for a Locality Sensitive Hashing (LSH) algorithm, which pre-selects potential duplicates. Finally, duplicate products are found by applying the Multi-component Similarity Method (MSM). Compared to original MSM, the number of needed computations can be reduced by 95% with only a minor decrease by 9% in the F1-measure.

References

  1. P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537--1555, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. O. Chum, J. Philbin, and A. Zisserman. Near duplicate image detection: Min-Hash and TF-IDF weighting. In 19th British Machine Vision Conference. British Machine Vision Association, 2008. http://www.bmva.org/bmvc/2008/papers/119.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  3. K. L. Clarkson. An algorithm for approximate closest-point queries. In Proceedings of the Tenth Annual Symposium on Computational Geometry, pages 160--164. ACM, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. de Bakker, F. Frasincar, and D. Vandic. A hybrid model words-driven approach for web product duplicate detection. In 25th International Conference on Advanced Information Systems Engineering, volume 7908, pages 149--161. Springer, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Getoor and A. Machanavajjhala. Entity resolution: Tutorial. http://www.umiacs.umd.edu/~getoor/Tutorials/ER VLDB2012.pdf, 2012.Google ScholarGoogle Scholar
  7. S.-C. Hsueh, M.-Y. Lin, and Y.-C. Chiu. A load-balanced mapreduce algorithm for blocking-based entity-resolution with multiple keys. In Twelfth Australasian Symposium on Parallel and Distributed Computing, volume 152. Australian Computer Society, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In 30th Annual ACM Symposium on Theory of Computing, pages 604--613. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Jin, M. M. A. Patwary, A. Agrawal, W. Hendrix, W. k. Liao, and A. Choudhary. Disc: A distributed single-linkage hierarchical clustering algorithm using mapreduce. In 4th International SC Workshop on Data Intensive Computing in the Clouds, 2013.Google ScholarGoogle Scholar
  10. Y. Ke, R. Sukthankar, and L. Huston. Efficient near-duplicate detection and sub-image retrieval. In 12th ACM International Conference on Multimedia, pages 869--876. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Papadakis, E. Ioannou, T. Palpanas, C. Niederee, and W. Nejdl. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Transactions on Knowledge and Data Engineering, 25(12):2655--2682, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Slaney and M. Casey. Locality-sensitive hashing for finding nearest neighbors. IEEE Signal Processing Magazine, 25(2):128--131, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  13. R. van Bezu, S. Borst, R. Rijkse, J. Verhagen, F. Frasincar, and D. Vandic. Multi-component similarity method for web product duplicate detection. In 30th Annual Symposium on Applied Computing, pages 761--768. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Vandic, J.-W. Van Dam, and F. Frasincar. Faceted product search powered by the Semantic Web. Decision Support Systems, 53(3):425--437, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Duplicate detection in web shops using LSH to reduce the number of computations

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing
          April 2016
          2360 pages
          ISBN:9781450337397
          DOI:10.1145/2851613

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 April 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SAC '16 Paper Acceptance Rate252of1,047submissions,24%Overall Acceptance Rate1,650of6,669submissions,25%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader