Skip to main content

Secure Similar Document Detection with Simhash

  • Conference paper
  • First Online:
Secure Data Management (SDM 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8425))

Included in the following conference series:

Abstract

Similar document detection is a well-studied problem with important application domains, such as plagiarism detection, document archiving, and patent/copyright protection. Recently, the research focus has shifted towards the privacy-preserving version of the problem, in which two parties want to identify similar documents within their respective datasets. These methods apply to scenarios such as patent protection or intelligence collaboration, where the contents of the documents at both parties should be kept secret. Nevertheless, existing protocols on secure similar document detection suffer from high computational and/or communication costs, which renders them impractical for large datasets. In this work, we introduce a solution based on simhash document fingerprints, which essentially reduce the problem to a secure XOR computation between two bit vectors. Our experimental results demonstrate that the proposed method improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol. Moreover, it achieves a high level of precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We follow the simhash definition of Charikar [2].

  2. 2.

    http://gmplib.org

  3. 3.

    Note that the EsPRESSo protocols are also implemented on top of group \(G\).

  4. 4.

    http://en.wikipedia.org

References

  1. Blundo, C., De Cristofaro, E., Gasti, P.: EsPRESSo: efficient privacy-preserving evaluation of sample set similarity. In: Di Pietro, R., Herranz, J., Damiani, E., State, R. (eds.) DPM 2012 and SETOP 2012. LNCS, vol. 7731, pp. 89–103. Springer, Heidelberg (2013)

    Google Scholar 

  2. Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)

    Google Scholar 

  3. Cramer, R., Gennaro, R., Schoenmakers, B.: A secure and optimally efficient multi-authority election scheme. Eur. Trans. Telecommun. 8(5), 481–490 (1997)

    Article  Google Scholar 

  4. Cristofaro, E.D., Gasti, P., Tsudik, G.: Fast and private computation of set intersection cardinality. IACR Cryptology ePrint Archive 2011, 141 (2011)

    Google Scholar 

  5. ElGamal, T.: A public-key cryptosystem and a signature scheme based on discrete logarithms. IEEE Trans. Inf. Theory 31(4), 469–472 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  6. Huang, L., Wang, L., Li, X.: Achieving both high precision and high recall in near-duplicate detection. In: CIKM, pp. 63–72 (2008)

    Google Scholar 

  7. Jiang, W., Murugesan, M., Clifton, C., Si, L.: Similar document detection with limited information disclosure. In: ICDE, pp. 735–743 (2008)

    Google Scholar 

  8. Jiang, W., Samanthula, B.K.: N-gram based secure similar document detection. In: Li, Y. (ed.) DBSec. LNCS, vol. 6818, pp. 239–246. Springer, Heidelberg (2011)

    Google Scholar 

  9. Lindell, Y., Pinkas, B.: Secure multiparty computation for privacy-preserving data mining. J. Priv. Confidentiality 1(1), 59–98 (2009)

    Google Scholar 

  10. Manber, U.: Finding similar files in a large file system. In: USENIX Winter, pp. 1–10 (1994)

    Google Scholar 

  11. Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)

    Google Scholar 

  12. Murugesan, M., Jiang, W., Clifton, C., Si, L., Vaidya, J.: Efficient privacy-preserving similar document detection. VLDB J. 19(4), 457–475 (2010)

    Article  Google Scholar 

  13. Naor, M., Pinkas, B.: Computationally secure oblivious transfer. J. Cryptology 18(1), 1–35 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  14. Yao, A.C.C.: How to generate and exchange secrets. In: FOCS, pp. 162–167 (1986)

    Google Scholar 

Download references

Acknowledgments

This research has been funded by the NSF CAREER Award IIS-0845262.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Spiridon Bakiras .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Buyrukbilen, S., Bakiras, S. (2014). Secure Similar Document Detection with Simhash. In: Jonker, W., Petković, M. (eds) Secure Data Management. SDM 2013. Lecture Notes in Computer Science(), vol 8425. Springer, Cham. https://doi.org/10.1007/978-3-319-06811-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-06811-4_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-06810-7

  • Online ISBN: 978-3-319-06811-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics