Secure Similar Document Detection with Simhash

Buyrukbilen, Sahin; Bakiras, Spiridon

doi:10.1007/978-3-319-06811-4_12

Sahin Buyrukbilen¹⁷ &
Spiridon Bakiras¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8425))

Included in the following conference series:

Workshop on Secure Data Management

1732 Accesses
10 Citations
3 Altmetric

Abstract

Similar document detection is a well-studied problem with important application domains, such as plagiarism detection, document archiving, and patent/copyright protection. Recently, the research focus has shifted towards the privacy-preserving version of the problem, in which two parties want to identify similar documents within their respective datasets. These methods apply to scenarios such as patent protection or intelligence collaboration, where the contents of the documents at both parties should be kept secret. Nevertheless, existing protocols on secure similar document detection suffer from high computational and/or communication costs, which renders them impractical for large datasets. In this work, we introduce a solution based on simhash document fingerprints, which essentially reduce the problem to a secure XOR computation between two bit vectors. Our experimental results demonstrate that the proposed method improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol. Moreover, it achieves a high level of precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We follow the simhash definition of Charikar [2].
2.
http://gmplib.org
3.
Note that the EsPRESSo protocols are also implemented on top of group \(G\).
4.
http://en.wikipedia.org

References

Blundo, C., De Cristofaro, E., Gasti, P.: EsPRESSo: efficient privacy-preserving evaluation of sample set similarity. In: Di Pietro, R., Herranz, J., Damiani, E., State, R. (eds.) DPM 2012 and SETOP 2012. LNCS, vol. 7731, pp. 89–103. Springer, Heidelberg (2013)
Google Scholar
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)
Google Scholar
Cramer, R., Gennaro, R., Schoenmakers, B.: A secure and optimally efficient multi-authority election scheme. Eur. Trans. Telecommun. 8(5), 481–490 (1997)
Article Google Scholar
Cristofaro, E.D., Gasti, P., Tsudik, G.: Fast and private computation of set intersection cardinality. IACR Cryptology ePrint Archive 2011, 141 (2011)
Google Scholar
ElGamal, T.: A public-key cryptosystem and a signature scheme based on discrete logarithms. IEEE Trans. Inf. Theory 31(4), 469–472 (1985)
Article MATH MathSciNet Google Scholar
Huang, L., Wang, L., Li, X.: Achieving both high precision and high recall in near-duplicate detection. In: CIKM, pp. 63–72 (2008)
Google Scholar
Jiang, W., Murugesan, M., Clifton, C., Si, L.: Similar document detection with limited information disclosure. In: ICDE, pp. 735–743 (2008)
Google Scholar
Jiang, W., Samanthula, B.K.: N-gram based secure similar document detection. In: Li, Y. (ed.) DBSec. LNCS, vol. 6818, pp. 239–246. Springer, Heidelberg (2011)
Google Scholar
Lindell, Y., Pinkas, B.: Secure multiparty computation for privacy-preserving data mining. J. Priv. Confidentiality 1(1), 59–98 (2009)
Google Scholar
Manber, U.: Finding similar files in a large file system. In: USENIX Winter, pp. 1–10 (1994)
Google Scholar
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Google Scholar
Murugesan, M., Jiang, W., Clifton, C., Si, L., Vaidya, J.: Efficient privacy-preserving similar document detection. VLDB J. 19(4), 457–475 (2010)
Article Google Scholar
Naor, M., Pinkas, B.: Computationally secure oblivious transfer. J. Cryptology 18(1), 1–35 (2005)
Article MATH MathSciNet Google Scholar
Yao, A.C.C.: How to generate and exchange secrets. In: FOCS, pp. 162–167 (1986)
Google Scholar

Download references

Acknowledgments

This research has been funded by the NSF CAREER Award IIS-0845262.

Author information

Authors and Affiliations

The Graduate Center, City University of New York, New York, USA
Sahin Buyrukbilen
John Jay College, City University of New York, New York, USA
Spiridon Bakiras

Authors

Sahin Buyrukbilen
View author publications
You can also search for this author in PubMed Google Scholar
Spiridon Bakiras
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Spiridon Bakiras .

Editor information

Editors and Affiliations

EIT ICT Labs/University of Twente, Enschede, The Netherlands
Willem Jonker
Philips Research/Eindhoven University of Technology, Eindhoven, The Netherlands
Milan Petković

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Buyrukbilen, S., Bakiras, S. (2014). Secure Similar Document Detection with Simhash. In: Jonker, W., Petković, M. (eds) Secure Data Management. SDM 2013. Lecture Notes in Computer Science(), vol 8425. Springer, Cham. https://doi.org/10.1007/978-3-319-06811-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-06811-4_12
Published: 15 May 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06810-7
Online ISBN: 978-3-319-06811-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics