1 Introduction

In the setting of secure two-party computation, two parties with private inputs wish to jointly compute some function of their inputs while preserving certain security properties like privacy, correctness and more. The standard definition [7, 12, 21, 34] formalizes security by comparing the execution of such protocol to an “ideal execution” where a trusted third party computes the function for the parties. Specifically, in the ideal world the parties just send their inputs over perfectly secure communication lines to a trusted party, who then computes the function honestly and sends the output to the designated party. Then, a real protocol is said to be secure if no adversary can do more harm in a real protocol execution than in an ideal one (where by definition no harm can be done). This way of defining security is very appealing and has many important advantages; for example, protocols proven secure in this way remain secure under sequential modular composition [12]. We call this definition simulation-based security because protocols are proven secure by simulating a real execution while running in the ideal model.

Secure two-party computation has been extensively studied, and it has been demonstrated that any polynomial-time two-party computation can be generically compiled into a secure function evaluation protocol with polynomial complexity [22, 23, 46]. These results apply in various settings, considering semi-honest and malicious adversaries. In the semi-honest setting corrupted parties follow the protocol instructions (but still try to gain additional private information), whereas, malicious players follow an arbitrary strategy. However, more often than not, the resulting protocols are inefficient for practical uses, in part because they are general and so do not utilize any specific properties of the problem at hand, and hence attention has been given to constructing efficient protocols for specific functions. This approach has proved quite successful for the semi-honest setting, while the malicious setting typically remained impractical (a notable exception is [4]).

In this paper we consider the following fundamental search problem: Alice holds a text t∈{0,1} of length n and Bob is given a pattern (i.e., a search word) p∈{0,1} of length m, where the sizes of t and p are mutually known. The goal is for Bob to only learn all the locations in the text that match the pattern, while Alice learns nothing about the pattern. This problem has been widely studied for decades due to its potential applications for text retrieval, music retrieval, computational biology, data mining, network security, and many more. The most known application in the context of privacy is in comparing two DNA strings; the following example is taken from [20]. Consider the case of a hospital holding a DNA database of all the participants in a research study, and a researcher wanting to determine the frequency of the occurrence of a specific gene. This is a classical pattern matching application, which is, however, complicated by privacy considerations. The hospital may be forbidden from releasing the DNA records to a third party. Likewise, the researcher may not want to reveal what specific gene he is working on, nor trust the hospital to perform the search correctly. The importance of this application is further illustrated in [5].

In an insecure setting this problem can be solved with linear time complexity. Nevertheless, most of the existing solutions do not attempt to achieve any level of security (if at all); see [2, 9, 10, 31, 35, 36] for just a few examples. In this work, we focus our attention on the secure computation of the basic pattern matching problem and several important variants of it. This paper is an extended version of [26]. The primary new contribution of this version is the design of special purpose zero-knowledge proofs that enable to reduce the communication complexity of our protocols for pattern matching with wildcards and approximate pattern matching discussed below.

1.1 Our Contribution

We present secure solutions for the following problems in the plain model under the DDH hardness assumption with simulation-based security in the presence of malicious adversaries. The security proofs of our protocols can be easily extended to the UC framework [13] as well. Our constructions achieve efficiency that is a significant improvement on the current state of the art; see a concrete analysis below. Throughout this paper we measure computation by the number of exponentiations and the number of group multiplications (by default we mean the former), and communication by the number of exchanged group elements within the protocol. In more details:

  • Secure Pattern Matching. We develop an efficient, constant rounds protocol for this problem that requires O(n+m) exponentiations and bandwidth of O(n+m) group elements. Our protocol lays the foundations for other important variants of pattern matching which are described next.

  • Secure Pattern Matching with Wildcards. This problem is a known variant of the classic problem where Bob (who holds the pattern) introduces a new “don’t care” character to its alphabet, denoted by ⋆ (or a wildcard). The goal is for Bob to learn all the locations in the text that match the pattern, where ⋆ matches any character in the text. This problem has been widely looked at by researchers with the aim of generalizing the basic searching model to searching with errors. This variant is known as pattern matching with don’t cares and can be solved in an insecure setting in O(n+m) time [28]. In this paper, we develop a protocol that computes this functionality with O(n+m) communication and O(nm) computation costs. The core idea of our solution is to proceed as in the above solution with two exceptions: Bob must supply the wildcard positions in encrypted form, and the substrings of Alice’s text must be modified to ensure that they will match the pattern at those positions. Ensuring correct behavior requires further modification of the protocol; see Sect. 4 for the complete description of the protocol.

  • Secure Approximate Pattern Matching. In this problem the goal is for Bob to find the text locations where the Hamming distance of each text substring and the pattern is less than some threshold τm. This problem is an extension of pattern matching with don’t cares due to the fact that Bob is able to learn all the matches within some error bound instead of learning the matches for specified error locations. An important application of this problem is secure face recognition [38]. The best algorithm for solving this problem in an insecure setting is the solution by Amir et al. [3] which introduces a solution in \(O(n \sqrt{\tau\log\tau})\) time. We design a protocol for this problem with O() communication and O(nm) computation costs. The main idea behind our construction is to have the parties securely compute the (encrypted) Hamming distance for each text position. See Sect. 5 for further details.

  • Secure Pattern Matching with hidden pattern/text length. Finally, we consider two variants with an additional security requirement of hiding the input lengths using padding of a special character. For public upper bounds on the lengths, Mm and Nn, the solutions for these problems require O(n+M) communication and exponentiations, and O(N+m) communication and exponentiations, respectively.

    Note that in the semi-honest setting the length of the pattern can be remained hidden by letting Bob run all the computations locally and then engage with Alice in a comparison phase. Nevertheless, this task is particularly challenging in the malicious setting due to correctness issues, and it is not clear how to efficiently enhance the security of the semi-honest protocol without leaking anything about m from the communication. An efficient analogue solution for hiding the text length is not known—not even for the semi-honest setting. Therefore, using padding is currently the best alternative that enables to obtain some level of privacy regarding the pattern/text lengths, even if the padding must be large enough to hide these lengths.

    We point out to a recent work by Chase and Visconti that studies the feasibility of size-hiding (database) commitments [16], proposes a construction based on universal arguments. Although this construction is viewed as purely theoretical and precludes practical implementations it illustrates the difficulty in designing cryptographic primitives that hide the input length.

1.2 Overview of Our Approach

Our approach for computing private pattern matching follows by having the parties jointly (and securely) transform their inputs from binary representation into elements of \(\mathbb {Z}_{q}\), which they can later compare. More explicitly, the parties break their inputs into bits and encrypt each bit separately. Next, they map every m consecutive encryptions of bits into a single encryption. That is, for every m encrypted bits a 1,…,a m the parties compute the encryption of \(\sum_{i=1}^{m} 2^{i-1}a_{i}\), relying on the additively homomorphism of the encryption scheme. Importantly, the parties exploit the fact that every two consecutive substrings \(\bar{t}_{i},\bar{t}_{i+1}\) of the text (starting in positions i and i+1, respectively) overlap with m−1 positions. Therefore, computing the encoding of \(\bar{t}_{i+1}\) from the encoding of \(\bar{t}_{i}\) can be obtained by subtracting from the latter the bit t i , dividing the result by 2 and finally adding 2m−1 t i+m . This reduces the problem to comparing two elements of \(\mathbb{Z} _{2^{m}}\) (embedded into \(\mathbb{Z} _{q}\)). Thus upon computing the encoding, the parties complete the protocol by comparing the encoding of p against every encoding of a substring of length m in the text, so that only Bob learns whether there is a much or not.

1.3 Prior Work

Secure Pattern Matching

To the best of our knowledge, the first work that considered pattern matching in the context of secure computation was [43], which solves pattern matching using a secure version of oblivious automata evaluation for implementing the KMP algorithm [31] in the semi-honest setting. The KMP algorithm is a well known technique that requires O(n) time complexity and searches for occurrences of the pattern within the text by employing the observation that when a mismatch occurs, the pattern embodies sufficient information to determine where the next match could begin. The overall costs of [43] are O(nm) exponentiations and bandwidth. Several followup improvements have been suggested based on this work, e.g., [19, 33]. These works reduce the round complexity and the number of exponentiations but still maintain security in the semi-honest setting; we provide a comparison with these works below.

This problem was also studied by Hazay and Lindell in [24] which used a different approach of oblivious pseudorandom function (PRF) evaluation. Their protocol achieves only a weaker notion of security called one-sided simulatability that does not guarantee full simulation for both corruption cases. A more recent construction that achieves full simulation in the malicious setting was developed by Gennaro et al. [20]. This work implements the KMP algorithm in the malicious setting using O(m) rounds and O(nm) exponentiations and bandwidth.

Finally, a recent paper by Katz and Malka [30] presents a secure solution for a generalized pattern matching problem of text processing. Here the party that holds the pattern has some additional information y, and its goal is to learn a function of the text and y with respect to the text locations where the pattern matches. Katz and Malka show how to use Yao’s garbled circuit approach to obtain a protocol where the size of the garbled circuit is linear in the number of occurrences of p in t (rather than linear in its length n). Their costs are dominated by the size of the circuit times the number of occurrences u (as u circuits are being transferred). They therefore need to assume some common knowledge of a threshold on the number of occurrences.

Variants of Pattern Matching

To the best of our knowledge, the first work that addresses a variant of secure pattern matching is the work by Jarrous and Pinkas [29], which solves the Hamming distance problem for two equal length strings against semi-honest adversaries (which is relevant in the context of approximate pattern matching). The costs of their protocol are inflated by a statistical parameter s for running a subprotocol of the oblivious polynomial evaluation functionality. This implies O(nm) exponentiations and groups elements.

Another work by Vergnaud [45] studies the problems of approximate pattern matching and pattern matching with wildcards in the presence of malicious adversaries by taking a different approach of Fast Fourier Transform (FFT). The paper implements the well known technique by Fischer and Paterson [18] in a distributed setting using convolutions and FFT, where the inputs are viewed as coefficients of two polynomials for which their product is computed using FFT (for each text alignment). The paper presents protocols that exhibit O(n) communication and O(nlogm) computational costs in the semi-honest and malicious settings (but does not provide a complete proof in the malicious setting).

Finally, a very recent paper Baron by et al. [6] studies the problem of pattern matching with wildcards in a more general sense of non-binary alphabet, implementing a different algorithm based on linear algebra formulation and additive homomorphic encryption. Their protocol requires O(m+n) communication complexity and O(nm) computational complexity.

1.4 Efficiency

Secure Pattern Matching

We measure the efficiency of our protocol by comparisons against generic secure two-party protocols, as well as protocols designed for this specific task. The most common technique for designing secure protocols in the two-party setting is Yao’s garbling technique for Boolean circuits [46]. The current best known circuit that computes the pattern matching functionality requires O(nm) gates, since the circuit compares the pattern against every text location. (As noted by [20], a circuit that implements the KMP algorithm requires O(nmlogm) gates). It is an open problem whether better circuits can be constructed.

In the semi-honest setting, Yao’s technique induces a protocol that uses O(nm) symmetric key operations and O(m) exponentiations that can be made independent of the input length (where the latter is obtained by employing the ideas of extended oblivious transfer (OT) [27], but also requires an additional assumption on the hash function). The works by [19, 33] present specific protocols that require O(nm) symmetric key operations (due to the automaton size) and O(n) exponentiations, which can also be made independent of n using extended OT.

On the other hand, our protocol for the semi-honest setting requires 8n exponentiations and n group multiplications, where at first Alice forwards Bob the encryptions of her encoding for each substring of length m, Bob then computes the difference with his encoding and finally the parties rerandomize the outcome and decrypt it. A summary of these comparisons is presented in Table 1.

Table 1. Comparisons with semi-honest constructions.

In the malicious setting, the state of the art generic implementation is a recent protocol by Lindell and Pinkas [32] that relies on the garbling technique of Yao. Due to enforcing correct behavior the overhead of their protocol is inflated by a statistical parameter s=132. Therefore, the constants of such a protocol when realizing pattern matching are relatively high and dominated by 5.66sm+39m+10s+6 exponentiations and 6.5snm symmetric key operations. Moreover the communication complexity is at least 7sm+22n+7s+5 group elements and 4snm symmetric ciphertexts. For large databases this bandwidth as well as the number of symmetric operations introduce huge overheads. The only work that proves full simulation in the malicious setting was developed by Gennaro et al. [20]. This protocol runs in O(m) rounds and requires O(nm) exponentiations and bandwidth due to rerandomization of the automaton for each iteration. Thus, even asymptotically their protocols achieves worse overhead than our protocol (which also leads to higher constants since the [20] protocol uses zero-knowledge proofs in each step). On the other hand, our protocol induces 38n+6m exponentiations and nm+19n+10 group multiplications. The main advantage of our protocol is regarding the communication complexity. A summary of these comparisons is presented in Table 2.

Table 2. Comparisons with malicious constructions.

Variants of Pattern Matching

Generic protocols achieve the same overhead as in the case of computing the standard pattern matching problem since circuit size is O(nm) gates. Moreover, the protocols by Vergnaud [45] compute approximate pattern matching and pattern matching with don’t cares with better computational overhead than our protocols. The solution for the former problem introduces O(n(logm+τ)) computation (in comparison to O(nm) exponentiations in our protocol). The solution for the latter problem introduces O(nlogm) computational overhead (in comparison to O(nm) exponentiations in our protocol). Finally, the work of [6] studies pattern matching with wildcards in the malicious setting and achieves similar costs to our protocols but for larger alphabets.

1.5 A Roadmap

We first present the underlying primitives in Sect. 2. The following sections then contain our protocols. The basic protocol is presented in Sect. 3. This is then extended, first with wildcards in the pattern (Sect. 4) followed by approximate matching (Sect. 5). Finally, the paper concludes with the protocols which hide the pattern and texts lengths (Sects. 6 and 7).

2 Preliminaries and Tools

Throughout the paper, we denote the security parameter by κ. A probabilistic machine is said to run in polynomial-time (ppt) if it runs in time that is polynomial in the security parameter κ and its input. A function μ(⋅) is negligible in κ (or simply negligible) if for every polynomial p(⋅) there exists a value K such that \(\mu(\kappa )<\frac{1}{p(\kappa)}\) for all κ>K; i.e., μ(κ)=κ ω(1). Let \(X= \{X(\kappa,a) \}_{\kappa \in{\mathbb{N}},a\in \{0, 1 \} ^{*}}\) and \(Y= \{Y(\kappa, a) \}_{\kappa\in\mathbb {N},a\in \{0, 1 \}^{*}}\) be distribution ensembles. We say that X and Y are computationally indistinguishable, denoted \(X\stackrel{\mathrm{c}}{\equiv} Y\), if for every polynomial non-uniform distinguisher D there exists a negligible μ(⋅) such that for every \(\kappa\in \mathbb{N}\) and a∈{0,1}

$$\bigl|\Pr\bigl[D\bigl(X(\kappa,a)\bigr)=1\bigr]-\Pr\bigl[D\bigl(Y(\kappa,a)\bigr)=1 \bigr] \bigr|<\mu (\kappa). $$

2.1 Hardness Assumptions

Our constructions rely on the following hardness assumption.

Definition 1

(DDH)

We say that the decisional Diffie–Hellman (DDH) problem is hard relative to the group \(\mathbb{G}_{q}\) if for all PPT \(\mathcal{A}\) there exists a negligible function negl such that

$$\bigl|\Pr \bigl[\mathcal{A}\bigl({{\mathbb {G}}_q},q,g,g^x,g^y,g^z \bigr)=1 \bigr]-\Pr \bigl[\mathcal{A}\bigl( \mathbb{G} _q ,q,g,g^x,g^y,g^{xy} \bigr)=1 \bigr] \bigr| \leq{\sf negl}(\kappa), $$

where \(\mathbb{G}_{q}\) has order q and the probabilities are taken over the choices of g generating \(\mathbb{G}_{q}\) and \(x,y,z\in\mathbb{Z}_{q}\).

2.2 Σ-Protocols

Definition 2

(Σ-protocol)

A protocol π is a Σ-protocol for relation R if it is a 3-round public-coin protocol and the following requirements hold:

  • Completeness: If P and V follow the protocol on input x and private input w to P where (x,w)∈R, then V always accepts.

  • Special soundness: There exists a polynomial-time algorithm A that given any x and any pair of accepting transcripts (a,e,z),(a,e′,z′) on input x, where ee′, outputs w such that (x,w)∈R.

  • Special honest-verifier zero knowledge: There exists a ppt algorithm M such that

    $$\bigl\{\bigl\langle P(x,w),V(x,e)\bigr\rangle \bigr\}_{(x,w)\in R,e\in \{0, 1 \}^*} \equiv \bigl\{ M(x,e) \bigr\}_{x\in L_R,e\in \{0, 1 \}^*}, $$

    where L R is the language of relation R, M(x,e) denotes the output of M upon input x and e, and 〈P(x,w),V(x,e)〉 denotes the output transcript of an execution between P and V, where P has input (x,w), V has input x, and V’s random tape (determining its query) equals e.

2.3 Public Key Encryption Schemes

We begin by specifying the definitions of public key encryption, semantic security and homomorphic encryption.

Definition 3

(PKE)

We say that Π=(G,E,D) is a public-key encryption scheme if G,E,D are polynomial-time algorithms specified as follows:

  • G, given a security parameter κ (in unary), outputs keys (pk,sk), where pk is a public key and sk is a secret key. We denote this by (pk,sk)←G(1κ).

  • E, given the public key pk and a plaintext message m, outputs a ciphertext c encrypting m. We denote this by cE pk (m); and when emphasizing the randomness r used for encryption, we denote this by cE pk (m;r).

  • D, given the public key pk, secret key sk and a ciphertext c, outputs a plaintext message m s.t. there exists randomness r for which c=E pk (m;r) (or ⊥ if no such message exists). We denote this by mD pk,sk (c).

For a public key encryption scheme Π=(G,E,D) and a non-uniform adversary \(\mathcal{A}=(\mathcal{A}_{1},\mathcal{A}_{2})\), we consider the following Semantic security game:

Denote by \({\mathsf {Adv}}_{\varPi,\mathcal{A}}(\kappa)\) the probability that \(\mathcal{A}\) wins the semantic security game.

Definition 4

(Semantic security)

A public key encryption scheme Π=(G,E,D) is semantically secure, if for every non-uniform adversary \(\mathcal{A}=(\mathcal{A}_{1},\mathcal{A}_{2})\) there exists a negligible function negl such that

$${\mathsf {Adv}}_{\varPi,\mathcal{A}}(\kappa) \leq\frac{1}{2} +\mathsf{negl}(\kappa). $$

An important tool that we exploit in our construction is homomorphic encryption over an additive group as defined below.

Definition 5

(Homomorphic PKE)

A public key encryption scheme (G,E,D) is additively homomorphic if for all n and all (pk,sk) output by G(1κ), it is possible to define groups \(\mathbb{M}\), \(\mathbb{C}\) such that

  • The plaintext space is \(\mathbb{M}\), and all ciphertexts output by E pk are elements of \(\mathbb{C}\).

  • For any \(m_{1}, m_{2} \in{\mathbb{M}}\) and \(c_{1}, c_{2} \in{\mathbb{C}}\) with m 1=D sk (c 1) and m 2=D sk (c 2), we have

    $$\{pk,c_1,c_1 \cdot c_2\} \equiv \bigl\{pk,E_{pk}(m_1),E_{pk}(m_1 + m_2)\bigr\} $$

    where the group operations are carried out in \(\mathbb{C}\) and \(\mathbb{M}\), respectively, and the encryptions of m 1 and m 1+m 2 use independent randomness.

Any additive homomorphic scheme supports the multiplication of a ciphertext by a scalar by computing multiple additions.

2.4 The ElGamal PKE

At the core of our proposed protocols lies the additively homomorphic variation of ElGamal PKE [17]. Essentially, we use the framework of Brandt [11] with minor variations. Formally, ElGamal PKE is a semantically secure public key encryption scheme assuming the hardness of the decisional Diffie–Helmann problem (DDH). We describe the plain scheme here; the distributed version is presented below. Let \(\mathbb{G}_{q}\) be a group of prime order q in which DDH is hard (we assume that multiplication and testing group membership can be performed efficiently). Then the public key is a tuple \(pk=\langle \mathbb{G}_{q} ,q,g,h\rangle\) and the corresponding secret key is sk=s, s.t. g s=h. Encryption is performed by choosing \(r{\in _{R}}{{\mathbb{Z}}_{q}}\) and computing E pk (m;r)=〈g r,h rg m〉. Decryption of a ciphertext C=〈α,β〉 is performed by computing g m=βα s and then finding m by running an exhaustive search. This variant of encrypting in the exponent suffices for our purposes as we do not require full decryption, but just the ability to distinguish between m=0 and m≠0. Note that this variant of ElGamal meets Definition 5 for \(\mathbb{M}= \mathbb{Z}_{q}\) and \(\mathbb{C}= \mathbb{G}_{q}^{2}\). We present the computation of the parties with respect to the ciphertext space componentwise. Namely, we write C r to denote 〈α r,β r〉 and C/C′ for 〈α/α′,β/β′〉, for ciphertexts C=〈α,β〉 and C′=〈α′,β′〉, and \(r\in \mathbb{Z}_{q}\).

2.4.1 Distributed ElGamal PKE

In a distributed scheme, the parties hold shares of the secret key so that the combined key remains a secret. In order to decrypt, each party uses its share to generate an intermediate computation which are eventually combined into the decrypted plaintext. Note that a public key and an additive sharing of the corresponding secret key is easily generated [40]. Namely, the parties first agree on \(\mathbb{G}_{q}\) and g. Then, each party P i picks \(s_{i} \in_{R} \mathbb{Z}_{q}\) and sends \(h_{i} = g^{s_{i}}\) to the other. Finally, the parties compute h=h 1 h 2 and set \(pk=\langle \mathbb{G}_{q},q,g,h\rangle\). Clearly, the secret key s=s 1+s 2 associated with this public key is shared amongst the parties. In order to ensure correct behavior, the parties must prove knowledge of their s i by running on (g,h i ) the zero-knowledge proof π DL, specified in Sect. 2.5. We denote this key generation protocol by π KeyGen which is correlated with the functionality \(\mathcal{F}_{\mathrm{KeyGen}} (1^{\kappa}, 1^{\kappa})=((pk, sk_{1}), (pk, sk_{2}))\).

To decrypt a ciphertext C=〈α,β〉, each party P i raises α to the power of its share, sends the outcome α i to the other party and then proves this was done correctly using π DL. Both parties then output β/(α 1 α 2). We denote this protocol by π Dec. This protocol allows a variation where only one party obtains the decrypted result. Another variation of π Dec allows a party, say P 1, to learn whether a ciphertext C=〈α,β〉 encrypts g 0 or not, but nothing more. This can be carried out as follows. P 2 first raises C to a random non-zero power, rerandomizes the result, and sends it to P 1. The parties then execute π NZ, defined below, to let P 1 verify P 2’s behavior. They then decrypt the final ciphertext towards P 1, who concludes that m=0 iff the masked plaintext was 0. Simulation is trivial given access to \(\mathcal{F}_{{\mathrm{ZK}}}^{\mathcal{R}_{\mathrm{NZ}}}\). We denote this protocol by π Dec0 and the associated ideal functionality by \(\mathcal{F}_{\mathrm{Dec0}}\).

2.5 Zero-Knowledge Proofs for \(\mathbb{G}_{q}\) and ElGamal PKE

To prevent malicious behavior, the parties must demonstrate that they are well-behaved. To achieve this, our protocols utilize zero-knowledge proofs of knowledge. All our proofs are Σ-protocols which show knowledge of a witness that some statement is true (i.e., belong to some relation \(\mathcal{R}\)). A generic efficient technique that enables to transform any Σ-protocol into a zero-knowledge proof (of knowledge) can be found in [25]. This transformation requires additionally five (six) exponentiations.

2.5.1 Zero-Knowledge Proofs with Constant Overhead

  1. 1.

    π DL, for demonstrating the knowledge of a solution x to a discrete logarithm problem [41].

    $$\mathcal{R}_{\mathrm{DL}}= \bigl\{\bigl((\mathbb{G}_q,q,g,h),x\bigr)\mid h=g^x \bigr\}. $$
  2. 2.

    π EqDL, for demonstrating equality of two discrete logarithms [15].

    $$\mathcal{R}_{\mathrm{EqDL}}= \bigl\{\bigl(( {{ \mathbb{G}}_q},q,g_1, g_2,h_1,h_2),x \bigr)\mid h_1=g_1^x \wedge h_2=g_2^x \bigr\}. $$

    Phrased differently, π EqDL demonstrates that a quadruple forms a Diffie–Hellman tuple or, equivalently, that an ElGamal ciphertext is an encryption of 0, where g 1,g 2 is part of the public key and 〈h 1,h 2〉 is the computed ciphertext; see Sect. 2.4 for the complete details of ElGamal.

  3. 3.

    π isBit, for demonstrating that a ciphertext C=〈α,β〉 is either an encryption of 0 or 1. This can be obtained directly from π EqDL using the compound proof of Cramer et al. [14].

    $$\mathcal{R}_{\mathrm{isBit}}= \bigl\{\bigl(( {{ \mathbb{G}}_q},q,g,h,\alpha, \beta),(b,r)\bigr)\mid { \langle\alpha, \beta \rangle} = \bigl\langle g^r, h^r\cdot g^b \bigr\rangle \wedge b\in \{0, 1 \} \bigr\}. $$
  4. 4.

    π Mult, for demonstrating that a ciphertext encrypts the product of two encrypted plaintexts [1]. Namely, given a ciphertext C the prover proves the knowledge of a plaintext f and randomness r f ,r π such that C f =E pk (f;r f ) and C π =C fE pk (0;r π ), where exponentiation is computed componentwise.

  5. 5.

    π NZ, for demonstrating that a ciphertext C′ can be computed from C=〈α,β〉 by raising C (componentwise) to a non-zero exponent and rerandomizing it, i.e. C′=C RE pk (0;r)=〈α′,β′〉.

    $$\mathcal{R}_{\mathrm{NZ}}= \bigl\{\bigl(\bigl(g,h, \alpha, \beta, \alpha', \beta'\bigr),(R, r)\bigr)\ \mathrm{s.t.}\ { \bigl\langle\alpha', \beta' \bigr\rangle} = { \bigl\langle\alpha^Rg^r, \beta^Rh^r \bigr\rangle} \wedge R\neq0 \bigr\}. $$

    The challenging part when constructing a proof for this relation is to show that R≠0. To do this, the prover picks \(R' \in _{R} \mathbb{Z} _{q}^{*} \), supplies the verifier with additional ciphertexts, C R =E pk (R;r R ), C R=E pk (R′;r R) and C π =E pk (RR′;r π ), and executes \(\pi_{{\scriptscriptstyle\mathrm{Mult}}}\) twice: once on (C,C R ,C′) and once on (C R ,C R,C π ). The prover then sends RR′ to the verifier and demonstrates it is the plaintext of C π using π EqDL. Finally, the verifier checks that RR′ is non-zero.

    The executions of π Mult demonstrate that C′ has been obtained from C through exponentiation and that the plaintext of C π depends on R. Running π EqDL and the final check ensures that RR′≠0 implying that so is R. Hence the protocol demonstrates that C′ has been obtained correctly. Further, since the verifier receives only ciphertexts along with RR′, which is uniformly distributed in \(\mathbb{Z} _{q}^{*} \), π NZ is zero-knowledge.

2.5.2 Additional Zero-Knowledge Proofs

  1. 1.

    π Perm, for demonstrating that a set of ciphertexts {C i } i is a random permutation and rerandomization of another set, \({ \{{C}_{i}' \}}_{i}\). A number of potential proofs exist in the literature; the most recent solution by Bayer and Groth [8] obtains sublinear communication, whereas the amount of the prover’s work in quasilinear. Other works, such as [44], require linear communication/computation complexity.

    $$\mathcal{R}_{\mathrm{Perm}}= \bigl\{\bigl(\bigl(g,h, \{{C}_i \}_i, \bigl\{ {C}_i' \bigr\}_i\bigr),\bigl(\pi, \{{r}_i \}_i\bigr)\bigr)\ \mathrm{s.t.}\ \bigl\langle \alpha_{i}', \beta_{i}' \bigr \rangle = \bigl\langle\alpha_{\pi(i)} g^{r_i}, \beta_{\pi(i)} h^{r_i} \bigr\rangle \bigr\}. $$
  2. 2.

    π ⋆-proof, for demonstrating the correctness with respect to the following relation, defined in two phases. Looking ahead, this proof is used within Protocol \({\mathcal{F}_{\mathrm{PM}\mbox{-}\star}}\) for secure pattern matching with wildcards. Specifically, let \({{\mathbb {G}}_{q}}\) be a group of prime order and let \(\mathbb{G}_{q}^{2}\) be the ciphertext domain for the associated, additively homomorphic ElGamal encryption scheme with encryption function E pk (⋅;⋅). Let \(T_{1},\ldots,T_{n}\in{{{\mathbb {G}}_{q}}^{2}}\) be a collection of encryptions. Then, for j∈{1,…,nm+1} define first a function \(\phi_{j}: ({{\mathbb {Z}}_{q}}^{m}\times{{\mathbb{Z}}_{q}} )\mapsto{{{\mathbb{G}}_{q}}^{2}}\) by

    $$\phi_j \bigl( \{w_i\}_{i=1}^{m} , r_j \bigr) = \Biggl(\prod_{i=1}^m(T_{i+j})^{w_i\cdot2^{i-1}} \Biggr)\cdot {{E_{pk}} (0;r_j )}. $$

    That is, the output is the rerandomization of an encryption of Alice’s substring (which holds the text), starting at position j with wildcard positions replaced by 0. Next, define a function \({\phi_{T_{1},\ldots,T_{n}}}: ({{\mathbb{Z}}_{q}}^{m}\times {{\mathbb{Z}}_{q}}^{m} \times{ {\mathbb{Z}}_{q}}^{n-m+1} )\mapsto ({{{\mathbb{G}}_{q}}^{2}} )^{m} \times ({{{\mathbb {G}}_{q}}^{2}} )^{n-m+1}\) as follows:

    (1)

    I.e., \({\phi_{T_{1},\ldots,T_{n}}}\) consists of m encryptions of values w i with randomness \(r_{w_{i}}\) as well as nm+1 rerandomized encryptions, each computed from m pairs, (w i ,T i+j ) as defined by ϕ j . Therefore, the set ciphertexts encrypting the text and {w i } i is the statement and the set of plaintexts and randomness is the witness. A detailed protocol as well as a complete proof can be found in Appendix A. Our proof introduces communication complexity O(n+m) which is linear in the inputs lengths, and computation cost O(nm).

  3. 3.

    π H-proof, for demonstrating correctness with respect to the following relation, also defined in two phases. Looking ahead, this proof is used within Protocol π APM for approximate pattern matching. The goal is for Alice to verify that the Hamming distances have been correctly computed, i.e., that Bob correctly performed his part of the computation between the substrings of the text, t, and his pattern, p. For j=1,…,nm+1 let

    $$ {{H_{T_1,\ldots,T_n}}^{(j)}}: { \mathbb{Z}}_q^m\times{{\mathbb {Z}}_q}\mapsto{{{\mathbb{G}}_q}^2} $$

    be defined as

    $$ {{H_{T_1,\ldots,T_n}}^{(j)}} \bigl( { \{p_i\}_{i=1}^{m}}, r_j \bigr) = \Biggl(\prod_{i=1}^m (T_{j+i-1} )^{-2p_i}\cdot {{E_{pk}} (1;0 )}^{p_i} \Biggr)\cdot {{E_{pk}} (0;r_j )}. $$

    Now define

    $$ {H_{T_1,\ldots,T_n}}: {{\mathbb {Z}}_q}^m\times{{ \mathbb{Z}}_q}^{n-m+1}\times {{ \mathbb{Z}}_q}^{m} \mapsto \bigl({{ {\mathbb{G}}_q}^2} \bigr)^{n-m+1}\times \bigl({{ {\mathbb{G}}_q}^2} \bigr)^m $$

    as

    A detailed protocol as well as a complete proof can be found in Appendix B. Our proof introduces communication complexity O(n+m) which is linear in the inputs lengths, and computation cost O(nm).

3 The Basic, Linear Solution

In this section we present our solution for the classic pattern matching problem. Initially, Alice holds an n-bit string t, while Bob holds an m-bit pattern p and the parties wish to compute the functionality \(\mathcal{F}_{{\mathrm{PM}}} \) defined by

$$\bigl((p,n),(t,m)\bigr)\mapsto \left\{ \begin{array}{l@{\quad }l} (\{j \mid\bar{t}_j=p\}_{j=1}^{n-m+1},\lambda\bigr) & \mathrm{if}\ |p|= m\ \mathrm{and}\ |t|=n , \\[3pt] (\lambda,\lambda) & \mathrm{otherwise.} \end{array} \right. $$

where λ is an empty string and \(\bar{t}_{j}\) is the substring of length m that begins at the jth position in t. This problem has been widely studied for decades due to its potential applications and can be solved in linear time complexity [10, 31] when no level of security is required. We examine a secure version for this problem where Alice, who holds the text, does not gain any information about the pattern from the protocol execution, whereas Bob, who holds the pattern, does not learn anything but the matched text locations. In our setting, the parties share no information (except for the input length) though it is assumed that they are connected by an authenticated communication channel and that the inputs are over a binary alphabet. Extending this to larger alphabets is discussed below. Our protocol exhibits overall linear communication and computation costs, and achieves full simulation in the presence of malicious adversaries. More specifically, the parties compute O(n+m) exponentiations and exchange O(n+m) group elements.

Here and below, we have the parties jointly (and securely) transform their inputs from binary representation into elements of \(\mathbb{Z}_{q}\) (we assume that m<log2 q; larger pattern-lengths can be accommodated by encoding the pattern and substrings of the text into multiple values; see Sect. 3.1 for further details), while exploiting the fact that every two consecutive substrings of the text are closely related. Informally, both parties break their inputs into bits and encrypt each bit separately. Next, the parties map every m consecutive encryptions of bits into a single encryption that denotes an m-character for which its binary representation is assembled from these m bits. Thus, the problem is reduced to comparing two elements of \({\mathbb {Z}}_{2^{m}}\) (embedded into \({\mathbb{Z}}_{q}\)). The crux of our protocol is to efficiently compute this mapping.

We are now ready to give a detailed description of our construction.

Protocol π PM

  • Inputs: The input of Alice is a binary string t of length n and an integer m, whereas the input of Bob is a binary string p of length m and an integer n. The parties share a security parameter 1κ as well.

  • The protocol:

    1. 1.

      Alice and Bob run protocol π KeyGen(1κ,1κ) to generate a public key \(pk=\langle {{\mathbb{G}}_{q}} ,q,g,h\rangle\) and the respective shares s A and s B of the secret key sk.

    2. 2.

      Bob sends encryptions \(P_{i} = E_{pk}(p_{i};r_{p_{i}})\), i=1,…,m, of his m-bit pattern p, to Alice. Further, for each encryption the parties run the zero-knowledge proof of knowledge π isBit, allowing Alice to verify that the plaintext of P i is a bit known to Bob, i.e. that he has provided a bit-string of length m. Both parties then compute an encryption of Bob’s pattern,

      (2)

      using the homomorphic property of ElGamal PKE.

    3. 3.

      Alice sends encryptions, \(T_{j} = E_{pk}(t_{j};r_{t_{j}})\) j=1,…,n, of the bits t j of her n-bit text, t, to Bob. Further, for each encryption the parties run π isBit, allowing Bob to verify that the plaintext of T j is a bit known to Alice, i.e. that she has indeed provided the encryption of a bit-string of length n that she knows.

    4. 4.

      Let \(\bar{t}_{j}\) be the m-bit substring of Alice’s text t, starting at position j=1,…,nm+1. For each such string both parties compute an encryption of that string,

      $$ \bar{T}_j \gets\prod _{i=j}^{j+m-1} T_i^{2^{i-j}}. $$
      (3)
    5. 5.

      For every \(\bar{T}_{j}\), j=1,…,nm+1, both parties compute

      $$ \Delta_j \gets\bar{T}_j\cdot P^{-1}. $$
      (4)
    6. 6.

      For every Δ j j=1,…,nm+1, Alice and Bob reveal to Bob whether its plaintext δ j is zero by running π Dec0. Bob then outputs j if this is the case.

Correctness of π PM

Before turning to our proof, we explain the intuition and demonstrate that protocol π PM correctly determines which substrings of the text t match the pattern p. Recall that the value P that is computed in Eq. (2) (Step 2) is an encryption of Bob’s pattern, \(p = \sum_{i=1}^{m} 2^{i-1}p_{i}\). This follows from the homomorphic property of ElGamal PKE,

$$ P = \prod_{i=1}^m P_i^{2^{i-1}} = E_{pk} \Biggl(\sum_{i=1}^m 2^{i-1}p_i; \sum_{i=1}^m 2^{i-1} r_{p_i} \Biggr). $$
(5)

Note that P is obtained deterministically from the P i , hence both Alice and Bob hold the same fixed encryption. Similarly, in Eq. (3) computed in Step 4, the parties compute encryptions of the substrings of length m of Alice’s text,

$$\bar{t}_j = \sum_{i=j}^{j+m-1}2^{i-j}t_i , $$

see a detailed discussion in the complexity paragraph regarding the efficiency of this step. As with P, the parties hold the same, fixed encryptions (with randomness \(r_{\bar{t}_{j}} = \sum_{i=j}^{j+m-1} 2^{i-j} r_{t_{i}} \)). The encryption Δ j computed by Eq. (4) is an encryption of \(\delta_{j} = \bar{t}_{j} - p\), i.e., the (\(\mathbb{Z}_{q}\)) difference between the substring of the text starting at position j and the pattern

At this point, it simply remains for Bob to securely determine which of the Δ j are encryptions of zero, as

$$ \delta_j = 0\quad \Longleftrightarrow\quad \bar{t}_j = p. $$

Security of π PM

We are now ready to prove the following theorem:

Theorem 6

(Main)

Assume that the DDH assumption holds in \({{\mathbb{G}}_{q}}\), then π PM securely computes \(\mathcal{F}_{\mathrm{PM}}\) in the presence of malicious adversaries.

Proof

We separately prove security in the case that Alice is corrupted and the case that Bob is corrupted. Our proof is in a hybrid model where a trusted party computes the ideal functionalities \(\mathcal {F}_{\mathrm{KeyGen}}\), \({\mathcal{F}_{\mathrm{Dec0}}}\) and \({{\mathcal{F}_{{\mathrm{ZK}}}^{\mathcal {R}_{{\mathrm{isBit}}}}}}\).

Alice is Corrupted

Recalling that Alice does not receive any output from the execution, we only need to prove that privacy is preserved and that Bob’s output cannot be affected (except with negligible probability). Formally, let \(\mathcal{A}\) denote an adversary controlling Alice then construct a simulator \(\mathcal{S}\) as follows:

  1. 1.

    \(\mathcal{S}\) is given a text t of length n, an integer m and \(\mathcal{A} \)’s auxiliary input and invokes \(\mathcal{A}\) on these values.

  2. 2.

    \(\mathcal{S}\) emulates the trusted party for \(\pi _{\scriptscriptstyle\mathrm{KeyGen}}\) as follows. It first chooses two random elements \(s_{A},s_{B}\in \mathbb{Z} _{q} \) and hands \(\mathcal{A}\) its share s A and the public key \(\langle \mathbb{G} _{q} ,q,g,h=g^{s_{A}+ s_{B}}\rangle\).

  3. 3.

    Next, \(\mathcal{S}\) sends m encryptions of 0 and emulates \(\mathcal{F}_{{\mathrm{ZK}}}^{\mathcal {R}_{{\mathrm{isBit}}}}\) by sending 1.

  4. 4.

    \(\mathcal{S}\) receives from \(\mathcal{A}\) n encryptions and the witness for the trusted party for π isBit. If the conditions for which the functionality outputs 1 are not met, \(\mathcal{S}\) aborts by sending ⊥ to the trusted party for \(\mathcal{F}_{\mathrm{PM}}\) and outputs whatever \(\mathcal{A}\) outputs.

  5. 5.

    Otherwise, \(\mathcal{S}\) defines t according to the witness for π isBit and records it.

  6. 6.

    \(\mathcal{S}\) and \(\mathcal{A}\) compute P, \(\{\bar{T}_{j}\}_{j}\) and {Δ j } j as in the hybrid execution. Then, \(\mathcal{S}\) emulates \({\mathcal{F}_{\mathrm{Dec0}}}\) accepting if the ideal functionality would accept as well.

  7. 7.

    If at any point \(\mathcal{A}\) sends an invalid message, \(\mathcal{S}\) aborts, sending ⊥ to the trusted party for \(\mathcal{F}_{\mathrm{PM}}\). Otherwise, it sends (t,m) to the trusted party and outputs whatever \(\mathcal{A}\) does.

Clearly, \(\mathcal{S}\) runs in probabilistic polynomial time. We prove now that the joint output distribution is computationally indistinguishable in both executions. To see that \(\mathcal{A}\)’s view is computationally indistinguishable, note first that the only difference between the executions is with respect to the encryptions that assemble p, i.e., the bits encryptions of the pattern (as \(\mathcal{S}\) sends encryptions of zero).

We prove that \(\mathcal{A}\) cannot distinguish the simulated and hybrid views via a reduction to the semantic security of ElGamal (cf. Definition 4). More formally, assume there exists a distinguisher D for these executions, we construct a distinguisher D E as follows. Upon receiving a public key pk and auxiliary input p, D E engages in an execution of π KeyGen with \(\mathcal{A}\) and sends it (s A ,pk) where \(s_{A}{\in _{R}}{{\mathbb{Z}}_{q}}\). D E continues emulating the role of Bob as \(\mathcal{S}\) does except for Step 2 of the protocol where it needs to send the encryptions of p 1,…,p m . In this step D E outputs two sets of plaintexts: (i) p 1,…,p m and (ii) 0,…,0. We denote by \(\tilde{P}_{1},\ldots,\tilde{P}_{m}\) the set of encryptions it receives back. D E hands \(\mathcal{A}\) this set and completes the run as \(\mathcal{S}\) does. Finally, it invokes D on \(\mathcal{A}\)’s output and outputs whatever D outputs. Note that at no point in the reduction, will D E need to use the actual plaintexts that correspond to the challenge ciphertexts. Moreover, if D E is given the encryptions of p then the adversary’s view is distributed as in the hybrid execution. Similarly, if it receives encryptions of zeros, then the adversary’s view is as in the simulation with \(\mathcal{S}\).

It remains to show that the honest Bob outputs the same set of indices with overwhelming probability in both executions. This follows directly from the correctness argument above. In particular, assuming that Alice indeed completes the execution honestly (which is indeed the case due to the zero-knowledge proofs), the protocol correctly computes the matching text locations. This concludes the case that Alice is corrupted.

Bob is Corrupted

Let \(\mathcal{A}\) denote an adversary controlling Bob. In this case we need to prove that Bob does not learn anything but the matching text locations. We similarly construct a simulator \(\mathcal{S}\) as follows,

  1. 1.

    \(\mathcal{S}\) is given a pattern p of length m, an integer n and \(\mathcal{A}\)’s auxiliary input and invokes \(\mathcal{A}\) on these values.

  2. 2.

    \(\mathcal{S}\) emulates the trusted party for π KeyGen as follows. It first chooses two random elements \(s_{A},s_{B}\in \mathbb{G}_{q}\) and hands \(\mathcal{A}\) its share s B and the public key \(\langle {\mathbb{G}}_{q} ,q,g,h=g^{s_{A}+ s_{B}}\rangle\).

  3. 3.

    \(\mathcal{S}\) receives from \(\mathcal{A}\) m encryptions and \(\mathcal{A}\)’s input for the trusted party for \({{\mathcal{F}_{{\mathrm {ZK}}}^{\mathcal{R}_{{\mathrm{isBit}}}}}}\). If the conditions for which the functionality outputs 1 are not met, \(\mathcal{S}\) aborts by sending ⊥ to the trusted party for \(\mathcal{F}_{\mathrm{PM}}\) and outputs whatever \(\mathcal{A}\) outputs.

  4. 4.

    Otherwise, \(\mathcal{S}\) defines P according to the witness for \(\pi_{{\scriptscriptstyle\mathrm{isBit}}} \) and sends it to the trusted party. Let Z be the set of returned indexes.

  5. 5.

    Next, \(\mathcal{S}\) sends n fresh encryptions of 0 and emulates \({{\mathcal{F}_{{\mathrm {ZK}}}^{\mathcal{R}_{{\mathrm{isBit}}}}}}\) by sending 1.

  6. 6.

    Finally, \(\mathcal{S}\) and \(\mathcal {A}\) compute P, \(\{\bar{T}_{j}\}_{j}\) and {Δ j } j as in the hybrid execution. Then \(\mathcal{S}\) emulates \({\mathcal {F}_{\mathrm{Dec0}}}\) by sending an output as specified by Z rather than by the encrypted “result”, {Δ j } j . Namely, \(\mathcal{S}\) “decrypts” Δ j into zero if and only if jZ.

  7. 7.

    If at any point \(\mathcal{A}\) sends an invalid message, \(\mathcal{S}\) aborts, sending ⊥ to the trusted party for \(\mathcal{F}_{\mathrm{PM}}\). Otherwise, it outputs whatever \(\mathcal{A}\) does.

It is immediate to see that \(\mathcal{S}\) runs in probabilistic polynomial time. We prove next that the adversary’s views are computational indistinguishable via a reduction to the semantic security of ElGamal. Recall that the key difference between the executions is that the encryptions of Alice’s text are replaced by encryptions of 0’s, which implies that the result given to \(\mathcal{A}\) in Step 6 of the simulation may not match the actual plaintexts.

Formally, assume there exists a distinguisher D for the simulated and hybrid protocol views. We may then construct a distinguisher D E breaking the semantic security of ElGamal PKE as follows. Upon receiving a public key pk and auxiliary input t, D E  emulates \(\mathcal{F}_{\mathrm{KeyGen}}\) by sending (s B ,pk) to \(\mathcal{A}\) where \(s_{B}{\in _{R}}{{\mathbb{Z}}_{q}}\). Note that this perfectly matches \(\mathcal{A}\)’s view in both protocol and simulation. D E continues emulating the role of Alice as \(\mathcal{S}\) does except for Step 5 of the simulation. Instead of simulating Alice’s input, D E outputs two sets of plaintexts: (i) (t 1,…,t n ) and, (ii) (0…,0). We denote by \(\tilde{T}_{1},\ldots,\tilde{T}_{n}\) the set of encryptions it receives back; D E hands \(\mathcal{A}\) this set and completes the simulated run. Finally, D E  invokes D on \(\mathcal{A}\)’s output and outputs whatever D outputs.

If D successfully distinguishes between a simulated view and a view of the hybrid protocol, then D E distinguishes between encryptions of the t i ’s and encryptions of 0’s. For case (i), i.e., if D E received encryptions of t 1,…,t n , \(\mathcal{A}\)’s view is identical to the view when executing the hybrid protocol, since except for the interaction with \(\mathcal {F}_{\mathrm{KeyGen}}\), \({{\mathcal{F}_{{\mathrm{ZK}}}^{\mathcal {R}_{{\mathrm{isBit}}}}}}\), and \({\mathcal{F}_{\mathrm{Dec0}}}\), Alice’s only action is to send her encrypted input. For case (ii), D E  sends n encryptions of 0 to \(\mathcal{A}\), hence in this case D E ’s behavior exactly matches that of \(\mathcal{S}\).  □

Complexity of π PM

The round complexity is constant as the key generation process and the zero-knowledge proofs run in constant rounds. Further, the number of group elements exchanged is bounded by O(n+m) as there are nm+1 substrings of length m and each zero-knowledge proof requires a constant number of exponentiations. Regarding computational complexity, it is clear that except for Step 4 at most O(m+n) exponentiations are required. Note first that Eq. (3) can be implemented using the square and multiply technique. Namely, for every j=1,…,nm+1, \(\bar {T}_{j}\) is computed by (⋯((T j+m−1)2T j+m−2)2T j+m−3⋯)2T j . This requires O(m) multiplications for each text location, which amounts to total O(nm) multiplications for the entire text. Reducing the number of multiplications into O(n) (on the expense of increasing the number of exponentiations by a constant factor) can be easily shown. That is, in addition to sending an encryption of 0 or 1 for each text location, Alice sends an encryption of 0 or 2m, respectively, and proves consistency. This enables to complete the transformation from binary representation in constant time per text location. We comment that from practical point of view, it may be much more efficient to compute O(m) multiplications for each location than proving this consistency (even though it only requires additional constant number of exponentiations.) Finally, note that our protocol utilizes ElGamal encryption which can be implemented over an elliptic curve group. This may reduce the modulus value dramatically as now only 160 bits are typically needed for the size of the key. This also means that the length of the pattern must be bounded by 160 bits. For applications that require longer patterns we propose a different approach; see Sect. 3.1.

3.1 Variations

The following variations can be handled similarly to the classic problem of pattern matching.

Non-binary Alphabets

Alphabets of larger size, s, can be handled by encoding the characters as elements of \(\mathbb {Z} _{s}\) and using s-ary rather than binary notation for the \(\bar{T}_{j}\) and P. Proving in ZK that an encryption contains a valid character is straightforward, e.g. it can be provided in binary (which of course requires O(logs) encryptions).

Long Patterns

When the pattern length m, (or the alphabet size s) is large, requiring q>s m may not be acceptable. This can be avoided by encoding the pattern p and substrings \(\bar{t}_{j}\) into multiple \({\mathbb{Z}}_{q}\) values, \(\{p^{(i)}\}_{i},\{\bar{t}^{(i)}_{j}\}_{i}\) for i∈[log2 s m/log2 q]. Namely, the number of blocks of length logq that are required to “cover” logs m; denote this value by ρ. Having computed encryptions {Δ i } i of the differences \(\{\delta_{i} = p^{(i)}- \bar{t}^{(i)}_{j}\}_{i}\), Alice raises each encryption to a random, non-zero exponent r i , rerandomizes them and sends them to Bob (proving that everything was done correctly). The parties then execute π Dec0 on the product of these encryptions and Bob reports a match if a 0 is found. Note that the plaintext of this product is ∑ i r i δ i . Thus, if the pattern matches, all δ i =0 implying that this is an encryption of 0. If one or more δ i ≠0, then the probability of this being an encryption of 0 is negligible. The overhead of this approach is dominated by repeating the basic linear solution ρ times for each text location. As now, the parties compare ρ blocks each time rather than just one. Hence, communication/computation complexities are multiplied by ρ.

Hiding Matched Locations

It may be required that Bob only learns the number of matches and not the actual locations of the hits. One example is determining how frequently some gene occurs rather than where it occurs in some DNA sequence. This is easily achieved by simply having Alice pick a uniformly random permutation and permute (and rerandomize) the Δ j of Eq. (4). The encryptions are sent to Bob, and π Perm is executed, allowing him to verify Alice’s behavior. Finally, π Dec0 is run and Bob outputs the number of encryptions of 0 received. Correctness is immediate: An encryption of 0 still signals that a match occurred. However, due to the random permutation that Alice applies, the locations are shuffled, implying that Bob does not learn the actual matches.

4 Secure Pattern Matching with Wildcards

The first variant of the classical pattern matching problem allows Bob to place wildcards, denoted by ⋆, in his pattern; these should match both 0 and 1. More formally, the parties wish to compute the functionality \({\mathcal{F}_{{\mathrm{PM-{\star}}}}}\) defined by

$$\bigl((p,n),(t,m)\bigr)\mapsto \left\{ \begin{array} {l @{\quad }l} \bigl(\{j \mid\bar{t}_j\stackrel{\star}{\equiv}p\} _{j=1}^{n-m+1},\lambda\bigr) & \mathrm{if}\ |p|= m\ \mathrm{and}\ |t|= n , \\[6pt] (\lambda,\lambda) & \mbox{otherwise} , \\ \end{array} \right. $$

where \(\bar{t}_{j}\) is the substring of length m that begins at the jth position of t and \(\stackrel{\star}{\equiv}\) is defined as “equal except with respect to ⋆-positions.” This problem has been widely looked at by researchers with the aim to generalize the basic searching model to searching with errors. This variant is known as pattern matching with don’t cares and can be solved in O(n+m) time [28]. The secure version of this problem guarantees that Alice will not be able to trace the locations of the don’t cares in addition to the security requirement introduced for the basic problem.

The core idea of the solution is to proceed as in the standard one with two exceptions: Bob must supply the wildcard positions in encrypted form, and the substrings of Alice’s text must be modified to ensure that they will match (i.e., equal) the pattern at those positions. Achieving correctness and ensuring correct behavior requires substantial modification of the protocol. Intuitively, for every m-bit substring \(\bar{t}_{j}\) of t, Bob replaces Alice’s value by 0 at the wildcard positions resulting in a string \(\bar{t}_{j}'\), see Step 6 below. Similarly, a pattern p′ is obtained from p by replacing the wildcards by 0. Clearly this ensures that the bits of \(\bar{t}_{j}'\) and p′ are equal at all wildcard positions. Thus, \(\bar{t}_{j}' = p'\) precisely when \(\bar{t}_{j}\) equals p at all non-wildcard positions.

Protocol \(\pi_{\mathrm {PM\mbox{-}{\star}}}\)

  • Inputs: The input of Alice is a binary string t of length n and an integer m, whereas the input of Bob is a string p over the alphabet {0,1,⋆} of length m and an integer n. The parties share a security parameter 1κ as well.

  • The protocol:

    1. 1.

      Alice and Bob run protocol \(\pi _{\scriptscriptstyle\mathrm{KeyGen}} (1^{\kappa},1^{\kappa})\) to generate a public key \(pk=\langle \mathbb{G} _{q} ,q,g,h\rangle\), and the respective shares s A and s B of the secret key sk.

    2. 2.

      For each position i=1,…,m, Bob first replaces ⋆ by 0

      $$ p'_i \gets \left\{ \begin{array}{l@{\quad }l} 1& \text{if}\ p_i=1 , \\ 0& \text{otherwise}. \end{array} \right. $$

      He then sends encryptions \(P'_{i} = E_{pk}(p'_{i};r_{p'_{i}})\) for i=1,…,m to Alice, and for each one they execute π isBit. Finally, both parties compute an encryption of Bob’s “pattern” in binary,

      $$ P' \gets\prod_{i=1}^m P_i^{\prime\,2^{i-1}}. $$
    3. 3.

      For each position i=1,…,m of Bob’s pattern, he computes a bit denoting the occurrences of a ⋆,

      $$ w_i \gets \left\{ \begin{array}{l@{\quad }l} 0& \text{if}\ p_i=\star, \\ 1& \text{otherwise}. \end{array} \right. $$

      He then encrypts these and sends the result to Alice,

      $$W_i\gets {E_{pk}} (w_i; r_{w_i} ), $$

      and the two run π isBit for each one.

    4. 4.

      For each i=1,…,m, Bob and Alice run π isBit on \(W_{i}/P'_{i}\). This demonstrates to Alice that if \(p'_{i}\) is set, then so is w i , i.e. that only 0’s occur at wildcard positions.

    5. 5.

      Alice supplies her input as in Step 3 of Protocol π PM in Sect. 3. She sends encryptions, \(T_{j} = E_{pk}(t_{j};r_{t_{j}})\) j=1,…,n, of the bits of t to Bob. Then the parties run π isBit for each of the encryptions.

    6. 6.

      For every m-bit substring of t starting at position j=1,…,nm+1, Bob computes an encryption

      $$\bar{T}_j'\gets \Biggl(\prod _{i=1}^m \bigl( (T_{j+i-1} )^{w_i} \bigr)^{2^{i-1}} \Biggr)\cdot{ {E_{pk}} (0;r_{j} )}. $$

      He sends these to Alice, and they run π ⋆-proof on the tuple consisting of the encryptions of Alice’s input and Bob’s w i , as well as the \(\bar{T}_{j}'\). This allows Alice to verify that Bob correctly computed encryptions of her substrings with her input replaced by 0 at Bob’s wildcard positions.

    7. 7.

      The protocol concludes as Protocol π PM does. Namely, for each of the \(\bar{T}_{j}'\) where j=1,…,nm+1, the parties compute

      $$\Delta_j \gets\bar{T}_j' \cdot P^{\prime\,-1} $$

      and run π Dec0. This reveals to Bob which of plaintexts δ j are 0. For each δ j =0 he concludes that the pattern matched and outputs j.

To see that the protocol does not introduce new opportunities for malicious behavior, first note that Alice’s specification is essentially as in the basic protocol π PM. Regarding Bob, the proofs of correct behavior limit him to supplying an input that an honest Bob could have supplied as well. Bob’s input, \(p_{i}'\) for i=1,…,m, is first shown to be a bit string, Step 2. The invocations of π isBit of Step 3 then ensure that so is the “wildcard string”. Finally, in Step 4 it is verified that for each wildcard p i of p, \(p_{i}' = 0\). In other words, there is a valid input where the honest Bob would send encryptions of the values that the malicious Bob can use. The only remaining option for a malicious Bob is in Step 6, however, the invocations of π ⋆-proof ensure his correct behavior. Formal simulation is analogous to that in Sect. 3. We state the following theorem:

Theorem 7

(Wildcards)

Assume that the DDH assumption holds in \(\mathbb{G} _{q} \), then π PM-⋆ securely computes \({\mathcal{F}_{{\mathrm{PM}\mbox{-}{\star}}}}\) in the presence of malicious adversaries.

Regarding complexity, clearly the most costly part of the protocol is Step 6 which requires Bob to send Θ(n+m) encryptions to Alice, as well as an invocation of π ⋆-proof. Hence, due to the latter communication complexity is O(n+m) and round complexity remains constant, while computation is increased to O(nm) multiplications and exponentiations. We remark that dropping the ZK-proofs results in a passively secure variant requiring only O(n+m) exponentiations since the computation of \(\bar{T}_{j}'\) in Step 6 can be implemented similarly to square and multiply.

5 Secure Approximate Matching

The second variation considered is approximate pattern matching: Alice holds an n-bit string t, while Bob holds an m-bit pattern p. The parties wish to determine approximate matches—strings with Hamming distance less than some threshold τm. This is captured by the functionality \(\mathcal{F}_{{\mathrm{APM}}}\) defined by

$$\bigl((p,n, \tau),\bigl(t,m, \tau'\bigr)\bigr)\mapsto \left\{ \begin{array} {l@{\quad } l} (\{j \mid \delta_H (\bar{t}_j,p ) < \tau\} _{j=1}^{n-m+1},\lambda) & \mbox{if $|p|= m\geq\tau= \tau'$} \\ & \mbox{and $|t|=n$,} \\ (\lambda,\lambda) & \mbox{otherwise,} \\ \end{array} \right. $$

where δ H denotes Hamming distance and \(\bar{t}_{j}\) is the substring of length m that begins at the jth position in t. We assume that the parties share some threshold \(\tau\in \mathbb{N} \). Note that this problem is an extension of pattern matching with don’t cares problem introduced in Sect. 4. Bob is able to learn all the matches within some error bound instead of learning the matches for specified error locations.

Two of the most important applications of approximate pattern matching are spell checking and matching DNA sequences. The most recent algorithm for solving this problem without considering privacy is by Amir et al. [3] which introduced a solution in time \(O(n\sqrt{\tau\log\tau})\). Our solution achieves O(nm) computation and O() communication complexity.

The main idea behind the construction is to have the parties securely supply their inputs in binary as above. Then, to determine the matches, the parties first compute the (encrypted) Hamming distance h j for each position j, using the homomorphic properties of ElGamal PKE (Steps 5 and 6). They then check whether h j =k for each k<τ. To avoid leaking information, these results are permuted before the final decryption.

Protocol π APM

  • Inputs: The input of Alice is a binary string t of length n, an integer m and a threshold τ′, whereas the input of Bob is a binary string p of length m, an integer n and a threshold τ. The parties share a security parameter 1κ as well.

  • The protocol:

    1. 1.

      Alice and Bob run protocol \(\pi _{\scriptscriptstyle\mathrm{KeyGen}} (1^{\kappa},1^{\kappa})\) to generate a public key \(pk=\langle \mathbb{G} _{q} ,q,g,h\rangle\), and the respective shares s A and s B of the secret key sk.

    2. 2.

      Alice sends Bob τ′ and the parties continue if τ=τ′.

    3. 3.

      As in the basic solution, Bob first sends encryptions \(P_{i} = E_{pk}(p_{i};r_{p_{i}})\) i=1,…,m, of the bits of his m-bit pattern, p, to Alice. They then run π isBit for each one.

    4. 4.

      Alice similarly provides encryptions, \(T_{j} = E_{pk}(t_{j};r_{t_{j}})\) j=1,…,n of her input as in π PM; for each one the parties execute π isBit.

    5. 5.

      For every m-bit substring of t starting at position j=1,…,nm+1, Bob computes an encryption

      $$ H_j'\gets\prod _{i=1}^m (T_{j+i-1} )^{-2p_i}\cdot E_{pk} (1;0 ) ^{p_i} $$
      (6)

      and rerandomizes it. He then sends all these to Alice and demonstrates that they have been correctly computed by executing π h-proof on the encryptions P i of the p i and the \(H_{j}'\).

    6. 6.

      For every m-bit substring, \(\bar{t}_{j}\) of t starting at position j=1,…,nm+1, both parties locally compute encryptions of the Hamming distance between \(\bar{t}_{j}\) and p,

      $$ H_{j}\gets H_j'\cdot \Biggl(\prod_{i=1}^mT_{j+i-1}\Biggr). $$
      (7)
    7. 7.

      For every k=0,…,τ−1 (i.e., for every Hamming distance which would be considered a match) and for every substring of length m starting at j=1,…,nm+1, both parties compute

      $$ \Delta_{j,k}\gets H_j\cdot \bigl\langle1, g^{-k} \bigr\rangle. $$
      (8)
    8. 8.

      For every j=1,…,nm+1, Alice picks a uniformly random permutation \(\pi_{j}: \mathbb{Z}_{\tau}\to \mathbb{Z}_{\tau}\) and applies π j to the set {Δ j,k } k ,

      $$\bigl(\Delta_{j,0}',\ldots,\Delta_{j,\tau-1}' \bigr) \gets \pi_j (\Delta_{j,0},\ldots, \Delta_{j,\tau-1} ), $$

      rerandomizes all encryptions,

      $$\Delta_{j,k}''\gets\Delta_{j,k}' \cdot {E_{pk}} \bigl(0;r_{j,k}' \bigr) $$

      for j=1,…,nm+1 and k=0,…,τ−1, and sends the \(\Delta _{j,k}''\) to Bob. For every permutation, j=1,…,nm+1, the parties execute π Perm on \(( (\Delta_{j,0},\ldots,\Delta_{j,\tau-1} ), (\Delta_{j,0}'',\ldots,\Delta_{j,\tau-1}'' ) )\) allowing Bob to verify that the plaintexts of the \(\Delta_{j,k}''\) correspond to those of the Δ j,k for all (fixed) j.

    9. 9.

      Finally, Alice and Bob execute π Dec0 on each \(\Delta_{j,k}''\) for j=1,…,nm+1 and k=0,…,τ−1. This reveals to Bob which plaintexts δ j,k are 0. He then outputs j iff this is the case for one of \(\delta_{j,0}'', \ldots, \delta_{j,\tau-1}''\).

Correctness follows from the intuition: The plaintexts of the H j from Eq. (7) are the desired Hamming distances. It is straightforward to verify that if the \(H_{j}'\) have been correctly computed, the p i are bits, and the T j are encryptions of bits, then the encryption

$$H_j'\cdot \Biggl(\prod_{i=1}^mT_{j+i-1} \Biggr) = \prod_{i=1}^m (T_{j+i-1} )^{1-2p_i}\cdot E_{pk} (1;0 ) ^{p_i} $$

contains the Hamming distance between the string p∈{0,1}m and the encrypted substring of length m starting at position j. The expression

$$(T_{j+i-1} )^{1-2p_i}\cdot E_{pk} (1;0 ) ^{p_i} $$

simply negates the encrypted bit, T j+i−1, if p i is set, i.e., computes an encryption of t j+i−1p i . Further, as multiplying ciphertexts computes the encrypted sum of the plaintexts and m<q, then clearly the overall result is the number of differing bits—in other words, the Hamming distance.

Each threshold test is performed using τ tests of equality, one for each possible value k<τ, where each test simply subtracts the associated k from H j under the encryption, Eq. (8), at which point the parties may mask and decrypt towards Bob. Note that the standard masking combined with the permutation of Step 8 ensures that for every potential match, Bob either receives τ uniformly random encryptions of random, non-zero values, or τ−1 such encryptions and a single encryption of zero. Both are easily simulated, hence we state the following theorem:

Theorem 8

(Approximate)

Assume that the DDH assumption holds in \({{\mathbb{G}}_{q}}\), then π APM securely computes \({\mathcal{F}_{{\mathrm{APM}}}}\) in the presence of malicious adversaries.

Regarding complexity, the most expensive steps are those associated with computing the Hamming distances, Steps 5 and 6, and the permutations and decryptions needed to compare the Hamming distances to τ, Steps 8 and 9. The former requires O(m+n) communication, but O(nm) multiplications and exponentiations. The latter requires both O() communication, multiplications and exponentiations. As τm this implies O() communication and O(mn) computation overall. Round complexity is constant as in the previous solutions. We remark that dropping the ZK-proofs results in a more efficient, passively secure variant, since the computational complexity of Steps 5 and 6 is reduced to O(nm) multiplications and O(n+m) exponentiations.

5.1 A Variation—Using Paillier Encryption

The approximate pattern matching protocol is our most costly construction in terms of communication, as O() elements are exchanged between the parties. This was due to implementing the comparison between Hamming distance and threshold using τ equality tests. We now propose an alternative to the above scheme, and note that it requires \(o(n\sqrt{\tau\log\tau})\) communication, i.e., exchange fewer elements than any “naive”, secure implementation based on [3] would.

Our protocol could equally well be constructed using Paillier encryption, [39]. The drawbacks include a significantly less efficient key generation as well as larger ciphertexts due to basing security on factoring rather than discrete logarithms. However, comparison (greater-than) becomes much more efficient requiring communication complexity of \(O(\operatorname {loglog}{\tau} \cdot(\operatorname{logloglog}{\tau} + k))\) where k is a security or correctness parameter, [42]. This implies an overall communication complexity of \(O(n\cdot\operatorname{loglog}{\tau} ( \operatorname {logloglog}{\tau} + k))\).Footnote 1 We remark that in practice, it may be preferable to avoid statistical security/correctness; with present knowledge this requires O(logτ) elements to be exchanged, e.g., by adapting the protocol of Nishide and Ohta, [37]. Despite an overall worse asymptotic behavior of O(nlogτ), avoiding the factor of k improves efficiency for “small” τ.

6 Hiding the Pattern Length

Here Alice is not required to know the length m of Bob’s pattern, only an upper bound Mm. Moreover, she will not learn any information about m. More formally, the parties wish to compute the functionality \({\mathcal{F}_{\mathrm{PM}\mbox{-}\mathrm{hpl}}}\) defined by

$$\bigl((p,n),(t,M)\bigr)\mapsto \left\{ \begin{array}{l@{\quad }l} (\{j \mid\bar{t}_j=p\}_{j=1}^{n-m+1},\lambda) & \mbox{if $|p|\leq M$ and $|t|=n$,} \\ (\lambda,\lambda) & \mbox{otherwise,} \\ \end{array} \right. $$

where \(\bar{t}_{j}\) is the substring of length m that begins at the jth position in t. A protocol π PM-hpl that realizes \(\mathcal{F}_{\mathrm{PM}\mbox{-}\mathrm{hpl}}\) can be obtained through minor alterations of π PM-⋆. The main idea is to have Bob construct a pattern p′ of length M by padding p with Mm wildcards. Though not completely correct, intuitively, executing π PM-⋆ on input ((p′,n),(t,M)) provides the desired result, as the wildcards ensure that the irrelevant postfixes of the \(\bar{t}_{j}\) are “ignored.” There are two reasons why this does not suffice. Firstly, the wildcards of \(\pi_{{\mathrm{PM}\mbox{-}\star}}\) mean match any character, however, matches must also be found when the wildcards occur after the end of the text (where there are no characters). Secondly, a malicious Bob must not have full access to wildcard-usage—i.e., he must not be able to arbitrarily place wildcards, they must occur only at the end of p′. To eliminate these issues, Alice’s text must be extended, while Bob must demonstrate that his wildcards are correctly placed. In detail, our construction is the following.

Protocol π PM-hpl

  • Inputs: The input of Alice is a binary string t of length n and an integer M, whereas the input of Bob is a string p over the alphabet {0,1} of length mM and an integer n. The parties share a security parameter 1κ as well.

  • The protocol:

    1. 1.

      Alice and Bob run protocol π KeyGen(1κ,1κ) to generate a public key \(pk=\langle {{\mathbb{G}}_{q}} ,q,g,h\rangle\), and the respective shares s A and s B of the secret key sk.

    2. 2.

      Bob constructs a pattern p′ of length M by padding p with Mm zeros. He then sends encryptions \(P'_{i} = E_{pk}(p'_{i};r_{p'_{i}})\) for i=1,…,M to Alice, and for each one they execute π isBit. Finally, both parties compute an encryption of Bob’s “pattern” in ternary,

      $$ P' \gets\prod_{i=1}^m P_i^{\prime\,3^{i-1}}. $$
    3. 3.

      For each position i=1,…,M of p′, Bob computes a bit denoting if this position is padding

      $$ w_i \gets \left\{ \begin{array}{l@{\quad }l} 0&\text{if\ }i>m, \\ 1&\text{otherwise}. \end{array} \right.$$

      He encrypts these and sends the result to Alice,

      $$W_i\gets{{E_{pk}} (w_i; {r_{w_i}} )}, $$

      and the two run π isBit for each one.

    4. 4.

      For each i=1,…,M, Bob and Alice run π isBit on \(W_{i}/P'_{i}\). This demonstrates to Alice that if \(p'_{i}\) is set, then so is w i , i.e., that if Bob claims some position is padding (w i =0) then the associated \(p'_{i}\) is also 0.

    5. 5.

      For each i=1,…,M−1, Bob and Alice run π isBit on W i /W i+1. This demonstrates to Alice that a 1 never follows a 0 in the w i , i.e., that w 1,…,w M is monotonically non-increasing. Hence zeros (signifying padding) occur at the end.

    6. 6.

      Alice supplies her input as in Step 3 of Protocol π PM in Sect. 3. She sends encryptions, \(T_{j} = E_{pk}(t_{j};r_{t_{j}})\) j=1,…,n, of the bits of t to Bob. Then the parties run π isBit for each of the encryptions.

    7. 7.

      Alice and Bob pad Alice’s encrypted text with M−1 default encryptions of 2, T j =〈1,g 2〉 for j∈{n+1,n+2,n+M−1}.

    8. 8.

      For every M-bit substring of the padded t starting at position j=1,…,n, Bob computes an encryption

      $$\bar{T}_j'\gets \Biggl(\prod _{i=1}^M \bigl( (T_{j+i-1} )^{w_i} \bigr)^{3^{i-1}} \Biggr)\cdot E_{pk} (0;r_{j} ). $$

      He sends these to Alice, and they run π ⋆-proof on the tuple consisting of the encryptions of Alice’s padded input and Bob’s w i , as well as the \(\bar{T}_{j}'\). This allows Alice to verify that Bob correctly computed encryptions of her substrings in ternary with her input replaced by 0 at Bob’s padding positions.Footnote 2

    9. 9.

      The protocol concludes as above: For each of the \(\bar{T}_{j}'\) where j=1,…,n, the parties compute

      $$\Delta_j \gets\bar{T}_j' \cdot{P}^{\prime\,-1}, $$

      and run π Dec0. This reveals to Bob which of plaintexts δ j are 0. For each δ j =0 he concludes that the pattern matched and outputs j.

Correctness is straightforward: Alice pad her text with the character 2, which will match Bob’s padding but not his binary pattern. This explains the need for ternary representation rather than binary representation in Steps 2 and 8. Specifically, any character, including 2, will match the padding characters of p′ since it is replaced by 0 in the computation of the encrypted substring \(\bar{T}_{j}'\), in Step 8. Thus, if Bob behaves honestly and supplies a correct input, then the matches are correctly output.

Moreover, due to the use of zero-knowledge proofs, malicious parties cannot deviate, i.e., they are forced to behave as an honest party would. In particular, Alice verifies that Bob’s “padding vector” w 1,…,w M , is not malformed in Steps 4 and 5. All padding of p′ is 0 and padding is added only at the end of p, such that the 1 character never follows the 0 character in the padding portion. Finally, Bob cannot use non-binary inputs due to the execution of π isBit. Hence a malicious Bob is reduced to supplying an input that an honest Bob could supply implying that the correct matches are found. The security argument for a malicious Alice follows similarly.

The communication complexity of π PM-hpl is O(n+M), whereas the computation is O(nM) multiplications due to the computation of Step 8. The analysis is analogous to the one for \(\pi_{{\mathrm{PM}\mbox{-}\star}}\); the main differences are Bob’s demonstration that the padding occurs at the end, Step 5, and the extension of Alice’s text to one of length n+M−1, Step 7, which clearly is linear in n+M. We conclude with the following theorem,

Theorem 9

(Pattern length hiding)

Assume that the DDH assumption holds in \({{\mathbb{G}}_{q}}\), then π PM-hpl securely computes \({\mathcal {F}_{{\mathrm{PM}\mbox{-}\mathrm{hpl}}}}\) in the presence of malicious adversaries.

Adding a Lower Bound on m

Allowing Bob to input arbitrary patterns of length at most M may not be acceptable. In particular using a single-bit pattern in \({\pi_{{\mathrm{PM}\mbox{-}\mathrm{hpl}}}}\) reveals all of Alice’s text, and if an honest Bob is allowed this action, then so is a malicious one. This “attack” can be prevented by adding a lower bound, μ on Bob’s pattern length. This can be enforced by setting W 1,…,W μ to default encryptions of 1, 〈1,g〉, in the above protocol.

7 Hiding the Text Length

The final variant does not require Bob to know the actual text length n, only an upper bound Nn. Moreover, he learns no information about n other than what can be inferred from the output. This property is desirable in applications where it is crucial to hide the size of the database as it gives away sensitive information. More formally, the parties wish to compute the functionality \(\mathcal{F}_{{\mathrm{PM}\mbox{-}\mathrm{htl}}} \),

$$\bigl((p,N),(t,m)\bigr)\mapsto \left\{ \begin{array} {l @{\quad }l} (\{j \mid\bar{t}_j=p\}_{j=1}^{n-m+1},\lambda) & \mbox{if $|p|= m$ and $|t|\leq N$,} \\ (\lambda,\lambda) & \mbox{otherwise,} \\ \end{array} \right. $$

where \(\bar{t}_{j}\) is the substring of length m that begins at the jth position in t.

The core idea of the solution is to extend the alphabet with an additional character and have Alice pad her text with Nn occurrences of this. Overall, the protocol is similar to π PM; moreover, Alice is forced to behave honestly using a similar construction to the one ensuring Bob’s honesty in \(\pi_{{\mathrm{PM}\mbox{-}\mathrm{hpl}}}\) above. The whole construction is as follows:

Protocol π PM-htl

  • Inputs: The input of Alice is a binary string t of length n and an integer m, whereas the input of Bob is a binary string p of length m and an integer N. The parties share a security parameter 1κ as well.

  • The protocol:

    1. 1.

      Alice and Bob run protocol π KeyGen(1κ,1κ) to generate a public key \(pk=\langle {{\mathbb{G}}_{q}} ,q,g,h\rangle\), and the respective shares s A and s B of the secret key sk.

    2. 2.

      As in the basic solution, Bob sends encryptions \(P_{i} = E_{pk}(p_{i};r_{p_{i}})\), i=1,…,m, of his m-bit pattern, p, to Alice. Further, for each encryption the parties execute π isBit, allowing Alice to verify that Bob has provided a bit-string of length m. Both parties then compute an encryption of Bob’s pattern,

      $$ P \gets\prod_{i=1}^m P_i^{3^{i-1}}. $$

      Note that contrary to the basic solution, the binary pattern is encoded in ternary to allow an additional symbol, 2.

    3. 3.

      Initially Alice pads her text with 1’s; we denote the padded text t′. She then sends encryptions, \(T'_{j} = E_{pk}(t'_{j};r_{t'_{j}})\), j=1,…,N, of the bits of this N-bit input, to Bob. Further, for each of the N encryptions, the parties execute π isBit, allowing Bob to verify that Alice has indeed provided the encryption of a known N-bit string.

    4. 4.

      Then, for j=1,…,N Alice computes

      $$ d_j \gets \left\{ \begin{array}{l@{\quad }l} 1&\text{if}\ j>n , \\ 0&\text{otherwise.} \end{array} \right. $$

      These bits represent Alice’s padding, and encryptions of them, \(D_{j} = E_{pk}({d}_{j};r_{{d}_{j}})\) j=1,…,N, are then sent to Bob. Alice then proves that they indeed contain bits by running π isBit, and she further demonstrates that d 1,…,d N is monotonically non-decreasing. Similarly to Bob’s proof in Step 5 of \(\pi_{{\mathrm{PM}\mbox{-}\mathrm{hpl}}}\), running π isBit on D j+1/D j demonstrates that all padding occurs at the end of t′.

    5. 5.

      Next, Alice and Bob run π isBit on \(T'_{j}/D_{j}\) for j=1,…,N. This demonstrates to Bob that whenever d j is set, then so is \(t'_{j}\), hence Alice’s padding contains only 1’s.

    6. 6.

      For every m-bit substring of the padded text t′, starting at position j=1,…,Nm+1, both parties compute an encryption of that string with any padding replaced by 2’s:

      $$ \bar{T}_j' \gets\prod _{i=j}^{j+m-1} \bigl(T'_i \cdot D_i \bigr)^{3^{i-j}}. $$
    7. 7.

      As Step 5 of π PM, for every \(\bar{T}_{j}'\), j=1,…,Nm+1 the parties compute

      $$ \Delta_j \gets\bar{T}_j' \cdot P^{-1}. $$
    8. 8.

      For every j=1,…,Nm+1 Alice and Bob run π Dec0 on Δ j ; Bob outputs j iff δ j =0.

Correctness of π PM-htl is easily verified. The honest Alice sets the Nn rightmost d j and \(t'_{j}\) to 1. Therefore, the \(\bar{T}_{j}'\) computed in Step 6 consists of an m-character substring of t′ in ternary, where any 1’s from padding has been replaced by 1+1=2. Bob’s pattern is similarly computed in ternary, implying that Δ j contains 0 iff the pattern matches.

Regarding security, Bob’s behavior is essentially the same as in π PM; hence the proof of security is analogous. Regarding Alice, note that even if she is malicious, she is forced to provide a well-formed text and denotation of padding due to the zero-knowledge proofs of knowledge. In Step 4 she demonstrates that the d=d 1,…,d N consists of a string of 0’s followed by a string of 1’s. (This is equivalent to saying that all padding occurs at the end.) Then in Step 5 she demonstrates that she indeed padded t with 1’s. In other words, an honest Alice could have supplied the same input. Formally, simulating the view is analogous to the basic case.

Complexity is similar to the basic protocol and only O(N+m) encryptions change hands, hence only this many zero-knowledge proofs of knowledge are needed as well. Analogously to the computation of the \(\bar{T}_{j}\) in π PM, computing \(\bar{T}_{j}'\) in Step 6 naïvely requires O(Nm) multiplications. Again, it is possible to reduce this to linear at the cost of increasing the number of exponentiations by a constant factor. Thus, both communication and computation complexities are linear while the required number of rounds is constant.

Theorem 10

(Text length hiding)

Assume that the DDH assumption holds in \({{\mathbb{G}}_{q}}\), then π PM-htl securely computes \({\mathcal{F}_{{\mathrm{PM}\mbox{-}\mathrm{htl}}}}\) in the presence of malicious adversaries.