Private Set Intersection Based on Lightweight Oblivious Key-Value Storage Structure

: At this stage, the application of Private Set Intersection (PSI) protocols is essential for smart homes. Oblivious Key-Value Stores (OKVS) can be used to design efﬁcient PSI protocols. Constructing OKVS with a cuckoo hashing graph is a common approach. It increases the number of hash functions while reducing the possibility of collisions into rings. However, the existing OKVS construction scheme requires a high time overhead, and such an OKVS applied to PSI protocols would also have a high communication overhead. In this paper, we propose a method called 3-Hash Garbled Cuckoo Graph (3H-GCG) for constructing cuckoo hash graphs. Speciﬁcally, this method handles hash collisions between different keys more efﬁciently than existing methods, and it can also be used to construct an OKVS structure with less storage space. Based on the 3H-GCG, we design a PSI protocol using the Vector Oblivious Linear Evaluation (VOLE) and OKVS paradigm, which achieves semi-honest security and malicious security. Extensive experiments demonstrate the effectiveness of our method. When the set size is 2 18 –2 20 , our PSI protocol is less computationally intensive than other existing protocols. The experiments also show an increase in the ratio of raw to constructed data of about 7.5%. With the semi-honest security setting, our protocol achieves the fastest runtime with the set size of 2 18 . With malicious security settings, our protocol has about 10% improvement in communication compared with other existing protocols.


Introduction
With the advent of the Big Data era, user data are generated, collected and consumed in different locations [1].More potential value can be obtained by integrating and analyzing data scattered in various places.However, the integration of data may bring about leakage of private information and compromise the privacy of users [2].The application of Private Set Intersection (PSI) protocols has effectively improved this situation [3][4][5][6][7][8][9].Specifically, the adoption of Oblivious Key-Value Stores (OKVS) has gained significant attraction in constructing PSI protocols.This paper aims to tackle the existing challenges related to the high computational complexity and storage demands of OKVS construction.We introduce a novel construction method for OKVS, utilizing a cuckoo hash map.This approach effectively addresses the computational and communication complexities linked to its integration into PSI protocol applications.The method has the potential to enhance privacy and efficiency, particularly in scenarios like federated learning [10].This method addresses the high computational complexity and high communication overhead of its construction process in PSI protocol applications.Moreover, we are committed to reducing the protocol's computational and communication complexity.
Participants obtain the intersection of sets owned by each other by executing the PSI protocol.And they do not disclose the non-intersection elements they own.Generic

•
We propose a new OKVS method, 3H-GCG, whose decoding algorithm consists of two arithmetic formulas, which is less dependent on auxiliary locations, and can be easily adapted to the PSI protocol.

•
We leverage the newly proposed 3H-GCG to design a two-party PSI protocol.Our approach incorporates the OKVS and VOLE paradigm [15,16], providing resilience against malicious adversaries.We offer security proofs for our two-party PSI protocol under semi-honest security settings and malicious security settings, respectively.

•
We implemented the proposed 3H-GCG and two-party PSI protocols and compared them with existing PSI protocols.When the set size is 2 18 -2 20 , our PSI protocol is less computationally intensive than other existing protocols.The experiments show that for our OKVS, 3H-GCG, the ratio of raw data to constructed data is improved by about 7.5% compared to other existing OKVSs.Under the semi-honest security setting, our PSI protocol achieves the fastest runtime with a set size of 2 18 .Under the malicious security setting, our PSI protocol has about 10% improvement in communication compared to existing protocols.
The remainder of this paper is organized as follows: in Section 2, we present the symbolic representation of this paper, the security model followed in this paper, and the preparatory knowledge and the related work of the PSI study.In Section 3, we introduce our own OKVS, 3H-GCG, and perform a parametric analysis.In Section 4, we propose the OPRF protocol based on 3H-GCG and further construct the PSI protocol, which can defend against malicious adversaries.In Section 5, we give the correctness analysis and security proof of the protocol.In Section 6, we present the details of our PSI implementation and the performance comparison with existing PSI protocols.In Section 7, we conclude the work of this paper.

Preliminaries and Backgrand
This section provides background information.In Section 2.1, we describe the symbolic representations used in this paper.Section 2.2 covers key definitions and properties of OKVS.In Section 2.3, we outline the principles of OKVS implementation using cuckoo hashing.Section 2.4 introduces standard model definitions for semi-honest and malicious security.Lastly, Section 2.5 discusses related work in PSI research.

Notation
We use X to denote the set of sender and Y to denote the set of receiver.[x, y] denotes the set {x, x + 1, . . ., y}. [a] denotes the set [1, a].p denotes the row vector (p 1 , p 2 , . . ., p n ).< •, • > denotes the inner product operation on vectors; i.e., < a, b > denotes the inner product of a and b.The assignment is noted as :=, and we use = to denote the statement that the values are equal.

OKVS Structure Definition and Property
Definition 1.A key-value store (KVS) is characterized by a set of keys (K), a set of values (V), and a set of hash functions (H).It comprises two algorithms:

•
Encode H : The input is a set of key-value pairs (k i , v i ) and the output is an object S. In rare cases, an error indicator ⊥ may be outputted instead.

•
Decode H : The input is an object S and a key k, and the output is a value v.A KVS is considered correct if for all subsets A of K × V with distinct keys, the following holds: The decision of whether Encode H outputs ⊥ is determined by the functions H and the keys k i , and is not influenced by the values v i .If the data are encoded as a polynomial, Encode H always succeeds.It is possible to invoke Decode H (S, k) on any key k.The goal is to make it impossible to determine whether k was used to generate S or not.This is further explained in the following definition [18].Definition 2. A KVS is considered as an oblivious KVS (OKVS) if for all distinct sets of keys {k 0 1 , k 0 2 , . . ., k 0 n } and {k 1 , k 1 2 , . . ., k 1 n }.In other words, if an OKVS encodes random values, it is infeasible to distinguish between an OKVS encoding of the keys of K 0 from an OKVS encoding of the keys of K 1 for any two sets of keys K 0 and K 1 .In fact, if the values encoded in the OKVS are random, then the two distributions are perfectly indistinguishable.

Security of OKVS:
An OKVS is composed of an encode and a decode algorithm.Encode takes as input a set of key-value pairs (k i , v i ) and returns a data structure S. Decode takes as input a data structure like S and a key value k, and outputs a result.Decode can be called on any key, but if it is called on one of the keys used to generate S k i , then the result is the corresponding v i .The most fundamental property of an OKVS is that when v i is random, S hides k i , reducing the probability that the value of the original data k will be leaked and increasing its security.

OKVS Construction Based on Cuckoo Hash Graph
This section outlines the construction of the cuckoo hash map proposed in Section 3.2 of document [17].The implementation details of the original encoding and decoding are described below: Encode H ((x 1 , y 1 ), . . ., (x n , y n )): Given n items (x i , y i ) where x i ∈ {0, 1} * and y i ∈ {0, 1} l , let M be an n × m matrix where the ith row is v(x i ).A data structure (matrix) D = (d 1 , . . ., d m ) T ∈ {0, 1} m×l can be solved for such that M × D = (y 1 , . . ., y n ) T .In other words, the following linear system of equations (over the field of order 2 l ) is satisfied: When the v(x i )s are linearly independent, a solution to this system of equations must exist.
Decode H (D, x): Given a data structure D ∈ ({0, 1} l ) m and a key x ∈ {0, 1} * , the corresponding value can be retrieved as follows: In other words, solving for a key x in D is equivalent to computing an x-or of a specific position in D. The choice of position is determined by v(x) and depends only on x, not on the data structure D.
For the construction of the data structure D, the paper [17] focuses on the analysis of instantiation with cuckoo hash graphs.The vertices of a cuckoo hash graph are denoted 1, ..., m, corresponding to the positions of the elements in the data structure D. The edges of the cuckoo hash graph are undirected edges, corresponding to the x i values to be inserted.The correspondence between the x i values and the edges in the cuckoo hash graph is It follows that the cuckoo hash graph may contain self-loops and undirected loops.

Security Model
In the security proof of the proposed method, we follow the standard security definitions for secure two-party computation.The whole idea of the proof follows the idealrealistic paradigm.The ideal state means that the participants of the PSI protocol operation truthfully provide the data that should be provided.Strictly following the requirements of the protocol leads to a true intersection result.Specifically, the security certificate is divided into semi-honest security proof and malicious security proof, and the specific requirements are as follows: Semi-honest security model: For the protocol Π, if there exist probabilistic polynomialtime adversaries S 1 and S 2 such that for all inputs X and Y, the following equation is satisfied: Then, it means that protocol Π is secure under the semi-honest model.Where view Π 1 (X, Y) denotes the view of P 1 in protocol Π, view Π 2 (X, Y) denotes the view of P 2 in protocol Π, out Π (X, Y) denotes the output of P 2 in the protocol Π, and f (X, Y) denotes the intersection calculation result of P 2 in the ideal state.Malicious security model: For protocol Π, for malicious participants P 1 and P 2 under the realistic model that can arbitrarily deviate from the protocol, there exist probabilistic polynomial-time adversaries S 1 and S 2 under the ideal model such that for all inputs X and Y, the following equation is satisfied: Then, it means that protocol Π is safe under the malicious model.Real Π 1 (X, Y) denotes the perspective of P 1 when P 1 is a malicious participant under the realistic model, Ideal F S 1 (X, Y) denotes the perspective of P 1 when running protocol Π under the ideal model.Real Π 2 (X, Y) denotes the perspective of P 2 when P 2 is a malicious participant under the realistic model, and Ideal F S 2 (X, Y) denotes the perspective of P 2 when running protocol Π under the ideal model.

Related Work
The idea of polynomial encoding (PE) associated with secure multi-party computation and PSI can be traced back to the work of Manulis et al. [19].They proposed the concept of index-hiding message encoding (IHME) to solve the privacy-preserving group discovery problem with linear computational and communication complexity.Then, a perfect security construction for IHME using PE is given.Kolesnikov et al. [20] implemented the OPRF protocol using PE, but PE requires a time complexity of O(n 2 ), which is expensive for large n.Subsequently, Pinkas et al. [21] proposed a polynomial-based PPRF scheme based on the construction of [20] to apportion the cost of batching multiple OPRF calls together.Kolesnikov et al. [22] constructed the reverse private membership test (RPMT) protocol using polynomials and implemented the private set union (PSU) computation protocol using RPMT.Pinkas et al. [13] designed polynomial slicing and streaming using PE and used it to achieve a low communication PSI.Based on the above analysis, existing PE technology suffers from high computational complexity in encoding and decoding.
To solve this problem, the researchers proposed the concept of OKVS.The leading OKVS implementations are currently based on random matrices or cuckoo hashing.Then, the development of OKVS can be traced back to Pinkas et al. [17], who proposed a fast malicious secure two-party PSI protocol using the proposed PaXoS.However, due to the random nature of the results produced by this scheme, there is a high risk of compromising the data information of the participants.Therefore, Rindal et al. [15] designed a variant of PaXoS, XoPaXoS, which effectively solved the data leakage problem.PaXoS has been much more computationally efficient compared to PE, but is still not general enough.
Based on the above gap, Garimella et al. [18] introduced the concept of OKVS using PaXoS as a starting point.According to the concept description, PE belongs to the most basic construct of OKVS, while PaXoS belongs to a specific, binary type of OKVS.A new OKVS structure, 3H-GCT, was designed along with the introduction of the OKVS concept [18].3H-GCT expands the number of hash functions of the cuckoo hash map from two to three on the basis of PaXoS, achieving an expansion from the particular to the general.However, the Gaussian elimination method applied by 3H-GCT in the encoding process still places a high demand on computational performance, and in this paper we endeavour to demonstrate protocol solutions with low communication and computation volumes.

Proposed OKVS
This section primarily focuses on explaining the encoding and decoding mechanisms of our OKVS.In Section 3.1, we present the codec implementation details, and in Section 3.2, we discuss the parameter selection scheme along with the results for the encoding and decoding mechanisms.

OKVS Construction Based on Cuckoo Hash Graph
This section begins with the introduction of our proposed OKVS algorithm, which we have named 3H-GCG.The main description of this algorithm is shown in Figure 1, which implements the storage of n key-value pairs into a cuckoo hash graph. Parameters: • The algorithm is parameterized with the functions • In addition, the algorithm uses the functions li(•) and ri(•), where li(x) outputs a bit-vector of length m with zero at all entries except of entries h 1 (x),h 2 (x) and h 3 (x).The function ri(x) outputs a random bit-vector of length κ + 0.35log(n) with zero at all entries except for two random entries.Algorithm: Set any empty position in l (and r) with a random value from F m (and F r ), output S = (l, r).
1. Compute li(k i ) and ri(k i ), fetching parameters S = (l, r). 2. Return < (li(k i ), ri(k i )), (l, r) >.In this algorithm, the number of hash functions of the cuckoo hash graph is set to 3. The three hash functions are h 1 , h 2 , h 3 , and their mapping ranges are [m].In addition, the node of the cuckoo hash graph is set to N i and the range of i is also [m], in keeping with the mapping range of the hash function.Further, two mapping functions li(•) and ri(•) are set.In particular, we place some restrictions on the mapping values output by these two functions to make them better serve our OKVS structure.Specifically, the output of the mapping function li(•) is an m − long bit string and all but three positions are zeros, where the three positions of li(x) are determined by h 1 (x), h 2 (x), and h 3 (x), respectively.The mapping function ri(•), on the other hand, outputs an κ + 0.35log(n) long string of bits, of which only two random positions are 1 and all other positions are zeros.
The encryption algorithm of 3H-GCG initializes two empty vectors l and r, and an empty queue Q, as detailed in steps 1 and 2 of the encoding algorithm in Figure 1.
Step 3 aims at classification.Since the mapping range [m] of the hash function is relatively small, it is inevitable that the key values in the set of key-value pairs collide with each other when hashing the values of different hash functions between elements.With the help of queue Q, we can divide the key-value pairs that need to be encoded into two categories.The keyvalue pairs where all three hashes collide are placed in queue Q, and the rest are placed outside of queue Q. Step 4 is to secretly store the key-value pairs from the queue Q into l and r.For is empty, then keep one empty position, fill the other positions with random values if they are empty, and finally decide the value of the retained empty position according to the position that is not empty, such that < li(k i ), l > = v i .Otherwise, compute ri(k i ) and use the random position determined by r(k i ) to assist in storing Step 5 is to secretly store the key-value pairs outside the queue Q into l and r.For k i not in Q, if ∃j ∈ [3], N h j (k i ) is not empty, then store the value of the empty node based on the nodes already set, such that < li(k i ), l > = v i .Otherwise, take two empty positions and store them randomly, and store the other position according to the position already set, such that < (li(k i ), ri(k i )), (l, r) > = v i .Note in particular that during the execution of steps 4 and 5, the initialized null vectors l and r are continually updated as key-value pairs are deposited.Until all the key-value pairs are stored, the encoding process is completed by filling in the random values in step 6 to obtain S := (l, r).
Unlike existing algorithms, the decoding algorithm of 3H-GCG consists of two algorithms Decode 1 and Decode 2 .The user needs to compute both Decode 1 and Decode 2 algorithms if they do not know any conditions when decoding.Of the two results obtained, at most one can be matched to the expected result.
Our OKVS construction is constructed with the idea of the cuckoo hash graph and the hypergraph construction.But the concept of cuckoo hashing mentions that if the data cannot be stored anymore, the other deposited elements will be cycled out until a free spot is found to store all the elements.This approach has little efficiency difference in decoding from our proposed 3H-GCG.However, having to constantly determine and restore elements during the encoding process increases computational overhead.Therefore, the position of the elements stored will not change again.If a storage conflict is encountered, the conflict is resolved directly by using the auxiliary position.This makes a considerable improvement to coding efficiency.
In addition, this encoding and decoding approach needs to be used in special scenarios.That is, at least one of the participants needs to have an expectation of the decoded result.If one of the two numbers solved does not follow the number one expects, it is proved that the number of the query is not stored in this OKVS.If one of the two numbers solved can be matched to the expected number before decoding, it is proved that the number one wants to confirm exists at this OKVS check.Based on the above description of the applicable scenarios, we find that this OKVS encoding method can be applied to the PSI protocol.The specific way in which the PSI protocol is constructed is described in detail in Section 4.

Parameter Analysis
In this work, the parameters to be analyzed are the output length l 1 of H 1 , the output length l 2 of H 2 , and the number of cuckoo hash graph vertices m and the number of auxiliary positions r.Of these, l 1 and l 2 are introduced when designing the PSI protocol in the next section, and we will first discuss their parameter selection here.Choices of l 1 , l 2 : Since H 1 and H 2 control the collision probability of the PSI protocol, l 1 and l 2 can be set to l 1 = l 2 = λ + log(n 1 n 2 ) in a semi-honest security setting, where n 1 and n 2 denote the number of elements of the sender and receiver sets, respectively.Under the malicious security setting, l 1 and l 2 can be set to denoting the number of queries that the sender and receiver can make to the H 1 and H 2 random prediction machines, respectively.Choices of m, r: In the 3H-GCG mechanism, S = (l, r), where |l| = m, |r| = r.To ensure the success of encoding, m = 1.2n and r = κ + 0.35log(n) are taken.In this scheme, the encoding fails when the auxiliary position is less than κ + 0.35log(n) at the time of encoding.Refer to [18] for specific calculations.

Private Set Intersection
In this section, we outline the PSI protocol scheme we have developed.Section 4.1 introduces the VOLE-based OPRF protocol, while Section 4.2 demonstrates how the OPRF protocol is used to design the VOLE-based PSI protocol.

OPRF-Based Vector-OLE
We begin by introducing a VOLE function that can be converted at random, which can be adapted to the participant's indication of malicious behaviour, and the whole scheme is resistant to malicious adversaries; the process is illustrated in Figure 2. Parameters: • Two parties, a sender and a receiver.
• A finite field F.
• The size of the output vectors: m.Functionality: -No input from either sender or receiver.
-In the event that the receiver exhibits malicious behavior, wait for the transmission of C and A ∈ F m .Subsequently, sample ∆ ← F and calculate B = C − ∆A.
-In the event that the sender exhibits malicious behavior, wait for the transmission of B ∈ F m and ∆ ∈ F. Subsequently, sample A ← F m and calculate C := B + ∆A.
-If there is no malicious behaviour on either side, sample A and B ← F m , as well as ∆ ← F, then calculate C := B + ∆A.
-The functionality transmits ∆ and B to the sender, while sending C, calculated as ∆A + B, and transmits A to the receiver.The whole VOLE scheme is adapted to both participants, with a restricted finite field of F and all vectors of dimension set to m.No input is required from either participant for the entire operation.If no malicious adversary is detected, vectors A and B and the scalar ∆ are randomly selected by the VOLE processor, then it can be calculated that C := B + ∆A.If a malicious adversary is present during the implementation of Figure 2, the implementation needs to be discussed on a case-by-case basis: If receiver is a malicious adversary, receiver needs to generate its own A and C for transmission to the VOLE processor; the processor takes a random value ∆ and computes B := C − ∆A, returning B and ∆ to sender.
If sender is a malicious adversary, sender needs to generate the scalar ∆ and B by himself and transmit both to the VOLE processor.The process then randomizes A and then calculates C := B + ∆A and transmits A and C to the receiver.
Based on the above random reversed VOLE function that can resist a malicious adversary, an OPRF protocol can be constructed to resist a malicious adversary.The specific OPRF execution process is shown in Figure 3.

Parameters:
• There exists two parties, a sender with set X ⊂ F of size n, and a receiver with set Y ⊂ F of size n.
• Computational and statistical security parameters κ and λ.
2. Receiver computes D with Y and sends the randomly generated w r to sender.Note that the data structure of D is consistent with the output data structure of 3H-GCG.
3. The parties invoke the VOLE functionality and neither party has any input.The receiver obtains C and A , while the sender obtains B and ∆, so that C = B + ∆A .Note that C, A and B are all OKVS structures of 3H-GCG, like D.
4. Receiver computes A := A + D and sends A to the sender.5. Sender sends w s to receiver, who aborts if H 3 (w s ) = s.Both parties then compute w = w s + w r .6. Sender computes P = ∆A + B and then outputs The receiver receives C and A and the sender receives B and scalar ∆ where C, A , and B, the same as D, are OKVS structures output by the 3H-GCG encoding algorithm.The receiver assigns A + D to A, which is then sent to the sender.The sender sends the w s generated in the first step to the receiver, who checks the w s against the s previously received.If s is not equal to H 3 (w s ), then the agreement aborts; if the two values are equal, then both parties can calculate w = w s + w r , respectively.Finally, the sender computes P = ∆A + B and uses P to compute two decodes of 3H-GCG for the elements in the set X, and the obtained results are integrated into the M 1 and M 2 sets, respectively.The reason for having two sets here is that 3H-GCG has two decryption algorithms.The security of this OPRF protocol is demonstrated in the next subsection.

PSI Protocol Description
We can adapt the OPRF protocol discussed in the previous section to create a PSI protocol in the same security environment.This adaptation is illustrated in Figure 4. Notably, the PSI protocol in Figure 4 primarily differs from the OPRF protocol in Figure 3 in just two key aspects.

Parameters:
• There exists two parties, a sender with set X ⊂ F of size n, and a receiver with a set Y ⊂ F of size n.
• Computational and statistical security parameters κ and λ.
• Random oracles 1. Sender samples w s ← F and sends s = H 3 (w s ) to receiver.2. Receiver computes D = Encode({(y, H 1 (y)) | y ∈ Y}) = (L D , R D ) using the Encode algorithm of 3H-GCG and sends the randomly generated w r to sender.
3. The parties invoke the VOLE functionality and neither party has any input.The receiver obtains C and A , while the sender obtains B and ∆, so that C = B + ∆A .Note that C, A , and B are all OKVS structures of 3H-GCG, the same as D.
4. Receiver computes A := A + D and sends A to the sender.5. Sender sends w s to receiver who aborts if H 3 (w s ) = s.Both parties then compute w = w s + w r .6. Sender computes P = ∆A + B and then computes The first place is where step 2 of the protocol accounts for the calculation of D, which is derived by applying the coding algorithm of 3H-GCG to the receiver's set.The second place is the addition of a receiver output intersection at the end of the protocol.There

Theoretical Analysis
The main work in this section is to analyze the performance of the PSI protocol scheme in Section 4. In Section 5.1, we analyze the correctness of our PSI protocol.In Section 5.1 we analyze the security of our PSI protocol.

Correctness Analysis
In the following we give a correctness analysis of the PSI protocol in Section 4.2.For the intersection part x = y, we only need to prove that Decode(C, y) = Decode(P, x) − ∆H 1 (x).The detailed derivation process is as follows: The Decode algorithm in the derivation of the above equation is generally applicable to the decryption algorithms Decode 1 and Decode 2 of 3H-GCG in Section 3.1.
For the non-intersection part elements put into the Decode 1 and Decode 2 algorithms, the value decrypted is a random value and the receiver cannot match this random value with its own elements.Therefore, our proposed PSI protocol is correct.

Security Analysis
Our PSI protocol is resistant to brute force decryption attacks, statistical attacks, and man-in-the-middle attacks due to the masking of encoded data in the protocol.Since it is stated in the literature [18] that a protocol secure in a malicious environment is not necessarily secure in a semi-honest environment, our security analysis is divided into two main parts.The first part is to demonstrate that the OPRF protocol proposed in Section 4.1 is resistant to malicious adversaries.The second part proves that the PSI protocol we proposed in Section 4.2 is resistant to attacks by malicious adversaries.The security proof satisfies the semi-honest security model and the malicious security model introduced in Section 2.4.
Theorem 1.The PSI protocol proposed in Figure 4 has semi-honest security under the model where H 1 and H 2 are random prediction machines and VOLE variant models.
Proof.We prove Theorem 1 by the following two lemmas.
Lemma 1.The PSI protocol proposed in Figure 4 can resist semi-honest sender under the model where H 1 and H 2 are random prediction machines and VOLE variants.
Proof.We construct the simulator S 1 as follows.Given the set of receivers Y, the S 1 runs the honest sender's protocol to generate its perspective, but with the following differences: for the VOLE machine, S 1 runs the VOLE simulator, which generates a sender-side perspective.
Finally, the simulator S 1 outputs the sender's view.We prove (view ) by the following mixed arguments: Hybrid 0: Sender's view and receiver's output under the real protocol.Hybrid 1: Same as Hybrid 0, except that instead of running the VOLE mechanism, S 1 runs the VOLE simulator, which generates the structures A , B, C, and the scalar ∆, satisfying the equation C = B + ∆A .The VOLE simulator sends B and ∆ to the sender.This Hybrid is exactly the same as Hybrid 0.
Hybrid 2: Same as Hybrid 1, except that in this variant, S 1 selects "w s " based on specific conditions.This variation holds statistical equivalence to Hybrid 1.
Hybrid 3: Same as Hybrid 2, except that the protocol terminates when there exists . This termination probability is negligible because H 1 is collision resistant.
Hybrid 4: Same as Hybrid 3, except that there exist x 1 , x 2 ∈ X ∪ Y with x 1 = x 2 such that one of the following equations is satisfied, then the agreement aborts.
The protocol is designed to terminate immediately upon the occurrence of any of these specified conditions.This rapid termination mechanism is an integral part of the protocol's efficiency and security measures.It ensures that in the event of certain predefined situations, the protocol can gracefully and swiftly exit without proceeding further.
The termination probability in these cases is intentionally kept at an extremely low level.This is primarily attributed to the inherent collision resistance of H 2 , a fundamental cryptographic primitive used within the protocol.The reliance on H 2 's collision resistance properties adds an additional layer of security to the termination process, further reducing the likelihood of any undesired or unexpected outcomes.
It is important to emphasize that this termination mechanism is a crucial component of the protocol's robustness, ensuring that it can withstand various adversarial scenarios while maintaining its integrity and security.Hybrid 5: Same as Hybrid 4, except that the output of P 2 is replaced with f (X, Y).The final output will change when and only when x ∈ X, y ∈ Y, x = y but H 2 (x, Decode1(P, x) − H 1 (x) + w) = H 2 (y, Decode1(C, y) + w).Since H 2 is collision resistant and the parameters of l 2 are chosen large enough, the probability of output change is negligible.
Hybrid 6: Same as Hybrid 5, except that the protocol will no longer abort.The indistinguishability between Hybrid 5 and Hybrid 6 is based on the collision resistance of H 1 and H 2 and the conversion of the VOLE.Lemma 2. The PSI protocol proposed in Figure 4 can resist the semi-honest receiver under the model where H 1 and H 2 are random prediction machines and VOLE variants.
Proof.We construct the simulator S 2 as follows.Given the set of senders X, S 2 runs the honest receiver's protocol to generate its perspective, but with the following differences: S 2 randomly generates 3H-GCG coding structure D. For each element x ∈ I, x is decrypted according to the decoding method of 3H-GCG (Decode 1 , Decode 2 ) to obtain the two values M 1 (x) and M 2 (x).S 2 runs the VOLE simulator to generate the vectors C, A , B, and the scalar ∆, and sends vectors C and A to the receiver.We prove view Π 2 (X, Y) c ≈ S 2 (1 n , X, n 2 ) by the following mixed arguments: Hybrid 0: Receiver perspective under the real protocol.Hybrid 1: Same as Hybrid 0, except that the protocol aborts when there exists Hybrid 2: Same as Hybrid 1, except that when there exists x 1 , x 2 ∈ X ∪ R, R is the set consisting of random values, and x 1 = x 2 such that one of Equations ( 7)-( 10) is satisfied, then the protocol aborts .
Hybrid 3: Same as Hybrid 2, except that instead of running the VOLE mechanism, S 2 runs the VOLE simulator, which generates the structures A , B, C and the scalar ∆, satisfying the equation C = B + ∆A .The VOLE simulator sends A and C to the receiver.This Hybrid is identical to Hybrid 2.
Hybrid 4: Same as Hybrid 3, Hybrid 4 introduces a slight variation in the protocol, specifically involving the actions of S 2 .In this modified hybrid, S 2 plays a unique role by considering the intersection set I, which represents the common elements between the two parties' datasets.Additionally, S 2 generates a set of n 1 − |I| random values and employs the sophisticated decoding algorithm of 3H-GCG, utilizing both Decode1 and Decode2 procedures.
The outcome of this process yields two sets of decoded values, denoted as M 1 and M 2 , which S 2 subsequently transmits to the receiver.It is essential to highlight that the randomly generated values remain entirely concealed from the receiver throughout this operation, ensuring data privacy and security.
One noteworthy aspect of Hybrid 4 is the statistical indistinguishability it shares with Hybrid 3.This property signifies that an external observer or attacker would find it exceptionally challenging to differentiate between the two hybrids based on the information available to them.This inherent similarity adds an extra layer of cryptographic strength and resilience to the protocol, contributing to its overall security.
Hybrid 5: Same as Hybrid 4, except that the protocol no longer aborts.Hybrid 4 and Hybrid 5 are indistinguishable due to the collision resistance of H 1 and H 2 and the convertibility of the VOLE.
Theorem 2. The PSI protocol proposed in Figure 4 has malicious security under the model where H 1 and H 2 are random prediction machines and VOLE variant models.
Proof.First assume that the sender is malicious and sets up simulator M. The interaction between the malicious sender and simulator M is as follows: 1.
Before the participating parties start executing the PSI protocol, the simulator M waits for the messages y sent by the malicious sender instead of the OPRF model, and all the y sent from the set Y .2.
After the malicious sender sends M 1 and M 2 to the simulator M, M computes Y 1 := {y ∈ Y ∧ y / ∈ Y s.t.y = y ∧ F(y) = F(y )} and then execute the PSI protocol with Y 1 .Since the simulator has ruled out the collision case F(y) = F(y ), the above simulation is correct and indistinguishable.However, there exists the case F(y) = F(y ) where the probability of having an element x in the set X of senders satisfying F(y ) = F(x) is 2 −λ , which is overall indistinguishable.We prove it using the following hybrid argument: Hybrid 0: Same protocol as the original, but the simulator M interacts with the malicious sender instead of the VOLE.
Hybrid 1: Similar to Hybrid 0, except that A is generated by sampling under a uniform distribution instead of summing over A and D. In the perspective of the malicious sender, A is also uniformly distributed, so Hybrid 1 is indistinguishable from Hybrid 0.
Hybrid 2: Similar to Hybrid 1, except that the simulator M no longer calls the Encode algorithm, so the protocol does not terminate even if the Encode algorithm fails to encode.Since the cuckoo hash graph OKVS generated by 3H-GCG has not been queried and is therefore also randomly sampled, the upper bound on the probability of protocol termination is 2 −λ .So, Hybrid 2 is statistically indistinguishable from Hybrid 1.It is worth noting that any input from the receiver is no longer used in this hybrid.
Hybrid 3: Similar to Hybrid 2, except that the protocol terminates if the query H 2 (x, Decode(P, x) − ∆H 1 (x) + w) has been queried before.Otherwise, the simulator M invokes the decoding step of the OPRF protocol to respond to the query.Since the queries H 2 (x, Decode(P, x) − ∆H 1 (x) + w) are all similarly distributed and the malicious sender has negligible probability to query H 2 (x, Decode(P, x) − ∆H 1 (x) + w), Hybrid 3 is indistinguishable from Hybrid 2.
Assuming the recipient is malicious and sets up simulator M, the interaction between the malicious recipient and simulator M is recorded as follows: 1.
The simulator M replaces the OPRF model and when the malicious recipient sends a query message to M, the simulator records the set X of messages sent by the malicious recipient and sends the response message after the query to the malicious recipient.

2.
The simulator executes the PSI protocol with the collected X and the Y held by itself as input to obtain the intersection set Z.

3.
The simulator uses the elements in Z and the consistent values of the non-intersecting elements in Y to input into OPRF for calculation, the calculated values are assembled into Y , and Y is sent to the malicious recipient.
Consistent values for the non-intersecting elements of Y are added throughout the third step of the simulation, and only this differs from the real protocol.However, since such consistent values occur with probability 2 −λ , this change is indistinguishable.We prove it using the following hybrid argument: Hybrid 0: Same protocol as the original, but the simulator M interacts with the malicious sender instead of the VOLE.
Hybrid 1: Similar to Hybrid 0, except that A is generated by sampling under a uniform distribution instead of summing over A and D. In the perspective of the malicious sender, A is also uniformly distributed, so Hybrid 1 is indistinguishable from Hybrid 0.
Hybrid 2: Similar to Hybrid 1, except that the simulator M no longer calls the Encode algorithm, so the protocol does not terminate even if the Encode algorithm fails to encode.Since the cuckoo hash graph OKVS generated by 3H-GCG has not been queried and is therefore also randomly sampled, the upper bound on the probability of protocol termination is 2 −λ .So Hybrid 2 is statistically indistinguishable from Hybrid 1.It is worth noting that any input from the receiver is no longer used in this hybrid.
Hybrid 3: Similar to Hybrid2, except that the protocol terminates if the query H 2 (x, Decode(P, x) − ∆H 1 (x) + w) has been queried before.Otherwise, the simulator M in-vokes the decoding step of the OPRF protocol to respond to the query.Since the queries H 2 (x, Decode(P, x) − ∆H 1 (x) + w) are all similarly distributed and the malicious sender has negligible probability to query H 2 (x, Decode(P, x) − ∆H 1 (x) + w), Hybrid 3 is indistinguishable from Hybrid 2.

Theoretical Efficacy Analysis and Experimental Comparison
The main work in this section is to analyze the efficacy of the 3H-GCG designed in Section 3 and the PSI protocol designed in Section 4 of this paper in terms of theoretical efficacy and the efficacy of the experimental implementation results.
The experiments were conducted on a host machine with the specifications of a LAPTOP-7CRPMU9N, featuring an AMD Ryzen 7 5800H processor clocked at 3.20 GHz.This system ran a 64-bit operating system and utilized an x64-based processor.For experimentation purposes, a virtual machine running Ubuntu 22.04 was employed.The virtual machine was allocated 8 GB of RAM, creating an isolated environment for conducting the research.All of our experiments are implemented in C/C++, including our work as well as replications of existing work.The VOLE mechanism was used in an extended version of Schoppmann et al. [23].

OKVS Part
Figure 6 illustrate the bit sizes of the post-coding structures of the individual OKVS structures.Figure 6 shows a comparison of the data in the semi-honest setting and in the malicious setting where the horizontal coordinate represents the logarithmic value of the set size owned by the participant with a base of two, and the vertical coordinate is the number of bits of the OKVS structure.The independent variable for both line graphs is the growth in the size of the number of elements of the set, and it grows exponentially.The random matrix is the method of reference [24] and the 3H-GCT is the method of reference [5].It is evident that the construction size of our 3H-GCG scales more favorably with set size compared to the structural growth exhibited by the random matrix method.Moreover, it remains competitive with 3H-GCT.In Table 2, we list the existing constructs that meet the OKVS definition and analyze the type separately, the ratio of the original data to the constructed data, the encoding overhead and the batch decoding overhead.These analyses are based on constructing OKVS with a failure probability of 2 −λ .We developed the 3H-GCG protocol by extending the principles of 3H-GCT and introducing specific constraints during its construction.As a result, we observed an approximately 7.5% improvement in the ratio between raw and processed data, while the encoding and decoding overhead maintained a consistent order of magnitude.Although the overall ratio is slightly lower than that achieved by the PaXoS structure, 3H-GCG stands out due to its ability to handle both binary and linear data types, making it more versatile when compared to the PaXoS structure, which is primarily designed for binary data types.

PSI Part
We compare the traffic of the PSI protocol in [3][4][5][6][7]19,25,26] with that of our protocol in a theoretical analysis.Note that these analyses are based on a semi-honest security environment and a malicious security environment.The specific results are presented in Table 3, where it can be seen that the communication overhead of our protocol outperforms other protocols when the set size is 2 16 .Some protocols have additional parameters, which are approximated by κ, λ.Note that the coefficients shown below often vary (non-linearly) as a function of n, κ, λ.The second column contains overheads for fixed λ = 40, κ = 128 (representatively), while the last three columns also fix the size of the set.More importantly, our protocols are resistant to malicious adversary attacks on a semi-honestly secure basis, and have a good competitive edge in terms of security.
We have undertaken reproduction work on existing PSI protocols and compared these protocols experimentally with our protocol, and the comparison results are shown in Table 4.All protocols operate in a LAN environment at less than 1 Gbps with sub-millisecond latency.The experimental data are divided into two parts, runtime and traffic, in milliseconds (ms) and megabytes (MB), respectively.We classify these protocols into two categories, semihonest sets and malicious sets, and compare them separately.From the results, we can see that we make a size limit on the encoded data, so our scheme has an optimal communication overhead.Based on the data in Table 4, it can be seen that in the semi-honest setting, our protocol can compete strongly with the work of Kolesnikov et al. [25] in terms of runtime.The communication aspect is not able to surpass the dominance of Pinkas et al. [13], but our work is not far from it in terms of communication volume.Under the malicious setting, our protocol has a significant advantage over other existing protocols in terms of runtime, especially reaching the fastest at set sizes of 2 18 .Communication is also minimized compared to other existing protocols, with an approximately 10% improvement over existing protocols.Analyzed from an overall perspective, our PSI protocol compared to the PSI protocols proposed in the literature [27] is more advantageous in terms of both communication and computational overhead, which fully justifies the relevance of our work.

Conclusions
In this paper, we proposed a new OKVS scheme, 3H-GCG, which improves on existing schemes and is more suitable for scenarios where decoding results are already expected.Based on this, we used 3H-GCG to construct an OPRF protocol and combine it with the VOLE approach to design a PSI protocol that can withstand malicious adversaries.And we give detailed security proofs for the PSI protocol.Finally, we compared the designed OKVS scheme and PSI protocol with other existing schemes.When the set size is 2 18 ∼ 2 20 , our PSI protocol is less computationally intensive than other existing protocols.The final results show that for our 3H-GCG, the ratio of raw data to constructed data is improved by about 7.5% over other existing OKVSs.Under the semi-honest security setting, our PSI protocol achieves the fastest runtime with a set size of 2 18 .Under the malicious security setting, our PSI protocol has about 10% improvement in communication compared to other existing protocols.
With the increasing number of real-life privacy protection scenarios, such as smart homes, smart cities, remote healthcare, smart transportation, and smart education, the need for privacy protection is becoming more stringent.Furthermore, it is worth noting that PSI protocols may also find application in the field of Computer-Aided Vehicle (CAV) traffic behavior research [28], further expanding the horizons of their utility.Specifically, the results of current research developments in PSI protocols in terms of reducing communication and computational overhead are sufficient for practical needs.As the various applications become smarter, there are higher requirements for privacy computing protocols, such as security, scalability, and adaptability.For example, there is a greater need for privacy computing protocols that are resistant to malicious adversaries, and a greater need for

Figure 1 .
Figure 1.The 3-Hash Garbled Cuckoo Graph constructing approach, fitting n key-value pairs (k i , v i ) to a data structure S.

2
The two values obtained could not be matched to the key-value pair set elements, and the participant could not know the kvalue represented by the two values.

Figure 6 .
Figure 6.Schematic illustration of the size of each OKVS coding structure.

Table 2 .
Different OKVS constructions and their properties with an error probability of 2 −λ .

Table 3 .
Comparison of the theoretical communication overheads of various PSI protocols.

Table 4 .
Comparison of experimental data, with optimal data categorized by semi-honest and malicious settings marked in the figure in bold.