Learned-Index-Based Semantic Keyword Query on Blockchain

Yao, Zhongming; Xin, Junchang; Hao, Kun; Wang, Zhiqiong; Zhu, Wancheng

doi:10.3390/math11092055

Open AccessArticle

Learned-Index-Based Semantic Keyword Query on Blockchain

by

Zhongming Yao

¹,

Junchang Xin

^1,2,*

,

Kun Hao

^3,4,

Zhiqiong Wang

³

and

Wancheng Zhu

⁵

¹

School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China

²

Key Laboratory of Big Data Management and Analytics (Liaoning Province), Shenyang 110819, China

³

College of Medicine and Biological Information Engineering, Northeastern University, Shenyang 110819, China

⁴

Neusoft Corporation (Research Center of Liaoning Promotion for Blockchain Engineering Technology), Shenyang 110819, China

⁵

Center for Rock Instability and Seismicity Research, School of Resources and Civil Engineering, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(9), 2055; https://doi.org/10.3390/math11092055

Submission received: 30 March 2023 / Revised: 22 April 2023 / Accepted: 23 April 2023 / Published: 26 April 2023

(This article belongs to the Special Issue Mathematical Modeling for Parallel and Distributed Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Blockchain has become increasingly popular for data management in recent years. However, the existing blockchain systems lack efficient semantic queries, particularly keyword queries. To address this issue, we propose a learned-index-based semantic keyword query architecture on blockchain. First, our architecture records data semantics information to support semantic keyword queries. Second, we establish the lookup table index for semantic information among blocks and the block-level recursive model index for blocks to improve the query efficiency. We store the lookup table in the extended block headers to maintain the result’s completeness, and we store recursive model indexes off chain to optimize the maintenance efficiency. Third, we propose a verifiable query algorithm based on our proposed architecture to maintain the result’s correctness. Finally, the experimental results show that combining the lookup table and the learned index effectively improves the query efficiency on blockchain.

Keywords:

blockchain; learned index; keyword query; semantic query

MSC:

68-00; 68M14; 68P15

1. Introduction

In recent years, blockchain has been a significant and continuous interest. It is a novel application that combines distributed data storage, point-to-point transmission, a consensus mechanism, and an encryption algorithm. The decentralization of blockchain makes it non-tamperable and traceable while effectively addressing the issues of high cost, low efficiency, and unsafe data storage associated with centralization. As a result, blockchain has found widespread use in finance, healthcare, and education.

There are many widely used blockchain systems (e.g., Bitcoin (https://bitcoin.org/en/, accessed on 10 February 2023) and Ethereum (https://www.ethereum.org/, accessed on 10 February 2023)). However, these systems do not support efficient semantics-empowered queries, particularly in keyword queries, as they do not store the semantic information. To provide efficient queries, several studies [1,2,3] extract the semantic data without modifying the original structure and store them on a local device. However, this approach does not ensure client security. Several studies [4,5] involve building both semantic information and query models directly on blockchain, which leads to performance degradation. Zhou E et al. [6] store original data in a database off-chain while keeping semantic information and index information on blockchain for security purposes and faster query processing. However, this method is more suitable for scenarios where original data are stored off-chain rather than data that are already on blockchain.

To address these limitations, we propose an architecture that records both the semantic information and the original data on blockchain. Then, we build a lightweight lookup table existence index on blockchain, and a detailed and efficient query index off-chain. However, this architecture still presents the following two challenges:

Designing an index that adapts to the blockchain structure: There are numerous outstanding indexes available in traditional databases. Nevertheless, traditional database indexes cannot be directly applied due to the constant updates of block numbers and unchanging data within blocks on blockchain. Therefore, designing and selecting appropriate indexes for blockchain presents a significant challenge.
Ensuring the completeness and correctness of query results after using indexes: The purpose of constructing indexes is to avoid completely traversing data when performing data queries. In this case, ensuring there are no omissions in query results is a challenge. In addition, we build some indexes off-chain to ensure the performance and scalability of the blockchain. Therefore, when using those off-chain indexes, ensuring the correctness of the data query results is also a challenge that needs to be addressed.

We propose a learned-index-based semantic keyword query architecture to tackle the challenges. For keyword queries, we record the semantic information of data. Regarding semantic information, we construct a lookup table existence index among blocks. To ensure the efficiency of query results without disrupting system operation, we store the lookup table index on blockchain, with each block only recording the updated parts. Additionally, we enhance query efficiency by building a learned index for each block. We design a verifiable query algorithm based on the architecture to guarantee the correctness of query results. The contributions of our paper are summarized below.

We propose a learned-index-based semantic keyword query architecture. In the architecture, we add header extension space within the blocks to record the semantic information in the data storage procedure so that our architecture provides semantics-empowered keyword queries.
We propose a double-layer index structure. Specifically, an inter-block lookup table existence index is established for semantic information to quickly locate the block where the query results are located. A block-level recursive model index is constructed for each block to promptly search the query results. To maintain system efficiency, only the updated part is stored in each block. The query table index is stored in the extended block header while the learned index is stored off-chain.
We propose a verifiable query algorithm based on our proposed architecture. In this algorithm, we use the double-layer index structure for keyword queries to realize efficient data queries and use the Merkle tree structure in blockchain to verify query results so as to avoid incorrect query results being caused by porting learned index construction and procedures stored off-chain.
Experiments show that the lookup table index has advantages in construction time, query time, and storage space, and the learned index also has benefits in terms of query time and storage space. The experiment of deploying our architecture on a blockchain system shows that our architecture can effectively improve query efficiency. On the Wiki-CS dataset, the query speed is improved by more than 15 times. On the Ethereum dataset, the query speed is improved by about 100 times.

Section 2 of this paper introduces the related work of blockchain data queries and the learned index. Section 3 defines the query problem in the blockchain query scenario. Section 4 proposes the learned-index-based semantic keyword query architecture and introduces the indexes. Section 5 proposes the verifiable query algorithm based on our proposed architecture. Section 6 evaluates the above query methods and analyzes the experimental results. Section 7 summarizes the paper.

2. Related Work

2.1. Blockchain Data Query Processing

Since Satoshi Nakamoto introduced the concept of Bitcoin in the Bitcoin white paper [7] in 2008, blockchain has come into the limelight. A blockchain is a distributed ledger in which multiple untrusted nodes work together to maintain the same global state [8]. From the database perspective, the blockchain can be considered a block-chain structured database, where data are packaged into blocks and connected using a chain structure. A block in blockchain consists of a block header and a block body, as shown in Figure 1, for the most representative Bitcoin system. The block header generally stores fields such as the timestamp, the previous block hash, and the Merkle tree root. The block body stores detailed data about the whole block.

In recent years, there have been many advances in the direction of blockchain data queries. Zhang C et al. [9,10] proposed a data authentication structure called GEM2-tree for range queries under the on–off chain model to reduce the overhead obtained using smart contracts and also offered an optimized GEM2-tree structure that reduces the computational overhead without sacrificing much query performance. Subsequently, the team proposed the suppressed Merkle inverted and Chameleon inverted index for keyword queries. They combined those indexes with Bloom filters to guarantee data query and verification while reducing the usage overhead of smart contracts. The team’s research goal is to minimize the average maintenance cost on blockchain without a significant loss in query performance, and the research goal is more focused on reducing the maintenance cost. Xu C et al. [11,12] proposed a processing framework for verifiable queries, vChain, which allows for lightweight users to verify query results obtained from untrusted parties, and developed two new indexes to aggregate intra-block and inter-block. Those indexes were developed to aggregate data within and among blocks to ensure efficient verification of query results. An inverted prefix tree was used to speed up subscription queries. Generally, this part of the team’s research focuses more on query result verification and verification efficiency. Zhu Y et al. [13] proposed and implemented a blockchain database called SEBDB to add relational data semantics to the blockchain platform, implemented an SQL-like language to support easy application development, and developed block-level, table-level, and hierarchical multi-level indexes to ensure the efficiency of data query. Li Y et al. [2] developed EtherQL, an efficient query layer for the Ethernet blockchain system, to support efficient analysis queries and provide flexible interfaces to support a series of queries such as range queries and top-k queries. The research in this paper is more oriented toward queries on the Ethernet platform. Ruan P et al. [14] proposed a fine-grained, secure and efficient blockchain provenance system, Lineagechain. The system offered new blockchain applications by exposing source information to smart contracts through a simple interface and provided a new jump table index to support efficient provenance query processing. This paper is more focused on data querying in terms of data provenance. Jia D et al. [15,16] proposed an ElasticChain architecture to improve the scalability of storage while ensuring the security of data. They proposed a B-M tree-based blockchain storage structure to ensure the efficiency of local queries within a block. This architecture belongs to the study of scalability on blockchain, and the proposed index structure is also suitable for data queries in the architecture.

In addition to the above studies, there are many studies on the practical applications of data queries on blockchain. For example, in healthcare, Lv Y et al. [17] proposed a data query optimization model to provide efficient medical record data query by building an index tree to locate the block where the patient’s medical record is located. In education, Xu Y et al. [5] proposed an education certificate blockchain supporting low latency and high throughput and built an MPT-chain structure to improve the data query speed. In the Internet of Things, Ren Y et al. [18] introduced the dual-combination Bloom filter method to convert the computational power of bitcoin mining into the computational power of query. They built a blockchain-based data query model to combine the data stream with the blockchain timestamp, aiming to optimize the streaming data query in the IoT scenario.In privacy protection, Linoy S et al. [19] proposed a privacy protection system to provide an efficient and scalable blockchain through Hadoop and synchronous Ethereum clients query processing. Except for the above, other types of data management, such as trajectory data and scientific research data [20,21,22,23], will also be able to combine with blockchain in the future.

2.2. Learned Index

In recent years, techniques from the field of artificial intelligence have been widely used in data-driven problems. Database researchers have been inspired to integrate data query processing and optimization techniques with machine learning and deep learning techniques in artificial intelligence to achieve a better performance [24,25]. As early as 2000, Sakurai Y et al. [26] proposed the first learned index: the A-tree, which uses a dynamic programming partitioning algorithm to segment the data and stores the slope of the segment, the start key and a pointer to the table page in each leaf node. When the query reaches the leaf node, the data searcher uses the slope and the distance to the start key to calculate the approximate position of the key, and then queries the data through local search. The A-tree improves the performance of the B-tree to some extent because it can only accelerate the query of leaf nodes in the B-tree. Later, with the continuous improvement of machine learning techniques, Kraska T et al. [27,28] proposed the concept of learned indexes, stating that all existing index structures can be replaced using other models. The key idea is to use machine learning models to learn the ordering or structure of query keywords and use this model to predict the location or existence of records. However, such indexes are only suitable for static read-only data queries. Although Ding J et al. [29] proposed a learned index to achieve data updates, the index’s performance is not good enough in the case of a large number of data updates.

3. Problem Definition

3.1. Data Query Definition

According to the literature [11], the data stored in the blockchain can be modeled as a number of data blocks consisting of a set of objects

{o_{1}, o_{2}, o_{3}, \dots, o_{n}}

. Each piece of data in each block in the blockchain can be regarded as an object o. Each object o can be regarded as a set of attributes, so the data unit of the blockchain can be defined as:

Definition 1

(blockchain data unit, o). Each piece of data stored on a blockchain is the data unit of the blockchain. The blockchain data unit can be defined as

o = {w_{1}, w_{2}, \dots}

, where

w_{1}

,

w_{2}

are the attributes of each entry in the data unit o.

Definition 2

(Query result data unit,

W_{o n}

). Assume that data unit

o = {w_{1}, w_{2}, w_{3}, \dots}

where

w_{1}

is used as query keyword k, so each piece of data stored on blockchain can be formalized as

o = {k, w_{2}, w_{3}, \dots}

. Then, when the user initiates a query

q = k = w_{1}

, each piece of data in the query result returned on blockchain can be defined as

W_{o n} = {w_{2}, w_{3}, \dots}

.

Definition 3

(data query result, R). If the user initiates a query

q = k

, the returned query result is

R = {k, R_{o n}}

, where

R_{o n}

is the concatenation of each query result data unit

W_{o n}

in the set of all query results on blockchain

{W_{o n}}

.

Example 1.

The above definition is illustrated by the data scenario in Figure 2. As shown in the figure, we need to query all the treatment records of user “Carl”, so the query with the keyword

k = “ C a r l ”

, and the query result on blockchain is

{W_{o n}} = {W_{o n}^{1}, W_{o n}^{2}} = {{00001, E d w i n, i n f o r m a t i o n_{1}}, {00001, B o b, i n f o r m a t i o n_{3}}}

. Then, in this query scenario, the query result is

R = {C a r l, {00001, E d w i n, i n f o r m a t i o n_{1}}, {00001, B o b, i n f o r m a t i o n_{3}}}

for the keyword

k = “ C a r l ”

.

3.2. Architecture Goals

This architecture aims to provide efficient and accurate keyword queries for blockchain systems. Therefore, the completeness and correctness of query results are the basic needs the architecture must ensure.

Result completeness. The query results obtained by users need to be guaranteed to be complete.
Result correctness. The query results obtained by users need to be guaranteed to be correct.

In addition, to implement the efficiency of the query process, we build some effective indexes for the query, so the query time and storage overhead required by the index are also the issues to which this architecture needs to pay attention.

Query time. The time consumption of the data query needs to be small.
Storage overhead. The storage overhead of the index needs to be small.

4. Architecture

4.1. Architecture Overview

Our proposed learned-index-based semantic keyword query architecture on blockchain is shown in Figure 3; there are three types of nodes in the architecture, as follows.

Full nodes. Full nodes store all blockchain data, including the block header and body in blocks. Full nodes can participate in the consensus process to operate and maintain the blockchain system. Full nodes involved in system operation and maintenance are also called miner nodes in some blockchains.
Light node. Light node only stores block header in blocks. The light node acts as the user’s client in the blockchain system. They cannot participate in the operation and maintenance of the system. Still, they can query the data from full nodes and verify the correction of the query results through the information in the block header.
Index management node. The index management node is responsible for building and storing the learned index. The learned index is stored off chain, so the index management node does not have to participate in the work on blockchain. Full nodes can use the index stored by the index management nodes in the data query procedure.

In this architecture, when full nodes participate in data storage on blockchain, they record the semantic information of keywords and establish a lookup table index for them. An extended block header is established in the blocks called the header extension. The lookup table index is stored in the header extension. The index management node off-chain constructs a learned index for each block and holds it off-chain after construction. Light nodes store block header information and can choose to keep the header extensions according to their needs. When light nodes want to query data, they initiate queries to full nodes. All query work of light nodes without synchronous extended block headers should be carried out through full nodes. Full nodes can obtain query results through the lookup table index and learned index, which are returned to light nodes together with the query verification fields. Light nodes can use verification fields to selectively verify query results. The light nodes that store the header extensions can complete some work by themselves.

The lookup table index and the learned index are combined into the double-layer index. The lookup tables are built among blocks to locate blocks quickly, and recursive model indexes are used to speed up data queries within blocks. To ensure the completeness of the query results, we store the lookup table on blockchain to ensure that it will not be destroyed. The lookup table records all the blocks corresponding to the query results to ensure there are no omissions in the query results. The learned indexes are stored off-chain, so security cannot be guaranteed. We do not verify the correctness of the index, but directly verify the query results. The verification query results can not only ensure the correctness of using indexes but also avoids the influence of malicious nodes.

4.2. Double-Layer Query Index

4.2.1. Inter-Block Lookup Table Index

We establish an inter-block lookup table index among blocks for the semantic information. This existence index is established to quickly locate the blocks with query results corresponding to the keyword k. As shown in Figure 4, the inter-block lookup table index consists of a sequential linked list for storing attribute values and several linked lists with elements of the sequential list as head nodes. The inter-block lookup table index uses linked lists instead of arrays to build the index based on the query keyword corresponding to the block number. It has a better performance when dealing with sparse data. In addition, when establishing the lookup table, indexed attribute values are serialized and inserted according to the eigenvalue size, further improving the query efficiency.

The inter-block lookup table index is defined as follows:

Definition 4

(Attribute value sequence linked list,

S L

). If the values of the attributes used as query keywords in the data unit are

k_{1}

,

k_{2}

,

k_{3}

, etc., the attribute value sequence linked list can be defined as

S L = {k_{1}, k_{2}, k_{3}, \dots}

.

Definition 5

(Block number linked list, L). If the block numbers in which the attribute value

k_{i}

is stored are

b_{k_{i}}^{1}

,

b_{k_{i}}^{2}

,

b_{k_{i}}^{3}

, etc., the block number linked list corresponding to the attribute value

k_{i}

can be defined as

L_{k_{i}} = {b_{k_{i}}^{1}, b_{k_{i}}^{2}, b_{k_{i}}^{3}, \dots}

.

Definition 6

(Lookup table,

L T

). According to the definition of attribute value sequence list and block number list, the lookup table can be defined as

L T = {S L, L_{k_{1}}, L_{k_{2}}, L_{k_{3}}, \dots} = {{k_{1}, k_{2}, k_{3}, \dots}, {b_{k_{1}}^{1}, b_{k_{1}}^{2}, b_{k_{1}}^{3}, \dots}, {b_{k_{2}}^{1}, b_{k_{2}}^{2}, b_{k_{2}}^{3}, \dots}, {b_{k_{3}}^{1}, b_{k_{3}}^{2}, b_{k_{3}}^{3}, \dots}, \dots}

.

The algorithm flowchart of the above inter-block lookup table index construction is shown in Figure 5. The specific steps for building and updating the inter-block lookup table index are as follows:

(1) When the new block is synchronized to the blockchain, it is first judged whether the lookup table

L T

has been constructed. If not constructed, the attribute value sequence linked list

S L

should be added to the lookup table

L T

.

(2) If the lookup table

L T

already exists, the inter-block lookup table

L T

is updated for each piece of data in the block. For each piece of data, the indexed attribute value is searched in the attribute value sequence linked list

S L

. If the value is found, step (3) is carried out. Otherwise, step (4) is carried out.

(3) If the indexed attribute of the data is found in the attribute value sequence linked list

S L

, the block number linked list L whose index value is the head node is located. After the last node of the list L, a new node is inserted to record the block number of the data.

(4) If the indexed attribute of the data is not found in the attribute value sequence linked list

S L

, the indexed attribute value of the data is new. The new attribute value is sequentially inserted into its corresponding position in the attribute value sequence linked list

S L

. Then, a new block number linked list L is established with the attribute value as the head node, and a subsequent node is inserted after the head node to record the block number of the data.

The lookup table records the blocks in the whole blockchain corresponding to keywords, and the lookup table should be updated every time a new block is updated. Therefore, with the increase of blocks, the volume of the lookup table will become larger and larger. Then, there are two problems storing the whole lookup tables in each block. First, the index storage occupation will become larger and larger; Second, the lookup table stored in each block is redundant. As shown in Figure 6, we intend to optimize it by recording only incremental modifications to the index structure in each block. The new block stores only the changed part of the lookup table and uses a pointer to the corresponding part of the lookup table that is located in the previous block.

4.2.2. Intra-Block Recursive Model Index

The block with query results can be quickly located using keywords through the inter-block lookup table index constructed in Section 4.2.1. This section establishes a learned recursive model index for each block. The purpose of establishing this index is to improve the efficiency of the query data within blocks.

This section applies the recursive model indexes for each block to build a block-level recursive model index, as shown in Figure 7. The recursive model index can significantly reduce storage space while improving query efficiency. However, the drawback of the recursive model index is that it does not support data updates or only supports a small number of data updates. However, once the blocks in blockchain are successfully verified, they will not be modified. All data updates will occur in subsequent blocks, so it is very appropriate to build recursive model for each block.

The relevant definitions of the block-level recursive model index are as follows:

Definition 7

(Block-level Recursive Model Index,

B R M I

). If the i layer of the model is

S_{i}

and the j sub-model of this layer is

M_{i}^{j}

, the block-level recursive model index composed of several sub-models can be defined as

B R M I = {S_{1} {M_{1}^{1}}, S_{2} {M_{2}^{1}, \dots}, \dots, S_{i} {M_{i}^{1}, \dots, M_{i}^{j}, \dots}, \dots}

.

To ensure that query results can be obtained in blocks using the block-level recursive model index, if the corresponding data are not found at the predicted position, the two sides of the predicted position are searched until the corresponding query data are obtained. Therefore, there will be a certain level of error when using the block-level recursive model to index the query, and the definition of the error is as follows:

Definition 8

(data error of

B R M I

, e). If a data query using index

B R M I

does not find a result at the predicted position, but can find a result at e positions adjacent to the predicted position, the error in the data can be defined as e.

A large error value means more records should be traversed when querying with

B R M I

index. Therefore, to ensure the query’s efficiency, the error threshold of all data is preset for each sub-model in the

B R M I

index. When the error in the sub-model exceeds the threshold, the B-tree index is used instead of the sub-model. This threshold is called the maximum permissible, so the maximum permissible error is defined as follows:

Definition 9

(

B R M I

sub-model maximum permissible error,

s m p e

). If the query result error e of all data in the index cannot exceed

s m p e

at the maximum,

s m p e

can be defined as the maximum permissible error of the index.

The algorithm flowchart of the above block-level recursive model index construction is shown in Figure 8. The specific steps of the block-level recursive model index construction are as follows:

(1) The configuration of the index is initialized before constructing the index, including the number of layers of the recursive model index, the number of sub-models in each layer, the initialization parameters of sub-models and the maximum permissible error

s m p e

.

(2) After the new block is successfully uploaded, the sub-model of the first layer is used to fit the keyword data, and the data are divided into a plurality of subsets according to the relevant result, and the number of subsets corresponds to the number of models corresponding to the second layer during initialization. After the division, the sub-models of the second layer are used to fit the respective subsets. The above steps are repeated until all the sub-models are fitted.

(3) The error of each sub-model is calculated. If the error exceeds the initial maximum permissible error, the B-tree index is established there to replace the current sub-model.

In terms of storage overhead, traditional indexes, such as B-tree, need to store the whole index structure. In contrast, learned indexes only need to keep sub-model parameters, so the storage overhead of learned indexes is relatively small.

Although the learned index performs well in terms of query speed and index space, it is too slow to build. Therefore, this construction process occurred off-chain so as not to affect the overall running speed of the blockchain. Learned indexes are stored off-chain, so the correctness of the learned index cannot be guaranteed. The index is verified using process verification. We abandon this idea and choose result verification to directly verify query results (in Section 5). This idea not only ensures that the index is correct, but also eliminates the influence of other malicious elements in the query process. For example, not all full nodes are well-intentioned.

5. Verifiable Query Algorithm

Full nodes use the lookup table index and the learned index in the data query process. The lookup table index is stored on blockchain, and the correctness of this index can be guaranteed so that the data query results will not be missed, i.e., the completeness of the query is guaranteed. However, the learned index is stored off-chain, and the correctness of the results obtained using this index cannot be guaranteed. Therefore, we propose using the result verification method to verify the query results. Since we do not modify the verification tree in the block, the result verification method we use is the traditional method in blockchain. The client recalculates the hash value of Merkle root, calculates the hash of the leaf node corresponding to the query result and its sibling node after connecting them, then calculates the connection value of the parent node and the uncle node and calculates the hash until the root hash value is generated. Clients can compare the computed hash value with the root hash stored in the block header to determine whether the data have been tampered with. This verification process is an optional step performed locally after the user receives the query result and does not affect the overall operational performance of the blockchain.

In the data query process, a user (light node) sends a query. Then, the full nodes determine the blocks with attribute values corresponding to keywords by means of the inter-block lookup table index according to keywords and perform data query in the determined block through the

B R M I

index. Next, the full nodes calculate the verification fields. Finally, the query results with verification fields are returned to the user.

The algorithm flowchart of the data query is shown in Figure 9. The specific steps of blockchain data query are as follows:

(1) The user initiates a query

q = k

and sends the keyword to the full nodes.

(2) The full nodes determine the block set where the query result corresponds to the keyword k by using the inter-block lookup table index. With the help of the existence index, the query result set

{W_{o n}}

corresponding to keyword k is determined using the block-level recursive model index in the block set.

(3) The union of the query results

{W_{o n}}

is taken on blockchain; the duplicate elements can be removed to obtain

R_{o n}

, and the keyword k can be combined to obtain the final query result R.

(4) The full nodes calculate the verification fields V.

(5) The full nodes return the query result R with verification fields V to the user.

Algorithm analysis: Assuming that the blockchain data to be queried consist of n blocks, each block has an average of m data. That is, the total amount of data is

n * m

. In the traditional blockchain query method, we need to traverse each piece of data in n blocks. Therefore, the query time complexity is

O (n * m)

. The proposed method uses an inter-block lookup table and block-level recursive model index to query. The first step uses the inter-block lookup table index to determine the blocks where the query result is located. Because the order list in the inter-block lookup table is ordered, the time complexity is

O (K)

in the worst case when the number of keywords is K. The second step uses the block-level recursive model index. The query performance in the worst case is related to the query performance of the B-tree. In the case that the data distribution cannot be learned, all models will be automatically replaced by B-tree to form a complete B-tree, and the search time complexity of S-fork B-tree is

O ({log}_{s} m)

, so the query time complexity of this method in the worst case is

O (K * {log}_{s} m)

. The double-layer index significantly improves query efficiency.

6. Experimental Analysis

6.1. Experimental Setup

6.1.1. Datasets

In our experiment, we used the following two datasets:

Wiki-CS dataset (https://github.com/pmernyei/wiki-cs-dataset/raw/master/dataset, accessed on 1 August 2022). This dataset consists of 11,701 data with 300 columns of attributes. We selected the first column of data as the keyword attribute. In addition, in order to facilitate the establishment of the bitmap index as a comparative experiment, the first column attributes were summarized into 35.
Ethereum dataset (https://www.ethereum.org/, accessed on 1 August 2022). The first 300,000 pieces of data in the Ethereum dataset were selected as the experimental data. According to the different size blocks set in the experiment, we recombined the data to build blocks, and selected the attribute $f r o m$ (the account that initiates the transaction) as the keyword. The attribute $f r o m$ is a username coded in hexadecimal code with 40 bits. To facilitate the establishment of the bitmap index as a comparative experiment, the attribute value was 500,000.

6.1.2. Experimental Settings

The hardware environment of the experiment was Intel (R) Xeon (R) Silver 4110 CPU@2.10 GHz and NVIDIA Corporation TU102 [Ge-Force RTX 2080 Ti]. The blockchain system we used was the BlockChainDemo system (https://github.com/zestaken/BlockChainDemo, accessed on 10 August 2022). The network bandwidth of the physical machine is 100 Mbps. The research content of this paper does not involve network delay. Moreover, the experiment was carried out in the same network environment, and the delay can be regarded as constant, so the influence of network delay can be ignored. In addition, this paper did not involve the work of the consensus mechanism, so we used the system’s default consensus mechanism.

6.2. Index Evaluation

We evaluated the performance of the lookup table and learned index, respectively, and then used an ablation experiment to evaluate the combination. In recent research, the main intra-block indexes were the Bloom filter [18] and B-tree indexes [6,11,13,16]. Specifically, ref. [18] adopted the Bloom filter index, which is an existential index. However, this can only avoid traversing part of the indexed data by determining whether they exist, and is not applicable to large amounts of data. Refs. [6,11,13,16] adopted the B-tree or B-data index with a verification function, which is the better method in the current data query index within blocks. Therefore, we chose B-tree as a comparative experiment for the intra-block indexes. In recent research, the inter-block indexes were mainly skip list indexes [11] and bitmap indexes [13]. Specifically, the skip list indexes perform poorly when facing sparse data. Bitmap indexes perform relatively well. Therefore, we chose bitmap as a comparative experiment for the inter-block index.

6.2.1. Lookup Table Index Evaluation

The lookup table experiment included three evaluations: construction time, query time and storage space. In the Wiki-CS dataset, each block consistently stores 100 pieces of data, increasing the number of blocks from 10 to 100. In the Ethereum dataset, each block consistently stores 1000 pieces of data, increasing the number of blocks from 75 to 300.

From Figure 10, Figure 11 and Figure 12, the curves of the three indicators of construction time, search time and storage space are basically linear, especially in the Ethereum dataset with a large order of magnitude. The curves’ linear phenomenon is more obvious, which shows that the index is scalable. The lookup table index is superior to the bitmap index in terms of building time, lookup time and storage space. The poor performance of the bitmap index is that it needs to reserve space in advance every time it is built, and is traversed whenever the index is updated. The lookup table index does not need to reserve space for subsequent content. Moreover, the list of keywords stored in the lookup table is ordered, so it is not traversed when used. Figure 12 shows that more space needs to be reserved when building the bitmap in the Ethereum dataset, leading to the difference between the two datasets in the storage space experiment. This shows that the lookup table index performs better with sparse data.

In addition, we also carried out experiments on the optimized lookup table. Because the optimization method and lookup table are the same in the construction process and query process, the construction and query time of the optimized lookup table is basically the same. The performance of the storage space is shown in Table 1. The lookup table index stores the index completely in each block, occupying little space when the system stores data. With the increase in the number of blocks, the size of the index increases continuously, and the storage space occupied in the block also gradually increases. Moreover, the whole index is stored in each block, which will lead to redundancy. However, the optimized method only stores the part of each update, so it occupies less storage space, and there is no redundancy in the indexes stored between different blocks.

6.2.2. Learned Index Evaluation

The learned index experiment also includes three evaluations: construction time, query time and storage space. In the Wiki-CS dataset, the number of blocks is 100, the amount of data in each block ranges from 100 to 1000, and the recursive index model has two layers: the first layer has 1 sub-model and the second layer has 3 sub-models. In the Ethereum dataset, the number of blocks is 100, the amount of data in each block ranges from 1000 to 3000, and the recursive index model has two layers: the first layer has 1 sub-model and the second layer has 20 sub-models. The maximum permissible errors in the first and second sub-models are 1 and 4.

As shown in Figure 13, the construction time of the learned index is higher than that of the B-tree because the learned index needs to be trained with data. However, the learned index is built for blocks, and the index will remain the same after construction so that full nodes can use it all the time. This index is stored off-chain, so the index construction will not affect the performance of the blockchain system. In addition, with the development of machine learning technology and equipment suitable for deep learning, the training time of the learned index will gradually decrease. As shown in Figure 14, the query time of the index is short, which can reflect the advantages of the learned index. As shown in Figure 15, the

B R M I

index’s storage overhead is much smaller than that of the B-tree index’s storage space. This phenomenon occurs because the B-tree index stores the whole tree structure with a large overhead. However, the

B R M I

index only needs to store the model parameters, and the storage overhead is minimal.

6.2.3. Ablation Experiment

We combine the intra-block and inter-block indexes for the ablation experiment to prove the improvement effect of each component on query efficiency. The inter-block index uses the bitmap index (denoted by “BP”) and the lookup table index (denoted by “LT”). The intra-block index uses the B-tree index and the

B R M I

index. In the Wiki-CS dataset, the number of blocks is 100 and there are 100 pieces of data in each block. In the Ethereum dataset, the number of blocks is 300 and there are 1000 pieces of data in each lock.

In Figure 16, the combination of the lookup table and

B R M I

performs well in both datasets. Compared with the combination of bitmap and B-tree, the lookup table and

B R M I

increase by about 15 times and 100 times in the Wiki-CS and Ethernet datasets, respectively. From Figure 16, we can see that the lookup table index and

B R M I

index show improvements in query rate. In addition, when combined with intra-block and inter-block indexes, the curve trend is the same as that of the individual index itself, indicating no conflict between the lookup table and the

B R M I

combination.

6.3. Architecture Cost

To evaluate the architecture’s impact on the blockchain system, we conducted experiments on the blockchain system without and with the depoloyed architecture. The number of blocks and data in blocks are consistent with the ablation experiment. The experimental results are shown in Figure 17. The architecture has little influence on the performance of the system. There are two reasons for this phenomenon. First, the learned index is maintained off-chain, which does not affect the performance of the blockchain system. Second, the lookup table index on blockchain is a lightweight index, which is very small compared with the data contained in blocks, so the storage time of the lookup table is short. In addition, only a short time is needed to construct and update lookup tables, so the architecture only affects the system’s performance a little.

6.4. Verification Cost

We experimented with two ways to evaluate the verification cost: the verification field size and the verification time of the query results. The performance is shown in Table 2. The table shows that the number of layers of the Merkle tree mainly determines the verification field size and time. The verification field is the certificate that the user verifies if the query result is questioned after receiving the query result. The uses chooses whether to store the authentication field. The verification process is also the user’s execution process after receiving the query results, which does not affect the overall operating performance of the blockchain system.

7. Summary

To support efficient semantic keyword queries on blockchain, we propose a learned index-based semantic keyword query architecture on blockchain. In this architecture, we recorded semantic information in data storage procedures to support semantic keyword queries. To improve the query efficiency and ensure the low storage overhead of the index, we designed a double-layer index structure: an inter-block lookup table index and an intra-block recursive model index. To guarantee the query results’ completeness of the query results, we stored the lookup table in the block header extension on blockchain. To ensure the query results’ correctness, we proposed a verifiable query algorithm for the above architecture. From the theoretical exposition and the experimental results, our framework can effectively improve the efficiency of keyword queries while ensuring the completeness and correctness of query results. Compared with the contrast experiment, our architecture increased by about 15 and 100 times in the Wiki-CS and Ethernet datasets. In the future, we will continue to improve the performance of our system and plan to conduct more research on data queries on blockchain.

Author Contributions

Conceptualization, Z.Y., J.X., K.H., Z.W. and W.Z.; methodology, Z.Y. and J.X.; software, Z.Y.; validation, Z.Y., J.X. and K.H.; formal analysis, Z.Y., J.X. and K.H.; investigation, Z.Y., J.X., K.H., Z.W. and W.Z.; resources, Z.Y., J.X., K.H., Z.W. and W.Z.; data curation, Z.Y. and K.H.; writing—original draft preparation, Z.Y., J.X. and K.H.; writing—review and editing, Z.Y., J.X. and K.H.; visualization, Z.Y. and K.H.; supervision, Z.Y., J.X., Z.W. and W.Z.; project administration, Z.Y., J.X. and Z.W.; funding acquisition, Z.Y., J.X., K.H., Z.W. and W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Key R&D Program of China (No. 2022YFB4500800), the National Natural Science Foundation of China (No. 62072089), the Fundamental Research Funds for the Central Universities of China (Nos. N2116016, N2104001, N2019007), and the Open Program of Neusoft Corporation, China (No. NCBETOP2102).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

El-Hindi, M.; Binnig, C.; Arasu, A.; Kossmann, D.; Ramamurthy, R. BlockchainDB—A Shared Database on Blockchains. Proc. Vldb Endow. 2019, 12, 1597–1609. [Google Scholar] [CrossRef]
Li, Y.; Zheng, K.; Yan, Y.; Liu, Q.; Zhou, X. EtherQL: A Query Layer for Blockchain System. In Proceedings of the Database Systems for Advanced Applications (DASFAA 2017), PT II, Suzhou, China, 27–30 March 2017; Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W., Eds.; Soochow University: Suzhou, China, 2017; Volume 10178, pp. 556–567. [Google Scholar] [CrossRef]
McConaghy, T.; Marques, R.; Müller, A.; De Jonghe, D.; McConaghy, T.; McMullen, G.; Henderson, R.; Bellemare, S.; Granzotto, A. Bigchaindb: A Scalable Blockchain Database; white paper; BigChainDB: Berlin, Germany, 2016. [Google Scholar]
Riegger, C.; Vincon, T.; Petrov, I. Efficient Data and Indexing Structure for Blockchains in Enterprise Systems. In Proceedings of the 20th International Conference on Information Integration and Web-Based Applications & Services, Assoc Comp Machinery, Hanoi, Vietnam, 4–6 December 2014; pp. 173–182. [Google Scholar] [CrossRef]
Xu, Y.; Zhao, S.; Kong, L.; Zheng, Y.; Zhang, S.; Li, Q. ECBC: A High Performance Educational Certificate Blockchain with Efficient Query. In Theoretical Aspects of Computing–ICTAC 2017: 14th International Colloquium, Hanoi, Vietnam, 23–27 October 2017; Natl Fdn Sci & Technol Dev Vietnam; HUMAX VINA Co.: Hanoi, Vietnam, 2017; Volume 10580, pp. 288–304. [Google Scholar] [CrossRef]
Zhou, E.; Hong, Z.; Xiao, Y.; Zhao, D.; Pei, Q.; Guo, S.; Akerkar, R. MSTDB: A Hybrid Storage-Empowered Scalable Semantic Blockchain Database. IEEE Trans. Knowl. Data Eng. 2022, 1–17. [Google Scholar] [CrossRef]
Nakamoto, S. Bitcoin: A peer-to-peer electronic cash system. Decent. Bus. Rev. 2008, 21260. [Google Scholar]
Dinh, T.T.A.; Liu, R.; Zhang, M.; Chen, G.; Ooi, B.C.; Wang, J. Untangling Blockchain: A Data Processing View of Blockchain Systems. IEEE Trans. Knowl. Data Eng. 2018, 30, 1366–1385. [Google Scholar] [CrossRef]
Zhang, C.; Xu, C.; Xu, J.; Tang, Y.; Choi, B. GEM(2)-Tree: A Gas-Efficient Structure for Authenticated Range Queries in Blockchain. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE 2019), Macao, China, 8–11 April 2019; pp. 842–853. [Google Scholar] [CrossRef]
Zhang, C.; Xu, C.; Wang, H.; Xu, J.; Choi, B. Authenticated Keyword Search in Scalable Hybrid-Storage Blockchains. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE 2021), Chania, Greece, 19–22 April 2021; pp. 996–1007. [Google Scholar] [CrossRef]
Xu, C.; Zhang, C.; Xu, J. vChain: Enabling Verifiable Boolean Range Queries over Blockchain Databases. In Proceedings of the Sigmod’19: 2019 International Conference on Management of Data; Assoc Comp Machinery; ACM SIGMOD: New York, NY, USA, 2019; pp. 141–158. [Google Scholar] [CrossRef]
Wang, H.; Xu, C.; Zhang, C.; Xu, J. vChain: A Blockchain System Ensuring Query Integrity. In Proceedings of the Sigmod’20: 2020 ACM SIGMOD International Conference on Management of Data; Assoc Comp Machinery; ACM SIGMOD: New York, NY, USA, 2020; pp. 2693–2696. [Google Scholar] [CrossRef]
Zhu, Y.; Zhang, Z.; Jin, C.; Zhou, A.; Yan, Y. SEBDB: Semantics hmpowered BlockChain DataBase. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE 2019), Macao, China, 8–11 April 2019; pp. 1820–1831. [Google Scholar]
Ruan, P.; Dinh, T.T.A.; Lin, Q.; Zhang, M.; Chen, G.; Ooi, B.C. LineageChain: A fine-grained, secure and efficient data provenance system for blockchains. VLDB J. 2021, 30, 3–24. [Google Scholar] [CrossRef]
Jia, D.; Xin, J.; Wang, Z.; Guo, W.; Wang, G. ElasticChain: Support Very Large Blockchain by Reducing Data Redundancy. In Proceedings of the Web and Big Data (APWEB-WAIM 2018), PT II, Macau, China, 23–25 July 2018; Cai, Y., Ishikawa, Y., Xu, J., Eds.; Volume 10988, pp. 440–454. [Google Scholar] [CrossRef]
Jia, D.; Xin, J.; Wang, Z.; Guo, W.; Wang, G. Efficient Query Model for Storage Capacity Scalable Blockchain System. J. Softw. 2019, 30, 2655–2670. [Google Scholar]
Lv, Y.; Liu, W.; Zhong, J.; Zhang, C.; Wang, K.; Wang, Z. An optimization model of electronic medical record query processing on blockchain. In Proceedings of the 2021 2nd International Conference on Artificial Intelligence and Information Systems (ICAIIS’21), Chongqing, China, 28–30 May 2021. [Google Scholar] [CrossRef]
Ren, Y.; Zhu, F.; Sharma, P.K.; Wang, T.; Wang, J.; Alfarraj, O.; Tolba, A. Data Query Mechanism Based on Hash Computing Power of Blockchain in Internet of Things. Sensors 2020, 20, 207. [Google Scholar] [CrossRef] [PubMed]
Linoy, S.; Mandikhani, H.; Ray, S.; Lu, R.; Stakhanova, N.; Ghorbani, A. Scalable Privacy-Preserving Query Processing Over Ethereum Blockchain. In Proceedings of the 2019 IEEE International Conference on Blockchain (BLOCKCHAIN 2019), Atlanta, GA, USA, 14–17 July 2019; pp. 398–404. [Google Scholar] [CrossRef]
Li, T.; Huang, R.; Chen, L.; Jensen, C.S.; Pedersen, T.B. Compression of Uncertain Trajectories in Road Networks. Proc. VLDB Endow. 2020, 13, 1050–1063. [Google Scholar] [CrossRef]
Li, T.; Chen, L.; Jensen, C.S.; Pedersen, T.B. TRACE: Real-time Compression of Streaming Trajectories in Road Networks. Proc. VLDB Endow. 2021, 14, 1175–1187. [Google Scholar] [CrossRef]
Li, T.; Chen, L.; Jensen, C.S.; Pedersen, T.B.; Gao, Y.; Hu, J. Evolutionary Clustering of Moving Objects. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE 2022), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 2399–2411. [Google Scholar] [CrossRef]
Fernando, D.; Kulshrestha, S.; Herath, J.D.; Mahadik, N.; Ma, Y.; Bai, C.; Yang, P.; Yan, G.; Lu, S. SciBlock: A Blockchain-Based Tamper-Proof Non-Repudiable Storage for Scientific Workflow Provenance. In Proceedings of the 2019 IEEE 5th International Conference on Collaboration and Internet Computing (CIC 2019), Los Angeles, CA, USA, 12–14 December 2019; pp. 81–90. [Google Scholar] [CrossRef]
Song, Y.; Gu, Y.; Li, F.; Yu, G. Survey on AI Powered New Techniques for Query Processing and Optimization. J. Front. Comput. Sci. Technol. 2020, 14, 1081–1103. [Google Scholar]
Song, Y.; Gu, Y.; Li, T.; Qi, J.; Liu, Z.; Jensen, C.S.; Yu, G. CHGNN: A Semi-Supervised Contrastive Hypergraph Learning Network. arXiv 2023, arXiv:2303.06213. [Google Scholar]
Sakurai, Y.; Yoshikawa, M.; Uemura, S.; Kojima, H. The A-tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation. In Proceedings of the 26th International Conference on Very Large Data Bases, Cairo, Egypt, 10–14 September 2000; Morgan Kaufmann: Burlington, MA, USA, 2000; pp. 516–526. [Google Scholar]
Kraska, T.; Beutel, A.; Chi, E.H.; Dean, J.; Polyzotis, N. The Case for Learned Index Structures. In Proceedings of the SIGMOD’18: 2018 International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; Das, G., Jermaine, C., Bernstein, P., Eldawy, A., Eds.; pp. 489–504. [Google Scholar] [CrossRef]
Kraska, T.; Alizadeh, M.; Beutel, A.; Chi, E.H.; Kristo, A.; Leclerc, G.; Madden, S.; Mao, H.; Nathan, V. SageDB: A Learned Database System. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research, CIDR, Asilomar, CA, USA, 13–16 January 2019. [Google Scholar]
Ding, J.; Minhas, U.F.; Yu, J.; Wang, C.; Do, J.; Li, Y.; Zhang, H.; Chandramouli, B.; Gehrke, J.; Kossmann, D.; et al. ALEX: An Updatable Adaptive Learned Index. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 14–19 June 2020; pp. 969–984. [Google Scholar] [CrossRef]

Figure 1. Blockchain structure.

Figure 2. An example of query scenario.

Figure 3. The architecture.

Figure 4. The lookup table index.

Figure 5. Lookup table construction and update algorithm flowchart.

Figure 6. The optimized lookup table index.

Figure 7. The BRMI index.

Figure 8. Block-level recursive model index building algorithm flowchart.

Figure 9. Blockchain data query algorithm flowchart.

Figure 10. The Construction Time of Inter-block Index.

Figure 11. The Query Time of Inter-block Index.

Figure 12. The Storage Space of Inter-block Index.

Figure 13. The Construction Time of BRMI Index.

Figure 14. The Query Time of BRMI-block Index.

Figure 15. The Storage Space of BRMI Index.

Figure 16. The Ablation Experiment.

Figure 17. The architecture deployment experiment.

Table 1. The performance of lookup table.

Dataset	Block nth	LT Space	Optimized LT Space
Wiki-CS	20	2052	104
	40	4052	98
	60	6096	95
	80	8152	103
	100	10,140	102
Ethereum	100	138,832	1222
	150	198,448	1248
	200	256,380	1125
	250	315,016	1216
	300	373,792	1118

Table 2. The verifiable query cost.

Dataset	Data Items	Merkle Layers	Space (B)	Times (ms)
Wiki-CS	100	8	256	9.9997
	200	9	288	11.7108
	300	10	320	12.8352
	400	10	320	12.1373
	500	10	320	11.6045
	600	11	352	12.7743
	700	11	352	11.9689
	800	11	352	12.9789
	900	11	352	12.7546
	1000	11	352	12.863
Ethereum	1000	11	352	12.2373
	1200	12	384	12.6396
	1400	12	384	12.3958
	1600	12	384	12.6824
	1800	12	384	12.3308
	2000	13	384	12.7679
	2200	13	416	13.1738
	2400	13	416	13.9893
	2600	13	416	13.7485
	2800	13	416	14.0407
	3000	13	416	13.6439

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yao, Z.; Xin, J.; Hao, K.; Wang, Z.; Zhu, W. Learned-Index-Based Semantic Keyword Query on Blockchain. Mathematics 2023, 11, 2055. https://doi.org/10.3390/math11092055

AMA Style

Yao Z, Xin J, Hao K, Wang Z, Zhu W. Learned-Index-Based Semantic Keyword Query on Blockchain. Mathematics. 2023; 11(9):2055. https://doi.org/10.3390/math11092055

Chicago/Turabian Style

Yao, Zhongming, Junchang Xin, Kun Hao, Zhiqiong Wang, and Wancheng Zhu. 2023. "Learned-Index-Based Semantic Keyword Query on Blockchain" Mathematics 11, no. 9: 2055. https://doi.org/10.3390/math11092055

APA Style

Yao, Z., Xin, J., Hao, K., Wang, Z., & Zhu, W. (2023). Learned-Index-Based Semantic Keyword Query on Blockchain. Mathematics, 11(9), 2055. https://doi.org/10.3390/math11092055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learned-Index-Based Semantic Keyword Query on Blockchain

Abstract

1. Introduction

2. Related Work

2.1. Blockchain Data Query Processing

2.2. Learned Index

3. Problem Definition

3.1. Data Query Definition

3.2. Architecture Goals

4. Architecture

4.1. Architecture Overview

4.2. Double-Layer Query Index

4.2.1. Inter-Block Lookup Table Index

4.2.2. Intra-Block Recursive Model Index

5. Verifiable Query Algorithm

6. Experimental Analysis

6.1. Experimental Setup

6.1.1. Datasets

6.1.2. Experimental Settings

6.2. Index Evaluation

6.2.1. Lookup Table Index Evaluation

6.2.2. Learned Index Evaluation

6.2.3. Ablation Experiment

6.3. Architecture Cost

6.4. Verification Cost

7. Summary

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI