Reinforcement Learning in Blockchain-Enabled IIoT Networks: A Survey of Recent Advances and Open Challenges

: Blockchain is emerging as a promising candidate for the uberization of Internet services. It is a decentralized, secure, and auditable solution for exchanging, and authenticating information via transactions, without the need of a trusted third party. Therefore, blockchain technology has recently been integrated with industrial Internet-of-things (IIoT) networks to help realize the fourth industrial revolution, Industry 4.0. Though blockchain-enabled IIoT networks may have the potential to support the services and demands of next-generation networks, the gap analysis presented in this work highlights some of the areas that need improvement. Based on these observations, the article then promotes the utility of reinforcement learning (RL) techniques to address some of the major issues of blockchain-enabled IIoT networks such as block time minimization and transaction throughput enhancement. This is followed by a comprehensive case study where a Q-learning technique is used for minimizing the occurrence of forking events by reducing the transmission delays for a miner. Extensive simulations have been performed and the results have been obtained for the average transmission delay which relates to the forking events. The obtained results demonstrate that the Q-learning approach outperforms the greedy policy while having a reasonable level of complexity. To further develop the blockchain-enabled IIoT networks, some future research directions are also documented. While this article highlights the applications of RL techniques in blockchain-enabled IIoT networks, the provided insights and results could pave the way for rapid adoption of blockchain technology.


Introduction
Blockchain is steadily gaining traction due to its applications in banking, supply chain management, and cybersecurity [1][2][3].Besides simplifying the business processes and reducing errors in verification, its integration with Internet-of-things (IoT) has recently been discussed extensively among the research community.As one of the key enablers of Industry 4.0 and an emerging offshoot of IoT, the Industrial IoT (IIoT) networks are paving their way in various commercial and social sectors such as retailing, manufacturing, logistics, pervasive monitoring, security surveillance, healthcare, and home automation [4][5][6][7].Moreover, with the recent developments in wireless communications and sensor network technologies, an increasing number of devices are being introduced in the IIoT space, where raw data are locally captured and processed to support decision-based processes.These devices can communicate and interact with each other as well as share and process information independent of human intervention [8].Therefore, they must be made secure to preserve data integrity as well as ensure resource availability and computing reliability.
Blockchain technology introduces a new design paradigm for next-generation transaction-based applications that, together with a distributed and shared public/private ledger and a collective consensus mechanism, build trust, transparency, and accountability in a system.A summary of these processes, illustrating a transaction from A to B, is given in Figure 1.Due to the transparent nature of blockchain technology, it is highlighted by the academia and the industry alike, as a potential solution for efficiently managing massive IIoT networks [5,6,9].Moreover, decentralized IIoT devices and trustless network architectures are expected to play a key role in the advancement of IIoT networks where data will be processed locally at the site of generation and not in a centralized manner.It will also facilitate device connectivity and make data storage trustless through devices and sensors that can operate without relying on a central authority.A blockchain can also offer IIoT devices a secure infrastructure that is robust against single-point-of-failure [10,11].Because decentralized networks have multiple entry points, they adhere to resilience and fault tolerance of the networks.In addition to this, IIoT infrastructures can become more accessible by using distributed ledger technologies.A comparison of blockchain with different ledger technologies is given in Table 1.Although there is a need to perform a detailed cost analysis of private blockchain networks, certain public blockchains platforms (e.g., IOTA) have zero transaction fees.Thus, the associated costs for operating centralized IIoT systems can also be reduced significantly by decentralizing the IIoT networks using blockchain.
Since blockchains do not rely on intermediaries for their operation, they can automate services through code, and by distributing control in the network via consensus algorithms.In such networks, trust between different IIoT devices is established through distributed consensus protocols, thereby removing the need of a trusted centralized service provider.This insight is also very important from the perspective of self-sustainability in IIoT networks [12].The notion of "distributed autonomous corporations" can be implemented on decentralized IIoT systems that can operate independently according to a pre-defined logic.Moreover, smart contracts (a set of encoded logic that can be used to create agreements when a certain set of conditions are met) can also be used for verifying a function or data and automating a procedure.This capability can be useful for many IIoT applications.For instance, a user/device can authorize a certain payment when a set of conditions indicate that the delivery of a product/service has been completed.This way interactions between user-user, user-device, or device-device can be handled transparently.
Furthermore, the systematic and, in particular, automatic processing of information or data with help from computers is the core task of computer science.For such processing, computer programs are developed, which are traditionally engineered by people.Each program can have one or a set of problems to address.The nature of these problems can range from simple to extremely complex and time-consuming or even computationally infeasible to solve.This can directly affect the processing of the input information and the creation of a program.Thus, addressing these problems and meeting optimal program design requirements remain key concerns in computer science.One of the most debated solutions to address these concerns is the so-called artificial intelligence (AI) that has the potential to define better program design principles [13].The basic goal of AI is to solve problems that are not easy for humans, for which a certain form of intelligence seems necessary.For example, the human decision-making process can be supported by expert systems.Rule-based systems are very common in this environment.The manual modeling of the facts and rules required for these systems as a knowledge base is very complex, expensive, and often also prone to errors.This difficulty is widely known as the knowledge acquisition bottleneck.Can the knowledge not be derived automatically from experiences that are available in the form of data?The initial development of machine learning as a part of AI can be interpreted as the answer to solving this problem.In the meantime, machine learning can even be seen as a key technology for solving AI tasks and is also used in numerous application areas that are not directly assigned to AI.

Characteristics and Overview of Reinforcement Learning Techniques
Reinforcement learning (RL) is a sub-branch of AI or, more generally, of computer science that deals with having computers solve problems without explicitly programming them.It is a sequential learning process where agents automatically adjust their policies by observing the result of their rewards.More specifically, the RL techniques can be based on the Markov decision process (MDP) that consists of an environment and a set of agents [14,15].The environment transitions into a new state once the agent takes some action and receives a reward in response.Due to state transitions, the dynamics of the environment can be modeled as a sequential decision-making process.Interaction with the surrounding world can be considered a foundation of human learning.Machine learning through reinforcement follows this basic paradigm learning.At its center, an agent tries to achieve a long-term goal by practicing meaningful actions.It can then provide feedback on these actions to gradually improve output.This approach received special attention in 2015 and 2016 when the best human players of the board game Go went against AlphaGo.Compared to chess, Go offers significantly more combinations and it is much more difficult to assess which player is currently in the lead.Moreover, this learning approach has also recently seen many novel applications in a wide range of domains.
In contrast to supervised and semi-supervised approaches, RL techniques generally do not need labeled data or prior information regarding the environment.This characteristic of RL makes it suitable for blockchain technology.The RL techniques can be broadly categorized into two types, i.e., model-based RL and model-free RL [16].A brief description of these two types is given below: The model-based approaches assume that the agent has access to the model of an environment.The model of an environment can be a function that predicts the transition probabilities and stat-function.Under these circumstances, the agent can have a better understanding of the environment as it can plan and think ahead about several possible choices.This approach generally improves the efficiency of the agent and help by planning the policy.A good example of this approach is AlphaZero [17].
In recent years, an extensive number of methods have been proposed for model-based RL.One of the emerging techniques is model-predictive control for selecting actions of the agent [18].In this way, the agent formulates an optimal plan (i.e., actions taken over a long period of time) after observing the state of the environment.The learning agent prepares a new plan after each new interaction with the environment.Another popular approach for model-based RL is data augmentation [19].It uses a learning algorithm to train the agent and either use fictitious experiences or augments the real experience with a fictitious one.Another approach, called embedded planning, makes use of subroutine which acts as side information.This provides the agent with the capability to choose a particular plan and ignore the other plans based that do not provide optimal policy.1.1.2.Model-Free RL Although there are many advantages of model-based RL, yet it is difficult to train the model-based RL.Furthermore, it is often difficult to find the ground-truth model for training the agent.The bias in the model could also be exploited by agents, thereby, performing sub-optimally in a practical setting.In these conditions, the best approach is to adopt a model-free RL approach to learn different aspects of the environment.This generally leads to a low bias training and is considered more feasible when multiple factors are affecting the reward.A brief discussion of these models is provided as follows: • Q-learning: Q-learning is the most studied and well-known RL technique that has been used for solving different sorts of problems.In Q-learning, an agent takes action based on the Q-values in the Q-table [20].There are four major components of a Q-network, i.e., a stage set, an action set, a reward, and transition probabilities.For each state, the agent executes some action under its pre-defined policy.Subsequently, the agent adopts the policy such that it has maximum Q-value.
After each sequence, the agent revises the Q-table for a more accurate estimation of the Q-values and updating the policy of the agent.Thus, in due course of time and after many steps, the policy converges to the optimal policy of the agent.

•
Multi-armed Bandit Learning: The agent in the multi-armed bandit approach selects the action without having the state information of the environment.Relevant to the action performed by the action time step, the agent receives a reward that is maximized in the next iterations [14].Since the agent is unaware of the environment and the associated reward, there always exists a tradeoff between exploitation and exploration.Due to this reason, the multi-armed bandit learning, although lower in complexity, is not efficient for highly dynamic environments.

•
Actor-critic Learning: The actor-critic learning divides the agent into two roles, i.e., actor and critic [20].The action selection policies over the action space are represented by the actor.In contrast, the critic is the observer which anticipates the expected reward received in the future by passing through the same state.Based on the observed reward, the state values are updated by the critic.Here, the critic can be considered as a trainer which trains the actor to improve its stability and select the optimal action in each state.Using the probability density function given by the actor, the size of the action space does not increase the complexity of the action selection.Therefore, it is much suitable for continuous and large action state spaces.

•
Miscellaneous: Other techniques include policy optimization, soft actor-critic, and deep policy optimization techniques.These techniques are mostly variants of the above-mentioned RL techniques that use entropy regularization, and stochastic policies to stabilize RL.Sometimes the conventional RL techniques may not be feasible for large-scale networks.In other words, the dimensionality of state, action, and reward could render the problem difficult to solve.Thus, for high dimensional optimization, deep neural networks have been proposed to work along with conventional RL techniques to improve the network.These deep neural networks can be used to estimate the Q-values during each iteration, also known as deep Q-network [20].The agent can select the maximum estimated value and store the experience again to train the neural network for estimation.Nevertheless, it is much more complex and computationally exhaustive than conventional RL techniques.

Motivation
There exist many issues and caveats in different avenues of blockchain-enabled IIoT networks that remain unresolved and to be addressed.These issues restrict the integration of blockchain with IIoT as they have different system limiting factors or trade-offs under different circumstances.This trade-off is similar in nature to CAP (Consistency, Availability, tolerance to network Partitions) theorem of distributed systems that states: "A robust and distributed system can only simultaneously provide two out of its three properties" [21].Similarly, the integration of blockchain can also be considered as a trade-off challenge.Therefore, this article first presents a review of current literature to highlight the challenges associated with blockchain-IIoT networks and then, it discusses how they can be addressed.

Related Surveys and Our Contributions
To address the concerns discussed above, this article formulates a gap analysis of recent literature, which is used to reflect and highlight some of the overlooked aspects of blockchain-enabled IIoT networks.Unlike RL surveys published in other articles, we present our gap analysis through an abstract representation of a blockchain-IIoT network.We believe that such representation has not yet been discussed in the literature.Subsequently, similar to other surveys, our article advocates the need for employing RL techniques to optimize their performance.In particular, it discusses some potential applications of RL in blockchain-enabled IIoT networks ranging from minimizing block time to improving transaction throughput.On the other hand, unlike the surveys of recent literature, our article also provides a case study where a Q-learning technique is used to minimize forking events.The simulation results of the RL technique demonstrate improvements when compared with a conventional greedy policy.Finally, some future research directions are provided for researchers working in academia and industry.Thus, the main contributions made in this paper unlike other surveys, are the following: 1.
Identifying the integration challenges of blockchain with IIoT and formulating a problem statement for a case study.

2.
A thorough review of the current paradigms in blockchain and their associated consensus algorithms.

3.
A review of reinforcement learning, its characteristics, and how they can help address the integration problems of blockchain and IIoT.4.
An abstract representation of a blockchain-IIoT network with three-layer network architecture, i.e., Physical Layer, Network Layer, and Application Layer.

5.
A concise review of how RL can be used to power different avenues in blockchain-enabled IIoT networks.6.
A case study to demonstrate the feasibility and improvements introduced by using RL in a blockchain-IIoT network.

Organization
The remainder of the paper is organized as follows.In Section 2, preliminaries of blockchain-based IIoT networks is provided.In Section 3, the existing gap analysis in the literature is outlined.This is followed by Section 4 which highlights the applications of RL in blockchain-based IIoT networks.Next, a case study is then presented in Section 5 for the minimization of forking.Finally, Section 6 provides conclusions and potential future research directions.

Blockchain-enabled IIoT Networks
Industry 4.0 represents the fourth industrial revolution that will facilitate IIoT with adaptive and autonomous systems that can self-heal and self-learn.IIoT aims to promote multi-disciplinary businesses and industries by realizing intelligent industrialization [22][23][24][25].It enables smarter industrial processes by incorporating AI with big data technologies for exploiting the massively produced and communicated data.The recent surge in data volumes generated by IIoT environments and then sent to centralized servers, present some security concerns like a single point of failure, data integrity, and in particular, scalability.Decentralized IIoT network architectures are expected to address these issues and play a key role in the advancement of IIoT, where data will be locally processed at the site of generation and not in a centralized manner.Over the years, blockchain technology has moved from the phase of inception to rapid research and development, as shown in Figure 2. Thus, it is high time to explore the applications of this technology in IIoT networks to solve the security problems and provide decentralized IIoT solutions via transparent, immutable, and distributed system design principles.Moreover, decentralizing IIoT can provide the following benefits:

Improved Security
A blockchain can offer IIoT devices a security infrastructure that is robust against a single-point-of-failure.A special feature of blockchain technology is the decentralized structure of the network.In a centralized network, transactions between network participants are always made with the help of a central instance (node).This intermediary can control all movements within the network.The individual members communicate with each other via this central point and not directly with each other.
A decentralized structure dispenses with such an instance so that direct communication with one another is possible.Such a network cannot be controlled from the outside.The users of the network can be distributed worldwide, but still have a uniform, synchronized database.The most common term for this is the "peer-to-peer network".Decentralized networks have multiple entry points that provide resilience and fault tolerance to networks.Moreover, public key infrastructure makes attacking a blockchain incredibly difficult because any data originating from anywhere other than the origin, i.e., genesis block, will be null and useless in the network.

Data Integrity
A blockchain uses cryptography for referencing its blocks, i.e, it adds a hash value of previous blocks in its succeeding blocks.Despite the low residual risk of hacking, blockchain technology ensures a high level of security, as the data is distributed locally, accessible to all users, and encrypted.It offers a high degree of reliability since the data is saved redundantly on all full nodes.By dispensing with intermediaries such as banks, it is possible to carry out the transactions directly with one another.This allows faster processing and, particularly in regions with a less developed legal system, enables contracts or transfers to be executed correctly and securely.This way, a block, and its data can be verified completely by checking the reference hash in a block.This property of blockchain preserves data integrity and makes it extremely difficult to tamper with data.

Cost Effectiveness
IIoT infrastructures can become more affordable when security vulnerabilities in their systems are removed by decentralizing the networks and store their data in distributed ledgers.Mutual trust in the parties involved in a transaction is not necessary with blockchain networks; rather, the technology behind it ensures secure transaction processing [26].Furthermore, transparency in a blockchain network is extremely important.In this way, the entire transaction history is clearly shown and users can view it at any time.Blockchain networks work independently and autonomously, which is why external influences do not affect the network.In traditional IIoT systems, the service providers usually have a monopoly on their operation and the cost of supporting devices [27].By using distributed ledger technologies, service providers and their monopolies can potentially be completely removed, thereby making IIoT more accessible.Moreover, the associated costs for operating centralized IIoT systems can also be eliminated.

Trustless
A blockchain does not rely on intermediaries for its operation and therefore, can automate services through code and by distributing control in the network [28].For the participants in a blockchain network, the private key is generated in the form of a random number and the public key is derived from it.The user's address is generated from the public key as an alphanumeric value.This is also called "pay to public key hash".The public key of a user can be recognized at any time by the other network participants, in contrast, the private key is secret and is used for the decryption and signature of the transactions.If two users want to execute a transaction within the blockchain network, the sender encrypts it with the recipient's public key.Decryption, i.e., making it readable again, is only possible if the recipient decrypts the transaction with his private key.To prove that the transaction came from the sender and not from an unauthorized third party, the sender signs it with his private key.The recipient can now use the sender's public key to ensure that the transaction originated from the sender [29].To increase the security of individual transactions so that, for example, the key cannot be used to conclude other transactions of a participant, there is the possibility that network participants use a new key pair for each transaction.Trust between users and devices in an IIoT system can then be established by using distributed ledgers with smart contracts, which will eliminate the need to place trust in centralized service providers.

Autonomy
The notion of "distributed autonomous corporations" can be implemented on a decentralized IIoT system that can operate independently according to some predefined logic.There is no central body that authenticates the participants, which is why all users have the same legitimation [30].The nodes of a blockchain network store and manage the entire transaction history of a blockchain in an unchangeable form.Over time, however, this transaction history takes up an enormous amount of storage capacity [31].In contrast, full nodes store the entire blockchain database.This can potentially remove intermediaries and central authorities to facilitate automation in IIoT systems.

Smart Contracts
These are computer programs, i.e., coded logic, that can be used to create agreements that are executed when a certain condition or a set of conditions are met [32].Practical examples of such a blockchain network are Ethereum or Bitcoin.Etherum offers a cryptocurrency called "Ether" which is a smart contract platform.There are a variety of different blockchain applications that are not limited to cryptocurrency since users can implement their functions [33].They can also be used for verifying a function as well as data, and to automate a process.This can be very useful for many applications in IIoT.For instance, a user/device can authorize payment when a set of conditions indicate that the delivery of a product/service has been completed.This way interactions between user-user, device-device, or user-device can be handled transparently.
A blockchain is an online, digital ledger that is globally distributed [34].The nature of this ledger can be public, private, or semi-private.Note that public ledgers are permissionless while private ones are permissioned.It defines a distributed system design paradigm that uses cryptography, a public key infrastructure (PKI), and economic modeling.This integration of primitives is applied together to a peer-to-peer (p2p) network as well as a shared consensus algorithm.The consensus is used to achieve synchronization among the distributed ledger and it typically operates on a huge number of nodes and/or devices [35].Moreover, on a blockchain, any kind of information and anything of value can be stored such as digital assets, deeds, identities, and even votes can be securely stored, moved, and managed.A brief description of the different components of blockchain is provided in Figure 3.The set of its properties of decentralization, immutability, transparency, and fault-tolerance render is suitable for decentralized IIoT environments.
Blocks in a blockchain can be considered as a page in the ledger.Each block consists of many components and has a head and a body.

Block
Blockchain is a distributed ledger.However, not all distributed ledgers are blockchains.It is an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable and permanent way

Distribution
A ledger is similar to a database.However, in the case of blockchain, the ledger is consensually shared and synchronized in a distributed manner.

Ledger
A confirmation in blockchain refers to the fact that a transaction has been processed by the nodes in network and is unlikely to be reversed by a single entity.

Confirmation
It refers to the original consensus algorithm in the network.It is used to produce new blocks in the chains.

Proof of work
Miners demonstrate original proof of work to create a new block and get rewarded for doing so.Thus, miners are used to validate the transactions in the blockchain network.

Miner
A miner gets some reward after mining a block successfully.The reward can be in the form of cryptocurrency which halves every four years.It can be inferred that the main constituent in a blockchain is 'blocks'.A block is a set of transactions, i.e., it collates tuples of transactions together concerning a specific period.At any time t in a blockchain network, the users generate hundreds of transactions.These transactions are initially unconfirmed and need to be verified so that they be eligible to be written in the blockchain.This requires the confirmation of miners, which add unconfirmed transactions together and form a block.Note that miners are blockchain entities that are responsible for generating new blocks.Thus, the first block in a blockchain is called the 'genesis' block and all the blocks added after it is called the 'successor' blocks.The successors are added to the genesis block in chronological order, i.e., the genesis block b g is hashed and stored in the second block b g+1 .The hash of b g+1 is stored into b g+2 and so on.This way, each block has a hash pointer of the previous block and to change the content of one block requires changing all of its preceding blocks until genesis.This forms the basis of a blockchain and can be formulated as:

Blockchain Preliminaries
Moreover, as mentioned before, a block contains a set of transactions.It can also contain application relevant data as well as certain meta-data such as block header of the previous block.Note that a block header is a common term used for representing the hash of the previous block.Thus, we can formulate a block in the following way: where B represents a blockchain, b g the chronologically appended blocks in it, and tx(•) represents the set of transactions included in a block.From Equation (2), we can see that a block can contain n instances of transactions, data, and other information depending upon the nature of the application it is designed for.
A transaction set tx(•) is a set of instructions that changes the ownership of digital assets from one user to another.Note that a digital asset can be a virtually valued currency, a token, or simply a deed, etc.The ownership of these assets is changed via the PKI framework offered by a blockchain.Mathematically, we can formulate a transaction in the following way: where t in and t out are input, output vectors of a transaction that represent both the sender and recipient.t in Num and t out Num are the number of transactions relative to a timestamp t, nonce represents the challenge for miners to mine the transaction and validate it, whereas data field is an extra field that can store transaction relevant information.Furthermore, in addition to storing block headers in metadata, a blockchain also stores the proof of a block in it.The proof is generated by the miners, which is a unique and deterministic function.Typical blockchain applications use the proof-of-work (PoW) consensus algorithm which can be generally formulated as: where the probability of finding a proof p for a target t is pr(p ≤ t).When a miner finds the target value, it broadcasts its proof in the network.The other miners then verify this proof and validate the block.Once the block is validated and accepted by the majority of the network, it is then finally added to the blockchain.Moreover, the time taken by a miner to find proof for a block can be given as: where t p is the time required to find a proof for a block and hashRate represents the number of hashes a miner can generate per second.Generally, hundreds of thousands of hashes are required per second to mine a block in a feasible time frame.Therefore, PoW based systems arguable require high computational resources and complexity.
We can now deduct that blockchain technology introduces a whole new design paradigm for next-generation applications that, together with a distributed and shared public/private ledger and a collective consensus mechanism, build trust, transparency, and accountability in a system.Thus, many different avenues have been already identified to benefit from its adoption such as financial institutions, smart power grids and cities, supply chain management frameworks, and cyber-physical systems [36][37][38].Besides simplifying business processes and providing transparency in system operations, its integration with IIoT has recently been greatly discussed among the research community.
As one of the key enablers of I4 and an emerging offshoot of IoT, IIoT networks are paving their way in various commercial, and social sectors such as retailing, manufacturing, logistics, pervasive monitoring, security surveillance, healthcare, transportation systems, and home automation, etc. [34,[39][40][41].Moreover, with the recent developments in wireless communications and sensor network technologies, an increasing number of devices are being introduced in the IIoT space, where raw data are locally captured and processed to support decision-based processes.These devices can communicate and interact with each other as well as share and process information independent of human intervention.Therefore, they must be made secure to ensure data integrity as well as resource availability and computing reliability [42].

Recent Studies and Gap Analysis
This section discusses some of the recent works done in blockchain-based IIoT networks while also providing a gap analysis.Although there is little consensus on the number of layers in a blockchain (Generally, a different arrangement of six layers (i.e., Data layer, Network layer, Consensus layer, Incentive layer, Contract layer, and Application layer) can be found in blockchain literature [43].However, due to the lack of overlap between blockchain and IIoT layered architecture, it is not straightforward to deduce the optimal number of layers.Therefore, we resort to three layers that may incorporate operations of other sub-layers.),there are generally three layers [44] (i.e., Physical (Perception), Network, and Application) having common functionality in IIoT and blockchain technology.

Physical Layer
For the physical layer, blockchain helps to facilitate key management in IIoT devices through their gateways that can act as the agent of a cluster of IIoT devices [4].In this regard, the authors in [45,46] address the issue of digital provenance in IoT based environments and vehicular networks, and propose a hardware primitive based framework that uses physical unclonable functions (PUFs), blockchain, and smart contracts.A PUF can be defined as a system that maps a set of challenges to a set of responses based on the physical microstructure of a device.In [47], Jiang et al. analyzed the wireless power transmission aspect of blockchain networks.The authors in [48] present a chip-level blockchain-based solution for the IIoT networks.They propose the integration of IoT with blockchain that uses a physical address inside a semiconductor chip mounted into an IIoT device.Furthermore, the authors in [49] present a new dimension to PUFs and introduce optical PUFs as a physical trust root for blockchain-based applications.They stress that optical PUFs (o-PUFs) can be successfully used as random number generators to generate bit strings.They explain how the bit strings can be generated by using a coherent light source that can be varied to create different PUF challenge-response pairs.The generated bit-strings can then be used as symmetric keys for asymmetric encryption.

Network Layer
For the network layer, there are studies that focus on the merger of blockchain with other technologies like software-defined networks and edge computing [50,51].The authors in [52] present four major blockchain trends.Note that we differentiate among the blockchain technology trends concerning their consensus protocols and not their applications.Blockchain 1.0 (e.g., Bitcoin, Ethereum) represents the typical blockchain data structure that starts with a genesis block and adds succeeding blocks to it in a chronological manner.These are the blockchain paradigms whose consensus protocols are proof-of-work.To exploit these protocols for system designs, one example includes [53] in which the authors propose a blockchain-based payment scheme for remote villages with intermittent Internet connectivity.Blockchain 1.0 however, has shown weak scalability and favored mining, thereby highlighting the need for a new, more effective protocol.A more advanced version is therefore introduced as Blockchain 2.0 (e.g., Ethereum 2.0) that uses smart contracts and proof-of-stake protocols to protect the network [54].It has also resulted in a significant reduction of the cost of verification and allows a transparent contract definition to prevent fraud and hacking.Blockchain 3.0 is another advancement over the previous version and uses DAGs in which there are no blocks but sites, i.e., transactions committed by IIoT devices.Each device represents a transaction and the connections (direct edges) between transactions represent their validation.The authors of [55] exploit Blockchain 1.0, 2.0, and 3.0 to propose deterministic cross-blockchain token transfers (DeXTT), a cross-blockchain transfer protocol that can be used to transfer tokens, i.e., digital assets from one blockchain to another.The most recent version of the blockchain is Blockchain 4.0 which is built to increase the degree of trust and privacy in Industry 4.0.Blockchain 4.0 is an extension of Blockchain 3.0 to make it feasible for real-life business scenarios.The consensus protocol for this trend includes virtual voting, blockchain consortium (federated blockchains), and gossip-based protocol variants.IIoT data collection, approval of workflows, asset and supply chain management are some of the key business processes that could be enabled by Blockchain 4.0.

Application Layer
In blockchain-enabled IIoT networks, the application layer deals with the interfacing and controlling issues [56].This layer can also be used to develop programmable currency, smart contracts, and REST APIs.For instance, the authors of [57] studied the reserve price of advertisement inventory in blockchain networks and analyzed its impact on the online market.Subsequently, they formulated an auction model for multi-channel sales of the advertisement inventory.Qian et al. in [58] highlighted the utility and importance of the application layer in the healthcare domain, similar to the work of [59].They argued that there is a need for secure application layer architecture for protecting user privacy and storing encrypted data in the cloud or edge.Another use case of blockchain is for mitigating IIoT device based distributed denial of service (DDoS) attacks [2].The authors used Ethereum with smart contracts that verify IIoT devices and enforces a bandwidth limit on them beyond which they cannot operate.

Applications of RL in Blockchain-Enabled IIoT Networks
RL techniques are fundamentally different from the other featured learning, where at the beginning of the learning process, the training data are available, be in accommodation with or without a label as examples or observations.That is why RL is also called learning through interaction.While in supervised learning, knowing the correct target values for the training data can determine exactly whether a decision is right or wrong (classification) or how far a prediction is from the correct value (regression), the agent in RL does not usually learn whether a decision is the best.Since the wealth of experience only grows with its interaction, the agent often only knows for a part of the possible actions that have led to a greater or smaller reward in the past.To solve this problem, typically previously unknown (not evaluated) actions are performed based on trial and error to increase the scope of activities within the scope of his strategy.
Aforementioned in view, RL is the least specified in the blockchain domain, so this learning form is ideal for scenarios in which the provision of training data is difficult and an agent has to make a strategy for a series of decisions.This interactive learning makes the RL techniques suitable for blockchain-enabled IIoT networks.This section builds on this thesis and provides details of how different RL techniques can be applied to IIoT-blockchain networks to improve their performance.A summary of these applications is provided in Figure 4.

Minimizing Forking Events
In general, forking is an event in the blockchain that takes place when a blockchain diverges into two potential chains.Since the miners in the blockchain need to use common consensus algorithms to maintain the history of blockchain, a forking event indicates a scenario where the miners clash and have a conflict of consensus agreement.This may result in creating forks that are both short and long.From the perspective of Blockchain 1.0, the forking may be caused when the nodes having completed proof-of-work, do not convey the results to other computing nodes.To avoid the forking events, the transmission delay at the miner can be reduced with the help of RL.The agents can be trained for an IIoT environment that develops an optimal policy for minimizing such delays.The Q-learning technique appears to be more suitable for improving transaction throughput and minimizing forking events.Moreover, energy efficiency and link security can be improved using actor-critic learning and deep Q-learning techniques, respectively.Finally, time to finality and the block time is expected to be reduced, respectively, with the help of Q-learning and multi-armed bandit learning.

Improving Energy Efficiency
Energy is an important aspect of providing communication infrastructure to IIoT-blockchain networks [3,60].Due to this reason, the importance of energy-efficient communication techniques cannot be overstated.The energy-constrained miner can become the point of failure in case the energy resources are not utilized efficiently.Besides this, the forking events may also result in re-computation of the proof-of-work at the miner side, thereby, reducing the energy budget.These situations may entail an extra cost in terms of consuming unnecessary energy of devices [61].To avoid this, RL techniques can be very helpful when applied correctly.We anticipate that these types of networks can be optimized using actor-critic learning.Due to the continuous and dynamic nature of the energy consumption of blockchain-enabled IIoT devices, it is important that the versatile actor-critic learning model may be employed.

Time to Finality Minimization
The term finality refers to the confirmation message that once committed to the blockchain, the well-formed blocks would not be tampered with.This is important since the blockchain users need to ensure that the transactions cannot be reversed or arbitrarily changed once a transaction goes through [31].Although the consensus protocols have been designed to reach finality in a smaller time, their impact on IIoT devices is not clear yet.Therefore, it is important to reduce the probabilistic time to finality for proof-of-work protocols.In this regard, it is important to avoid selfish mining by devices.An RL system can be used to detect selfish miners that consume the resources unnecessarily.
For instance, the Q-learning technique can be used to minimize the probabilistic time to the finality of proof-of-work for blockchain-enabled IIoT networks.The agents can update themselves by learning from the received rewards and probabilistically improving the time to finality [62].

Enhancing Transaction Throughput
Transaction throughput refers to the number of transactions per second [63].The current number of transactions in IIoT-blockchain networks suffers from scalability problems.One convenient solution can simply be to increase the number of transactions per block.The other way can be to increase the frequency with which the blocks are added into the network.Thus, recording more transactions could have a reverse effect on the decentralization of blockchain-enabled IIoT networks.It may also increase the mining time since a miner needs to check the validity of all signatures on the transactions before mining a block.Because of such intricacies, the optimal block size and frequency of adding new blocks are highly application-and resource-dependent while the RL techniques can be used to efficiently regulate the network dynamics [64,65].Moreover, the optimal policies and the tradeoffs between the transaction throughput and decentralization can also be identified with the help of RL techniques [66].The agent can also provide the specific set of actions needed to increase the throughput while not compromising the decentralization of IIoT-blockchain networks.

Improving Link Security
Another important aspect of IIoT-blockchain networks is linked to security [67].This becomes critical when miners are sharing sensitive information or exchanging acknowledgment messages.Due to the broadcast nature of messages exchanged between IIoT-blockchain devices, it is important to secure the links through physical layer security (PLS) techniques [8].The PLS exploits the randomness of a wireless channel to confuse the eavesdroppers.On other occasions, artificial noise can also be added by a friendly jammer to protect the communication channel.The RL can be extensively used for the provisioning of link security for IIoT-blockchain networks.Multi-armed bandit techniques can be used to identify the nearby eavesdroppers in the blockchain network.Deep Q-learning can be applied to introduce the artificial noise in the network, without damaging the quality of the legitimate link.Due to the dynamicity of these learning techniques, they can be applied easily in many indoor and outdoor link security cases [68].

Average Block Time Reduction
Block time, or block interval, can be defined as the total amount of it takes to mine a block.Due to the high variability in time for mining a block, the average block time is more preferred in large-scale networks.Therefore, the average block time of a blockchain-enabled IIoT network is directly related to the complexity of the proof-of-work algorithm.It may be noted that some existing platforms (e.g., Ethereum) dynamically change the complexity of the blocks.However, the RL techniques can be used for optimizing the long-term utility of a blockchain network instead of relying on instantaneous gains.For instance, a multi-armed bandit learning network can be used to identify the complexity of the algorithms based on the scale and other characteristics of the network.Thus, in pursuit of developing a long-term optimal policy, if the average block time is less than the expected block time, then the level of difficulty can be increased.In contrast, if the expected time is less than the average block time, then the level of difficulty can be reduced for mining.Thus, RL techniques can help optimize the performance based on different characteristics of the end-to-end blockchain network [69].

Case Study: Minimization of Forking in Blockchain
In this section, we present a case study for optimizing IIoT-blockchain networks using RL techniques.The results provided here can act as fundamental building blocks for future research in RL and blockchain-enabled IIoT networks.

Problem Formulation
We consider a blockchain-enabled IIoT network consisting of multiple miners and a single communication point, as shown in Figure 5.The communication point is the static infrastructure and associated with miners in the network.The miners are mobile units with computational capabilities for gathering transaction data.The ledger is considered to be located at the communication point, whereby, transaction records are stored as blocks.Whenever a transaction record is stored as a block at the communication point, the block needs to be validated first for confirming the originality of transactions.As per Blockchain 1.0, the communication points can delegate this proof-of-work (mining) computation for validation to the wireless miners.Once each miner completes the proof-of-work, an acknowledgment (ACK) message is sent to the communication point by the miner.The communication point then propagates the ACK message to the other IIoT devices.In principle, the ACK messages received by the communication point must be identical to the order of completion of the task.A forking event can occur in case the ACK first sent by the miner arrives later than other ACKs due to transmission delay [70].This results in creating branches and the recovery from the forking event increase the overall latency of the network [71].Note that we consider that K miners and J transaction nodes in the network are equipped with single antennas and communicate with the communication point over the quasi-static flat-fading channel.Additionally, the channel coefficients remain the same during each time slot and vary from a one-time slot to another.The transmission time of the message is the ratio of the size of the message to the overall data rate given by where Γ is the size of the message.Here, R is the data rate given as where p k and h k are the transmit power and channel coefficient of the main link while p j and h j are the transmit power and channel coefficient of the interference link.Furthermore, N 0 represents the variance of additive white Gaussian noise (AWGN).Since the size of the message is generally fixed, we aim to maximize the data rate to reduce the transmission delay, thereby reducing the forking event.

Algorithm Design
To solve the max p 1 ,p 2 ,...,p K R problem subject to 0 ≤ p k ≤ P m , we employ a Q-learning technique by controlling the power of the miner.In the following, we define three key elements of the Q-learning model, i.e., states, rewards, and actions.

States
In a decision epoch, a state s ∈ S is defined by the energy of the miner and the channel gains.The miner takes a decision based on either saving the energy or transferring the message.

Actions
We consider that the miner adaptively switches between energy-saving and message transferring modes in a state.Thus, the action set a ∈ Ã can be represented as either 0 or 1, whereby, 0 represents energy-saving and 1 denotes message transferring mode.This is mathematically given as

Reward
Note that the forking event is directly dependent on the successfully transferred information and indirectly depends on the energy of the miner.Thus, the agent receives a reward R(s |s, a) only when operating in the message transferring mode.In the energy-saving mode, the agent receives no reward but the energy saved can be used for later time slots to support data transmission.Mathematically, the reward function can be expressed as where a represents the action performed by the agent.It can be noted that when a = 0, the agent operates in energy-saving mode and does not receive any reward.During each epoch, the miner chooses an action based on the state-action values and updates the Q-table.The table is initialized by setting Q-value as zero for all the actions and states.The iteration starts with picking a random state and update the Q-table using the −greedy method.This iterative method aims to maximize the Q-value and the agent receives an immediate reward if the information is transferred successfully.The Q-value is updated as where α represents the learning rate and γ is the discount factor.The agent makes a better decision over a longer period and the Q-table starts to stabilize.In the end, the mode selection policy is obtained which is based on the best set of actions for different states.

Results and Discussion
We now evaluate the performance of the proposed Q-learning scheme for reducing the transmission delay of miners and the occurrence of forking events in the blockchain.We performed extensive simulation and created a backscatter communication environment for the agent interaction.The learning agent interacts with the environment and takes an action that maximizes the immediate reward R.After performing an action, the agent observes the next state and again interacts with the simulation environment.Over the time and after a specific learning period, the performance starts to converge and the Q-table stabilizes.For the sake of fair comparison, we compare our Q-learning based scheme with a greedy policy.In the case of greedy policy, the miner chooses message transfer mode if it has sufficient energy to transmit the signal.Otherwise, it stays passive with the energy-saving mode.
Figure 6a shows the average transmission delay achieved by using the proposed Q-learning algorithm.Each point in the figure represents the rolling average of the last 10 3 time slots.For each curve, we have illustrated the impact of different ACK message sizes.The smaller message size reduces the transmission delay.Furthermore, the increase in the number of iterations further reduces the transmission delay.This indicates that the agent makes a better decision over time and becomes fairly stable after many interactions with the environment.In Figure 6b, we compare the Q-learning approach with the benchmark greedy policy.It can be seen that the Q-learning approach significantly outperforms the greedy policy.Since the greedy approach only focuses on maximizing the current reward, it performs poorly in terms of average transmission delay.In contrast, the Q-learning approach not only tries to maximize the current reward but also considers future rewards and, thus, optimizes the actions.

Conclusions and Future Work
Blockchain-enabled IIoT networks are emerging as a new norm in the global development and adoption of blockchain technology.However, the optimization of these networks requires more than just conventional model-driven methods.Thus, the RL techniques are a viable solution to address the challenges that IIoT-blockchain networks face such as minimizing forking events and improving the transactional throughput.This article has highlighted some of the recent and concrete studies on blockchain-enabled IIoT networks and detailed how RL techniques can be applied to solve the associated issues comprehensively.Subsequently, a case study has been presented where the occurrence of the forking event is minimized using the Q-learning technique.The results demonstrate the improvements offered by the proposed technique over the greedy method.
As a promising candidate for optimizing the performance of blockchain-enabled IIoT networks, the RL techniques are helping to realize intelligence in such networks.However, there are several challenges and open research questions that may drive future research efforts in this domain.Some of the potential research directions include and are not limited to the following:

Energy Constrained IIoT Devices in Blockchain
One of the major issues in applying the RL techniques to blockchain-enabled IIoT devices is the energy-constrained nature of the IIoT devices.Though it arguably makes the performance of the RL techniques no less significant, it drains the energy of low-powered devices very quickly.Especially, in the case of distributed deep RL techniques, this issue can become more prominent.In these conditions, there is a requirement of energy-efficient management at the physical layer of networks [72].

Independence of RL Techniques and Blockchain
Typically, the RL and blockchain models operate independently, therefore, the blockchain entities are not able to communicate with the RL processes.This independence can potentially reduce the performance of the blockchain-enabled IIoT network, to the point of jeopardizing the entire infrastructure.Thus, there is a more fervent need for cross-domain research efforts for integrating RL processes within blockchain infrastructure to the level of individual bits and instructions.This embedding is expected to provide improved performance due to higher collective gains for the network.

The Scalability Paradox
Scalability has been a major area of research in blockchain networks.In recent years, the issue has received much attention for realizing the massive and large-scale blockchain-enabled IIoT networks for smart cities.Although the scalability is an inherent problem of blockchains, this becomes more complex with the introduction of RL techniques.The training for RL techniques needs to be performed extensively and after regular intervals.The deep RL techniques require a more complex architecture of the network for distributed intelligence.The issue worsens when the blockchain-enabled IIoT networks are hyper-mobile, where some devices enter and leave the network recurrently.Thus, there is a need for focused research efforts to make progress in this domain.

Selection of Appropriate RL Techniques
Another important issue to address is the selection of appropriate RL techniques for optimizing different aspects of the network.Since one size does not fit all, the performance of RL techniques may differ dramatically if applied to the wrong set of problems.As detailed in the above sections, the deep Q-learning method is complex and may be suitable for problems with large state space.On the other hand, less complex techniques like multi-armed bandit learning may be best for addressing small local issues in blockchain-enabled IIoT networks.Due to the high diversity of RL techniques, future research efforts need to focus on developing a compendium of RL techniques for a specific set of problems of blockchain-enabled IIoT networks.

Financial Forecasting
Some of the recent literature in blockchain domain space has studied and stressed that in many cases, RL and deep RL have better financial forecasting results when compared to supervised learning techniques.For instance, stock price prediction; it is well known that historical data cannot reflect the dynamics of the current market, which contributes to poor prediction performance of future price changes.It is expected that by adopting RL and employing its techniques, better forecasts can be made.Extending this argument to blockchain-enabled IIoT network where hundreds and thousands of agents operate autonomously, it is indispensable for better management that the cost expenses associated with them be predicted in an accurate way.

Smart Agents
This presents a very interesting research direction, where agents with smart capabilities are designed in a way that they help regulate the blockchain-IIoT network and detect abnormal or malicious behavior patterns with a high probability.Note that the former is especially important for blockchains that are private or consortium in nature, while the latter is required for public blockchains.The design and employment of such agents will not only help regulate blockchains, but it will also help them in achieving self-healing attributes.

Anonymous Data Sharing
With the recent and ongoing developments in the IIoT space, the issues of privacy are gaining worldwide attention.Blockchain coupled with RL techniques can help enable anonymous data sharing between two users or devices in a blockchain-IIoT network.Thus, multiple-layer structures of blockchain can be designed together with data fusion, which will allow sophisticated authorization of data for different users and enable more complex networks to be designed.

AFigure 1 .
Figure 1.An illustration of exchanging digital currency from A to B in a blockchain.
to promote the feasibility of blockchain networks in other domains, e.g., wireless networks and healthcare.

Figure 2 .
Figure 2. Timeline of blockchain technology over the years since its inception.

Figure 3 .
Figure 3.A brief illustration of blockchain primitives including ledger, miner and block.

Figure 4 .
Figure 4. Applications of RL techniques for blockchain-enabled IIoT networks.The Q-learning technique appears to be more suitable for improving transaction throughput and minimizing forking events.Moreover, energy efficiency and link security can be improved using actor-critic learning and deep Q-learning techniques, respectively.Finally, time to finality and the block time is expected to be reduced, respectively, with the help of Q-learning and multi-armed bandit learning.

Figure 5 .
Figure 5.An illustration of the system model.The considered network consists of multiple miners and a single communication point.A forking event takes place when the ACK messages arrive out of order at the communication point.

Figure 6 .
Figure 6.Average transmission delay against (a) number of iterations (b) transmit power.

Table 1 .
Comparison of distributed ledger technologies.