Next Article in Journal
Research on the Application Status of Machine Vision Technology in Furniture Manufacturing Process
Next Article in Special Issue
An Emotional Model Based on Fuzzy Logic and Social Psychology for a Personal Assistant Robot
Previous Article in Journal
Biogenic Nanomagnetic Carriers Derived from Magnetotactic Bacteria: Magnetic Parameters of Magnetosomes Inside Magnetospirillum spp.
Previous Article in Special Issue
Development and Evaluation of Fuzzy Logic Controllers for Improving Performance of Wind Turbines on Semi-Submersible Platforms under Different Wind Scenarios
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Reinforcement Learning Agent for Negotiation in Multi-Agent Cooperative Distributed Predictive Control

by
Oscar Aponte-Rengifo
*,†,
Pastora Vega
and
Mario Francisco
Department of Computer Science and Automatics, Faculty of Sciences, University of Salamanca, Plaza de la Merced, s/n, 37008 Salamanca, Spain
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2023, 13(4), 2432; https://doi.org/10.3390/app13042432
Submission received: 31 December 2022 / Revised: 30 January 2023 / Accepted: 2 February 2023 / Published: 14 February 2023
(This article belongs to the Special Issue Advances in Intelligent Control and Engineering Applications)

Abstract

:
This paper proposes a novel solution for using deep neural networks with reinforcement learning as a valid option in negotiating distributed hierarchical controller agents. The proposed method is implemented in the upper layer of a hierarchical control architecture composed at its lowest levels by distributed control based on local models and negotiation processes with fuzzy logic. The advantage of the proposal is that it does not require the use of models in the negotiation, and it facilitates the minimization of any dynamic behavior index and the specification of constraints. Specifically, it uses a reinforcement learning policy gradient algorithm to achieve a consensus among the agents. The algorithm is successfully applied to a level system composed of eight interconnected tanks that are quite difficult to control due to their non-linear nature and the high interaction among their subsystems.

1. Introduction

Model predictive control (MPC) is a successful technique with increasing industrial applications [1]. Nevertheless, the computational requirements that involve a centralized control MPC are, in some cases, unfeasible from a practical point of view due to the large-scale nature of the actual engineering infrastructure. Therefore, distributed model predictive control (DMPC) appears as a possible solution to these problems since a complex problem can always be seen as a set of coupled simpler problems, with a clear structure that represents the global one [2]. In addition, DMPC allows the exploitation of this structure to formulate control laws based on local information and some communication between the agents to achieve global performance and stability. In this case, negotiation among agents is a typical approach. The critical task of solving the negotiation problem in multi-agent DMPC systems comes from designing distributed control protocols based on local information from each agent and its neighbors.
Consequently, the main concern of this negotiation problem is achieving global decisions based on local information. Therefore, strategies that allow agents to change their local decisions online according to changes in the environment or the behavior of other agents to achieve a joint agreement on a global interest is a priority concern. Negotiation to achieve consensus is focused on addressing this problem. An interesting approach is using deep neural networks as a negotiation manager trained in the machine learning method, which offers many advantages and motivates the development of many algorithms for decision making and control, studying all the properties of stability, convergence, and feasibility.
Within artificial intelligence, machine learning techniques have proven to be a powerful tool when using knowledge extracted from data for supervision and the control of complex processes [3,4,5,6,7]. Technological advances and data availability have boosted this methodology. Deep learning is another field that uses the large amount of data provided by intelligent sensors. This learning is based on neural networks that capture complex behavior, such as [8], where deep learning-based models as soft-sensors to forecast WWTP key features have been developed.
Among them, there is one approach, reinforcement learning (RL) [9] in which an RL agent learns by interacting with the environment in order to determine, based on a policy, what action to take given an environment state, aiming at the maximization of the expected cumulative reward. Ultimately, the RL agent learns the policy to follow when the environment is in a certain state, adapting to variations. In other words, reinforcement learning has as its main objective the training of an RL agent to complete a task in an uncertain environment which it gets to know through the states that the environment assumes due to the actions [10,11].
Valuable surveys about the RL algorithms and their applications can be found in [9,12]. Several online model-free value-function-based reinforcement learning algorithms that use the expected cumulative reward criterion (Q-learning, Sarsa, and actor–critic methods) are described. Further information can also be found in [13,14,15]. However, several policy search and policy gradient algorithms have been proposed in [16,17].
Many traditional reinforcement learning algorithms have been designed to solve problems in the discrete domain. However, real-world problems are often continuous, making the learning process to select a good or even an optimal policy quite a complex problem. Two RL algorithms that address continuous problems are the deep deterministic policy gradient (DDPG), which uses a deterministic policy, and policy gradient (PG), which assumes a stochastic policy. Despite the significant advances over the last few years, many issues still need to improve the ability of reinforcement learning methods in complex and continuous domains that can be tackled. Furthermore, classical RL algorithms require a large amount of data to obtain suitable policies. Therefore, applying RL to complex problems is not straightforward, as the combination of reinforcement learning algorithms with function approximation is currently an active field of research. A detailed description of RL methodologies in continuous state and action spaces is given in [18].
Recently, in the literature, research works have been published showing that, in particular, RL based on neural network schemes combined with other techniques can be used successfully in the control and monitoring of continuous processes. In [11,19,20,21] strategies for wind turbine pitch control using RL, lookup tables, and neural networks are present. Specifically, in [11], controlling the direct angle of the wind turbine blades is considered, some hybrid control configurations are proposed. Neuro-estimators improving the controllers, together with the application of some of these techniques in a terrestrial turbine model, are shown. Other control solutions with RL algorithms and without the use of models can be found in [22], among others. In cooperative control, advances have also used RL. Thus, in [7], a cooperative wind farm control with deep reinforcement learning and knowledge-assisted learning is proposed with excellent results. In addition, in [23], a novel proposal is made for high-level control. RL based on RNAs is used for high-level agent negotiation within a distributed model predictive control (DMPC) scheme implemented in the lower layers to control a system level composed of eight interconnected tanks. Negotiation agent approaches based on reinforcement learning have advantages over other approaches, such as genetic algorithms that require multiple tests before arriving at the best strategy [24] or heuristic approaches in which the negotiation agent does not go through a learning process [25].
Negotiation approaches based on reinforcement learning often employ the Q-learning algorithm [26], based on the value function, which predicts the reward that an action will have given a state. This approach is the case in [27], in which a negotiation agent is proposed based on a Q-learning algorithm that computes the value of one or more variables shared between two MPC agents. On the other hand, there are those based on policies, a topic addressed in this work, which directly predict the action itself [28]. One of these algorithms is the Policy Gradient (PG) [17], which directly parameterizes a policy by tracing the direction of the ascending gradient [16]. In addition, PG employs a deep neural network as a policy approximation function and works in sizeable continuous state and action spaces.
Local MPCs of a DMPC control make decisions based on local objectives. For this reason, consensus between local decisions is necessary to achieve global objectives to improve global index performance. In particular, in this paper, a negotiation agent based on RL manages the negotiation processes among agents of local MPCs to improve a global behavior index of a plant composed of eight interconnected tanks. Given this, the PG-DMPC algorithm within a deep neural network trained by the Policy Gradient (PG) algorithm [29] is proposed and implemented in the upper-level control layer within the control architecture. Therefore, the decision to implement the PG algorithm as a negotiation agent is based on the following advantages: not requiring knowledge of the model, the ability to adapt to handle the uncertainty of the environment, the convergence of the algorithm is assured, and few and easy understanding of tuning parameters.
The remainder of this paper is organized as follows: The problem statement in Section 2. Section 3 presents the RL method, the PG-DMPC algorithm, and DMPC from the low-level control layer. Moreover, Section 4 shows the case study, negotiation framework, the training of the negotiation agent, and performance indexes. Section 5 presents the results. Finally, Section 6 provides conclusions and direction for future work.

Notation

N 0 + and R + are, respectively, the sets of non-negative integers and positive real numbers. R n refers to an n-dimension Euclidean space. The scalar product of vectors a, b R n is denoted as a b or a · b. Given sets X , R n , the Cartesian product is X × Y x , y : x X , y Y . If  X i i N is a family of sets indexed by N , then the Cartesian product is × i N X i X 1 × × X N = x 1 , , x N : x 1 X i , , x N X N . Moreover, the Minkowski sum is X Y x + y : x X , y Y . The set subtraction operation is symbolized by ∖. The image of a set X R n under a linear mapping A : R n R m i s A X A x : x X .

2. Problem Statement

In implementing DMPC methodologies, local decisions must follow a global control objective to maintain the performance and stability of the system. In this paper, we propose reinforcement learning (RL) in the upper-level negotiation layer of a DMPC control system in search of consensus to achieve global decisions. It requires an upper-level agent negotiator to provide a final solution for each DMPC agent of the lower level in which several control actions are available.
A PG-DMPC (policy gradient-DMPC) algorithm is proposed to carry out negotiation. A deep neural network trained by the policy gradient [29] algorithm has as inputs local decisions of a multi-agent FL-DMPC (fuzzy logic DMPC) [30]. Hence, it provides negotiation coefficients applied to the consensus of the global decisions through its pairwise local decisions to achieve the global goal and stability requirements of the FL-DMPC multi-agent control in the following control architecture (Figure 1).
At each time instant t, a local decision is the control sequence defined as follows:
U t = u t , , u t + 1 , , u t + H p 1 ,
where U ( t ) is the instant control action vector, and H p is the prediction horizon of the local MPC controllers in the lower layer. The agent i of the MPC with i 1 , 2 , , N of N agents have local decisions U i = U i f 1 ( t ) , U i f 2 ( t ) , , U i f M i ( t ) of M pairwise negotiations in the FL-DMPC lower layer. To achieve a global decision U i f , it is proposed to agree as follows:
U i f t = m = 1 M i U i f m t · α i m .
where α i m is the output of the deep neural network considered, like the negotiation coefficients weighing local decisions U i f M i , to achieve the control objective.

3. Methodology

Given a problem to be solved by the RL method, the environment is modelled as a Markov decision process with state space S, action space A, state transition function P s t + 1 | s t , a t , with  s S and a A , t as the time step, and a reward function r s t , a t . The training of an agent consists of the interaction of the agent with the environment, for which the agent sends an action a t to the environment, and it sends back a state s t + 1 and a reward r t + 1 as a qualifier of the a t according to the environment objectives. Consequently, each episode training generates an episode experience composed of the sequence s t , a t , r t + 1 , s t + 1 , , s T 1 , a T 1 , s T , r T with t 0 , 1 , , T where T is the time horizon.
The PG algorithm selects a t based on a stochastic policy π θ a t | s t that assigns states to actions π : S P A . For this task, the algorithm uses a deep neural network as an approximation function of the stochastic policy π θ ( a t , s t ) . The parameter of the policy optimization θ is performed through the ascent optimization method of the gradient. This optimization aims to maximize for each state in the episode training for t = 1 , 2 , , T 1 , the discounted future reward,
G t = t = 0 T γ t r t
where γ is a discount factor, and r t is the reward received at t time step. Therefore, the objective function is
G t · θ l o g π θ a t s t .
The score function θ l o g π θ a t s t allows optimization to be achieved without requiring the dynamic model of the system, optimizing the gradient of the logarithm of π θ a t | s t which expresses the probability that the agent selects a t given a s t .
The estimate of the discounted future reward G t made by the Monte Carlo method has a high variance, leading to unstable learning updates, slow convergence, and thus slow learning of the optimal policy. The baseline method takes from G t a value given from a baseline function b s ; φ to address the variance in the estimate,
A t ^ = G t b φ s ,
where A t ^ is called the advantage function. Consequently, the objective function becomes,
A t ^ · θ l o g π θ a t s t .
The exploration policy used by PG is based on a categorical distribution, where π θ a s = P a s ; θ is a discrete probability distribution that describes the probability π s , a ; θ that the policy can take action a of a set k actions, with  i 1 , , k , given a state s, with the probability of each action listed separately and is the sum of the probabilities of all the actions equal to zero, i = 1 k p i = 1 .
Figure 2 shows the policy gradient algorithm for the negotiating agent training. The optimization of deep neural networks is based on the input data collected at training by a categorical distribution exploration policy. Model knowledge is not necessary, only local decisions as the state and control objectives as the reward, a consequence of the action of the previous t time step employed to agree among the local decisions and obtain the global decisions of the agents.

3.1. PG-DMPC Algorithm

This section details the PG-DMPC algorithm implemented as a negotiating agent required for the upper layer of a distributed MPC control system. The aim of the upper layer PG-DMPC algorithm is to achieve a consensus among all candidate control sequences for each agent to obtain the final control action implemented.
In order to reach consensus, the deep neural network receives the available control sequences for each agent as state  s t ,
s t = U 1 t , U 2 t , , U N t ,
where U i = U i f 1 t , U i f 2 t , , U i f M i t , N is the number of agents, i 1 , 2 , , N , and  M i is the total number of pairwise fuzzy negotiations in the lower layer performed by agent i. The total number of control sequences received by the PG-DMPC upper level is defined as M = i = 1 N M i . The states s t S , where S is a compact subspace S R N · M · H p .
The action a t A where A R N · M is a compact set, a t are some coefficients weighting the control sequences to compute a suitable final control sequence U i f to be applied in the plant for agent i, this procedure being the way to achieve consensus among agents in the upper layer. Particularly, the action vector is defined as a t = ϕ i , , ϕ N , where ϕ i = α i 1 , α i 2 , , α i M i 1 , i 1 , , N . Note that for agent i, the number of coefficients provided is M i 1 because the last one is obtained as the complement to 1 in order to reduce the computational load, and provided that m = 1 M α i m < 1 due to normalization when training the deep neural network,
α i M i = 1 m = 1 M i 1 α i m .
Finally, the reward provided by the environment at each time step is defined in Section 4.2.1 because it depends on the particular case study and its global control objective.
Considering states and actions defined above, Algorithm 1 developed negotiation in this paper is described:   
Algorithm 1 Proposed PG-DMPC negotiation algorithm for multiples agents
At each time step t for each agent i:
1.
The states s t are taken from the environment (low-level control layer) by the RL upper layer (deep neural network) defined by non-linear function ζ .
2.
The deep neural network takes the states as inputs in order to provide the actions a t which are the negotiation coefficients α i m , for i 1 , , N , m 1 , , M i 1 , as outputs,
a t = ζ s t .
The deep neural network ζ outputs satisfy m = 1 M i 1 1 because it is a constraint considered at training for normalization.
3.
The final control sequence U i f is obtained, considering the actions provided by ζ , U i f t = m = 1 M i U i f m t · α i m .
4.
The control sequences for each agent are aggregated to U N = U 1 f , U 2 f , , U N f . U N f t is computed and compared with the global cost J N f x N t from the previous time t. Otherwise, the previous t control sequences that check stability, U N s ( t + 1 ) , are applied. Hence, the global cost function is
J N f x N t , U N f t = k = 1 H p 1 ( x N t + k x r N t + k Q N 2 + u N t + k u r N t + k R N 2 ) + x N t + H P x r N t + H P P N 2 ,
with the weighting matrices Q N = Q i i N and R N = R i i N and the terminal cost matrix P N = P i i N , x r N and u r N the global state and input references is calculated by a procedure to remove offset based on [30].
The proposed PG-DMPC negotiation algorithm is performed in the upper-level control layer from the control architecture, (Figure 1), where the low-level control layer is based on fuzzy logic DMPC and detailed in the next section.

3.2. Distributed MPC

The low-level control layer is a DMPC in which agent i 1 , 2 , 3 , 4 negotiates with its neighbours in a pairwise manner. Local models are defined as:
x i t + 1 = A i x i t + B i i u i t + B d i d i t + w i t ,
where t N 0 + denotes the time instant; x i R q i , u i R r i , and d i R d i are, respectively, the state, inputl and disturbance vectors of each subsystem i N , constrained in the convex sets containing the origin in their interior X i = x i R q i : A x , i x i b x , i , U i = u i R r i : A u , i u i b u , i , respectively; and A i R q i x q i , B i i R q i x r i and B d i are matrices of proper dimensions and can be seen in [30].
The vector w i R q i represents the coupling with other subsystems j belonging to the set of neighbors N i = j N \ i : B i j 0 , i.e,
w i t = j N i B i j u j t ,
where u j R r j is the input vector of subsystem j N i , and matrix B i j R q i × r j models the input coupling between i y j. Moreover, w i is bounded in a convex set W i j N B i j U j due to the system constraints.
The linear discrete-time state-space model is
x ˜ N t + 1 = A N x ˜ N t + B N u ˜ N t + B d ˜ N d ˜ N t ,
where x ˜ N and u ˜ N are the state vector and the input vector respectively, d ˜ N is the disturbance vector, and  A N , B N , and B d ˜ N are the corresponding matrices of the global system.
The low-level control layer performs the FL-DMPC algorithm for multiple agents. Specifically, fuzzy-based negotiations are made in pairs considering the couplings with their neighboring subsystems, which are assumed to hold their current trajectories. To this end, it a shifted sequence of agent i is used, which is defined by adding K i x i ( t + N p ) on the sequence chosen at the previous time step U i ( t 1 ) :
U i s ( t ) = u i ( t + 1 | t 1 ) u i ( t + 2 | t 1 ) u i ( t + N p 1 | t 1 ) K i x i ( t + N p | t 1 ) = u i s ( t ) u i s ( t + 1 ) u i s ( t + N p 2 ) u i s ( t + N p 1 ) .
To make the paper self-contained, a brief description of the FL-DMPC algorithm [30] is presented here (Algorithm 2):
Algorithm 2 Multi-agent FL-DMPC algorithm
At each time step t for each agent i
1.
Firstly, agent i measures its local state x ˜ i t and disturbance d ˜ i .
2.
Agent i calculates its shifted trajectory U i s t and sends it to its neighbors.
3.
Agent i minimizes its cost function considering that neighbor j N i applies its shifted trajectory U j s ( t ) . It is assumed that the rest of the neighboring subsystems l N i \ j follows their current control trajectories U l s ( t ) . Specifically, agent i solves
U i * ( t ) = arg min U i ( t ) J i x i ( t ) , U i ( t ) , U j s ( t ) , U l s ( t ) ,
subject to
x i ( t + k + 1 ) = A i x i ( t + k ) + B i i u i ( t + k ) + B i j u j ( t + k ) + B d i d i t + k + l N i \ { j } B i l u l ( t + k ) ,
Constraints
x i ( t ) = x ˜ i ( t ) , i N , x i ( t + k ) X i , k = 0 , , N p 1 , x i ( t + N p ) Ω i , u i ( t + k ) U i , k = 0 , , N p 1 , u j ( t + k ) = u j s ( t + k ) , k = 0 , , N p 1 , u l ( t + k ) = u l s ( t + k ) , k = 0 , , N p 1 , d i ( t + 1 ) = d i ( t ) , d i ( 0 ) = d ˜ i ( t ) ,
where set Ω i is imposed as the terminal state constraint of agent i. Details regarding the calculation Ω i are given in [30].
4.
Agent i again optimizes its cost J i ( · ) maintaining its optimal input sequence U i * ( t ) to send to find the input sequence wished for its neighbors U j w i ( t ) . Here, it is also assumed that subsystems l follow their current trajectories. To this end, agent i solves
U j w i ( t ) = arg min U j ( t ) J i x i ( t ) , U i * ( t ) , U j ( t ) , U l s ( t ) ,
subject to
x i ( t + k + 1 ) = A i x i ( t + k ) + B i i u i ( t + k ) + B i j u j ( t + k ) + B d i d i t + k + l N i \ { j } B i l u l ( t + k ) ,
Constraints
x i ( t ) = x ˜ i ( t ) , i N , x i ( t + k ) X i , k = 0 , , N p 1 , x i ( t + N p ) Ω i , u i ( t + k ) = u i * t + k , k = 0 , , N p 1 , u j ( t + k ) U j , j N i , k = 0 , N p 1 , u l ( t + k ) = u l s ( t + k ) , k = 0 , , N p 1 , d i ( t + 1 ) = d i ( t ) , d i ( 0 ) = d ˜ i ( t ) ,
5.
Agent i sends U j w i ( t ) to agent j and receives U i w j ( t ) .
6.
For each agent i N , the triple of possible inputs is U i s t , U i w j t , U i * t . Since the wished control sequence U i w j is computed by neighbor j without considering state constraints of agent i, it is needed to check whether state constraint satisfaction of agent i holds after applying U i w j . Otherwise, it is excluded from the fuzzification process. Afterwards, fuzzy negotiation is applied to compute the final sequence U i f m . Similarly, U j f m is computed.
7.
A resulting pairwise fuzzy negotiation sequence U i j f m t = U i f m t , U j f m t , U l s t is defined based on U i f m and U j f m , assuming that the rest of subsystem l N \ i , j follows their pre-defined trajectories.
8.
Agent i sends its cost for the fuzzy and stabilizing control inputs to its neighbours and vice versa. Let us define U i j s = U i s t , U j s t , U l s t ; if the condition
l M i M j i , j J l x l t , U i j f m l M i M j i , j J l x l t , U i j s
holds, then stability is guaranteed, and thus, U i f m t is sent to the upper control layer. Otherwise, U i s t is sent.
Hence, the cost function J i · is
J i x i t , U i t , U j t , U l t = k = 1 H p 1 ( x i t + k x r i t + k Q i 2 + u i t + k u r i t + k R i 2 + u j t + k u r j t + k R i 2 + u l t + k u r l t + k R i 2 ) + x i t + H P x r i t + H P P i 2 ,
with Q i being a semi-positive definite matrix, R i , P i being positive-definite matrices, and x r i and u r i being the state and input references calculated by a procedure to remove offset based on [30].

4. Case Study

This section describes the coupled eight-tank plant based on the quadruple tank process in which the proposed PG-DMPC algorithm was implemented together with the previously detailed low-level control layer.

4.1. Plant Description and Control Objective

The eight-coupled tanks plant comprises eight interconnected tanks (Figure 3) with four upper tanks (3, 4, 7, and 8) that discharge into the lower ones (1, 2, 5, and 6), which in turn, discharge into sinking tanks. The plant is controlled by four pumps whose flows are divided through six three-way valves γ v with v 1 , 2 , , 6 manually operated.
The system is divided into s subsystems with s = 1 , 2 , 3 , 4 : Tanks 1 and 3 are part of Subsystem 1; Tanks 2 and 4 form Subsystem 2; Tanks 5 and 7 belong to Subsystem 3; and the rest of the tanks form Subsystem 4. The level of Tank h n with n 1 , 2 , , 8 , being the controlled variables the levels h 1 , h 2 , h 5 , and h 6 renamed h s c as, h 1 c = h 1 , h 2 c = h 2 , h 3 c = h 5 , and h 4 c = h 6 for each agent. The manipulated variable q s is the flow given by the pump of each Subsystem s. The tank level operating point is established, h n 0 for n 1 , 2 , , 8 as h 1 0 = 0.10 , h 2 0 = 0.15 , h 3 0 = 0.07 , h 4 0 = 0.03 , h 5 0 = 0.10 , h 6 0 = 0.15 , h 7 0 = 0.025 , h 8 0 = 0.10 (units: meters). In addition, the operating point of the flow of the pumps q s 0 as q 1 0 = 0.142 , q 2 0 = 0.421 , q 3 0 = 0.421 , q 4 0 = 0.140 (units: m3/h).
The aggregated state vector and input state vector are defined as
x ˜ N = h 1 c t h 1 0 , h 2 c t h 2 0 , h 3 t h 3 0 , h 4 t h 4 0 ,
h 3 c t h 5 0 , h 4 c t h 6 0 , h 7 t h 7 0 , h 8 t h 8 0 ,
u ˜ N = q 1 t q 1 0 , , q 4 t q 4 0 ,
The disturbance state vector is defined as d ˜ N = d 1 t d 1 0 , , d 4 t d 4 0 .
The control objective is to solve a tracking problem to reach the reference T ˜ N ,
T ˜ N = h 1 T t h 1 0 , h 2 T t h 2 0 , 0 , 0 , h 3 T t h 5 0 , h 4 T t h 6 0 , 0 , 0
, where h s T is the target level of the controlled variable h s c .
For the upper tank levels, the objective is to keep the operating point despite the disturbance d ˜ N affecting the system. Additionally, the state and input vectors are constrained by
h n 0 < x ˜ n t 0.08 , q i 0 < u ˜ i t 0.04 ,
n 1 , 2 , , 8 , i 1 , 2 , 3 , 4 .
The linear state space models of the global system and subsystems are detailed in [30].

4.2. Negotiation Framework

In this case study, there are four agents. Subsystems 1 and 4 have only one neighbour, Subsystems 2 and 3, respectively, because coupling only exists with them. On the other hand, Subsystems 2 and 3 have two neighbours each. Then, the low-level control layer FL-DMPC provides agent i the following control sequences U 1 = U 1 f 1 ( t ) , U 2 = U 2 f 1 ( t ) , U 2 f 2 ( t ) , U 3 = U 3 f 1 ( t ) , U 3 f 2 ( t ) , and U 4 = U 4 f 1 ( t ) , obtained by performing pairwise negotiation among agents. In particular, M 1 = 1 , M 2 = 2 , M 3 = 2 , and M 4 = 1 in (Equation (1)).
Due to the particular configuration of the eight coupled tanks system, negotiation is not required in the upper layer for Subsystems 1 and 4, and therefore U 1 f ( t ) = U 1 f 1 and U 4 f ( t ) = U 4 f 1 . Negotiation is only needed for Agents 2 and 3, and the state is defined as:
s t U 2 ( t ) , U 3 ( t ) .
The actions a t provided by the deep neural network are:
a t ϕ 2 , ϕ 3 = ζ s t ; θ
where ϕ 2 = α 2 1 and ϕ 3 = α 3 1 . Note that only one weighting coefficient is obtained for each subsystem because the other ones are directly obtained from:
α 2 2 = 1 α 2 1 ; α 3 2 = 1 α 3 1
Then,
U 2 f t = U 2 f 1 t · α 2 1 + U 2 f 2 t · α 2 2
U 3 f t = U 3 f 1 t · α 3 1 + U 3 f 2 t · α 3 2
Therefore, considering that the final sequence for each agent is U i f = [ u i f ( t ) , u i f ( t + 1 ) , , u i f ( t + H p 1 ) ] , with H p = 5 the prediction horizon for each local MPC controller and manipulated variable, q s , to be applied to the system is selected as the first value of the sequences, usually an MPC framework:
q 1 = u 1 f ( 1 ) , q 2 = u 2 f ( 1 ) , q 3 = u 3 f ( 1 ) , q 4 = u 4 f ( 1 )
Figure 4 displays the implemented PG-DMPC algorithm within the control architecture performed in the case study, where PG-DMPC receives local decisions from a FL-DMPC and gives global decisions according to the control objective.

4.2.1. Reward

The PG algorithm optimizes weights θ of the deep neural network ζ s t ; θ used as an approximation function of the stochastic policy π θ s t , a t in order to maximize the discounted future reward G t (Equation (2)). The reward function is critical for the proper working of the RL, and it is defined heuristically as:
r t ( s t , a t ) = z t a t , i f α 2 1 o r α 3 1 0 , 1 2000 , i f e s t < e b s t s 1000 , O t h e r w i s e , i f α 2 1 a n d α 3 1 0 , 1
where e s t = h s T t h s c t 2 is the tracking square error, and e s b is the error threshold for each subsystem. Note that a positive reward corresponds to a desired situation and negative rewards penalties for the RL.
The maximum reward is given when the tracking error does not exceed the threshold value for any subsystem and the weighting coefficient provided by the deep neural network belongs to 0 , 1 . A small penalty is given when any error exceeds the threshold, but still, α 2 1 and α 3 1 0 , 1 . Finally, a large penalty defined by z t is added if α 2 1 or α 3 1 0 , 1 in order to obtain normalized values for the coefficients:
z t a t = 3000 α 2 1 + α 3 1 × 1000 .

4.2.2. Discrete Action Framework

Discretization is the process of transferring continuous functions, models, variables, and equations to discrete counterparts. In our case, the actions of the trading agent are discrete values, which in Equations (20) and (21) give a discrete weight to the continuous sequences for consensus. More specifically, the discretization is carried out as ϕ i = 0.05 · x , whit x 1 , 2 , . . . 20 . A discrete action framework is employed to obtain a shorter training time, simpler calculations, a lower computational cost, and faster convergence. According to discrete stochastic policy, each output of the deep neural network is the probability of assigning a combination of discrete values of ϕ 2 and ϕ 3 given a specific state in the inputs.

4.3. Training

This section compares the influence of the baseline method with two approaches, the convergence velocity and stability, during the search for the optimal policy. Table 1 shows the two PG configurations compared and trained.
Policy and baseline networks employed as approximation functions are set up to a 0.01 fixed learning rate using the Adam optimizer. The policy deep neural network consists of three fully connected hidden layers, whereas the baseline network uses one fully connected hidden layer.
Training and results are obtained using the MATLAB reinforcement learning toolbox. The two configurations trained for 1000 episodes with the time horizon T = 120 and the simulation of the environment with sample time T s = 5 . Consequently, each T s is an RL training time step. The initial randomization of levels taken from the interval h n 0 0.01 , h n 0 + 0.01 is required and performed. As a remark, although the agents trained with the exploratory policy based on categorical distribution, their policies during validation follow greedy exploration, which selects the action with maximum likelihood.
Figure 5 displays the non-use and use of the baseline method under the categorical distribution exploration policy. Baseline and non-baseline in policy convergence have quite a similar approach time to the high rewards zone. Despite this, the baseline case sets more stability and a faster approach to the high-reward zone.

4.4. Performance Indexes

In order to evaluate the proposal, the following performance indexes were established.
MPC global cost function,
P e = t = 0 T J N · ,
where J N is detailed in Equation (7).
Sum of the integrals of squared errors of controlled levels,
ISE = s I S E s ,
with
I S E s = 0 T h s c t h s T t 2 d t .
The sum of the squared differences of the controlled levels h v s b is,
e t = h s c t h s T t 2
The sum of the pumping energies P E s for each pump s as the sum of the average of pumping energy over the prediction horizon is proportional to water flows provided by the pumps:
P E t = s P E s t ,
with
P E s t = 0.04 3600 H p k = 1 H p q s ( t + k )
with s 1 , 2 , 3 , 4 for all performance indexes.

5. Results

Results of the proposed PG-DMPC are presented in this section. The objective of the PG-DMPC algorithm is to provide consensus among local decisions from an FL-DMPC algorithm to achieve control objectives. The proposal PG-DMPC is compared with fuzzy DMPC (FL) [30], DMPC using a cooperative game [31], and the centralized MPC. FL and Coop. The game was also implemented in the upper layer. Furthermore, the influence of the baseline method is evaluated. Three validation cases (Table 2) were employed to demonstrate the proper performance of the proposal under the different initial, reference state vectors and disturbance states. The sampling time used in simulations is T s = 5 s.
Case 1: The reference vector is the same as the one used for training, and the state vector of the operating point was considered as the initial state vector, x ˜ N = x ˜ N 0 .
Case 2: The reference vector and the initial state vector are the same as those used in Case 1, but for this case, the disturbances d ˜ N = 0.02 ; 0.02 ; 0.02 ; 0.02 were included.
Case 3: The reference vector changes at t > 200 from the operation point as initial reference vector to T ˜ N 2 with h T 1 = h 1 0 + 0.04 , h T 2 = h 2 0 + 0.03 , h T 3 = h 5 0 + 0.04 , h T 4 = h 6 0 + 0.03 . The initial state vector was taken from h n 0 0.03 , h n 0 + 0.03 . Note that the initial state vector is taken out of the interval for initial state vectors during training.
Table 3, Table 4 and Table 5 show the performance of the proposal and the techniques considered in Cases 1, 2, and 3, respectively. In the same order as the validation cases, Figure 6, Figure 7 and Figure 8 portray the evolution of the state vectors until the reference vector is reached by Conf. 1 and Conf. 2.
Table 3, Table 4 and Table 5 show that the centralized control has the best results because it has the availability of full plant information for prediction. For validation Case 1, Coop game offers better results in Pe, ISE, and e. Both configurations of the PG-DMPC show an advantage in P e over the FL, while ISE and e display very close values to FL. (Table 3).
In validation Case 2, Coop. game shows better P e than PG-DMPC and FL techniques. In this case, the worst P e are given by PG-DMPC. On the other hand, these three techniques show ISE and e are very close to each other. Although the measure disturbance was added, an acceptable response was raised by PG-DMPC (Table 4).
Finally, in Case 3, the Coop game and FL show better P e values. At the same time, the worst of ISE and e are given by FL. The two PG-DMPC configurations display similarly in the four indices. The PG-DMPC agent has an acceptable response and is comparable to the other techniques even though it was not trained for reference tracking (Table 5).
In addition, Table 6 shows the compliance of the stability evaluation according to Step 4 of Algorithm 1 (see Section 2). For each configuration and validation case, the percentage of total time that cost J N decreases without using the backup sequence U s is shown in the table. Both configurations show similar stability, highlighting the stability in Case 2 under Conf. 2.

6. Conclusions

In this paper, the PG-DMPC algorithm was implemented as a negotiating agent in a distributed MPC control system. The proposal generates negotiation coefficients to achieve final control sequences according to the control objectives of DMPC agents. The key of the proposed PG-DMPC algorithm is the deep neural network trained by the RL method. The results obtained are satisfactory, obtaining a successful consensus between the sequences for negotiation in the evaluation cases. It shows similar results to the other techniques with which it was compared even though no prior knowledge of the negotiation problem was required for its training. Therefore, we demonstrated that PG-DMPC is a powerful technique for negotiation problems in multi-agent DMPC control.
PG-DMPC configuration results are similar due to optimal policies converging to the maximum sum of rewards during training. As an advantage, it is reaffirmed that the baseline configurations present a faster convergence to the zone of high reward values. Therefore, the stopping criterion linked to the speed of arrival at the high reward zone would present significant benefits in computational cost and training time compared to the other configurations. Despite using discrete values, the proposed algorithm shows an acceptable tracking of the controlled levels. In addition, it allows any other criteria in possible future rewards to adapt to any control objectives. In fact, PG-DMPC could be implemented as a single control layer of the present architecture without knowledge of the local models of the system, which is necessary for a centralized control that involves higher computational costs.

Author Contributions

Conceptualization, P.V. and M.F.; formal analysis, O.A.-R.; funding acquisition, P.V. and M.F.; investigation, O.A.-R.; methodology, O.A.-R. and P.V.; project administration, P.V. and M.F.; resources, P.V. and M.F.; software, O.A.-R. and M.F.; supervision, P.V. and M.F.; validation, O.A.-R.; visualization, O.A.-R.; writing—original draft, O.A.-R.; writing—review and editing, P.V. and M.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by projects PID2019-105434RB-C31 and TED2021-129201B-I00 of the Spanish Government and Samuel Solórzano Foundation Project FS/11-2021.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Qin, S.; Badgwell, T.A. A survey of industrial model predictive control technology. Control. Eng. Pract. 2003, 11, 733–764. [Google Scholar] [CrossRef]
  2. Christofides, P.D.; Scattolini, R.; de la Peña, D.M.; Liu, J. Distributed model predictive control: A tutorial review and future research directions. Comput. Chem. Eng. 2013, 51, 21–41. [Google Scholar] [CrossRef]
  3. Zamarreno, J.M.; Vega, P. Neural predictive control. Application to a highly non-linear system. Eng. Appl. Artif. Intell. 1999, 12, 149–158. [Google Scholar] [CrossRef]
  4. Huang, J.-Q.; Lewis, F. Neural-network predictive control for nonlinear dynamic systems with time-delay. IEEE Trans. Neural Netw. 2003, 14, 377–389. [Google Scholar] [CrossRef] [PubMed]
  5. Fernandez-Gauna, B.; Osa, J.L.; Graña, M. Experiments of conditioned reinforcement learning in continuous space control tasks. Neurocomputing 2018, 271, 38–47. [Google Scholar] [CrossRef]
  6. Sierra, J.E.; Santos, M. Modelling engineering systems using analytical and neural techniques: Hybridization. Neurocomputing 2018, 271, 70–83. [Google Scholar] [CrossRef]
  7. Zhao, H.; Zhao, J.; Qiu, J.; Liang, G.; Dong, Z.Y. Cooperative Wind Farm Control With Deep Reinforcement Learning and Knowledge-Assisted Learning. IEEE Trans. Ind. Inform. 2020, 16, 6912. [Google Scholar] [CrossRef]
  8. Cheng, T.; Harrou, F.; Kadri, F.; Sun, Y.; Leiknes, T. Forecasting of Wastewater Treatment Plant Key Features Using Deep Learning-Based Models: A Case Study. IEEE Access 2020, 8, 184475–184485. [Google Scholar] [CrossRef]
  9. Sutton, R.S.; Barto, A.G. Reinforcement Learning, Second Edition: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  10. Neftci, E.O.; Averbeck, B.B. Reinforcement learning in artificial and biological systems. Nat. Mach. Intell. 2019, 1, 133–143. [Google Scholar] [CrossRef]
  11. Sierra-García, J.; Santos, M. Lookup Table and Neural Network Hybrid Strategy for Wind Turbine Pitch Control. Sustainability 2021, 13, 3235. [Google Scholar] [CrossRef]
  12. Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
  13. Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, Cambridge, UK, 1989. [Google Scholar]
  14. Rummery, G.A.; Niranjan, M. Online Q-Learning Using Connectionist Systems; Citeseer: Princeton, NJ, USA, 1994; Volume 37. [Google Scholar]
  15. Sutton, L.K.R. Model-based reinforcement learning with an approximate, learned model. In Proceedings of the Ninth Yale Workshop on Adaptive and Learning Systems, New Haven, CT, USA, 10–12 June 1996; Volume 1996, pp. 101–105. [Google Scholar]
  16. Baxter, J.; Bartlett, P.L. Infinite-horizon policy-gradient estimation. J. Artif. Intell. Res. 2001, 15, 319–350. [Google Scholar] [CrossRef]
  17. Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1999; Volume 12. [Google Scholar]
  18. Hasselt, H.V. Reinforcement learning in continuous state and action spaces. In Reinforcement Learning; Springer: Berlin/Heidelberg, Germany, 2012; pp. 207–251. [Google Scholar]
  19. Sierra-García, J.E.; Santos, M. Performance Analysis of a Wind Turbine Pitch Neurocontroller with Unsupervised Learning. Complexity 2020, 2020, e4681767. [Google Scholar] [CrossRef]
  20. Sierra-García, J.E.; Santos, M. Improving Wind Turbine Pitch Control by Effective Wind Neuro-Estimators. IEEE Access 2021, 9, 10413–10425. [Google Scholar] [CrossRef]
  21. Sierra-Garcia, J.E.; Santos, M. Deep learning and fuzzy logic to implement a hybrid wind turbine pitch control. Neural Comput. Appl. 2022, 34, 10503–10517. [Google Scholar] [CrossRef]
  22. Recht, B. A Tour of Reinforcement Learning: The View from Continuous Control. Annu. Rev. Control. Robot. Auton. Syst. 2019, 2, 253–279. [Google Scholar] [CrossRef]
  23. Aponte, O.; Vega, P.; Francisco, M. Avances en Informática y Automática. Master’s Thesis, University of Salamanca, Salamanca, Spain, 2022. [Google Scholar]
  24. Oliver, J.R. A Machine-Learning Approach to Automated Negotiation and Prospects for Electronic Commerce. J. Manag. Inf. Syst. 1996, 13, 83. [Google Scholar] [CrossRef]
  25. Nguyen, T.D.; Jennings, N.R. Coordinating multiple concurrent negotiations. In Proceedings of the 3rd International Conference on Autonomous Agents and Multi-Agent Systems, New York, NY, USA, 19–23 July 2004; pp. 1064–1071. [Google Scholar]
  26. Bakker, J.; Hammond, A.; Bloembergen, D.; Baarslag, T. RLBOA: A Modular Reinforcement Learning Framework for Autonomous Negotiating Agents. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, Montreal, QC, Canada, 13–17 May 2019; p. 9. [Google Scholar]
  27. Javalera, V.; Morcego, B.; Puig, V. Negotiation and Learning in distributed MPC of Large Scale Systems. In Proceedings of the 2010 American Control Conference, Baltimore, MD, USA, 30 June–2 July 2010; pp. 3168–3173. [Google Scholar] [CrossRef]
  28. Kakade, S.M. A Natural Policy Gradient. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2001; Volume 14. [Google Scholar]
  29. Williams, R.J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning; Kluwer Academic Publishers: Amsterdam, The Netherlands, 1992; p. 28. [Google Scholar]
  30. Masero, E.; Francisco, M.; Maestre, J.M.; Revollar, S.; Vega, P. Hierarchical distributed model predictive control based on fuzzy negotiation. Expert Syst. Appl. 2021, 176, 114836. [Google Scholar] [CrossRef]
  31. Maestre, J.M.; Muñoz De La Peña, D.; Camacho, E.F. Distributed model predictive control based on a cooperative game. Optim. Control. Appl. Methods 2010, 32, 153–176. [Google Scholar] [CrossRef]
Figure 1. PG-DMPC algorithm proposed in the upper-level PG negotiation control layer within the control architecture.
Figure 1. PG-DMPC algorithm proposed in the upper-level PG negotiation control layer within the control architecture.
Applsci 13 02432 g001
Figure 2. PG optimization scheme for the required negotiation agent in the upper-level control layer in the distributed MPC-based control system.
Figure 2. PG optimization scheme for the required negotiation agent in the upper-level control layer in the distributed MPC-based control system.
Applsci 13 02432 g002
Figure 3. Schematic diagram of the eight-coupled tanks plant with the proposed subsystems.
Figure 3. Schematic diagram of the eight-coupled tanks plant with the proposed subsystems.
Applsci 13 02432 g003
Figure 4. PG-DMPC negotiation algorithm implemented.
Figure 4. PG-DMPC negotiation algorithm implemented.
Applsci 13 02432 g004
Figure 5. Gradient estimation variance approach by exploration policy based on categorical distribution.
Figure 5. Gradient estimation variance approach by exploration policy based on categorical distribution.
Applsci 13 02432 g005
Figure 6. Validation Case 1: Evolution of the levels of the controlled variables: (a) Conf. 1 performance; (b) Conf. 2 performance.
Figure 6. Validation Case 1: Evolution of the levels of the controlled variables: (a) Conf. 1 performance; (b) Conf. 2 performance.
Applsci 13 02432 g006
Figure 7. Validation Case 2: Evolution of the levels of the controlled variables involving disturbance in the state vector; (a) Conf. 1 performance; (b) Conf. 2 performance.
Figure 7. Validation Case 2: Evolution of the levels of the controlled variables involving disturbance in the state vector; (a) Conf. 1 performance; (b) Conf. 2 performance.
Applsci 13 02432 g007
Figure 8. Validation Case 3: Evolution of the levels of the controlled variables involving steps: (a) Conf. 1 performance; (b) Conf. 2 performance.
Figure 8. Validation Case 3: Evolution of the levels of the controlled variables involving steps: (a) Conf. 1 performance; (b) Conf. 2 performance.
Applsci 13 02432 g008
Table 1. Policy gradient configurations implemented.
Table 1. Policy gradient configurations implemented.
NamedGradient EstimationExploration Policy
Conf. 1Non-baseline andCategorical distribution
Conf. 2Baseline andCategorical distribution
Table 2. Validation cases.
Table 2. Validation cases.
Case 1Case 2Case 3
Control problemregulationregulationtracking
Disturbancenonefrom t > 245 to t < 305 none
Referencesteady statesteady statestep from one to another steady state
Table 3. Case 1: Configurations performance through performance indices values.
Table 3. Case 1: Configurations performance through performance indices values.
IndexConf. 1Conf. 2FL-DMPCCoop. GameCentralized MPC
P e 0.035940.035910.036050.030400.01958
ISE0.042720.042970.042850.041300.03012
e0.009340.009390.009370.009060.00682
PE × 1000.775050.775010.775030.775050.77463
Table 4. Case 2: Configurations performance through performance indices value.
Table 4. Case 2: Configurations performance through performance indices value.
IndexConf. 1Conf. 2FLCoop. GameCentralized MPC
P e 0.037450.037780.037330.031320.02029
ISE0.043490.043690.043510.042270.03080
e0.009470.009530.009500.009250.00696
PE * 1000.770380.770400.770280.770160.76993
Table 5. Case 3: Configurations performance through performance indices values.
Table 5. Case 3: Configurations performance through performance indices values.
IndexConf. 1Conf. 2FLCoop. GameCentralized MPC
P e 0.267970.267160.270600.222420.12143
ISE0.334290.334090.334650.333410.20224
e0.068150.068110.068230.067980.04175
PE × 1001.594171.594181.594101.594501.59313
Table 6. Compliance with stability requirement: % of total time.
Table 6. Compliance with stability requirement: % of total time.
Validation CaseConf. 1Conf. 2
Case 184.2%83.33%
Case 280.84%82.5%
Case 370.41%70.41%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aponte-Rengifo, O.; Vega, P.; Francisco, M. Deep Reinforcement Learning Agent for Negotiation in Multi-Agent Cooperative Distributed Predictive Control. Appl. Sci. 2023, 13, 2432. https://doi.org/10.3390/app13042432

AMA Style

Aponte-Rengifo O, Vega P, Francisco M. Deep Reinforcement Learning Agent for Negotiation in Multi-Agent Cooperative Distributed Predictive Control. Applied Sciences. 2023; 13(4):2432. https://doi.org/10.3390/app13042432

Chicago/Turabian Style

Aponte-Rengifo, Oscar, Pastora Vega, and Mario Francisco. 2023. "Deep Reinforcement Learning Agent for Negotiation in Multi-Agent Cooperative Distributed Predictive Control" Applied Sciences 13, no. 4: 2432. https://doi.org/10.3390/app13042432

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop