Next Article in Journal
Weinberg’s Compositeness
Previous Article in Journal
A Two-Step Iteration Method for Vertical Linear Complementarity Problems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on Solving Nonlinear Problem of Ball and Beam System by Introducing Detail-Reward Function

1
School of Software Engineering, Dalian University of Foreign Languages, Dalian 116044, China
2
School of Mechanical Engineering, Dalian Jiaotong University, Dalian 116028, China
3
School of Control Science and Engineering, Dalian University of Technology, Dalian 116024, China
*
Author to whom correspondence should be addressed.
Symmetry 2022, 14(9), 1883; https://doi.org/10.3390/sym14091883
Submission received: 9 August 2022 / Revised: 25 August 2022 / Accepted: 5 September 2022 / Published: 8 September 2022
(This article belongs to the Section Engineering and Materials)

Abstract

:
As a complex nonlinear system, the first-order incremental relationship between the state variables of the beam and ball system (BABS) is asymmetric in the definition domain of the variables, and the characteristics of the system do not satisfy the superposition theorem. Studying the balance control of the BABS can help to better grasp the relevant characteristics of the nonlinear system. In this paper, the deep reinforcement learning method is used to study the BABS based on a visual sensor. First, the detail-reward function is designed by observing the control details of the system, and the rationality of the function is proved based on Q-function; secondly, considering and comparing the applicability of image processing methods in ball coordinate location, an intelligent location algorithm is proposed, and the location effects between the algorithms are compared and analyzed; then, combining the nonlinear theory and LQR theory, a reinforcement learning policy model is proposed to linearize near the equilibrium point, which significantly improves the control effect. Finally, experiments are designed to verify the effectiveness of the above methods in the control system. The experimental results show that the design scheme can be effectively applied to the control system of the BABS. It is verified that the introduction of detail-reward mechanism into a deep reinforcement learning algorithm can significantly reduce the complexity of the nonlinear control system and iterative algorithm, and effectively solve nonlinear control problems.

1. Introduction

Control fields provide disciplines and approaches for designing engineering systems to maintain desirable performance by automatically adapting to environmental changes, and the evolution of control theory is related to analyzing controller design methods, as well as research advances in technology and their real-time implementation [1,2]. There are many control benchmark problems related to engineering systems, such as inverted pendulum systems [3], beam and ball systems [4], hovercraft systems [5], and ball and plate systems [6]. In control education, the beam and ball system (BABS) is used as standard and important laboratory equipment to validate and design different control methods. As the beam is an important part of the BABS, it is very important to study its own properties. So far, many scholars made great progress in the research of the force and structure of the beam [7,8,9,10,11,12]. This is very helpful for us to study the materials, viscosity and friction analysis, and force analysis of the BABS. In the BABS, the beam is mounted on the output shaft of the motor, and the position of the ball on the beam is controlled by applying a control signal to change the angle of the beam. The control structure of the BABS is widely used in practical applications, for example, horizontally stabilized aircrafts [13] and balancing robots carrying goods [14,15]. It is of great significance to study the control problems of the BABS for the application of the engineering system.
The BABS is a nonlinear, strong coupling, underdrive and unstable dynamic system, the core of which is to study the position balance of the ball [16]. As far as our knowledge is concerned, designing a complete control system mainly includes the design of the control model and the acquisition of system state variables [17]. In BABS, the core of study can be summarized as follows: ① according to the control objective, how to establish a control model of the system to control the ball to be in a balanced state at any position on the beam; and ② how to get state variables in the system (ball position information). At present, the design methods of the BABS’s control model mainly include traditional control methods and intelligent control methods [18]. The position information of the ball needs to be acquired by sensors, and the sensors are mainly divided into contact sensors and non-contact sensors [19,20,21,22]. Next, we will introduce the common control methods of BABS, the research status of sensor technology, and the existing problems.
Some scholars use the traditional proportional-integral-differential (PID) or improved PID as a design method for the control model of the BABS. This method takes the deviation information between the real-time position of BABS and the target position as the input of the control model, and realizes the control objective by adjusting the proportional, integral, and differential parameters [22,23]. The tuning of the PID parameters depends on the experience of the researchers. When the mechanical parameters of the system change, the PID parameters need to be re-adjusted, which has some impact on the flexibility of the control. Some scholars simplify the dynamic model of the BABS, convert the nonlinear output of the model into a linear output within the allowable error range, and use the linear quadratic regulator (LQR) as the design method of the BABS control model [24,25,26,27]. When the dynamic model is complex, especially when the system is subject to a variety of external interference, the output of the linearized model has great uncertainty, which requires the designer to repeatedly calculate and test the model.
With the rapid development of computer technology and control theory, intelligent control algorithms are applied to control systems. Some scholars use fuzzy control (FC) [28,29,30] and sliding mode control (SMC) [31,32,33,34] to study the BABS, and achieved good control results. In the control model design process of BABS, the FC design method includes the following steps: ① Select the input of the control model, fuzzify it, and construct a membership function. ② Build the experience-based rule base. ③ Realize fuzzy reasoning. ④ Convert the control variables obtained by reasoning into the control output. The SMC method involves the following steps: ① Design the sliding surface so that the state trajectory of the BABS is asymptotically stable after switching to the sliding mode. ② Design the slide mode control law, so that the state trajectory of the system can maintain the motion on the sliding surface within a limited time. However, FC and SMC have the following limitations: the establishment of fuzzy rules and membership functions in FC depends on the experience of the designers. When the BABS becomes more complicated due to the influence of external disturbance, the design of the controller lacks a systematic nature, and the control accuracy and stability need to be improved [35]. In SMC, when the state trajectory reaches the sliding mode surface and the switching range of the control law is large, it is difficult to strictly slide along the sliding mode surface to the equilibrium point, but it is distributed on both sides of the sliding mode surface, and the BABS is prone to cause a buffeting phenomenon in the above situation, which has a large negative impact on the stability and security of the system [36].
More intelligent control methods can be used to solve the nonlinear control problems of the BABS. Adaptive dynamic programming (ADP) is an iterative solution algorithm based on the idea of reinforcement learning (RL), and is combined with the actor–critic framework using dynamic programming solution methods, such as policy iteration and value iteration to realize the approximation of the performance index function and optimal control law in nonlinear systems [37,38,39,40]. Based on the dynamic model of the BABS, the quadratic performance index function is adopted, and various types of function approximation structures are designed to obtain the approximate solutions of the performance index function and the optimal control law [41,42,43]. The ADP method does not pay attention to the control details, and there is no clear correlation description between the performance index function and the system state. In addition, the parameters in the performance index function do not give a definite design method, and these parameters not only affect the performance index function, but also determine the result of the optimal control law.
RL is a machine learning method combined with animal learning psychology [44], which uses a trial-and-error mechanism to interact with the environment, and obtains the optimal policy by maximizing the cumulative reward under the reward stimulus given by the environment. The essence of RL can be understood as the process of goal-oriented automated learning and decision making [45]. Some studies take RL as the control method, design the reward function in the BABS, and obtain the maximum reward value through continuous interaction with the dynamic environment, so as to obtain the optimal control law [46,47,48,49]. The design of the reward function in the above literature is different from the performance index function in ADP. However, there is a lack of systematic methods for the design of the reward function, and most of the reward functions have no clear relationship with the system control details, and the design of the reward function is relatively sparse, and there is no rational proof. In addition, when the control policy trained in the dynamic environment is used in the real platform, the algorithms need to be adjusted and improved. We propose a detail-reward mechanism (DRM) in [50], which gives the reward function by considering the control details, and combines the related technologies in deep reinforcement learning (DRL) to solve the problems existing in ADP and RL in nonlinear system control. In summary, to solve the control problem of the BABS, the research hotspot is more inclined to adopt the DRL method, and if the DRM is introduced, it may be possible to satisfy the control requirements from the details.
There are three ways to obtain the information of ball position in the BABS: based on touch sensors (touch screen) [21,22], based on F/T sensors [51], and based on vision systems [19,20]. When the contact sensor (touch screen) is used to measure the position of the ball, the resistance value or capacitance value of the touch screen changes during the movement of the ball, so as to obtain the current position information. However, the contact between the ball and the touch screen is easily causes damage to the outer film of the sensor, resulting in poor measurement accuracy. Moreover, the touch screen itself has a strong viscous effect on the ball, increasing the complexity of the system. The F/T sensor is used to measure the position and speed of the ball. When the ball is subjected to uneven force, it is easy to move in the vertical direction, which causes the measured values of the ball to be discontinuous. As a non-contact sensor, the vision sensor collects the dynamic image of the ball and the beam, analyzes the pixel characteristics, obtains the pixel coordinate information of the ball in the image, and maps pixel coordinates to actual physical coordinates based on geometric relationships [19,20]. Considering the safety and continuity of measurement, the mechanical and physical properties of the system can be guaranteed, and the visual sensor has obvious advantages in collecting state information.
The methods of collecting the position information of the object using the vision sensor can be divided into two categories: object location detection based on deep learning [52,53,54] and object location detection based on image processing methods [55,56,57,58]. The former uses a deep network to predict the object location information, and its prediction accuracy is high. However, in some control systems with high detection timeliness, it is necessary to compress the deep network model to reduce the time from the image input to the position output, otherwise, it is easy to cause control lag. The latter can guarantee the timeliness of detection, but when the acceleration of the object changes rapidly or the motion characteristics are not obvious, the detection accuracy will be greatly reduced. Therefore, an important problem to be solved in the BABS is not only to ensure the timeliness of image acquisition and processing by visual sensors, but also to be able to intelligently identify targets with high accuracy.
This paper focuses on the establishment of the control model and the acquisition of ball position information in the BABS. The main contributions are summarized as follows:
(1)
Aiming at the limitation of the sparse reward function in the BABS, the detail-reward function is designed to be subject to the control details of the system, and the Q-function is used to mathematically prove the rationality of the detail-reward function. Finally, the feasibility of the detail-reward function is verified by experiments.
(2)
A visual location method is proposed. When the surface features of the ball are not obvious due to the influence of the environment, the detection accuracy of this method can be improved, compared with the conventional image processing method.
(3)
Aiming at the limitations of the policy model trained by deep reinforcement learning in the BABS, a method of model linearization near the equilibrium point is proposed by combining nonlinear control theory and LQR theory to improve the control stability.
The remainder of this paper is organized as follows: Section 2 describes the dynamic model of the BABS. In Section 3, the structure of the BABS is designed, which mainly involves the design of the control model and the design of the visual location method. Section 4 analyzes and discusses the experimental results of the BABS. Finally, Section 5 presents the main conclusions and future research work.

2. Dynamic Model of BABS

Figure 1 is the schematic diagram of the BABS. Figure 1a shows the composition of the BABS. It consists of beam, steel ball, vision sensor, electromechanical components, and mechanical transmission. Driven by the electromechanical device, the beam rotates around the rotation center in the Z-O-X plane in a limited angle range. The ball rolls freely on the beam under the action of gravity. There is no slipping motion between the ball and beam, and the ball does not jump away from the beam. The final control objective of the system is that the ball can stay at any preset point on the beam, and the beam is horizontal in the balance state, that is, the angle is zero. Figure 1b is the schematic diagram of the symbol of BABS.
In Figure 1b, the ball mass is denoted as m, the radius of the ball is r , the beam length is l r , the moment of inertia is J , and the beam rotation angle is α ; the ball is affected by the gravity g and the resistance f related to the ball velocity in the rolling process. The dynamic model is established according to the motion characteristics of the BABS, and the clockwise rotation of the beam is defined as the positive direction. When the BABS meets the integrity constraints, according to the definition of generalized force, we have
i = 1 n F i · δ r i = i = 1 N Q k · δ q k
where F i is the active force on the virtual displacement vector r i , q = q 1 , q 2 , are the generalized coordinates, and Q k is the generalized force on q k . According to the physical constraints of the BABS, let q = χ , because the gravity component m g s i n α in the X direction is potential force, that is, the system will cause potential energy changes under the action of this force, and because f is non potential, the generalized force guided by gravity is
Q χ = δ W F δ χ = f · δ χ δ χ = f
and the kinetic energy of the system is
T = 1 2 m v 2 + 1 2 J ω 2 = 1 2 m χ ˙ 2 + 1 2 J χ ˙ 2 r 2 .
If the system potential energy is zero when the beam is in a horizontal state, i.e., α = 0 , and the ball starts at the equilibrium point O , the system potential energy is
V = m g χ sin α
By introducing Equation (4) into Lagrangian function, we can get
L = T V = 1 2 m χ ˙ 2 + 1 2 J χ ˙ 2 r 2 + m g χ sin α .
Substitute Equation (5) into Lagrangian equation
d d t L χ ˙ L χ = f .
Equation (6) can be rewritten into the following formulas
d d t L χ ˙ = d d t m + J r 2 χ ˙ = m + J r 2 χ ¨ L χ = m g sin α f = k f χ ˙ ,
where k f is the Viscous friction coefficient, and Equation (7) can be converted into
m + J r 2 χ ¨ m g sin α = k f χ ˙ .
Then the dynamic differential equation of the BABS can be obtained as follows:
χ ¨ + k f r 2 m r 2 + J χ ˙ = m g r 2 m r 2 + J sin α .

3. Model Design of the BABS

The research content of this section mainly includes the following two parts: the design of the BABS control model (Section 3.1) and the detection of the ball position information (Section 3.2).

3.1. Control Model Design

The design of the control model for the BABS is mainly divided into the following parts: ① According to the control characteristics of the BABS, a reinforcement learning model (Section 3.1.1) is established. ② According to the control details of the BABS, a detail-reward function (Section 3.1.2) is constructed. ③ According to the high-dimensional state characteristics of the BABS, a deep reinforcement learning algorithm (Section 3.1.3) is designed.

3.1.1. Reinforcement Learning Model Design

In reinforcement learning, when agents interact with the environment, they need to find the mapping relationship between states and actions, such as the probability that a state corresponds to an action or some different actions [45]. The state can be regarded as the input of the system, while the action is the output, and the mapping process from state to action is called a policy, which is represented by π , i.e.,
a = π s
The control model of the BABS realizes the mapping function from the input of the system model to the output of the control law. In order to study the reinforcement learning model of the BABS, we express the system state feedback as the state in reinforcement learning, the control law as action, and the control model as policy. In this way, in the above formula, a is the action vector of the BABS and s is the state vector. The control problem of the BABS just is a standard Markov decision process (MDP). The next state s t + 1 that the system can reach depends only on the current state s t and the current action a t , i.e., if the input is determined, then the output is also determined. When using reinforcement learning to solve the BABS control problem with MDP characteristics, we need to build five elements of reinforcement learning: environment, state, agent, action, and reward. At the state s t described by the environment variables, the system takes action a t , obtains the reward r t according to the reward function, updates the next state to s t + 1 , and then continues to perform the following series of actions to obtain the subsequent reward values. Figure 2 shows the reinforcement learning model of the BABS.
Environment: Considering the limitations and difficulties of adopting reinforcement learning in the real platform [59], we establish the dynamic environment of the BABS according to Equation (9), so that we can train and update policy in the dynamic environment, and take the final policy model (PM) as the control model of the BABS. In the process of building the dynamic environment, the mechanical parameters in the dynamic model are obtained by measuring the real platform. Considering the measurement error, the floating parameter method is adopted when introducing the measurement parameters into the dynamic environment, i.e., the parameters of the controlled object are set to float randomly within a range. When the PM generated by learning in the dynamic environment has poor control effect on the real platform, the parameters are changed to train and update the policy again to reduce the difference between the dynamic environment and the real platform. Considering the uncertain interference factors (such as air resistance, mechanical clearance, etc.) during the operation of the real platform, we simulate the above disturbances by adding noise when building the dynamic environment.
State: We define the state in the reinforcement learning model as s = χ , χ ˙ , α , α ˙ T , where χ denotes the distance of the ball from the centre of rotation, χ ˙ denotes the velocity of the ball, α denotes the angle of rotation of the beam around the centre of rotation, and α ˙ denotes the angular velocity of beam. The state quantities in the dynamical environment are obtained by means of a dynamical model, and in the real platform by means of a vision sensor and angle encoders.
Action: Although various forms of control quantities can be used as output actions of the control model, e.g., motor output current, angle of the beam, etc., we take the angular acceleration α ¨ as the output action of the control model, and it obeys Equation (10).
Reward: In the BABS, the agent receives a reward that reflects the rationality of the control action it takes. Getting a higher reward is the objective of model optimization, and the reward is designed by the way of the detail-reward mechanism. The creation and rationality proof of the detail-reward function is described in Section 3.1.2 of this paper.
A noteworthy feature of the BABS is its nonlinearity. According to the Equation (9), which is nonlinear in itself, we cannot find the time-varying matrices A t and B t independent of the state vector s and the control vector α ¨ , respectively, for s ˙ = A t s + B t α ¨ to hold. This makes it very complicated to obtain the analytical solution of the system using mathematical methods.
In reinforcement learning, G t is used to represent the accumulation of all reward values from time t [59], which is expressed as follows:
G t = r t + γ r t + 1 + = k = 0 γ k r t + k + 1
where, γ is the discount rate, 0 γ 1 and r t + 1 denotes the reward value obtained by the agent by taking action a t through policy at state s t . The objective of the reinforcement learning model design for the BABS is to make the expectation of G t higher and higher through continuous iterative computation, i.e., to maximize the value function of each state. We define the value function after executing action a at a state s in the interaction as Q π s , a , and the relationship between Q π s , a and G t can be expressed as
Q π s , a = E G t s , a = E r t + 1 + γ r t + 2 + γ 2 r t + 3 s , a .
In the above equation, a A , π is the learning policy, which can generate actions according to the current state and obtain the value function under the current action. According to the Bellman optimality equation [59], the corresponding policy under the optimal value function is the optimal policy, which is
Q * s , a = max π Q π s , a
π * s = arg max a A Q * s , a .
We try to use Equations (10), (13), and (14) to solve the nonlinear control problem of the BABS. The solution process of the control model is the solution process of the optimal policy in the reinforcement learning model. The solution method of the optimal value function and optimal policy is introduced in Section 3.1.3 of this paper.

3.1.2. Detail-Reward Function

The objective of reinforcement learning is to maximize the reward gain and to obtain the optimal policy. From Equation (11), maximizing the Q function, i.e., maximizing the expectation of the G t , Q-function [59] can be given as
Q * s , a = E s P s a · R s , a , s + γ max a A Q * s , a .
The design of reward function is an important part of reinforcement learning, which is crucial for guiding an agent to learn good policy. Logically confusing rewards cannot guide an agent to learn effectively. The optimal reward function should have an appropriate reward value output for each state–action pair s , a , i.e., the output of each action should have an exact reward value corresponding to it. If each action is the output under the optimal policy, then each action will get the maximum reward value. In fact, it is very difficult to design such a reward function, even impossible to complete.
In nonlinear control systems, the equilibrium point is the trend and target of state fluctuations. When reinforcement learning is used to solve nonlinear control problems, the equilibrium point of the system is usually designed as a sparse reward point, which means that the design of the reward function takes the equilibrium point as the design object, i.e., when the state of the agent reaches or near the equilibrium point, the agent receives a reward in some way [60]. As a type of reward function, sparse reward is used to tell the agent the final goal to achieve, rather than telling the agent how to achieve the goal, or relying on some detailed instructional information. From the perspective of reinforcement learning theory, the design of sparse reward function is feasible. Taking the equilibrium point as the reward point can maximize the cumulative reward value in the learning process [59]. However, in most learning processes, due to the huge state space, the design of rewards is too sparse and unclear, and it is difficult for agents to get positive rewards. Agents are prone to the strange phenomenon of “circling” in some states, resulting in slow learning or even being unable to learn, and ultimately leading to learning failure [61]. In addition, the stability of PM obtained after learning in the control process needs to be improved.
In the reinforcement learning process of nonlinear systems, because the state is often multiple continuous variables, the probability of the agent obtaining the equilibrium state in the exploratory learning process is very low, and the sparse reward cannot guide the agent’s learning. At this time, it is necessary to carry out some technical processing on the reward mechanism. When solving specific nonlinear control system problems, there are often some “local” empirical methods, that is, the use of some empirical control methods for some specific state space. These methods are often understood as the “optimal control method” in this space. We can use these empirical methods to make some technical modifications to the sparse reward in order to enrich the output of the reward function [62,63]. For the convenience of later algorithm processing, we give the following assumption, theorem, and lemma.
Assumption. 
Suppose there exists a controllable nonlinear system, S is the state set, sparse reward R s 0 = 1 is constructed at the equilibrium point s 0 , and R s , a , s 0 , 1 . S S is an empirical control state set of the system, and there is the optimal empirical control method f S , A over state S , i.e., at any specific state s t S and action a t A , there is an output state s t + 1 = f s t , a t , and the state s t + 1 approaches the equilibrium point s 0 with maximum probability.
Under the action of the state transition probability P s a · , the state shifts from s t to s t + 1 . In the process of state transition, it cannot be guaranteed that all the next state s t + 1 is closest to s 0 , but the optimal policy guarantees the maximum mathematical expectation. For an accurate dynamic model of a nonlinear control system, after determining the state–action pair s t , a t , the next state s t + 1 output by the model is determined. However, in the real system, due to the influence of random disturbance and sensor accuracy, the next state is uncertain, and there may be some deviation. For the convenience and generality of the following mathematical description, we assume that under the optimal empirical control method f s t , a t , s t + 1 is the state closest to the equilibrium point s 0 .
We described the detail-reward mechanism (DRM) in literature [50]. The form of the DRM is
R x k , u k = i = 0 n e i x k , u k .
We call e i · the evaluation function, and the evaluation function sequence is e 1 x k , u k , e 2 x k , u k , , e n x k , u k , assuming e i x k , u k is continuously differentiable; where n is the number of evaluation functions, which is attributed to the evaluation of different observation perspectives.
Based on the DRM, the following theorem is given.
Theorem. 
Assuming that there exists the optimal empirical control method f S , A on state S S , the evaluation function e 0 s t , a t = R s t , a t , s t + 1 is constructed as follows:
R s , a = R s , a , s , s S   a n d   R s , a , s 0 1 ,   s S   a n d   R s , a , s = 0   R s , a , s ,   otherwise
R s t , a t , s t + 1 0 , 1 is the sparse reward near the equilibrium point, which consists of the Q-function (15); construct e 1 s t , a t = K s t , a t , s t + 1 , where K s t , a t , s t + 1 0 , 1 is the evaluation function based on the optimal empirical control method f s t , a t . Thus the detail-reward function is defined by
E s t , a t , s t + 1 = i = 0 1 e i s t , a t = K s t , a t , s t + 1 · R s t , a t , s t + 1 .
The optimal policy is unchanged by choosing the detail-reward function E s t , a t , s t + 1 substituted for the sparse reward function R s t , a t , s t + 1 .
Proof of Theorem.
The Q-function of sparse reward R s , a , s 0 , 1 under the state s t S is
Q * s , a = E s R s , a , s + γ max a A Q * s , a .
The Q-function using the DRM is
Q S * s t , a t = E s t + 1 E s t , a t , s t + 1 + γ max a t + 1 A Q * s t + 1 , a t + 1
Note that only the detail-reward function E · is used in step t , and the sparse reward function R · is used in step t + 1 and subsequent processes. Suppose the optimal empirical control is
s t + 1 = f s t , a t .
This means that the system state transitions to the next state s t + 1 after executing action a t at state s t , and the distance between state s t + 1 and equilibrium point s 0 is expressed as
L s t + 1 = s t + 1 s 0 .
According to the definition of optimal empirical control, the optimal action consists of
a t * = arg min f s t , a t S , a t A L f s t , a t .
Then, the evaluation function, with respect to the optimal empirical control method, has the value in this case of
K s t , a t * , f s t , a t * = 1 .
In other words, the evaluation function can give the maximum reward value of 1 when the distance between the state s t + 1 and the equilibrium point s 0 is minimal. Thus the optimal policy is as follows:
π S * s t = arg max a t A Q S * s t , a t = arg max a t A E s t + 1 K s t , a t , s t + 1 · R s t , a t , s t + 1 + γ max a t + 1 A Q * s t + 1 , a t + 1
Just prove that when s S and R s , a , s 0 , we have
arg max a t A E s t + 1 1 · R s t , a t , s t + 1 + γ max a t + 1 A Q * s t + 1 , a t + 1 = arg max a t A Q * s t , a t = π * s t .
This shows that the detail-reward function and the sparse reward function have the same effect on the optimal policy. Then, the proof is complete. □
In fact, the optimal empirical control method continues to be used in step t + 1 and subsequent processes, and the optimal policy is still applicable, i.e., the Q-function and the optimal policy consist with
Q S * s t , a t = E s t + 1 E s t , a t , s t + 1 + γ max a t + 1 A Q * s t + 1 , a t + 1
π * s t = arg max a t A Q S * s t , a t .
Lemma. 
Assume that there exist the optimal empirical control methods f 1 S , A , f 2 S , A , , f n S , A on the states S 1 , S 2 , , S n S , respectively, and construct the evaluation function e 0 s t , a t = R s t , a t , s t + 1 , where
R s , a = R s , a , s ,   s S   a n d   R s , a , s 0 1 ,   s S 1 S 2 S n   a n d   R s , a , s = 0 R s , a , s ,   otherwise
R s t , a t , s t + 1 0 , 1 is the sparse reward near the equilibrium point, which consists of the Q-function (15). e 1 s t , a t = K 1 s t , a t , s t + 1 , e 2 s t , a t = K 2 s t , a t , s t + 1 , , e n s t , a t = K n s t , a t , s t + 1 , K i s t , a t , s t + 1 0 , 1 are the evaluation functions based on their respective optimal empirical control method f i s t , a t . Moreover, the detail-reward function is
E s t , a t , s t + 1 = i = 0 n e i s t , a t = i = 1 n K i s t , a t , s t + 1 · R s t , a t , s t + 1 .
The optimal policy is unchanged by choosing the detail-reward function E s t , a t , s t + 1 substituted for the sparse reward function R s t , a t , s t + 1 .
Proof of Lemma.
Referring to the proof method of theorem, combined with Equation (27), we rewrite Equation (27) as
Q S 1 * s t , a t = E s t + 1 E 1 s t , a t , s t + 1 + γ max a t + 1 A Q S 1 * s t + 1 , a t + 1 .
Then get
E 2 s t , a t , s t + 1 = e i s t , a t · K 1 s t , a t , s t + 1 · R s t , a t , s t + 1 = K i s t , a t , s t + 1 · K 1 s t , a t , s t + 1 · R s t , a t , s t + 1 = K i s t , a t , s t + 1 · E 1 s t , a t , s t + 1 ,
and the Q-function is
Q S 2 * s t , a t = E s t + 1 E 2 s t , a t , s t + 1 + γ max a t + 1 A Q S 1 * s t + 1 , a t + 1 .
Based on the optimal empirical control, we get the optimal action
a t * = arg min f i s t , a t S , a t A L f i s t , a t .
Then the evaluation function with respect to the optimal empirical control method has the value in this case
K i s t , a t * , f i s t , a t * = 1 ,
thus, we get
π S 2 * s t = arg max a t A Q S 2 * s t , a t .
Just prove that when s S 1 S i and R s , a , s 0 , we have
arg max a t A E s t + 1 1 · E 1 s t , a t , s t + 1 + γ max a t + 1 A Q S 1 * s t + 1 , a t + 1 = arg max a t A Q S 1 * s t , a t = π * s t .
According to Equation (36), it can be obtained by mathematical induction
π S n * s t = arg max a t A Q S n 1 * s t , a t = π * s t .
Then, the proof is complete. □
Above is the completed proof of the rationality of the detail-reward function. By constructing the evaluation function to replace the sparse reward, we try to design the evaluation function from different observation perspectives in the BABS.
(1)
The evaluation function e 0 ·
The control objective of the BABS is to control the action of the beam so that the ball moves over it and the final system state converges to a point of equilibrium. In Section 3.1.1, the system state s = χ , χ ˙ , α , α ˙ T is defined, and the control task is to drive the beam angle α so that the ball position χ converges to the equilibrium position. We denote α max as the maximum value of the beam deviating from the equilibrium point, χ max as the maximum value of the ball deviating from the equilibrium point, χ ˙ max as the maximum velocity of the ball moving, and α ˙ max as the maximum angular velocity of the beam rotating. The evaluation function e 0 · is designed according to the variation range of the state vector. Figure 3 is the design schematic diagram of the evaluation function.
According to the distribution of the ball position and beam angle in Figure 3, we define the state space of the system in the control process as follows
Ω x = l r 2 χ t l r 2 χ ˙ m a x χ t ˙ χ ˙ m a x α m a x α t α m a x α ˙ m a x α ˙ t α ˙ m a x
In the control process of the BABS, assuming the equilibrium point state is s 0 = χ 0 , χ 0 ˙ , α 0 , α ˙ 0 T , we design a sparse reward R · near s 0 as follows:
Φ = χ 0 0.02 χ t χ 0 + 0.02 χ 0 ˙ 0.02 χ t ˙ χ 0 ˙ + 0.02 α 0 0.02 α t α 0 + 0.02 α ˙ 0 0.02 α ˙ t α ˙ 0 + 0.02
R s t , a t = 1 ,   s t Φ 0 ,   s t Φ .
The evaluation function e 0 · consists of the following expression:
e 0 s t , a t = 1 , s t Ω x 0 , s t Ω x .
From Equation (39), it follows that Φ Ω x . In fact, the design of e 0 · includes the term of sparse reward R · , and the difference between e 0 · and R · is that the sparse reward is given only when the state reaches the equilibrium point, and the training model is more difficult. However, when e 0 · is used, as long as the system state is in a reasonable state space, a predetermined reward value will be given. The design of e 0 · is to provide a mathematical multiplier for the final reward function without violating the original sparse reward.
(2)
The evaluation function e 1 ·
During the rotation of the beam, in order to make the ball move to the predetermined position of the equilibrium point as much as possible and reduce the amplitude of the swing angle of the beam, we try to design the evaluation function e 1 · from the perspective of the position of the ball. Figure 4 is the design schematic diagram of the evaluation function; it shows that the position of ball 1 is close to the equilibrium point, while the position of ball 0 is far from the point. In terms of the temporal effects of control, the closer the ball is to the equilibrium point, the greater the choice of action; conversely, the closer it is to the boundary position, the less control space the model has and the more likely it is to fail in control.
In order to evaluate the position of the ball in the X direction, evaluation function e 1 · can be given as
e 1 ( s t , a t ) = 1 sgn ( χ t ) 2 χ t l r
where, χ t denotes the position of the ball in the X direction at time t , the evaluation function e 1 s t , a t 0 , 1 is continuously differentiable. When the ball is closer to the equilibrium point, a higher reward value is obtained. The design of e 1 · can reasonably restrict the output of actions in PM, so that the ball position is constantly close to the equilibrium point.
(3)
The evaluation function e 2 ·
Figure 5 is the design schematic diagram of evaluation function e 2 · . The ball moves from position χ 0 to χ 3 , and passes through the equilibrium point χ 2 in the middle. If the ball does not change its direction on χ 3 and continues to accelerate, this state will be detrimental to the stability of the system.
In the algorithm, we should encourage the correct action of PM output and punish the incorrect action at the same time. When designing the evaluation function, we introduce the punishment mechanism; as shown in Equation (44), a punishment coefficient p k is introduced into e 2 · to punish the behavior of the ball deviating from the equilibrium point and continuing to accelerate, and encourages the model to produce the action of quickly suppressing the acceleration of the ball after the ball deviates from the equilibrium point in the learning process.
e 2 s t , a t = 1 sgn ( χ ¨ t ) · p k · χ ¨ t ,
χ ¨ t denotes the acceleration of the ball at time t , and the evaluation function e 2 s t , a t 0 , 1 is continuously differentiable. When the acceleration of the ball close to the equilibrium point is smaller, the reward value is obtained is higher.
The above designs the evaluation function from three aspects: the state space of the system, the position distribution of the ball, and the acceleration control of the ball, and then uses the DRM to obtain the detail-reward function as follows:
R s t , a t = i = 0 2 e i s t , a t

3.1.3. Deep Deterministic Policy Gradient (DDPG)

Considering that the BABS has the characteristics of a high-dimensional state, we use DRL as the control method, and use the feature expression ability of the deep network and the decision-making ability of reinforcement learning to realize the control of the BABS. In DRL, DDPG as a deterministic policy algorithm based on an actor-critic framework is widely used in many systems with continuous output characteristics [64]. Experience replay, target network, and exploration in DDPG can effectively solve the problems of strong data correlation and lack of stability in RL [65].
The DDPG algorithm mainly includes the following parts: ① The network includes actor and critic. Actor and critic are composed of the training network and target network, respectively. The actor training network is defined as μ s | w μ , actor target network is defined as μ s | w μ , critic training network is defined as Q s , a | w Q , and critic target network is defined as Q s , a | w Q . ② The experience replay D with the capacity of N is introduced to store the data of the interaction between the agent and the environment. The target network parameters are updated by soft update to ensure the stability of the training network. ③ Random noise ς is added to the output of the actor training network to ensure that the agent has a certain exploration ability when selecting actions [66].
At s t , the agent obtains the execution action a t through the network μ . After adding noise to the action, it interacts with the environment to generate an immediate reward r t + 1 , and the state is updated to s t + 1 . The agent stores the tuple s t , a t , r t + 1 , s t + 1 in the experience replay. Once a predetermined amount of experience replay data is reached, the Q network is trained by sampling data from the experience replay. Q needs to evaluate the actions made by μ , therefore, Q uses the deterministic policy gradient method to optimize the network parameters of μ , and the network parameter replication process adopts a soft update. Agent training repeats the above process until the network converges.
The target action value function in the DDPG algorithm can be expressed as
y i = r i + 1 + γ Q s i + 1 , μ s i + 1 | w μ | w Q .
The loss function used to update the Critic training network is as follows:
L w = 1 N i y i Q s i , a i w Q 2 .
Using the deterministic sampling policy gradient to update the actor training network parameters, the expression is
w μ J 1 N i a Q s , a | w Q | s = s i , a = μ s i w μ μ ( s | w μ ) | s .
We define the actor training network in DDPG as PM. When the actor network converges, PM is the optimal policy model. Figure 6 shows the actor network in DDPG. The critic network and actor network have the same structure, and the difference is that the input parameters of the two networks are different. In addition, the actor network requires the tanh function [67] to constrain the output values, whereas the critic network does not. Table 1 shows the network parameters of DDPG algorithm model. Algorithm 1 shows the process of the DDPG algorithm.
Algorithm 1: DDPG algorithm
Randomly initialize:
 critic Q s , a | w Q   and   actor   μ s , a | w μ with   weights   w Q and   w u
Initialize target network Q   and   μ   with   weights   w Q   w Q , w μ   w μ
Initialize replay buffer D
1.  For episode = 1, M do
2.    Initialize a random process N for action exploration
3.    Receive initial observation state s χ , α , χ ˙ , α ˙
4.    For t = 1, T do
5.      Selectn action a t = μ s t w μ + N t according to the current policy and exploration noise
6.      Execute action a t   and observe reward r t + 1   and   observe   new   state   s t + 1
7.      Store transition s t , a t , r t + 1 , s t + 1 in D
8.      Sample a random minibatch of N transitions s i , a i , r i + 1 , s i + 1   from D
9.      Set y i = r i + 1 + γ Q s i + 1 , μ s i + 1 w μ w Q
10.      Update critic by minibatch the loss: L = 1 N i y i Q s i , a i w Q 2
11.      Update the actor policy using the sampled gradient:
w μ J 1 N i a Q s , a | w Q | s = s i , a = μ s i w μ μ ( s | w μ ) | s
12.      Update the target network:
ω   τ ω + 1 τ ω
w   τ w + 1 τ w
13.    end for
14.  end for

3.2. Design of Visual Location Method

Obtaining the position χ t of the ball at time t on the real platform requires the following two steps: first, data processing is carried out on the image collected by the visual sensor, the pixel coordinates of the ball centroid are located, and the pixel distance l t from the ball centroid to the round mark of the beam is calculated (the position of the round mark is shown in Figure 7); then, the mathematical model of coordinate transformation is used to convert it to the physical position χ t under the real platform.
In order to improve the location accuracy of the ball’s centroid, we paint the surface of the ball black, the whole beam is silver, and black round markers (defined as P1 and P2) are embedded on both sides of the beam. Before calculating the pixel coordinates of the ball cntroid, the image needs to be preprocessed. The preprocessing steps are as follows: First, ROI processing is performed on the image, and the visual range of the beam rotating under the visual sensor is taken as the target area, which can reduce the image processing time and improve the system control efficiency. Secondly, the ROI-processed image is converted from a RGB color space to a HSV color space. Finally, the HSV value of the pixels on the surface of the ball is taken as the threshold, and the image is binarized; the binarization formula is as follows:
P u , v = 255 ,   i f   H u , v H m i n , H m a x   o r   S u , v S m i n , S m a x   o r   V u , v V m i n , V m a x   0 ,   e l s e
P u , v denotes the pixel value of the pixel point, u , v represents the coordinates of the pixel point in the image, P denotes the pixel value, H u , v denotes the H value of the pixel point, S u , v denotes the S value, and V u , v denotes the V value. Through the above formula, the HSV image can be converted into a binary image. Figure 7 shows the preprocessing process.
We try to use conventional image processing methods [57,58] to extract the outline of the ball in the binary image, and draw the minimum bounding rectangle [68] outside the outline of the ball. Considering the ratio between the coordinate value of the pixel and the actual physical value, the pixel coordinates of the bounding rectangle center point can be used as the pixel coordinates of the ball centroid. Figure 8 shows the effect diagram of obtaining the pixel coordinates of the centroid of the ball by using the conventional image processing method.
In Figure 8, there is a white area in the outline of the ball. The main reason is that the surface of the ball is disturbed by external light during its movement, which makes it easy to form an uneven “reflection area” on the surface, so that the HSV value of the pixels in the reflection area is not within the threshold range. After the binarization of the surface of the ball with Equation (49), the pixel value in the reflection area is 255, and the pixel value of the rest is 0. It can be seen from Figure 8a that the area of the “reflection area” on the surface of the ball is small, and the “reflection area” is mainly concentrated in the middle part. The above method can detect the approximate outline of the surface of the ball, but the pixel location deviation of the centroid is large. In Figure 8b, the surface of the ball is seriously disturbed by external light, and the “reflection area” is mainly concentrated on the edge of the ball surface, and the centroid location error is large, so the traditional contour detection has great limitations.
To solve the above problems, we propose an intelligent location method for the accurate location of the centroid coordinates in the image. Algorithm constraints and conditions: ① The ball does not move in the Z direction. ② During the rolling process of the ball along the X direction, the centroid of the ball, center P1, and center P2, are collinear.
This method benefits from observing the physical phenomenon of farmers planting crops, showing the objective law of “Planting-Harvest-Statistics”. We establish the “rules of the game” through the algorithm, and simulate the farmers “Planting crops -> Harvesting crops -> counting crops”, and use this algorithmic method to accurately locate the ball centroid coordinates in the binary image. We first introduce the “rules of the game”, then describe the relationship between the feature attributes in the game and the binary image information, and finally the specific method of the pixel coordinate location of the ball is given.
(1)
Planting crops: on a predetermined square land, the land is neatly divided into small plots belonging to each farmer in the row and column direction, and each farmer is planting crops on his own plot. The quantitative value of crops is denoted by variable b , which satisfies b 0 , B ,   B . When planting crops, the quantitative value of crops is unified as B .
(2)
Harvest crops: A farmer who was given a harvest command can harvest the crop. The rules for harvesting the crop are to first harvest the crop belonging to his plot, i.e., b = 0 , then to harvest crops from adjacent plots, with the rule that the crop with a quantization value of 1 is harvested from all adjacent plots with b > 0. The direction of farmers harvesting crops is from left to right, and top to bottom.
(3)
Counting the crop: After all harvesting instructions are completed, the farmer’s representative counts the crop. Figure 9 shows the statistical method.
In the figure, the whole rectangular area represents the predetermined land, in which the red circle indicates the farmer representative, the blue square area indicates the plot where the farmer is located, the orange box area is the window for the farmer representative to count the crops (the number of crops in the statistical window), R denotes the side length of the statistical window, and the red arrow indicates the movement direction of the farmer representative when counting. The statistical process follows the following rules: the farmer representative moves along the red arrow, and counts the number of crops in the window where the current position is located, and the count ends when he reaches the land boundary. Then, farmer representative compares the statistical results to determine the window with the least number of crops.
The mapping relationship between feature attributes and binary image information in “rules of the game”: “Predetermined square land”, refers to the binarized image to be processed, and each pixel in the image represents the private plot belonging to the farmer at that position. In fact, in terms of mathematical description, “farmer” means the assignment actions of “planting crops” and “harvesting crops”. When the pixel value is “0”, it means that the farmer at the position receives the instruction of “harvesting crops”. The “farmer representative” is the algorithm action of “counting the number of crops”, which is used to count the sum of the values of the crop equivalent b of the pixel positions in the R R window, and R is the side length of the statistical window.
Combined with the “rules of the game” and the above mapping relationship, we can locate the pixel coordinates of the ball centroid in the binary image. For the convenience of description, we use matrix A to represent the binary image
A = 0 0 , 0 | b 0 , 0 0 0 , 1 | b 0 , 1 255 0 , n | b 0 , n 0 1 , 0 | b 1 , 0 255 1 , 1 | b 1 , 1 0 1 , n | b 1 , n 255 m , 0 | b m , 0 0 m , 1 | b m , 1 255 m , n | b m , n .
There are m + 1 X n + 1 elements in matrix A , corresponding to the information of m + 1 X n + 1 pixels in the binary image. The element “ 0 0 , 0 | b 0 , 0 ” in the matrix means that the value b is assigned at pixel point 0 , 0 , where “0” represents the pixel value of the pixel, “ 0 , 0 ” represents the pixel coordinates, and “ | ” denotes the assignment at a pixel position. The “planting crops” rule is used to assign a constant B value to the variable b of all pixel positions in the binary image.
“Harvest crops” means to assign a value to the variables b of the pixel position and adjacent position with the value of 0 in the binary image. The specific operations are as follows:
Suppose the pixel value of position i , j is 0, i 0 , m ,   j 0 , n , and the adjacent pixel matrix is
G i , j = 0 i 1 , j 1 | b i 1 , j 1 255 i 1 , j | b i 1 , j 0 i 1 , j + 1 | b i 1 , j + 1 255 i , j 1 | b i , j 1 0 i , j | b i , j 0 i , j + 1 ] | b i , j + 1 0 i + 1 , j 1 | b i + 1 , i 1 0 i + 1 , j | b i + 1 , j 255 i + 1 , j + 1 ] | b i + 1 , j + 1 .
After operating the variable b of each pixel position in matrix G i , j , then G i , j becomes G i , j , G i , j can be written as follows:
G i , j = 0 i 1 , j 1 | b i 1 , j 1 1 255 i 1 , j | b i 1 , j 1 0 i 1 , j + 1 | b i 1 , j + 1 1 255 i , j 1 | b i , j 1 1 0 i , j | 0 0 i , j + 1 ] | b i , j + 1 1 0 i + 1 , j 1 | b i + 1 , i 1 1 0 i + 1 , j | b i + 1 , j 1 255 i + 1 , j + 1 ] | b i + 1 , j + 1 1 .
After the b -variable of the pixel position in the matrix is operated, the next pixel i , j + 1 and the variables b of the adjacent pixel positions are operated in the direction of “harvesting crops”. After the operation of b -variable at each pixel position in A is completed, A becomes A (the number of elements does not change), for example, A can be written as follows:
A = 0 0 , 0 | 0 0 , 0 0 0 , 1 | 0 0 , 1 255 0 , n | 0 0 , n 0 1 , 0 | 0 1 , 0 255 1 , 1 | 0 1 , 1 0 1 , n | 0 1 , n 255 m , 0 | b m , 0 0 m , 1 | 0 m , 1 255 m , n | B m , n .
“Statistical crops” means dividing A into n + 2 − R windows along the moving direction of the ball in the binary image, and summing the b -values of all pixel positions in each window. Assuming that the virtual centroid of the ball is at the position m 2 , j in A , j R 2 , n + 2 R 2 , the window related to the b -variable can be expressed as follows
Q m 2 , j = b m 2 R 2 , j R 2 b m 2 R 2 , j R 2 + 1 b m 2 R 2 , j b m 2 R 2 , j + R 2 b m 2 R 2 + 1 , j + R 2 b m 2 R 2 + 1 , j R 2 + 1 b m 2 R 2 + 1 , j b m 2 R 2 + 1 , j + R 2 b m 2 , j R 2 b m 2 , j R 2 + 1 b m 2 , j b m 2 , j + R 2 b m 2 + R 2 , j R 2 b m 2 + R 2 , j R 2 + 1 b m 2 + R 2 , j b m 2 + R 2 , j + R 2
where b m 2 , j represents the b -value at position m 2 , j , and the sum of all b -values in Q m 2 , j is
T m 2 , j = k 2 = 0 R k 1 = 0 R b m 2 R 2 + k 1 , j R 2 + k 2 .
In this way, the sum of all b -values in the n + 2 − R windows is counted. Then, the window with the lowest sum is found, i.e.,
min T m 2 , R 2 , T m 2 , R 2 + 1 , , T m 2 , n + 2 R 2 .
Note that in the minimum window, the center coordinates m 2 , j of this window are the pixel coordinates of the centroid of the ball.
In order to verify the practicability of the proposed method, we collect multiple images under different circumstances, and use the proposed method to locate the pixel coordinates of the ball centroid (the value of constant B in “Planting crops” is set to 3, and the value of R in “Counting crops” is set to 45). Figure 10 shows the location effect under different light influence conditions.
Figure 10a shows the location experiment of the ball without the interference of external light. The intersection point of the green line and the center line of the beam is the centroid determination point of the ball. The white area within the outline of the ball is caused by the wear on its surface. Since the HSV value of the pixel point in the worn part is not within the threshold, the pixel values of the worn part as a whole will become 255 after binarization. The red circle in the figure indicates the worn part of the beam. The pixel HSV value of this part is within the threshold range, therefore, after binarization, the pixel values of the worn part will become 0 as a whole.
Figure 10b shows that the ball is disturbed by external light, and the “reflection area” is mainly concentrated in the middle part of its surface. This situation is common in the control process of the BABS. Most of the black areas in the outline of the ball can be obtained by conventional image processing methods, but in the calculation process, the location deviation is often large and unstable.
Figure 10c shows that the ball is disturbed by external light, and the “reflection area” is mainly concentrated on the edge of the ball surface. In this case, it is difficult to obtain the complete contour of the ball, and there is a serious deviation in coordinate location.
Figure 10 shows the location results of our intelligent algorithm given above. In three cases, accurate position can be achieved without the interference of external light. However, it should be pointed out that under the interference of extra strong light, the contour of the ball cannot be distinguished in extreme cases, and this algorithm will still be affected. This situation may need to be solved by using professional cameras or adding filters. In addition, the practicability of this algorithm will not be corrupted by the face wear of the ball or of the beam. Table 2 shows the pixel coordinates of the ball centroid in the above three cases.
It can be seen from Table 2 that the maximum deviation between the location coordinates and the actual coordinates is less than three pixels. The above method benefits from a “game rule”, which does not require complex numerical operations on image information, and not only ensures the location accuracy, but also improves the image processing efficiency.
After obtaining the pixel coordinates of the ball, the spatial geometry transformation is required from the image position to the physical position. Assuming that the pixel coordinates of the ball centroid obtained by this algorithm are m 2 , j , and the pixel coordinates of p1 are m 2 , k , we can calculate the pixel distance l from p1 to the ball centroid. The spatial geometry transformation method in literature can finally calculate the physical position χ of the ball.

4. Experiments and Results

This section describes and analyzes the experimental results. We will describe and analyze the experimental test effect of the BABS under the dynamic environment and the real platform, and compare it with various control methods. The experiment is divided into three parts: experimental effects of different reward functions in a dynamic environment (Section 4.1), the control effect under the real platform (Section 4.2), and the experimental results of different control methods under the real platform (Section 4.3). The preset equilibrium position in Section 4.1, Section 4.2 and Section 4.3 is 0 , 0 , 0 , 0 T , i.e., the control objective of the system is to guide the ball to be balanced at the 0 point and for the beam angle to be 0.
Next, the basic environment, hardware facilities, and error requirements of the experiment are described. This paper’s program is written in Python3.7 and runs on a Dell G5 5500 PC with Windows 10 and i7-10750H processor. The real platform is driven by a Tamagawa TS4607N2190E200 servo motor. Different models and parameters are used in the control algorithms. Since the focus of this paper is on deep reinforcement learning and some algorithm design, we do not describe too much about the motor drive details. According to the industry standard of the BABS, the control accuracy of the ball is required to be 7 mm, the stability accuracy is 4 mm, and the repetition accuracy is 4 mm. Under the balance state, the swing amplitude of the beam is not more than 0.087 rad.

4.1. Simulation Experiment

In the dynamic environment of the BABS, the deep reinforcement learning parameters are shown in Table 3, and the initial states of the BABS in the training and testing phases are shown in Table 4, and the physical parameters in BABS are shown in Table 5.
We first give an illustration of the agent training rules for the following experiments. Each complete control process of agent training is called an episode, and in each episode, the maximum number of time steps the agent is allowed to learn is 500. The setting range of the ball position is [−0.16, 0.16], and the setting range of the beam angle is [−0.209, 0.209]. During the training of the agent, as long as the change values of the ball and the beam are always within the set range in the dynamic environment, the agent can continue to learn until the maximum time step in this episode, and the step value is increased by 1 while the sum of the reward value is also added to the instant reward value, otherwise the learning round will be exited, and the sums of steps and rewards under each episode is recorded.
In order to verify the feasibility of the detail-reward function in the experiments, we take the sparse reward function in Equation (41) as the comparison object to analyze the experimental effects of the two reward functions. Figure 11 shows the comparison of the experimental effects of different reward functions in the training and testing phases. Figure 11a shows the number of admissible control actions output by the BABS agent under each episode in the training process. Figure 11b,c show the experimental results in the testing phase. Figure 11b shows the ball position at each step. Figure 11c shows the angle of the beam at each step. In the BABS, we require that the ball should be controlled to be near zero ( χ = 0 ), and the angle of the beam should be controlled to be near zero too ( α = 0 ).
It can be seen from Figure 11a that the network convergence speed of the detail-reward function is faster. When the number of repeated attempts (episodes) is 120, the number of learning steps reaches 500. In part A of the figure, the actions output by the PM ensure that the state variables of the system are near the equilibrium position for 500 consecutive steps, and the stability of the model is good. When using the sparse reward function, the number of learning episodes is less than 100, the convergence speed of the model is slow, and the training time is long. We test the PMs produced by the detailed-reward function and the sparse reward function in 200 steps. It can be seen from Figure 11b,c that under the sparse reward function, the steady-state error of the ball position is large and cannot converge to the 0 point position; the overshoot of the beam angle is large, and it continues to vibrate near the equilibrium position of the 0 point, the adjustment time is long, and the stability is weak. In this case, when the system is disturbed, the possibility of divergence is large. On the contrary, PM under the detail-reward function can control the position parameters of the ball and the angle parameters of the beam to converge quickly to near 0, and the stability of the model is better.
The following conclusions can be drawn from an analytical comparison of the experimental effects of the simulation of the two reward functions: Using the same reinforcement learning method, it is difficult to train PM under the sparse reward function, the model convergence is slower, and the control effect of the trained PM needs to be improved; the detail-reward function can guide the agent to quickly learn useful experience, so the convergence speed of the model in the training stage is faster. In the test stage, PM under the detail-reward function can keep the system stable and the overshoot is small. The use of detail-reward functions is feasible when solving non-linear control problems using deep reinforcement learning.

4.2. Real Platform Control Experiment

We transplant the PM produced by training in the dynamic environment to the real platform as the control model of the system. Under the actions of the control law, we record the state data of the system within 6 s. Figure 12 is the test effect diagram under the real platform.
In the figure, the maximum deviations of the beam angle and ball position are 0.043 rad and 0.02 m, respectively. Two problems in the control process of the system are considered: ① The instability of the beam acceleration results in the rectangular oscillation of the beam angle curve at the equilibrium point. ② The ball swam slightly near the equilibrium point. In order to maintain a better control effect of the real platform, we should continue to improve the algorithm, which is expected to improve the above two problems.
The reasons for the above problems may be: ① There are physical differences between the dynamic environment and the real platform, so it is difficult to fully simulate the physical states of the dynamic environment in the real platform; ② The real platform is subject to uncertain disturbances, such as mechanical deviation, control error introduced by the server’s electrical, air resistance, etc., which makes it difficult to carry out unbiased design in the dynamic environment. In addition, in deep reinforcement learning, PM is a deep neural network, and the output of the network is a process of infinitely approaching the target value. Therefore, there are certain limitations in taking the PM after training as the control model of the real system.
Combining the LQR theory and nonlinear control theory [69], in engineering practice, the linearized control law can not only reduce the complexity of the system, but also simplify the calculation process of the model and improve the stability of the system. We understand PM trained in a dynamic environment as an “approximator” of an ideal nonlinear control function, i.e., it may not be accurate enough in every output, but it can ensure the convergence of control in the state space of the system. Then, we expect that through this “approximator”, we can get a satisfactory output function of linearized control law. According to the LQR theory, it is assumed that there is an accurate linearization function
u t = A x 1 t + B x 2 t + C x 3 t + D x 4 t
where A , B , C , and D are constant coefficients, and this function should be the linearized trend function of the “approximator” near the equilibrium point. Therefore, we can get a large number of output values near the equilibrium point through the “approximator”, and find a linear trend line passing through the equilibrium point, as well as use this trend line to replace the linearization function of the LQR. In this way, we deal with the PM as follows:
(1)
Observe the linear characteristics of the output value of PM produced by training in the dynamic environment near the equilibrium point in the real platform.
(2)
In the case of satisfactory linearity, the best linear parameters A , B , C , and   D are obtained by linearization.
If PM is close to the ideal model, the output points should show good linearity. On this premise, discrete points of PM output are linearized, assuming that the linearized function is
u t = A x 1 t + B x 2 t + C x 3 t + D x 4 t + E .
First of all, we should ensure that the value of E in the linearized function cannot deviate too far from zero, because it is not consistent with the control law near the equilibrium point; then, when the E value is close to 0, the zero-crossing linearization method is used to make the control output of the equilibrium point zero, and at the same time, it is subject to the output law of the linearized control model near the equilibrium point. The linearized function is
u t = A x 1 t + B x 2 t + C x 3 t + D x 4 t .
After eliminating the linearization of the constant term E , the function is not only close to the output of the model, but also avoids the oscillation phenomenon caused by the discontinuous output values of PM near the equilibrium point.
We call this model processing method policy model linearization (PML). Figure 13 shows the experimental effect comparison of PM and PML under the real platform.
From the experimental results in Figure 13a,b, it can be seen that using PML as the control model, the maximum error of the beam angle from the equilibrium point is 0.018 rad, which is a reduction of 0.025 rad compared to the angle under PM; also, the maximum error in the ball position is 0.016 m from the equilibrium point, a reduction of 0.004 m compared to the position under the PM. Comparing the experimental effects of PM and PML, we can draw the following conclusions: ① When taking PML as the control model, the output control law is smoother and more effective, the overshoot is significantly reduced in the control process, and the control of the beam and the ball is more accurate and reasonable, without the influence of command pulsation. ② Near the equilibrium point, the vibration of the beam decreases significantly until it approaches 0. ③ Although the phenomenon of narrow amplitude vibration still exists in the adjustment time stage, the time to equilibrium is significantly shorter. ④ From the overall control effect, PML is significantly better than the PM model.

4.3. Experimental Comparison of Different Control Methods

Anti-disturbance ability is an indispensable function of an effective control system. Next, we will test the anti-disturbance ability of the models. Figure 14 shows the state curves of the BABS under different control laws after the ball is disturbed. Three common control methods are compared in the experiment: PID, FC, and LQR. Figure 14a shows the beam angle distribution curve in 16 s, Figure 14b shows the ball position distribution curves, Figure 14c shows the beam angular velocity distribution curves, and Figure 14d shows the ball velocity distribution curves.
From the above subgraphs Figure 14a,b, it can be seen that when the PID method and the FC method are used in the BABS, the position distribution curves of the beam and the ball oscillate seriously in the initial stage. With the increase in time steps, the beam angle and the ball position oscillate near the equilibrium point. After 13 s, the ball position gradually converges around the equilibrium point, but the beam oscillates significantly at this position, resulting in a less stable system. As a control method based on the system dynamics model, LQR has a shorter adjustment time, and the oscillation amplitude of the beam near the equilibrium point is significantly reduced compared with PID and FC. After 8 s, the ball position and beam angle approach 0. The regulation ability of PML is better than PID and FC. Compared with LQR, PML has a longer regulation time; it takes 11 s for the system to reach the equilibrium state, and the overshoot and steady-state error of the system are reasonable and ideal. Figure 14c,d are the comparison diagrams of the control curves of the beam angular velocity and the ball -velocity under the four methods. PML reflects the control function of quick adjustment and rapid reaching a stable state.
Comparing the experimental results of the above different methods, the following conclusions can be drawn: FC and PID have some control effects on the BABS, and the control quality may be improved by modifying the model parameters and designing the model structures, but this depends on the designer’s experience. The control effect of the LQR is good, but it depends on an excellent dynamic model, and the designer needs to debug the Q and R parameters of the performance index function in the LQR, relying on his own experience. The DRM is a better option, combined with the rich algorithms in deep reinforcement learning, the detail-reward function is designed from the perspectives of system control details to achieve the control task of the system. At the same time, the designer’s “experience control methods” or personality designs are also technically integrated into the algorithms.

5. Conclusions and Future Work

In this paper, it is the first time a detail-reward function in deep reinforcement learning is designed to solve the control problems of the BABS. It is proven by the Q-function that the detail-reward function can obtain the best control policy. The experimental results show that the detail-reward function can complete the control task, which proves its rationality and feasibility. In addition, an intelligent location method based on vision is designed to capture the ball position; compared with conventional image processing methods, this algorithm can accurately locate the ball in the BABS, and the error is less than three pixels. In order to improve the control ability of the system, a linearization method near the equilibrium point is adopted for the policy model, generated by deep reinforcement learning. Experiments show that this method can improve the control stability and anti-interference ability of the system. The above methods have good technical reference value for solving the control problems of nonlinear systems.
As for future works, it will be possible to develop the following: ① Try to apply the design idea of a detail-reward mechanism to more nonlinear control systems, such as multi-agents, robot motion control systems, and so on. ② Introduce the detail-reward function to more deep reinforcement learning algorithms for verification, such as A3C, td3, and PPO. ③ Introduce the BABS structure into a V-REP environment, realize end-to-end learning from input image to output action with the help of virtual vision sensors, and study the compression control model to improve the control efficiency. ④ When the model produced in the dynamic environment is transplanted to the real platform, the control effect of the model in the real platform is not ideal, due to the differences between the two environments. We will study how to reduce the differences between the dynamic environment and the real platform, so as to improve the quality of control under the real platform.

Author Contributions

Conceptualization, X.L. and S.Y.; methodology, X.L. and S.Y.; formal analysis, X.L. and S.Y.; investigation, X.L. and S.Y.; resources, X.L., S.Y., Z.C. and Y.Z.; writing—original draft preparation, X.L. and S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Joint Development Research Institute of Intelligent Motion Control Technology of the Liaoning Provincial Department of Education and the National Key R & D Program of China (Grant No. 2017YFB1300700).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors thank the editor, the associate editor and reviewers for their helpful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Murray, R.M. Future directions in control, dynamics, and systems: Overview, grand challenges, and new courses. Eur. J. Control 2003, 9, 144–158, Lavoisier and FRANCE. Available online: http://www.cds.caltech.edu/~murray/papers/2003l_mur03-ejc.html (accessed on 14 January 2003). [CrossRef]
  2. Bars, R.; Colaneri, P.; de Souza, C.E.; Dugard, L.; Allgöwer, F.; Kleimenov, A.; Scherer, C. Theory, algorithms and technology in the design of control systems. Annu. Rev. Control 2006, 30, 19–30. [Google Scholar] [CrossRef]
  3. Boubaker, O. The inverted pendulum: A fundamental benchmark in control theory and robotics. In Proceedings of the International Conference on Education and e-Learning Innovations, Sousse, Tunisia, 1–3 July 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1–6. [Google Scholar] [CrossRef]
  4. Andreev, F.; Auckly, D.; Gosavi, S.; Kapitanski, L.; Kelkar, A.; White, W. Matching, linear systems, and the ball and beam. Automatica 2002, 38, 2147–2152. [Google Scholar] [CrossRef]
  5. Aranda, J.; Chaos, D.; Dormido-Canto, S.; Muñoz, R.; Manuel Díaz, J. Benchmark control problems for a non-linear underactuated hovercraft: A simulation laboratory for control testing. IFAC Proc. Vol. 2006, 39, 463–468. [Google Scholar] [CrossRef]
  6. Hauser, J.; Sastry, S.; Kokotovic, P. Nonlinear control via approximate input-output linearization: The ball and beam example. IEEE Trans. Autom. Control 1992, 37, 392–398. [Google Scholar] [CrossRef]
  7. Nguyen, T.D.; Minh, P.V.; Phan, H.C.; Ta, D.T.; Nguyen, T.D. Bending of symmetric sandwich FGM beams with shear connectors. Math. Probl. Eng. 2021, 2021, 7596300. [Google Scholar] [CrossRef]
  8. Tran, T.T.; Nguyen, N.H.; Do, T.V.; Minh, P.V.; Nguyen, D.D. Bending and thermal buckling of unsymmetric functionally graded sandwich beams in high-temperature environment based on a new third-order shear deformation theory. J. Sandw. Struct. Mater. 2021, 23, 906–930. [Google Scholar] [CrossRef]
  9. Nam, V.H.; Vinh, P.V.; Chinh, N.V.; Do, T.V.; Hong, T.T. A new beam model for simulation of the mechanical behaviour of variable thickness functionally graded material beams based on modified first order shear deformation theory. Materials 2019, 12, 404. [Google Scholar] [CrossRef]
  10. Nguyen, H.N.; Hong, T.T.; Vinh, P.V.; Do, T.V. An efficient beam element based on Quasi-3D theory for static bending analysis of functionally graded beams. Materials 2019, 12, 2198. [Google Scholar] [CrossRef] [Green Version]
  11. Tho, N.C.; Nguyen, T.T.; To, D.T.; Minh, P.V.; Hoa, L.K. Modelling of the flexoelectric effect on rotating nanobeams with geometrical imperfection. J. Brazil. Soc. Mech. Sci. Eng. 2021, 43, 510. [Google Scholar] [CrossRef]
  12. Tho, N.C.; Ta, N.T.; Thom, D.V. New numerical results from simulations of beams and space frame systems with a tuned mass damper. Materials 2019, 12, 1329. [Google Scholar] [CrossRef]
  13. Mahmoodabadi, M.J.; Danesh, N. Gravitational search algorithm-based fuzzy control for a nonlinear ball and beam system. J. Control Decis. 2018, 5, 229–240. [Google Scholar] [CrossRef]
  14. Yu, W.; Ortiz, F. Stability analysis of PD regulation for ball and beam system. In Proceedings of the 2005 IEEE Conference on Control Applications, Toronto, ON, Canada, 28–31 August 2005; pp. 517–522. [Google Scholar] [CrossRef]
  15. Sira-Ramirez, H. On the control of the” ball and beam” system: A trajectory planning approach. In Proceedings of the 39th IEEE Conference on Decision and Control, Sydney, NSW, Australia, 12–15 December 2000; Volume 4, pp. 4042–4047. [Google Scholar] [CrossRef]
  16. Almutairi, N.B.; Zribi, M. On the sliding mode control of a ball on a beam system. Nonlinear Dyn. 2010, 59, 221–238. [Google Scholar] [CrossRef]
  17. Friedland, B. Control System Design: An Introduction to State-Space Methods; Courier Corporation: Chicago, IL, USA, 2012. [Google Scholar]
  18. Danilo, M.O.; Gil-González, W.; Ramírez-Vanegas, C. Discrete-inverse optimal control applied to the ball and beam dynamical system: A passivity-based control approach. Symmetry 2020, 12, 1359. [Google Scholar] [CrossRef]
  19. Ho, M.T.; Rizal, Y.; Chu, L.M. Visual servoing tracking control of a ball and plate system: Design, implementation and experimental validation. Int. J. Adv. Robot. Syst. 2013, 10, 287. [Google Scholar] [CrossRef]
  20. Moreno-Armendariz, M.A.; Pérez-Olvera, C.A.; Rodríguez, F.O.; Rubio, E. Indirect hierarchical FCMAC control for the ball and plate system. Neurocomputing 2010, 73, 2454–2463. [Google Scholar] [CrossRef]
  21. Yuan, D.; Zhang, Z. Modelling and control scheme of the ball–plate trajectory-tracking pneumatic system with a touch screen and a rotary cylinder. IET Control Theory Appl. 2010, 4, 573–589. [Google Scholar] [CrossRef]
  22. Mehedi, I.M.; Al-Saggaf, U.M.; Mansouri, R.; Bettayeb, M. Two degrees of freedom fractional controller design: Application to the ball and beam system. Measurement 2019, 135, 13–22. [Google Scholar] [CrossRef]
  23. Meenakshipriya, B.; Kalpana, K. Modelling and control of ball and beam system using coefficient diagram method (CDM) based PID controller. IFAC Proc. 2014, 47, 620–626. [Google Scholar] [CrossRef]
  24. Márton, L.; Hodel, A.S.; Lantos, B.; Hung, J.Y. Underactuated robot control: Comparing LQR, subspace stabilization, and combined error metric approaches. IEEE Trans. Ind. Electron. 2008, 55, 3724–3730. [Google Scholar] [CrossRef]
  25. Keshmiri, M.; Jahromi, A.F.; Mohebbi, A.; Amoozgar, M.H.; Xie, W.F. Modeling and control of ball and beam system using model based and non-model based control approaches. Int. J. Smart Sens. Intell. Syst. 2017, 5, 14–35. [Google Scholar] [CrossRef]
  26. Choudhary, M.K.; Kumar, G.N. ESO based LQR controller for ball and beam system. IFAC-Pap. 2016, 49, 607–610. [Google Scholar] [CrossRef]
  27. da Silveira Castro, R.; Flores, J.V.; Salton, A.T. A comparative analysis of repetitive and resonant controllers to a servo-vision ball and plate system. IFAC Proc. 2014, 47, 1120–1125. [Google Scholar] [CrossRef]
  28. Chang, Y.H.; Chan, W.S.; Chang, C.W.; Tao, C.W. Adaptive fuzzy dynamic surface control for ball and beam system. Int. J. Fuzzy Syst. 2011, 13, 1–7. [Google Scholar]
  29. Chien, T.L.; Chen, C.C.; Tsai, M.C.; Chen, Y.C. Control of AMIRA’s ball and beam system via improved fuzzy feedback linearization approach. Appl. Math. Model. 2010, 34, 3791–3804. [Google Scholar] [CrossRef]
  30. Castillo, O.; Lizárraga, E.; Soria, J.; Melin, P.; Valdez, F. New approach using ant colony optimization with ant set partition for fuzzy control design applied to the ball and beam system. Inf. Sci. 2015, 294, 203–215. [Google Scholar] [CrossRef]
  31. Chang, Y.H.; Chang, C.W.; Tao, C.W.; Lin, H.W.; Taur, J.S. Fuzzy sliding-mode control for ball and beam system with fuzzy ant colony optimization. Expert Syst. Appl. 2012, 39, 3624–3633. [Google Scholar] [CrossRef]
  32. Hammadih, M.L.; Hosani, K.A.; Boiko, I. Interpolating sliding mode observer for a ball and beam system. Int. J. Control 2016, 89, 1879–1889. [Google Scholar] [CrossRef]
  33. Hung, L.C.; Chung, H.Y. Decoupled control using neural network-based sliding-mode controller for nonlinear systems. Expert Systems with Applications. 2007, 32, 1168–1182. [Google Scholar] [CrossRef]
  34. Das, A.; Roy, P. Improved performance of cascaded fractional-order SMC over cascaded SMC for position control of a ball and plate system. IETE J. Res. 2017, 63, 238–247. [Google Scholar] [CrossRef]
  35. Singh, R.; Bhushan, B. Real-time control of ball balancer using neural integrated fuzzy controller. Artif. Intell. Rev. 2020, 53, 351–368. [Google Scholar] [CrossRef]
  36. Zhang, L.; Chen, W.; Wang, J.; Zhang, J. Adaptive robust slide mode trajectory tracking controller for lower extremity rehabilitation exoskeleton. In Proceedings of the 2018 13th IEEE Conference on Industrial Electronics and Applications (ICIEA), Wuhan, China, 31 May–2 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 992–997. [Google Scholar] [CrossRef]
  37. Wang, F.Y.; Zhang, H.; Liu, D. Adaptive dynamic programming: An introduction. IEEE Comput. Intell. Mag. 2009, 4, 39–47. [Google Scholar] [CrossRef]
  38. Mu, C.; Zhang, Y.; Gao, Z.; Sun, C. ADP-based robust tracking control for a class of nonlinear systems with unmatched uncertainties. IEEE Trans. Syst. Man Cybern. Syst. 2019, 50, 4056–4067. [Google Scholar] [CrossRef]
  39. Dong, H.; Zhao, X.; Luo, B. Optimal tracking control for uncertain nonlinear systems with prescribed performance via critic-only adp. IEEE Trans. Syst. Man Cybern. Syst. 2020, 52, 561–573. [Google Scholar] [CrossRef]
  40. Song, R.; Zhu, L. Optimal fixed-point tracking control for discrete-time nonlinear systems via ADP. IEEE/CAA J. Autom. Sin. 2019, 6, 657–666. [Google Scholar] [CrossRef]
  41. Ni, Z.; He, H.; Zhao, D.; Xu, X.; Prokhorov, D.V. GrDHP: A general utility function representation for dual heuristic dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 2014, 26, 614–627. [Google Scholar] [CrossRef]
  42. Song, R.; Wei, Q.; Sun, Q. Nearly finite-horizon optimal control for a class of nonaffine time-delay nonlinear systems based on adaptive dynamic programming. Neurocomputing 2015, 156, 166–175. [Google Scholar] [CrossRef]
  43. Burghardt, A.; Szuster, M. Neuro-dynamic programming in control of the ball and beam system. In Solid State Phenomena; Trans Tech Publications Ltd.: Wollerau, Switzerland, 2014; Volume 210, pp. 206–214. [Google Scholar] [CrossRef]
  44. Thorndike, E.L. Animal intelligence: An experimental study of the associative processes in animals. Psychol. Rev. Monogr. Suppl. 1898, 2, 149. [Google Scholar] [CrossRef]
  45. Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
  46. Jin, Z.; Liu, A.; Zhang, W.A.; Yu, L.; Su, C.Y. A Learning Based Hierarchical Control Framework for Human-Robot Collaboration. IEEE Trans. Autom. Sci. Eng. 2022, 1–12. [Google Scholar] [CrossRef]
  47. Zhong, X.; Ni, Z.; He, H. Gr-GDHP: A new architecture for globalized dual heuristic dynamic programming. IEEE Trans. Cybern. 2016, 47, 3318–3330. [Google Scholar] [CrossRef]
  48. Ni, Z.; He, H.; Zhong, X.; Prokhorov, D.V. Model-free dual heuristic dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 1834–1839. [Google Scholar] [CrossRef]
  49. Ganesh, A.; Sundareswari, M.B.; Panda, M.; Mozhi, G.T.; Dhanalakshmi, K. Reinforcement learning control of servo actuated centrally pivoted ball on a beam. In Proceedings of the 2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS), Rupnagar, India, 26–28 November 2020; pp. 103–108. [Google Scholar] [CrossRef]
  50. Yao, S.; Liu, X.; Zhang, Y.; Cui, Z. An approach to solving optimal control problems of nonlinear systems by introducing detail-reward mechanism in deep reinforcement learning. Math. Biosci. Eng. 2022, 19, 9258–9290. [Google Scholar] [CrossRef]
  51. Ryu, K.; Oh, Y. Balance control of ball-beam system using redundant manipulator. In Proceedings of the 2011 IEEE International Conference on Mechatronics, Istanbul, Turkey, 13–15 April 2011; pp. 403–408. [Google Scholar] [CrossRef]
  52. Liu, T.; Wang, N.; Zhang, L.; Ai, S.M.; Du, H.W. A novel visual measurement method for three-dimensional trajectory of underwater moving objects based on deep learning. IEEE Access 2020, 8, 186376–186392. [Google Scholar] [CrossRef]
  53. Supreeth, H.S.G.; Patil, C.M. Moving object detection and tracking using deep learning neural network and correlation filter. In Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 20–21 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1775–1780. [Google Scholar] [CrossRef]
  54. Pathak, A.R.; Pandey, M.; Rautaray, S. Application of deep learning for object detection. Procedia Comput. Sci. 2018, 132, 1706–1717. [Google Scholar] [CrossRef]
  55. Mukherjee, M.; Potdar, Y.U.; Potdar, A.U. Object tracking using edge detection. In Proceedings of the International Conference and Workshop on Emerging Trends in Technology, Maharashtra, India, 26–27 February 2010; pp. 686–689. [Google Scholar] [CrossRef]
  56. Qul’am, H.M.; Dewi, T.; Risma, P.; Oktarina, Y.; Permatasari, D. Edge detection for online image processing of a vision guide pick and place robot. In Proceedings of the 2019 International Conference on Electrical Engineering and Computer Science (ICECOS), Batam, Indonesia, 2–3 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 102–106. [Google Scholar] [CrossRef]
  57. Wang, B.; Zhang, K.; Shi, L.; Zhong, H.H. An edge detection algorithm of moving object based on background modeling and active contour model. In Advanced Materials Research; Trans Tech Publications Ltd.: Wollerau, Switzerland, 2013; Volume 765, pp. 2393–2398. [Google Scholar] [CrossRef]
  58. Suzuki, S. Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process. 1985, 30, 32–46. [Google Scholar] [CrossRef]
  59. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; Volume 1. [Google Scholar]
  60. Vecerik, M.; Hester, T.; Scholz, J.; Wang, F.; Pietauin, O.; Piot, B.; Heess, N.; Rothörl, T.; Lampe, T.; Riedmiller, M. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv 2017, arXiv:1707.08817. [Google Scholar] [CrossRef]
  61. Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. Icml 1999, 99, 278–287. [Google Scholar]
  62. Zhu, Y.; Zhao, D.; He, H. Integration of fuzzy controller with adaptive dynamic programming. In Proceedings of the 10th World Congress on Intelligent Control and Automation, Beijing, China, 6–8 July 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 310–315. [Google Scholar] [CrossRef]
  63. Wang, Z.Y.; Dai, Y.P.; Li, Y.W.; Yao, Y. A kind of utility function in adaptive dynamic programming for inverted pendulum control. In Proceedings of the 2010 International Conference on Machine Learning and Cybernetics, Qingdao, China, 11–14 July 2010; IEEE: Piscataway, NJ, USA, 2010; Volume 3, pp. 1538–1543. [Google Scholar] [CrossRef]
  64. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar] [CrossRef]
  65. Satheeshbabu, S.; Uppalapati, N.K.; Fu, T.; Krishnan, G. Continuous control of a soft continuum arm using deep reinforcement learning. In Proceedings of the 2020 3rd IEEE International Conference on Soft Robotics (RoboSoft), New Haven, CT, USA, 15 May–15 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 497–503. [Google Scholar] [CrossRef]
  66. Ma, Y.; Zhu, W.; Benton, M.G.; Romagnoli, J. Continuous control of a polymerization system with deep reinforcement learning. J. Process Control 2019, 75, 40–47. [Google Scholar] [CrossRef]
  67. Nwankpa, C.; Ijomah, W.; Gachagan, A.; Marshall, S. Activation functions: Comparison of trends in practice and research for deep learning. arXiv 2018, arXiv:1811.03378. [Google Scholar] [CrossRef]
  68. Chaudhuri, D.; Samal, A. A simple method for fitting of bounding rectangle to closed regions. Pattern Recognit. 2007, 40, 1981–1989. [Google Scholar] [CrossRef]
  69. Vukic, Z. Nonlinear Control Systems; CRC Press: Boca Raton, FL, USA, 2003. [Google Scholar]
Figure 1. Ball and beam system.
Figure 1. Ball and beam system.
Symmetry 14 01883 g001aSymmetry 14 01883 g001b
Figure 2. Reinforcement learning model.
Figure 2. Reinforcement learning model.
Symmetry 14 01883 g002
Figure 3. The design schematic diagram of the evaluation function e 0 · .
Figure 3. The design schematic diagram of the evaluation function e 0 · .
Symmetry 14 01883 g003
Figure 4. The design schematic diagram of the evaluation function e 1 · .
Figure 4. The design schematic diagram of the evaluation function e 1 · .
Symmetry 14 01883 g004
Figure 5. The design schematic diagram of the evaluation function e 2 · .
Figure 5. The design schematic diagram of the evaluation function e 2 · .
Symmetry 14 01883 g005
Figure 6. Actor network structure diagram.
Figure 6. Actor network structure diagram.
Symmetry 14 01883 g006
Figure 7. Image preprocessing process.
Figure 7. Image preprocessing process.
Symmetry 14 01883 g007
Figure 8. Location of the ball centroid.
Figure 8. Location of the ball centroid.
Symmetry 14 01883 g008
Figure 9. Schematic diagram of the statistical method.
Figure 9. Schematic diagram of the statistical method.
Symmetry 14 01883 g009
Figure 10. Ball location under different conditions.
Figure 10. Ball location under different conditions.
Symmetry 14 01883 g010aSymmetry 14 01883 g010b
Figure 11. Comparison of experimental effects under different reward functions.
Figure 11. Comparison of experimental effects under different reward functions.
Symmetry 14 01883 g011aSymmetry 14 01883 g011b
Figure 12. The control effect under the real platform.
Figure 12. The control effect under the real platform.
Symmetry 14 01883 g012
Figure 13. Comparison of control effects before and after model optimization.
Figure 13. Comparison of control effects before and after model optimization.
Symmetry 14 01883 g013aSymmetry 14 01883 g013b
Figure 14. Comparison of experimental results under different methods.
Figure 14. Comparison of experimental results under different methods.
Symmetry 14 01883 g014aSymmetry 14 01883 g014b
Table 1. DDPG network parameters.
Table 1. DDPG network parameters.
NetworkNetwork Structure Learning   Rate   ( η ) Loss FunctionActivate FunctionOptimization FunctionBatch Size
Actor(4,128)
(128,128)
(128,1)
1 × 10−3 Q ( s , a | w Q ) ReluAdam64
Critic(5,128)
(128,128)
(128,1)
1 × 10−2MSEReluAdam64
Table 2. Pixel coordinates of ball centroid.
Table 2. Pixel coordinates of ball centroid.
No.
Actual pixel coordinates U , V (387,21)(316,23)(316,23)(222,22)(301,22)(301,22)
Location pixel coordinates U 1 , V 1 (387,24)(318,23)(317,24)(222,24)(299,24)(303,24)
Table 3. Deep reinforcement learning parameters.
Table 3. Deep reinforcement learning parameters.
EpisodesStepsDiscount Factor ( γ ) Epsilon   τ Memory
5005000.991 × 10−210,000
Table 4. Initial state parameters of the BABS.
Table 4. Initial state parameters of the BABS.
Experimental PhasesBall Position (/m)Ball Velocity (m/s)Beam Angle (/rad)Beam Angular Velocity (rad/s)
Training phase(−0.16,0.16)(−0.08,0.08)(−0.209,0.209)(−0.05,0.05)
Testing phase0.110.20.005−0.177
(In the training phase of the BABS, the initial state of the ball and beam is a random number within the corresponding parameter range).
Table 5. Physical parameters of BABS.
Table 5. Physical parameters of BABS.
Ball   Mass   m Ball   Radius   r Beam   Length   l r Moment of Inertia J Gravity   g
0.113 kg0.015 m0.4 m0.00001017 kgm29.8 m/s2
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yao, S.; Liu, X.; Zhang, Y.; Cui, Z. Research on Solving Nonlinear Problem of Ball and Beam System by Introducing Detail-Reward Function. Symmetry 2022, 14, 1883. https://doi.org/10.3390/sym14091883

AMA Style

Yao S, Liu X, Zhang Y, Cui Z. Research on Solving Nonlinear Problem of Ball and Beam System by Introducing Detail-Reward Function. Symmetry. 2022; 14(9):1883. https://doi.org/10.3390/sym14091883

Chicago/Turabian Style

Yao, Shixuan, Xiaochen Liu, Yinghui Zhang, and Ze Cui. 2022. "Research on Solving Nonlinear Problem of Ball and Beam System by Introducing Detail-Reward Function" Symmetry 14, no. 9: 1883. https://doi.org/10.3390/sym14091883

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop