Adaptive Neural-PID Visual Servoing Tracking Control via Extreme Learning Machine

The vision-guided robot is intensively embedded in modern industry, but it is still a challenge to track moving objects in real time accurately. In this paper, a hybrid adaptive control scheme combined with an Extreme Learning Machine (ELM) and proportional–integral–derivative (PID) is proposed for dynamic visual tracking of the manipulator. The scheme extracts line features on the image plane based on a laser-camera system and determines an optimal control input to guide the robot, so that the image features are aligned with their desired positions. The observation and state–space equations are first determined by analyzing the motion features of the camera and the object. The system is then represented as an autoregressive moving average with extra input (ARMAX) and a valid estimation model. The adaptive predictor estimates online the relevant 3D parameters between the camera and the object, which are subsequently used to calculate the system sensitivity of the neural network. The ELM–PID controller is designed for adaptive adjustment of control parameters, and the scheme was validated on a physical robot platform. The experimental results showed that the proposed method’s vision-tracking control displayed superior performance to pure P and PID controllers.


Introduction
Robotic vision has important commercial and domestic applications such as in assembly and welding, fruit picking, and household services. Most robots, however, follow a set program to complete repetitive tasks. When discrepancies occur in the target or the robot, it tends to be unable to make timely environmental adjustments, which is largely due to the inherent lack of an adequate perception capability [1]. Visual servoing enables dexterous control of robots through continuous visual perception and has drawn consistent attention.
Since 1996, Hutchinson's three classic surveys [2][3][4] have provided a systematic understanding of visual servoing. According to the representation of control signals, it can be categorized as position-based (PBVS), image-based (IBVS) or hybrid (HVS). In particular, IBVS has attracted widespread interest for its simple structure and insensitivity to calibration accuracy. Common methods of IBVS control are adaptive [5], sliding mode [6], fuzzy [7] and learning-based [8]. Saleem et al. [9] proposed an adaptive fuzzy-tuned proportional derivative (AFT-PD) control scheme to improve the visual tracking control of a mobile wheeled robot. YANG et al. [10] used radial basis function (RBF) neural networks to estimate the dynamic parameters of the robot and compensate for the robot's torque to improve the tracking performance of the controller.
Most IBVS studies have been carried out under the assumption that the target is stationary, so visual tracking in dynamic scenes has rarely been considered. Certain researchers have estimated the Jacobi matrix of IBVS by developing an adaptive algorithm.
Chang [11] and Gongye [12] proposed Kalman filtering to provide the unknown depth information of the image Jacobi matrix in real time and verified its practicality on a manipulator. Music [13] established particle filtering and the Broyden algorithm to estimate the Jacobian matrix. Simulation experiments were performed to compare the trajectory tracking performance in static and dynamic scenes. The results showed that the Broyden computation time was much less than that of particle filtering, but its performance was vulnerable to noise interference.
In recent work, learning-based IBVS control has been used to optimize the parameters or to construct nonlinear estimators [14,15]. RBF neural networks combined with PI controllers were used to improve the accuracy of trajectory tracking of a dual robotic arm [16]. From a different perspective, the RBF is used as the inverse kinematics solver of the robotic arm, which was obtained by training the samples of the forward kinematics. The results showed that the method effectively improves the positioning accuracy of visual servoing [17]. To address the low stability of the visual servoing, Shi [18] proposed using Q-learning to adjust the control parameters of visual servoing adaptively, and to optimize the parameters of Q-learning further using a fuzzy-based method to improve the convergence and the stability of the control system. Overall, the above works have proved the effectiveness of the method only in static scenarios. Indeed, the dynamic convergence and responsiveness have not been adequately analyzed.
The image feature extraction is also an essential aspect that affects the reliability behavior of the control system. In visual servoing, the system needs to extract valid features in the image sequence and drive the robot motion while minimizing errors between the current and the reference image. Therefore, the representation of the image Jacobi matrix, i.e., the velocity variation relationship between the target in image space and in Cartesian space, is an essential part of vision tracking. The point features [19], optical flow [20] and image moments [21] have been extensively studied in visual tracking. However, the reliability problem of feature matching in complex scenes has not been well addressed. To ameliorate that, direct visual servoing (DVS) has been proposed in recent years [22][23][24].
Instead of the common local image feature extraction techniques, DVS uses the whole image to construct a mapping relationship with the robot's joint pose, which is obtained by acquiring a large amount of data, (usually several thousand samples) and feeding it into a neural network. Although DVS enhances the robustness of feature extraction and significantly reduces the time spent on image processing, the acquisition of its samples and the training process are time-consuming. Additionally, the network needs to be retrained when the environment changes, which greatly diminishes its practical feasibility.
As mentioned, in previous research findings, the primary limitation of the robot's visual servoing dynamic tracking performance is the estimation speed and accuracy of the image Jacobi matrix, and the design of the controller [25]. Therefore, our work in this paper is focused on the above-mentioned bottlenecks. In the hardware system, a laser-camera vision system implements the acquisition of image information. The system consists of a camera rigidly mounted on the robot end-effector, and a laser stripe generator. The camera is used to acquire image information and the laser stripe generator is used to project visible light onto the table. The inertial principal axis of the target is recognized and used as desired image features, while the visible straight line on the plane is used as the current image feature. Hence, in this paper the visual tracking problem was transformed into an error minimization problem of line feature, the inimitable strengths of which have been proven in welding robots and bricklaying robots [26,27].
This research is concerned with implementing line feature visual servoing. This work was conducted within the framework of a robotic bricklaying project. The main partner was a construction company in Guangxi Province, and the primary concerns were to automate the robot's masonry work, particularly precise brick placement, since relatively large operating errors may adversely affect the strength of the building. Cutting-edge masonry robots still require strict calibration of the hand-eye and working environment, which may lead to deteriorating operational effectiveness if unexpected deviations of the robot's position occur. Thus, it is necessary to use the sensing technology with closedloop control, such as visual servoing. Since manual masonry typically uses laser lines for alignment and the bricks are composed of three intersecting orthogonal planes from which points features may not be reliably extracted, we focused our research on enabling visual servoing based on line features.
For the design of the controller, a hybrid ELM-PID control scheme was proposed for real-time visual tracking by a six-degree-of-freedom robot. We gave the observation and the state-space equations by analyzing the motion rules of the camera and the object. The system was then represented as a time-varying multi-input/multi-output autoregressive moving average model. Subsequently, an ELM model was proposed to predict the relevant Jacobian information, which uses the measured data of the previous moment as an input to predict the result of the next moment. These prediction parameters were then used as the online input of the gradient descent algorithm. To perform faster adaptive tuning of the controller parameters, a three-layer structured BP neural network was used. The aim of this study was to perform an algorithm-level validation of robotic autonomous masonry and to further assess the positioning accuracy and control speed in static and dynamic environments.
This paper is organized as follows: Section 2 presents visual tracking modeling. In Section 3, we introduce the adaptive estimation of visual tracking. Section 4 analyzes the ELM-PID controller implementation. Section 5 shows the experimental design, and Section 6 analyzes the experimental results. Finally, conclusions are drawn in Section 7.

Camera Model and Feature Motion
Visual tracking is implemented by a laser-camera system consisting of a camera mounted on the end-effector to acquire image information and guide the robot's movement, as shown in Figure 1. The vision-tracking task in this paper was to align the laser streak line with the principal axis of inertia of the target image. A laser streak is visible light projected onto a plane by a laser sensor. The principal axis of inertia can be obtained by simple image processing. We define the camera's focal point as the origin of the image plane. Assuming a perspective projection and the focal length to be unity, a point P = (X, Y, Z) in the camera frame projects onto a p = (x, y) in the image coordinates such that Taking the temporal derivative of the projection in Equation (1), we obtain Consider a camera moving with a body velocity v = (T, w) in the world frame and observing a world point p with camera-relative coordinates P = (X, Y, Z). Assume that the camera moves with a transnational velocity T = (T X , T Y , T Z ) and a rotational velocity Ω = (Ω X , Ω Y , Ω Z ). The velocity of the point relative to the camera frame iṡ which we can write in scalar form as From combining Equations (2) and (4), and grouping terms we obtain Equation (5) defines the image Jacobi of the point features, which describe the velocity transformation between the spatial motion of the camera and the image feature points. Consider the visible light line as the laser beam's projection on the image plane. The equation of the line on the image is expressed as follows where ρ is the signed vertical distance of the line to the origin, and θ is the angle of the line with respect to the x-axis. The coordinates of the points on the line are denoted as x and y.
The velocity relationship of line feature can be given as The parameters of a i , b i , c i and d i are the normal vectors of the polyhedral plane.
To convert from two-dimensional image coordinates to three-dimensional spatial coordinates, the system shown in Figure 1 must be calibrated accurately. In the process, the values of the disparity angle α, the tilt angle β, and the offset Y o will be determined.

Xc Yc
Zc θc θo Xo Yo Zo Light plane

Stripe plane
Laser stripe β α Figure 1. Configuration of the laser-camera system with a camera and a laser stripe generator.

Observation Equation
Considering that quite a large part of the research has explored the use of lines in visual servoing by pegging out constant depth information (e.g., automated welding, lane keeping for autonomous driving), we simplified the problem to a set of assumptions. Suppose the object is moving in plane P A and the camera and laser stripe sensor are moving in plane P B . Here, the two planes are parallel to each other, and the distance between them (Z o ) is constant. The z-axis of the camera is perpendicular to these two planes. Consider a coordinate point p, the intersection of a straight line and its perpendicular line through the origin. In the image space, the laser-stripe lines will be aligned with the minimum inertia principal axis of the object. The observation equations can be given as where ω is the tangent angle of the line; C = I 3 ; and w is an observation noise vector. The measurement vector is acquired in a manner slightly different from the method used to track individual feature points.

State-Space Equation
Define each time t = kT s , k = 1, 2, 3, . . . , where Ts denotes the sampling period, so that the observation Equation (11) becomes The output vector y(k) can be obtained directly as the output of the vision. The state variables x(k) can be obtained if the white noise terms w(k) are ignored due to uncertainties.
To construct the state equation to describe the system dynamics, the velocities of the camera were defined as (T cX , T cY , T cZ ) and (Ω cX , Ω cY , Ω cZ ), and the velocities of the object feature were denoted as (T oX , T oY , T oZ ) and (Ω oX , Ω oY , Ω oZ ). Combining Equations (4)- (7) and (13), we have where U c , V c and Θ c are the components of the tracking motion of the camera. U o , V o and Θ o are the components of the tracking motion of the object. If the optical flow of the point p at moment k is (U k , V k , Θ k ), its optical flow can be expressed as where (14)-(22) the state-space representation can be given as , Ω oZ (k)) T ∈ R 3 is the exogenous disturbances vector or the state-noise vector. x(k), y(k), and ω(k) are now the tracking errors respectively. The matrixes A, B and D are given as Considering the scenarios where the depth information changes (e.g., robot grasping), Equations (13)-(29) will be modified acordingly. Evidently, the above changes do not affect any of the inference results after Section 2, so the proposed method is still valid outside the assumed constraints.

Adaptive Estimation Model
Dynamic parameters (i.e., relative poses between the camera and the object) are necessary for visual servoing. When they are known, the Kalman filter method can be used to control the system. However, in a practical scene, the dynamics of the camera and the object may be unknown. Adaptive control techniques can be used as an alternative method. Equation (27) can be expressed as the last n states and control inputs such that Suppose A is a characteristic polynomial such that f (λ) = λ n + a 1 λ n−1 + · · · + a n , n = 3 and by applying the Cayley-Hamilton theorem, we can obtain According to the ARMAX model, the previous system can be written as where A z −1 = 1 + a 1 z −1 + a 2 z −2 + · · · + a n z −n (34) where n = 3 is the dimension of the vector; y(k) and u(k) are the n-dimensional output and input, respectively; η(k) is the n-dimensional noise term that is a combination of the state noise τ(k) and the observation noise w(k); and z −1 is the unit delay operator. Since the image processing requires a finite time, d is defined as a delay factor with a value of 1. A(z −1 ) is a scalar polynomial in z −1 . [B ij z −1 ] and [C ij z −1 ] are the scalar polynomial matrices of B ij z −1 and C ij z −1 , respectively. Moreover, η(k) is a sequence of independent equidistributed Gaussian variables with 0 mean and δ 2 variance.

Estimation Schemes
The unknown parameters of the system can be estimated by the least squares method. The predictor gives the optimal prediction based on the historical measurements. Equation (33) can be expressed in the form of an optimal predictor (d = 1) such that where where the F(z −1 ) and G(z −1 ) are obtained by using the division algorithm, and C z −1 = F z −1 A z −1 + z −d G z −1 , respectively. The term y opt (k + d | k) denotes the optimal output prediction, and the term y(k) denotes the output of the adaptive predictor.
If the polynomials c opt (z −1 ) have some of their zeros on the unit circle, a predictor with sub-optimal performance can be designed. y * i (k + 1) is given by where Due to the time-varying nature of the estimated parameters, the well-known least square error method is used to estimate the unknown values of the parameters and can be estimated on-line byΦ where the superscript i refers to the object model; the caret indicates the estimated value; µ i ∈ (0, 1] is a forgetting factor used to discount an exponential decay of the past data. Γ i (k) is a covariance matrix.

Extreme Learning Machine
ELM is a single hidden layer feedforward neural network proposed by Huang [28]. It differs significantly from the traditional feedforward neural network by using a gradient descent algorithm in the training phase. The weights and biases between the input and hidden layers of the network are set randomly and not adjusted after setting. Instead of iterative adjustment, the weights β between the hidden and output layers are determined by solving the generalized inverse matrix. Compared with traditional back-propagation learning algorithms, ELM has better training speed and generalization performance [29,30]. For N arbi- The output O j of the network with L hidden layer nodes can be expressed as where g(.) is the sigmoid activation function; w i = [w i1 , w i2 , . . . , w in ]; and b is the weight and the bias between the input and hidden layers, and β = [β 1 , β 2 , . . . , β L ] T is the weight between the hidden and output layers. The goal of learning in this network is to minimize output error, which can be expressed as and can also be written in matrix form such as where H = g w i · X j + b i is the output matrix of the hidden layer; β is the weight of the output layer; and T is the desired output matrix of the network. They are expressed as The solution can be obtained as where H † is the Moore-Penrose matrix of H.

ELM-PID Visual Tracking Controller
In the ELM-PID scheme, the ELM network is used to predict the system output value, which is the image Jacobian information. The structure of the ELM network is defined as six input nodes, eighteen hidden nodes and three output nodes. The input vector is [u(k), y(k)] ∈ R 6 , and the output vector is y * (k) = [x(k), y(k), ω(k)] T ∈ R 3 . The block diagram of the ELM-PID controller is shown in Figure 2. In this scheme, r(k) and e(k) are the desired feature position and system error signal, respectively; u(k) is the PID output signal; and y(k) and y * (k) are the output of the current and predicted feature position respectively. The tracking process of the visual servo includes the training and self-adjusting phases.
In the training phase the initial parameters of K P , K I and K D are set at 10, 1 and 0.01, respectively. The robot controller is essentially an image Jacobi matrix, which describes the mapping relationship between the motion velocity of the features in image space and the robot joint angle. It guides the manipulator movement according to the input signal u and outputs the corresponding feature position. The sampling frequency is set to 50 ms. The collected input value u and output value y are used as the ELM's training set. which will be normalized for the training.
In the self-adjusting phase the incremental PID algorithm is employed, and the digital PID controller can be expressed in discrete time as The control law is given as where u is output of the PID controller; k is the iteration step, and The cost function of the controller is defined as During the operation, back-propagation was used to adjust the weights of the network and minimize the cost function E. The self-tuning steps are given as where η is the learning rate, and ∂y ∂u is the Jacobian parameter of the visual servoing system. Assuming that the sampling period were small enough, the two adjacent sampling points could be considered as linearly varying. Thus, we can obtain an approximation of the inverse Jacobi matrix such that ∂y ∂u where l is the number of nodes in the hidden layer; β is the output weight; b is the bias of the hidden layer; and ω im is the corresponding weight in vector ω i with input u. The calculation K P , K I and K D for incremental PID is given as where α ∈ (0, 1) is the momentum coefficient.

Experiments
As previously mentioned, this research pursued the accurate and rapid localization, and anti-disturbance capability of a line feature visual servoing. In this section, we focus on the experimental validation of the control scheme of the line feature visual servo, especially in its behavior of positioning accuracy and dynamic performance. The proposed method was validated and evaluated by testing on a six-degree-of-freedom manipulator arm. The controller code was implemented in a Robot Operating System (ROS) and the connection between the ROS and the robot was based on TCP/IP. During the tests, the tracking accuracy, the convergence speed of the feature error, the velocity component and the trajectory of the end-effector among the comparative methods will be verified.
In the experiment, a black line 230 mm long and 1 mm wide was printed on paper to simulate the line features of the target. The goal of visual tracking is to control the movement of the robotic arm so that the visible red line emitted by the thlaser generator on the robot arm is aligned with the black line on the working plane. Figure 3 shows the Universal Robots UR3 robot used in this experiment. An Intel RealSense D415 RGB-D camera was used to provide RGB and depth information about the scene. Both sensors were mounted on the end-effector of the UR3 robot.  Due to the simplified representation of the target, the image lines can be identified in a straightforward way by applying the Hough transform. The block diagram of the control loop of the system is shown in Figure 2. During aligning, the target and endeffector planes are kept parallel at a distance of Z o = 680 mm. The z-axis of the camera coordinate system was perpendicular to the above two planes. The control system input was the desired position, which was the translation of the x-axis and y-axis of the camera coordinate system and the rotation of the z-axis. The unknown 3D parameters (θ, ρ, λ θ , λ ρ ) between the camera and the object were estimated by using the adaptive optimal predictor in Equations (31)-(33). The sampling period of the experiments was set to 50 ms.
In the first experiment, The ELM-PID controller was implemented to verify the positioning capability of the static target. Line segments printed on paper represented the line features of the target, and the projection of the laser beam on a plane was used as the current line feature, as shown in Figure 4. The robot was initialized with a specific pose, and its end-effector was 40 cm above the table. The target was placed in a random pose but always within the camera's field of view.
In the second experiment, the performance of the ELM-PID controller in dynamic visual tracking is evaluated by using the black-lined paper to simulate the line feature of target. This piece of paper was attached to a hand grip, and random translational motions of the target were carried out by pulling the hand grip manually, as shown in Figure 5. The target was always within the camera's field of view during the motion. The visual servoing control performance of the proposed method will be analyzed for pixel error of the image features, the convergence speed of feature error and the trajectories of motion.

Results
The static test of the first experiment showed that all three methods guided the robot to the target and achieve the positioning of line features. Figure 6 shows the feature error convergence curves for the current and desired features in the image plane. As shown in Figure 6a, the error convergence curve of the P controller had significant fluctuations, and convergence was completed at the 49th sample. It can be seen in Figure 6b that, the convergence curve of the PID controller was smoother than that of the P controller, which converged at the 29th sample. The ELM-PID controller had the best performance regarding the convergence curve and convergence rate, which had a smooth curve and reached the vision-servoing task at only the 14th sample. Figure 7 shows the velocity comparison of the robot's end-effector with the three controllers. It can be seen in Figure 7a,b that, the P controller and the PID controller exhibited noticeable peaks in the motion velocity trajectory, which implied that there may be some defects in the smoothness of the robot motion. Its initial translational velocities in the x-, yand z-axes were −0.19, 0.19, and 0.40 m/s, respectively, and the corresponding rotation velocities were 0.08, 0.12, and −0.13 rad/s, which were around 155 and 70% larger than P and PID controllers in overall initial velocity. Accordingly, we found that the motion trajectory of the robot arm showed some jitter from Figure 8a,b. In Figure 8c, it can be seen that the ELM-PID algorithm had a larger initial motion velocity, faster convergence, and smoother motion velocity profile of the end-effector. The Cartesian space motion trajectory of the ELM-PID algorithm was also much closer to a straight line, as shown in Figure 8c. The dynamic tests of the second experiment showed that the ELM-PID controller had superior visual tracking performance for moving objects. Figure 9 illustrates the convergence curves of the feature error in the image plane between the current and reference features. Since the target was in constant motion, the feature error convergence curve was degraded compared to the first experiment. As shown in Figure 9a, the feature error curve of the P controller had noticeable jitter and failed to converge. Figure 9b shows that the convergence curve of the PID controller was more stable than that of the P controller, but the convergence took longer to complete. The ELM-PID controller had an ideal curve and converged faster, reaching the vision servo task at the 18th sample step. Figure 10 shows the velocity comparison of the end-effector corresponding to the above three vision controllers. It can be seen in Figure 10a,b that, prominent jagged peaks appeared in the motion velocity trajectories of the P and PID controller, which suggested that the robot motion process had poor smoothness. The ELM-PID algorithm improved the stability of the velocity components of the end-effector, as shown in Figure 10c. The ELM-PID still has a larger initial velocity and steeper velocity convergence curves in the dynamic test. Its initial translational velocities in the x-, yand z-axis were −0.19, 0.19 and 0.40 m/s, respectively, and the corresponding rotational velocities were 0.08, 0.12 and −0.13 rad/s, which outperformed the P and PID controllers by around 156 and 71% in overall initial velocity. Figure 11 shows that the Cartesian space motion trajectory of the ELM-PID algorithm was closer to a straight line, and the path length was distinctly shorter than the previous two.
In essence, the P controller had substantial limitations and poor following performance in the dynamic vision tracking tasks. Although the pure PID controller accomplished dynamic vision servoing, the tracking efficiency was unsatisfactory, which may be because of the PID parameters were not in the first-rank interval. The ELM-PID exhibited the optimal error accuracy and convergence speed, steadier speed performance and faster speed convergence in both the static and dynamic vision tracking tests. The end-effector's trajectory profile further indicated that it had fewer redundant kinematic trajectories.

Conclusions
This paper proposed a hybrid ELM-PID control approach for real-time robotic visual tracking of a moving target. The system consists of a visual feedback controller with selfadjusting capabilities. The image processing and camera control are performed in parallel online. The approach uses an adaptive predictor to estimate the 3D-related parameters between the camera and the object in motion. The processing of the camera image necessitates the vision sampling periods to be 50 ms and the motion of the object is assumed to be smooth. Since the dynamics of the object are unknown, the velocity and position of the object determined from the successive images are predicted using an ARMAX model. The desired trajectory points are generated online based on the relative position and velocity of the camera and the object. The online planner determines the desired trajectory points (sub-goal) one sampling period ahead. The ELM-PID controller is then constructed for visual tracking of static and moving targets. Comparing the visual servo methods based on the P and PID controller, the ELM-PID controller demonstrated the optimal visual tracking capability in both the static and dynamic tracking tests. By this method, online estimation of image Jacobi matrices was performed with an extreme learning machine, which overcame the long training time of traditional neural networks and solved the singularity of the Jacobi matrices. The gradient descent algorithm was also used to adjust the PID controller parameters adaptively, improving the controller's dynamic performance. The principal contributions of this work are as follows: 1.
We developed a mathematical expression for the visual tracking of a line feature. This model was based on measurements of the vector of discrete displacements obtained by a laser-camera system; 2.
We presented the self-tuning control scheme of a hybrid ELM-PID. The scheme was efficient for a nonlinear system and objects in motion by composing an ELM-PID control and adaptive prediction; 3.
The constructed laser camera system was able to achieve low-cost, real-time vision tracking. The comprehensive experiments validated the suitability of the method for vision tracking with combined vision and control.
In future work, an attempt would be made to transplant the algorithm to industrial robots and to further tune and optimize the algorithm in industrial scenes.