Eye-in-Hand Robotic Arm Gripping System Based on Machine Learning and State Delay Optimization

This research focused on using RGB-D images and modifying an existing machine learning network architecture to generate predictions of the location of successfully grasped objects and to optimize the control system for state delays. A five-finger gripper designed to mimic the human palm was tested to demonstrate that it can perform more delicate missions than many two- or three-finger grippers. Experiments were conducted using the 6-DOF robot arm with the five-finger and two-finger grippers to perform at least 100 actual machine grasps, and compared to the results of other studies. Additionally, we investigated state time delays and proposed a control method for a robot manipulator. Many studies on time-delay systems have been conducted, but most focus on input and output delays. One reason for this emphasis is that input and output delays are the most commonly occurring delays in physical or electronic systems. An additional reason is that state delays increase the complexity of the overall control system. Finally, it was demonstrated that our network can perform as well as a deep network architecture with little training data and omitting steps, such as posture evaluation, and when combined with the hardware advantages of the five-finger gripper, it can produce an automated system with a gripping success rate of over 90%. This paper is an extended study of the conference paper.


Introduction
The COVID-19 pandemic means that there is a greater demand for automated production. To avoid a reduction in production capacity due to insufficient manpower, robotic arms with automatic control systems must perform more complex and more varied tasks.
Nowadays, robot arms for pick-and-place and assembly tasks have matured [1][2][3][4][5][6][7][8], and many factory robot arms now use two-or three-finger grippers for tasks, and many research papers have been conducted in this direction or for experiments [9][10][11][12]. Some use algorithms for control [13] and others use visual images and machine learning to allow robotic arms to perform tasks [14,15]. Other studies convert images into point clouds that use depth distances for better task execution [16][17][18][19][20][21][22][23][24]. For example, the study in [25] combines object recognition by Mobile-DasNet and point cloud analysis to generate the coordinates of the arm endpoints for the apple picking task.
However, if the factory is to become more automated, the robot arm needs to be able to perform more complex movements. On this premise, the original two-and three-finger grippers may not be able to complete more delicate or complex movements, so many people began to develop and use other types of grippers to overcome these shortcomings, including the five-finger grippers. Although the five-finger grippers have a larger gripping range, higher flexibility and fault tolerance, because of their complex structure, many control systems that use two-or three-finger grippers, as to derive the gripping attitude may not be compatible with five fingers.
There have been some studies using five-finger graspers, but most of them studied hardware or other aspects [26][27][28][29]. The studies used for application have used a virtual space as a result of the study [30,31] or used algorithms for gripping pose prediction [32,33].
This paper is an extended study of the original conference paper [1]. We extend the original conference paper on the effects of time delay for the robot manipulator. We investigate state time delays and propose a control method for a robot manipulator. Many studies on time-delay systems have been conducted, but most focus on input and output delays. One reason for this emphasis is that input and output delays are the most commonly occurring delays in physical or electronic systems. An additional reason is that state delays increase the complexity of the overall control system. In this paper, we use the qbSoftHand five-finger grasper in combination with an RGB-D visual recognition system and modify the proposed mechanical learning network by omitting the pose evaluation part so that the arm can automatically move to the target location and accomplish the task of grasping the target object to the placement area. The state delay of the control system is also optimized to enable the system to operate efficiently. Experiments were conducted with objects that were considered difficult to grasp with a two-or three-finger grasper to demonstrate the superiority of the grasping method and five fingers.

Control System Architecture
The hardware for this study is a UR5 six-degree-of-freedom arm, qbSoftHand fivefinger gripper, HIWIN XEG-32 gripper and a RealSense D435 camera. The system architecture is shown in Figure 1. Object recognition is achieved using Yolo [34,35]. The point cloud for the object is generated after sampling and adjustment. The data are used as input data for an AI model that predicts the gripping posture for the arm and sends this information to the arm to perform the gripping action.
control systems that use two-or three-finger grippers, as to derive the gripping att may not be compatible with five fingers.
There have been some studies using five-finger graspers, but most of them stu hardware or other aspects [26][27][28][29]. The studies used for application have used a v space as a result of the study [30,31] or used algorithms for gripping pose predi [32,33].
This paper is an extended study of the original conference paper [1]. We exten original conference paper on the effects of time delay for the robot manipulator. W vestigate state time delays and propose a control method for a robot manipulator. M studies on time-delay systems have been conducted, but most focus on input and ou delays. One reason for this emphasis is that input and output delays are the most monly occurring delays in physical or electronic systems. An additional reason is state delays increase the complexity of the overall control system. In this paper, w the qbSoftHand five-finger grasper in combination with an RGB-D visual recognition tem and modify the proposed mechanical learning network by omitting the pose ev tion part so that the arm can automatically move to the target location and accomplis task of grasping the target object to the placement area. The state delay of the contro tem is also optimized to enable the system to operate efficiently. Experiments were ducted with objects that were considered difficult to grasp with a two-or three-f grasper to demonstrate the superiority of the grasping method and five fingers.

Control System Architecture
The hardware for this study is a UR5 six-degree-of-freedom arm, qbSoftHand finger gripper, HIWIN XEG-32 gripper and a RealSense D435 camera. The system a tecture is shown in Figure 1. Object recognition is achieved using Yolo [34,35]. The cloud for the object is generated after sampling and adjustment. The data are used as data for an AI model that predicts the gripping posture for the arm and sends this i mation to the arm to perform the gripping action. The pose that is predicted by the network is sent to the UR5 control system vi socket library of the network cable. The current joint angle, the tool center point ( and the analog and digital signals for the arm are also obtained in this way. The pose that is predicted by the network is sent to the UR5 control system via the socket library of the network cable. The current joint angle, the tool center point (TCP) and the analog and digital signals for the arm are also obtained in this way.
A successful grasping action is achieved if the arm moves to the target location, picks up the object and moves it to the target position without dropping the object.
A successful grasping action is achieved if the arm moves to the target location, picks up the object and moves it to the target position without dropping the object.

Convolutional Neural Network (CNN) Architecture
The architecture of the CNN is shown in Figure 2. This section describes the hidden layer (convolutional layer). The convolutional layer is the core of the CNN [36]. When the image is input into the convolution layer, it performs convolutional operations using the convolutional kernel. The formula is: i j is the position on the first floor, l ij k and l j b are the respective weights of the convolution kernel and the offset of the feature. More details can be found in [37].
The input data for this study are not images, but numerical values, so the convolution is a one-dimensional convolution. Differences between this and a two-dimensional convolution are described in Section 3.3.
The AI network for this study consists of three one-dimensional convolutional layers, a pooling layer and a multi-layer perceptron [25]. The point cloud data are input and the position at which the object is grasped is predicted. This information is transmitted to the UR5 arm control system. The network architecture using the CNN is shown in Figure 3.

Min-Pnet Architecture
Min-Pnet is a network designed according to the architecture of PointNet [37] (see Figure 4) In PointNet research, it is mentioned that many studies transform the input data The convolutional layer is the core of the CNN [36]. When the image is input into the convolution layer, it performs convolutional operations using the convolutional kernel. The formula is: is a tanh function, P j is a local receptive field, X l−1 i is the value of the l-1 feature on the i window, (i, j) is the position on the first floor, k l ij and b l j are the respective weights of the convolution kernel and the offset of the feature. More details can be found in [37].
The input data for this study are not images, but numerical values, so the convolution is a one-dimensional convolution. Differences between this and a two-dimensional convolution are described in Section 3.3.
The AI network for this study consists of three one-dimensional convolutional layers, a pooling layer and a multi-layer perceptron [25]. The point cloud data are input and the position at which the object is grasped is predicted. This information is transmitted to the UR5 arm control system. The network architecture using the CNN is shown in Figure 3.

Convolutional Neural Network (CNN) Architecture
The architecture of the CNN is shown in Figure 2. This section describes the hidd layer (convolutional layer). The convolutional layer is the core of the CNN [36]. When the image is input into t convolution layer, it performs convolutional operations using the convolutional kern The formula is: is a tanh function, j P is a local receptive field, details can be found in [37].
The input data for this study are not images, but numerical values, so the convoluti is a one-dimensional convolution. Differences between this and a two-dimensional co volution are described in Section 3.3.
The AI network for this study consists of three one-dimensional convolutional laye a pooling layer and a multi-layer perceptron [25]. The point cloud data are input and t position at which the object is grasped is predicted. This information is transmitted to t UR5 arm control system. The network architecture using the CNN is shown in Figure 3

Min-Pnet Architecture
Min-Pnet is a network designed according to the architecture of PointNet [37] (s Figure 4) In PointNet research, it is mentioned that many studies transform the input da

Min-Pnet Architecture
Min-Pnet is a network designed according to the architecture of PointNet [37] (see Figure 4) In PointNet research, it is mentioned that many studies transform the input data into regular 3D voxel grids or collections of multi-angle images, generating large amounts of unnecessary data and destroying the natural invariance of the original data. This study uses point clouds directly to avoid these problems and to make it easier to learn. into regular 3D voxel grids or collections of multi-angle images, generating large amounts of unnecessary data and destroying the natural invariance of the original data. This study uses point clouds directly to avoid these problems and to make it easier to learn. In order to protect the network itself from point cloud disorder, PointNet uses its "feature extraction layer" to convert disordered point cloud data into 1024-dimensional features to perform subsequent tasks.
The feature extraction layer is to transform the input points into features by assembling the input data into a canonical space and using T-net to predict an affine transformation matrix. The partial and global feature extraction of the point cloud is performed without affecting the correlation and invariance of the points.
In this study, the Min-Pnet (see Figure 5) uses one feature extraction layer instead of two to reduce the training time and to avoid overfitting. The original 2D conv part was changed to 1D conv. This is because our input data size is different from the original setting of Pointnet, which causes problems in the matrix multiply part of the network.

Input Point Cloud Data
To ensure that the point clouds are reliable and not easily affected by the external environment, a PCL [38] chopbox and the Yolo Bounding Box are used to remove most of the unwanted point clouds (walls and desktops; see Figure 6). In order to protect the network itself from point cloud disorder, PointNet uses its "feature extraction layer" to convert disordered point cloud data into 1024-dimensional features to perform subsequent tasks.
The feature extraction layer is to transform the input points into features by assembling the input data into a canonical space and using T-net to predict an affine transformation matrix. The partial and global feature extraction of the point cloud is performed without affecting the correlation and invariance of the points.
In this study, the Min-Pnet (see Figure 5) uses one feature extraction layer instead of two to reduce the training time and to avoid overfitting. The original 2D conv part was changed to 1D conv. This is because our input data size is different from the original setting of Pointnet, which causes problems in the matrix multiply part of the network. into regular 3D voxel grids or collections of multi-angle images, generating large amounts of unnecessary data and destroying the natural invariance of the original data. This study uses point clouds directly to avoid these problems and to make it easier to learn. In order to protect the network itself from point cloud disorder, PointNet uses its "feature extraction layer" to convert disordered point cloud data into 1024-dimensional features to perform subsequent tasks.
The feature extraction layer is to transform the input points into features by assembling the input data into a canonical space and using T-net to predict an affine transformation matrix. The partial and global feature extraction of the point cloud is performed without affecting the correlation and invariance of the points.
In this study, the Min-Pnet (see Figure 5) uses one feature extraction layer instead of two to reduce the training time and to avoid overfitting. The original 2D conv part was changed to 1D conv. This is because our input data size is different from the original setting of Pointnet, which causes problems in the matrix multiply part of the network.

Input Point Cloud Data
To ensure that the point clouds are reliable and not easily affected by the external environment, a PCL [38] chopbox and the Yolo Bounding Box are used to remove most of the unwanted point clouds (walls and desktops; see Figure 6).
To increase the amount of training data and to ensure that the model resists manipulation in response to slight errors that are caused by the hardware, Gaussian noise is used (Figure 7.) To ensure that the input data are the same size, the point clouds for these objects are sampled and processed. The processed input data are a 3000 × 3 (N × 3 in Figure 8) matrix of the XYZ coordinates of the points (Figure 7). These data correspond to grasping point data, including the angle of each joint and a coordinate system based on UR5. These data are the input for the AI network and the prediction. To increase the amount of training data and to ensure that the model resists manipulation in response to slight errors that are caused by the hardware, Gaussian noise is used (Figure 7.) To ensure that the input data are the same size, the point clouds for these objects are sampled and processed. The processed input data are a 3000 × 3 (N × 3 in Figure 8) matrix of the XYZ coordinates of the points (Figure 7). These data correspond to grasping point data, including the angle of each joint and a coordinate system based on UR5. These data are the input for the AI network and the prediction.   To increase the amount of training data and to ensure that the model resists manipulation in response to slight errors that are caused by the hardware, Gaussian noise is used (Figure 7.) To ensure that the input data are the same size, the point clouds for these objects are sampled and processed. The processed input data are a 3000 × 3 (N × 3 in Figure 8) matrix of the XYZ coordinates of the points (Figure 7). These data correspond to grasping point data, including the angle of each joint and a coordinate system based on UR5. These data are the input for the AI network and the prediction.  A one-dimensional convolution layer (1D conv) is used for convolution and feature extraction from 1D data. Most AI frameworks that are used for object recognition or for grasp prediction, which use images, use a two-dimensional convolution layer (2D conv). The differences between these systems are shown in the Figure 9.
A one-dimensional convolution layer (1D conv) is used for convolution and feature extraction from 1D data. Most AI frameworks that are used for object recognition or for grasp prediction, which use images, use a two-dimensional convolution layer (2D conv). The differences between these systems are shown in the Figure 9. In the CNN network, 1D conv directly extracts features from the input point cloud, and after processing with the maximum pooling layer (Max Pooling), it finally inputs the multi-layer perceptron (MLP) composed of several fully connected layers to obtain the output of the final predicted grasp position. This study uses the position of the object in space to predict the grasping position of the gripper so a 1D conv is used to extract the main features. After we sample the point cloud data, it is input to the model as a network, and the model outputs the predicted grasp posture coordinates or joint angles.
A 1D conv is initially used for natural language and data analysis and can also be used to analyze point clouds in data format. It is less computationally intensive and requires a shorter training time than a 2D conv. A 2D conv cannot be used because the point cloud data are not continuous (Figure 9), so the extracted features cannot be applied, which affects the prediction results.

Pooling Layer
The pooling layer reduces the dimensions of features and filters redundant features to reduce the computational burden and increase the generalization of the network. The pooling layer uses the maximum pooling (Max Pooling; see Figure 10) and mean pooling (Average pooling). The pooling process is expressed as: where 1 l i X − is the value of the ith window in the layer l-1 input feature, l j b is the offset for the jth window in layer l and pool represents the sampling function. In the CNN network, 1D conv directly extracts features from the input point cloud, and after processing with the maximum pooling layer (Max Pooling), it finally inputs the multi-layer perceptron (MLP) composed of several fully connected layers to obtain the output of the final predicted grasp position. This study uses the position of the object in space to predict the grasping position of the gripper so a 1D conv is used to extract the main features. After we sample the point cloud data, it is input to the model as a network, and the model outputs the predicted grasp posture coordinates or joint angles.
A 1D conv is initially used for natural language and data analysis and can also be used to analyze point clouds in data format. It is less computationally intensive and requires a shorter training time than a 2D conv. A 2D conv cannot be used because the point cloud data are not continuous (Figure 9), so the extracted features cannot be applied, which affects the prediction results.

Pooling Layer
The pooling layer reduces the dimensions of features and filters redundant features to reduce the computational burden and increase the generalization of the network. The pooling layer uses the maximum pooling (Max Pooling; see Figure 10) and mean pooling (Average pooling). The pooling process is expressed as: where X l−1 i is the value of the ith window in the layer l-1 input feature, b l j is the offset for the jth window in layer l and pool represents the sampling function.

Multi-Layer Perceptron (MLP)
Multi-layer perceptron uses several fully connected layers. Neurons in the full connection layer are connected to each other in the previous layer. The formula is:

Multi-Layer Perceptron (MLP)
Multi-layer perceptron uses several fully connected layers. Neurons in the full connection layer are connected to each other in the previous layer. The formula is: where f (u l ) is the activation function, W l is the weight of layers l-1 to 1, b l is the offset for Layer 1, and X l−1 is the output feature of Layer l-1.
The last fully connected layer outputs six parameters that are used to predict joint angles or TCP coordinates. To avoid overfitting and prediction errors in the AI model, an exit layer is inserted before the final output layer.

State Delays Using Digital Redesign
We present the principles of optimal digital redesign in this section. This system removes the following parts of the original transformation of a time-delay system to a delay-free system [39]: The optimal quadratic state feedback control law is used to minimise the following performance cost function: where Q c ≥ 0, R c > 0. The following formula is obtained by optimising the controller: The entire closed loop system can then be expressed as .
For m = p, we obtain Here, P is the solution to the Riccati equation given below.
The linear quadratic regulator (LQR) (Equation (6)) design characteristics make the resulting closed loop system stable. If the system state is unmeasurable, an observer must be designed to measure the system state. The linear observable continuous system (see Figure 11) that is described by the equation is shown below: wherex(t) is the estimated state, and L c is the gain of the observer: We apply digital redesign to the analogue controller (Equation (7)) to obtain a more practical digital controller. The operation of the discrete time state feedback controller is described by the following equation: Sensors 2023, 23, 1076 8 of 22 where For the linear model of the sampling system, the digital tracker based on the observer and the observer are shown in Figure 12.
The linear quadratic regulator (LQR) (Equation (6)) design characteristics make the resulting closed loop system stable. If the system state is unmeasurable, an observer must be designed to measure the system state. The linear observable continuous system (see Figure 11) that is described by the equation is shown below: where ˆ( ) x t is the estimated state, and c L is the gain of the observer: Analog Observer Linear Quadratic Tracker Figure 11. Linear-quadratic analogue tracker based on an observer.
We apply digital redesign to the analogue controller (Equation (7)) to obtain a more practical digital controller. The operation of the discrete time state feedback controller is described by the following equation: For the linear model of the sampling system, the digital tracker based on the observer and the observer are shown in Figure 12.

Results
This study uses bottles, bowls and sports balls for the experiments (see Figure 13). The training parameters for the AI network are a batch size = 32 and an epoch = 10,000 and the loss parameter uses mean squared error (MSE). The selected optimizer is 'adam'.

Results
This study uses bottles, bowls and sports balls for the experiments (see Figure 13). The training parameters for the AI network are a batch size = 32 and an epoch = 10,000 and the loss parameter uses mean squared error (MSE). The selected optimizer is 'adam'.
y i −ŷ i is error, n is number of data in Dataset. This study uses Python 3.7.7 and a Windows10 system environment for model training and a crawling test. In order to verify that the proposed control method [40] can be applied in subsequent real-world tests and to collect data more easily, a simulator was used to design a model of a nonlinear MIMO robot manipulator, as shown in Figure 14.
This study uses bottles, bowls and sports balls for the experiments (see Figure 13). The training parameters for the AI network are a batch size = 32 and an epoch = 10,000 and the loss parameter uses mean squared error (MSE). The selected optimizer is 'adam'.  In order to verify that the proposed control method [40] can be applied in subsequent real-world tests and to collect data more easily, a simulator was used to design a model of a nonlinear MIMO robot manipulator, as shown in Figure 14.    i i y y − is error, n is number of data in Dataset. This study uses Python 3.7.7 and a Windows10 system environment for model training and a crawling test. In order to verify that the proposed control method [40] can be applied in subsequent real-world tests and to collect data more easily, a simulator was used to design a model of a nonlinear MIMO robot manipulator, as shown in Figure 14.

The Test of the Robot Manipulator
The dynamic equation of the two-link robot system is given below: where and q = q 1 q 2 T , where q 1 and q 2 are angular positions, M(q) is the moment of inertia, q) includes the Coriolis and centripetal forces, G(q) is the gravitational force, and Γ is the applied torque vector. Here, we use the short-hand notations s i = sin(q i ) and c i = cos(q i ). The nominal parameters of the system are as follows: the link masses are m 1 = 5 kg and m 2 = 2.5 kg, the lengths are l 1 = l 2 = 0.5 m, and the gravitational acceleration is g r = 9.81 ms −2 . Then, (30) can be rewritten in the following form: ..
Let x and f (x) represent the state of the system and a nonlinear function of the state x, respectively. The following notation is used: The inverse of the matrix M is calculated as M −1 = p 11 p 12 p 21 p 22 , such that g(x(t)) = 0 p 11 0 p 21 0 p 12 0 p 22 The dynamic equation of the two-link robot system can, thus, be reformulated as follows: .
where C = 1 0 0 0 0 0 1 0 and the initial condition is First, OKID is applied to convert the nonlinear system (Equation (23)) to an equivalent linear system. The system (Equation (23)) is injected with white noise u(t) = u 1 (t) u 2 (t) with a zero mean and covariance diag cov(u 1,2 (t)) = 0.2 0.2 at the sampling time T = 0.01 s. Figure 15 shows that the error between the output of the identified equivalent linear system and the original nonlinear system (Equation (23)) can be controlled to within 10 −6 ∼ 10 −5 .  We then consider two different reference inputs r(t) = r 1 r 2 T for the identified equivalent linear system at the sampling time T = 0.01 s with an input delay τ i = 0.5 × T and an output delay τ o = 0.3 × T for the optimal DR method presented in Section 4. To test whether the designed control can effectively suppress a state delay, we gradually increase the state delay τ s = 0 × T for two different reference inputs, i.e., Types 1 and 2. Figures 16-23 show the ability of the control to suppress the Types 1 and 2 state delays. Table 1 summarises the results in Figures 16-23, showing that the robot manipulator suppresses the state delay for approximately 2.6 s. Table 1. The ability of the robot manipulator to suppress state delays.

AI Training Results
The next section shows the training results of the neural network. There are 325 training data points for the three subjects in the training and (bottle: 180; bowl and ball: 50). Noisy point cloud data are used for training to give a total of 595 training samples. There are 45 data points in a test set (bottle: 25; bowl and ball: 10). During the training period, the predicted output (joint angles or TCP) was standardization. The standardization range ( Table 2) is set according to the working range of the arm.

AI Training Results
The next section shows the training results of the neural network. There are 325 training data points for the three subjects in the training and (bottle: 180; bowl and ball: 50). Noisy point cloud data are used for training to give a total of 595 training samples. There are 45 data points in a test set (bottle: 25; bowl and ball: 10). During the training period, the predicted output (joint angles or TCP) was standardization. The standardization range ( Table 2) is set according to the working range of the arm.  Three models were trained for these three types of objects. The output for these models in Table 3 above is the UR5 arm joint angle that qbSoftHand uses to grasp the object. These angles are input into the UR5 control system and the arm is moved using the MoveJ command.

Real Machine Grasping
For the grasping test, the camera observes objects in the work space at an angle of about 45 degrees to the desktop ( Figure 24).
2 Figure 24. The UR5 arm observes the position of the object.

Five-Finger and Two-Finger Claw Gripping Experiment
The experiment first tested the grip comparison between the two-finger and fivefinger grippers (Figure 25). A network was trained using oriented bounding box (OBB) framework data [41] and a point cloud model using a two-finger gripper as a control group. Simultaneously, the results were compared with those of a study similar to [25]. The results are displayed in Table 4.   According to the above test results, it can be observed that compared to the two fingers, the five-finger gripper achieved a success rate of more than 80% in the grip of various objects. The performance of the two-finger gripper (Figure 26), although it had a success rate of more than half, was much worse than that of the five-finger gripper. In fact, in most of the failed grips with the two-finger, the predicted position was very close to the object, but the object could not be grasped because the gripping area of the gripper was not wide enough and the force applied at the contact point. This demonstrates that even the five-finger grippers, which are only capable of open grip action, show a significant advantage over the two-finger on the hardware level. Reference [25] used a three-finger gripper and achieved at least an 80% success rate in both indoor and outdoor tests as in this study. Considering the larger number of samples they collected (570 in total), the similar success rate achieved in our study may be attributed to the adaptability of the five-finger gripper, in addition to the different experimental environment (clutter vs. single).  Then, we tested whether our network could produce a corresponding grip on an unknown object (see Figure 27) with a similar training set. The objects we tested were: an alcohol spray bottle (considered as bottle), a sock wrapped in a ball, and a small doll (considered as sport ball). The network included CNN, Min-Pnet and OBB, and the hardware was the same as in the previous section. The results are shown in Table 6.

Comparison of CNN and Min-Pnet Networks
The next section compares CNN networks, Min-Pnet networks designed with reference to the PointNet concept, and networks using OBB. Since the advantages of five fingers in hardware have been demonstrated in the previous section, it is shown that they can effectively compensate for the shortcomings of neural networks. In order to compare the differences in network architectures more clearly, the hardware was compared using a two-finger gripper. The results are shown in Table 5.
Compared to the CNN network, the results of Min-Pnet were significantly better in all aspects. Although the difference in the results on the bottles was not large (5%), the bowl and sports ball performed significantly better (24% and 20%), which may be due to the hardware limitation mentioned in the previous section, while for the OBB, except for the bowls with a larger successful area, the results were not very good. In the OBB experiment, it was found that objects that were too close or too far away were not successful, probably due to the angle of view and distance, so that the OBB frame point did not fully represent the size and position of the object.

Out-of-Training Set Object Gripping
Then, we tested whether our network could produce a corresponding grip on an unknown object (see Figure 27) with a similar training set. The objects we tested were: an alcohol spray bottle (considered as bottle), a sock wrapped in a ball, and a small doll (considered as sport ball). The network included CNN, Min-Pnet and OBB, and the hardware was the same as in the previous section. The results are shown in Table 6. Although the number of experiments is small, the success rate of Min-Pnet proves that the feature transformation architecture can indeed analyze point clouds more effectively, demonstrating its adaptability by still being able to predict effective grip positions when faced with objects that are similar but not identical to the training data. This also demonstrates the drawback of OBB's difficulty in displaying asymmetric object features.

Conclusions
Simulation results in Section 5.1 are presented demonstrating effective optimal digital re-design control of a robot manipulator with a nonlinear delay. In particular, the proposed method can effectively control a state delay within the tolerable scope.
In terms of gripper hardware comparison, although the qbSoftHand used in this study cannot grasp objects more flexibly due to hardware limitations, it can still grasp many objects that are difficult to grasp with the two-finger gripper, such as smooth bowls, balls, and bottles. A comparison with the two-finger gripper and the three-finger gripper of [25] demonstrates the advantage of the five-finger gripper on the hard surface.
In the case of objects outside the training set, the Min-Pnet achieved a success rate of more than 70% despite the small number of experiments, and after analysis, it was found  Although the number of experiments is small, the success rate of Min-Pnet proves that the feature transformation architecture can indeed analyze point clouds more effectively, demonstrating its adaptability by still being able to predict effective grip positions when faced with objects that are similar but not identical to the training data. This also demonstrates the drawback of OBB's difficulty in displaying asymmetric object features.

Conclusions
Simulation results in Section 5.1 are presented demonstrating effective optimal digital re-design control of a robot manipulator with a nonlinear delay. In particular, the proposed method can effectively control a state delay within the tolerable scope.
In terms of gripper hardware comparison, although the qbSoftHand used in this study cannot grasp objects more flexibly due to hardware limitations, it can still grasp many objects that are difficult to grasp with the two-finger gripper, such as smooth bowls, balls, and bottles. A comparison with the two-finger gripper and the three-finger gripper of [25] demonstrates the advantage of the five-finger gripper on the hard surface.
In the case of objects outside the training set, the Min-Pnet achieved a success rate of more than 70% despite the small number of experiments, and after analysis, it was found that most of the gripping failures were due to hardware limitations. When testing the limits of the network, it is also proved that the features and poses of the objects cannot be fully expressed by using only the OBB framework.
Due to hardware and time constraints, it is not possible to conduct more complex experiments such as identifying and grasping multiple objects in a chaotic environment or testing the grasping of more objects in this paper. However, our research has partially demonstrated the advantages of the five-finger gripper and provided a simple and effective grasping system to automate the task of grasping objects.
There are many future research directions, such as using wireless communication to control the arm. The original data collection part can be accomplished in a virtual environment, or using 5G combined with AR for arm training, etc.