An Efficient Motion Adjustment Method for a Dual-Arm Transfer Robot Based on a Two-Level Neural Network and a Greedy Algorithm

: As the manipulation object of a patient transfer robot is a human, which can be considered a complex and time-varying system, motion adjustment of a patient transfer robot is inevitable and essential for ensuring patient safety and comfort. This paper proposes a motion adjustment method based on a two-level deep neural network (DNN) and a greedy algorithm. First, a dataset including information about human posture and contact forces is collected by experiment. Then, the DNN, which is used to estimate contact force, is established and trained with the collected datasets. Furthermore, the adjustment is conducted by comparing the estimated contact force of the next state and the real contact force of the current state by a greedy algorithm. To assess the validity, first, we employed the DNN to estimate contact force and obtained the accuracy and speed of 84% and 30 ms, respectively (implemented with an affordable processing unit). Then, we applied the greedy algorithm to a dual-arm transfer robot and found that the motion adjustment could reduce the contact force and improve human comfort efficiently; these validated the effectiveness of our proposal and provided a new approach to adjust the posture of the care receiver for improving their comfort through reducing the contact force between human and robot.


Introduction
Nowadays, nursing care for disabled people has become very urgent in a rapidly aging society.Transferring a disabled person from or to a bed, a wheelchair, and a toilet is a daily task that brings a heavy burden to caregivers [1].To alleviate the family burden, the development of the transfer nursing robot has attracted wide attention in recent years [2][3][4][5].In general, a transfer nursing robot's operation has six steps, namely, posture recognition, lifting-up process, motion adjustment, moving process, putting-down operation and homing, as shown in Figure 1.It should be noted that motion adjustment is crucial as human comfort can be improved by reducing the human burden (internal and external force) through it [6,7].
Comfort is a metric for measuring the performance of a dual-arm transfer robot [8].It is a physical or psychological sense of the care receiver and can be evaluated by the internal force and the external force of humans [9].Scholars have conducted some research to adjust the posture of a care receiver during transfer to reduce the forces.Mukai et al. [10] proposed a tactile-based motion adjustment method to reduce the contact force between a robot and a care receiver.They generated the motion trajectories of a robot using the tactile information-based interpolation between preset trajectories for tall and short persons [10,11].However, because the proposed motion adjustment method was conducted by changing the horizontal distance between the two arms of the robot, it cannot reduce the contact force efficiently.Furthermore, the posture of a care receiver as a key factor affecting patient comfort was not considered.Hasegawa et al. [12] and Ding et al. [13] proposed a new motion-adjusting method by developing a mechanical model and human comfort evaluation.First, the internal forces and the external forces were estimated by using the developed mechanical model.Then, the comfort evaluation function was generated by weighting the obtained forces through a questionnaire method.Finally, the robot's motion was adjusted by optimizing the human comfort level through the comfort evaluation function.However, due to the different sensitiveness of human comfort to each force and the complexity of human body structure, the accuracy of the developed mechanical model and the comfort evaluation function were low.Therefore, this method can be used neither to optimize internal and external forces on the human body nor to adjust robot trajectory.To tackle these problems, Delp et al. [14] and de Zee et al. [15] developed a musculoskeletal model to simulate the human body and the contacted objects for estimating the physical interaction between humans and objects (machines or robots).
Based on the musculoskeletal model, many efficient controls were achieved to assist human rehabilitation or training.Ding et al. [16] developed a musculoskeletal model specialized for a transfer robot to estimate the muscle force, and they extended this model by using a dynamic model to consider the interaction between humans and robots, which can be used to estimate the contact force between humans and robots.A comfortable holding posture is estimated by minimizing the total activities of all muscles.However, processors with high specifications were required to achieve satisfactorily high speed [17]; this would increase the expenses and consequently limit the practicality of a patient-transfer robot.
Electronics 2024, 13, x FOR PEER REVIEW 2 of 14 between a robot and a care receiver.They generated the motion trajectories of a robot using the tactile information-based interpolation between preset trajectories for tall and short persons [10,11].However, because the proposed motion adjustment method was conducted by changing the horizontal distance between the two arms of the robot, it cannot reduce the contact force efficiently.Furthermore, the posture of a care receiver as a key factor affecting patient comfort was not considered.Hasegawa et al. [12] and Ding et al. [13] proposed a new motion-adjusting method by developing a mechanical model and human comfort evaluation.First, the internal forces and the external forces were estimated by using the developed mechanical model.Then, the comfort evaluation function was generated by weighting the obtained forces through a questionnaire method.Finally, the robot's motion was adjusted by optimizing the human comfort level through the comfort evaluation function.However, due to the different sensitiveness of human comfort to each force and the complexity of human body structure, the accuracy of the developed mechanical model and the comfort evaluation function were low.Therefore, this method can be used neither to optimize internal and external forces on the human body nor to adjust robot trajectory.To tackle these problems, Delp et al. [14] and de Zee et al. [15] developed a musculoskeletal model to simulate the human body and the contacted objects for estimating the physical interaction between humans and objects (machines or robots).Based on the musculoskeletal model, many efficient controls were achieved to assist human rehabilitation or training.Ding et al. [16] developed a musculoskeletal model specialized for a transfer robot to estimate the muscle force, and they extended this model by using a dynamic model to consider the interaction between humans and robots, which can be used to estimate the contact force between humans and robots.A comfortable holding posture is estimated by minimizing the total activities of all muscles.However, processors with high specifications were required to achieve satisfactorily high speed [17]; this would increase the expenses and consequently limit the practicality of a patient-transfer robot.To reduce the load of the care receiver and improve the comfort of them during transfer motion, this paper proposes a method for adjusting the robot trajectory based on the predicted contact force during transfer motion.The basic idea is that the contact force is related to the posture of a care receiver, and this relation can be modeled using machine learning.First, the contact forces between the human body and robot arm, which contains four dimensions: position (2D), size and direction, and the human posture, which was expressed by human joint positions, were collected as a dataset.Then, a two-level deep neural network (DNN) is constructed and trained on the collected data samples to predict the contact force.The trajectory adjustment is performed by comparing the contact force To reduce the load of the care receiver and improve the comfort of them during transfer motion, this paper proposes a method for adjusting the robot trajectory based on the predicted contact force during transfer motion.The basic idea is that the contact force is related to the posture of a care receiver, and this relation can be modeled using machine learning.First, the contact forces between the human body and robot arm, which contains four dimensions: position (2D), size and direction, and the human posture, which was expressed by human joint positions, were collected as a dataset.Then, a two-level deep neural network (DNN) is constructed and trained on the collected data samples to predict the contact force.The trajectory adjustment is performed by comparing the contact force obtained from the DNN and the tactile sensor on the robot's arm through a greedy algorithm.Finally, we practically applied the motion adjustment method to a dual-arm nursing care robot for adjusting the posture of the care receiver to evaluate the efficient of improving the transfer comfort.

Two-Level DNN for Contact Force Estimation
DNN is a mathematical regression model framework [18] that can establish the linear and nonlinear relationship between input and output.It is widely used in practical engineering research.The advantages of a strong nonlinear mapping ability and flexible network structure [19,20] can avoid the limited nonlinear expression of a mechanical model method [21].Therefore, this study uses a two-level DNN to construct a mathematical model to express a relationship between the contact force and the lifting state.
As illustrated in Figure 2, first, the lifting state is encoded as five human joint positions (A, B, C, D, E), two lifting point positions (G, H), and human weight distribution (W 1 , W 2 , W 3 , W 4 ).Furthermore, features are extracted from the encoded lifting state by using the first-level subnetwork.Then, the second-level subnetwork selects related features to compute the contact force.The local feature extraction function of the convolution layer can guarantee sufficient information about the lifting state [22]; meanwhile, the global feature fusion function of the self-attention mechanism can improve the nonlinear expression ability of the proposed mathematical model [23,24].This ensures the high accuracy of a patient transfer robot's operation.Thus, it is meaningful to examine the applicability of a DNN.
Electronics 2024, 13, x FOR PEER REVIEW 3 of 14 obtained from the DNN and the tactile sensor on the robot's arm through a greedy algorithm.Finally, we practically applied the motion adjustment method to a dual-arm nursing care robot for adjusting the posture of the care receiver to evaluate the efficient of improving the transfer comfort.

Two-Level DNN for Contact Force Estimation
DNN is a mathematical regression model framework [18] that can establish the linear and nonlinear relationship between input and output.It is widely used in practical engineering research.The advantages of a strong nonlinear mapping ability and flexible network structure [19,20] can avoid the limited nonlinear expression of a mechanical model method [21].Therefore, this study uses a two-level DNN to construct a mathematical model to express a relationship between the contact force and the lifting state.
As illustrated in Figure 2, first, the lifting state is encoded as five human joint positions (A, B, C, D, E), two lifting point positions (G, H), and human weight distribution (W1, W2, W3, W4).Furthermore, features are extracted from the encoded lifting state by using the first-level subnetwork.Then, the second-level subnetwork selects related features to compute the contact force.The local feature extraction function of the convolution layer can guarantee sufficient information about the lifting state [22]; meanwhile, the global feature fusion function of the self-attention mechanism can improve the nonlinear expression ability of the proposed mathematical model [23,24].This ensures the high accuracy of a patient transfer robot's operation.Thus, it is meaningful to examine the applicability of a DNN.

Encoder Design
The input of the first-level subnetwork is the lifting state including the posture and weight of a care receiver and the positions of lifting points.Because the multi-modalities are difficult to combine mathematically, inspired by the results presented in [25], this study used an encoder to transform the lift-state into a set of two-dimensional (2D) heatmaps and a set of 2D vector fields [26].In this way, a unified data form for the first layer network input is generated.The diagram of the encoder is presented in Figure 3.

Encoder Design
The input of the first-level subnetwork is the lifting state including the posture and weight of a care receiver and the positions of lifting points.Because the multi-modalities are difficult to combine mathematically, inspired by the results presented in [25], this study used an encoder to transform the lift-state into a set of two-dimensional (2D) heatmaps and a set of 2D vector fields [26].In this way, a unified data form for the first layer network input is generated.The diagram of the encoder is presented in Figure 3.
Electronics 2024, 13, x FOR PEER REVIEW 3 of 14 obtained from the DNN and the tactile sensor on the robot's arm through a greedy algorithm.Finally, we practically applied the motion adjustment method to a dual-arm nursing care robot for adjusting the posture of the care receiver to evaluate the efficient of improving the transfer comfort.

Two-Level DNN for Contact Force Estimation
DNN is a mathematical regression model framework [18] that can establish the linear and nonlinear relationship between input and output.It is widely used in practical engineering research.The advantages of a strong nonlinear mapping ability and flexible network structure [19,20] can avoid the limited nonlinear expression of a mechanical model method [21].Therefore, this study uses a two-level DNN to construct a mathematical model to express a relationship between the contact force and the lifting state.
As illustrated in Figure 2, first, the lifting state is encoded as five human joint positions (A, B, C, D, E), two lifting point positions (G, H), and human weight distribution (W1, W2, W3, W4).Furthermore, features are extracted from the encoded lifting state by using the first-level subnetwork.Then, the second-level subnetwork selects related features to compute the contact force.The local feature extraction function of the convolution layer can guarantee sufficient information about the lifting state [22]; meanwhile, the global feature fusion function of the self-attention mechanism can improve the nonlinear expression ability of the proposed mathematical model [23,24].This ensures the high accuracy of a patient transfer robot's operation.Thus, it is meaningful to examine the applicability of a DNN.

Encoder Design
The input of the first-level subnetwork is the lifting state including the posture and weight of a care receiver and the positions of lifting points.Because the multi-modalities are difficult to combine mathematically, inspired by the results presented in [25], this study used an encoder to transform the lift-state into a set of two-dimensional (2D) heatmaps and a set of 2D vector fields [26].In this way, a unified data form for the first layer network input is generated.The diagram of the encoder is presented in Figure 3.   1).Here, to satisfy the requirements of a patient transfer robot [9], five human joints are selected to express a care receiver's posture.
where i denotes the index of a human joint and i ∈ {1, 2, . .., 5}; (u h , v h ) is the position of a pixel in the confidence map; p j i ∈ R 1×2 is the position of joint i; S i (u h , v h , i) indicates the pixel value at position (u h , v h ); g(•) is the Gaussian function; and A is a parameter of the Gaussian function.
Heatmaps 2. The lifting points confidence map S ′ = (S ′ 1 , S ′ 2 ) are generated as [27] where i denotes the lifting point index and i = 1, 2; (u h , v h )) is the position of a pixel in the confidence map; h j i ∈ R 1×2 is the position of holding point i, which is obtained by the robot locating system; S ′ i (u h , v h , i) is the pixel value at position (u h , v h ); and A ′ is a parameter of the Gaussian function.
Vector fields, the part affinity field (PAF) method, which can transform a line to a metric, was used to transform the human limb with gravity to a 2D vector field.The 2D vector fields of the four body parts are generated by the following: where w i (P) is the vector value at position P in w i , is the unit vector in the direction of a body part i, where p i and p i ′ denote the endpoints of the body part i, and in this work, they indicate the joint positions.
The set of points on the limb is constructed of points within a distance threshold of the line segment, which are points that satisfy the following conditions: where is the length of a body part i; P ∈ R 2×1 is a two-dimensional (2D) position in w i ; v i⊥ is a vector perpendicular to v i ; and σ i is the width of body part i, and in this work, the value of σ i is defined based on the experience by the following: where m i and l i are the weight and length of a body part i, respectively, and k and t are parameters of the equation.The obtained vector fields are expressed as L = (L 1 , L 2 , . .., L 4 ).

First-Level Subnetwork
In the first subnetwork, the CNN [27] is used to extract local features of the liftingup state.Previous studies have proven the effectiveness of the VGG16 on many image recognition datasets [28][29][30].In particular, using convolution kernels with a size of 3 × 3 not only enhances the ability to extract local features but also increases the CNN depth; using a pooling layer reduces network parameters and prevents over-fitting of the network; using ReLU (Rectified Linear Unit) activation function layer enhances the nonlinear expression ability of the network [31]; these advantages may contribute to achieving high-accuracy and real-time contact force estimation.However, the input of a conventional VGG16 network should be a color image with a fixed size of 224 × 224 × 3 [27], which, in most cases, does not match the sizes of the obtained confidence maps S and S ′ and vector field L. To address this problem, the filter size of the VGG16's first layer is modified, making it able to generate feature maps that can match the second layer of the VGG16.The modified network is referred to as the M-VGG16 network.
The structure of the M-VGG16 network is presented in Figure 4. First, data of the transformed confidence maps S and S ′ and vector fields L are concatenated to generate the input maps F 1 ∈ R 56×56×11 of the first-level subnetwork.Then, F 1 is analyzed by the revised first convolutional layer C:64-1, whose weights are initialized to 1 to match with the input of the C:64-2.The following maxpooling layer, P:2, is used to decrease the size of the feature map.Next, the obtained feature maps are analyzed by C:128-3, P:2, C:128-3, and P:2 sequentially to obtain the final feature map F, which is then used as the input of the next-level subnetwork.In the proposed network design, all convolutional layers, except the first one, are initialized by the corresponding pre-trained convolutional layer of the VGG16.
Electronics 2024, 13, x FOR PEER REVIEW 5 of 14 able to generate feature maps that can match the second layer of the VGG16.The modified network is referred to as the M-VGG16 network.
The structure of the M-VGG16 network is presented in Figure 4. First, data of the transformed confidence maps S and S' and vector fields L are concatenated to generate the input maps F1 ∈ ℝ 56 × 56 × 11 of the first-level subnetwork.Then, F1 is analyzed by the revised first convolutional layer C:64-1, whose weights are initialized to 1 to match with the input of the C:64-2.The following maxpooling layer, P:2, is used to decrease the size of the feature map.Next, the obtained feature maps are analyzed by C:128-3, P:2, C:128-3, and P:2 sequentially to obtain the final feature map F, which is then used as the input of the next-level subnetwork.In the proposed network design, all convolutional layers, except the first one, are initialized by the corresponding pre-trained convolutional layer of the VGG16.

Second-Level Subnetwork
In the second-level subnetwork, a transformer-based backbone is used to extract the global features of the obtained feature maps and generate the contact force.
The structure of the second-level subnetwork is presented in Figure 5.The input of the second-level subnetwork, which is the feature map F (obtained from the first subnetwork), is processed by a transpose function and a max layer to adjust the input size from 7 × 7 × 256 to 7 × 256 to make it match with the size of the transformer input.The size is adjusted as follows: where transpose(0, 2, 1) represents a function that can transfer the position of the second and third dimensions of the feature maps; max(−1) is a function that extracts the maximum value of the last dimension, namely, the feature map size can be reduced from three dimensions (3D) to two dimensions (2D).

Second-Level Subnetwork
In the second-level subnetwork, a transformer-based backbone is used to extract the global features of the obtained feature maps and generate the contact force.
The structure of the second-level subnetwork is presented in Figure 5.The input of the second-level subnetwork, which is the feature map F (obtained from the first subnetwork), is processed by a transpose function and a max layer to adjust the input size from 7 × 7 × 256 to 7 × 256 to make it match with the size of the transformer input.The size is adjusted as follows: F ′ = F.transpose(0, 2, 1).max(−1) where transpose(0, 2, 1) represents a function that can transfer the position of the second and third dimensions of the feature maps; max(−1) is a function that extracts the maximum value of the last dimension, namely, the feature map size can be reduced from three dimensions (3D) to two dimensions (2D).
Electronics 2024, 13, x FOR PEER REVIEW 5 of 14 able to generate feature maps that can match the second layer of the VGG16.The modified network is referred to as the M-VGG16 network.
The structure of the M-VGG16 network is presented in Figure 4. First, data of the transformed confidence maps S and S' and vector fields L are concatenated to generate the input maps F1 ∈ ℝ 56 × 56 × 11 of the first-level subnetwork.Then, F1 is analyzed by the revised first convolutional layer C:64-1, whose weights are initialized to 1 to match with the input of the C:64-2.The following maxpooling layer, P:2, is used to decrease the size of the feature map.Next, the obtained feature maps are analyzed by C:128-3, P:2, C:128-3, and P:2 sequentially to obtain the final feature map F, which is then used as the input of the next-level subnetwork.In the proposed network design, all convolutional layers, except the first one, are initialized by the corresponding pre-trained convolutional layer of the VGG16.

Second-Level Subnetwork
In the second-level subnetwork, a transformer-based backbone is used to extract the global features of the obtained feature maps and generate the contact force.
The structure of the second-level subnetwork is presented in Figure 5.The input of the second-level subnetwork, which is the feature map F (obtained from the first subnetwork), is processed by a transpose function and a max layer to adjust the input size from 7 × 7 × 256 to 7 × 256 to make it match with the size of the transformer input.The size is adjusted as follows: ′ = .(0,2,1). (−1) where transpose(0, 2, 1) represents a function that can transfer the position of the second and third dimensions of the feature maps; max(−1) is a function that extracts the maximum value of the last dimension, namely, the feature map size can be reduced from three dimensions (3D) to two dimensions (2D).Next, a multi-head attention module is used to analyze the spatial relationship between the obtained features F ′ .As shown in Figure 6, F ′ is split into eight heads that are multiplied with the weight matrices W Q i , W K i , and W V i , i = 1, 2, . .., 8, to obtain matrices Q i , K i , and V i that indicate the query, key, and value of them, respectively [32].
Electronics 2024, 13, x FOR PEER REVIEW 6 of 14 Next, a multi-head attention module is used to analyze the spatial relationship between the obtained features F'.As shown in Figure 6, F' is split into eight heads that are multiplied with the weight matrices WQ  The similarity (S i ) of Q i and K i is calculated by the following: where K iT is the transpose of K i ; d K i is the dimension of K i , where i = (1, 2, …, 8) is the index of a head.Further, the attention of a head (Z i ) is calculated by weighted matching as follows: where d K i is the dimension of K i ; softmax(•) is a function that can normalize data into a value between zero and one, which can be used as a weight of V i in weighted matching.Afterward, the attention Zi from each head is concatenated to generate Z = (Z1, Z2, …, Z8).After the multi-head attention module, two shortcut connection layers (i.e., the blue arrows in Figure 5) and two normalization layers (green frame in Figure 5) are used to overcome the degradation and accelerate the convergence speed of a neural network [33].To increase the nonlinear expression ability of the network, a fully connected feed-forward network, which consists of two linear transformation layers with an activation layer between them, is inserted after the first normalization layer.Moreover, six iterations are conducted on all network layers (i.e., the blue block in Figure 5).Finally, a liner layer is used to change the output size to two; namely, the output includes the contact force values on the human back and thigh.
It is worth mentioning that the loss function of the DNN is defined as a norm of the distance between the predicted result and the actual force, and it is expressed as follows: where F B and F T are the predicted contact force values on the human back and thigh, respectively; F B g and F T g are the actual contact force values on the human back and thigh, respectively.The similarity (S i ) of Q i and K i is calculated by the following: where K iT is the transpose of K i ; d Ki is the dimension of K i , where i = (1, 2, . .., 8) is the index of a head.Further, the attention of a head (Z i ) is calculated by weighted matching as follows: where d Ki is the dimension of K i ; softmax(•) is a function that can normalize data into a value between zero and one, which can be used as a weight of V i in weighted matching.Afterward, the attention Z i from each head is concatenated to generate Z = (Z 1 , Z 2 , . .., Z 8 ).After the multi-head attention module, two shortcut connection layers (i.e., the blue arrows in Figure 5) and two normalization layers (green frame in Figure 5) are used to overcome the degradation and accelerate the convergence speed of a neural network [33].To increase the nonlinear expression ability of the network, a fully connected feed-forward network, which consists of two linear transformation layers with an activation layer between them, is inserted after the first normalization layer.Moreover, six iterations are conducted on all network layers (i.e., the blue block in Figure 5).Finally, a liner layer is used to change the output size to two; namely, the output includes the contact force values on the human back and thigh.
It is worth mentioning that the loss function of the DNN is defined as a norm of the distance between the predicted result and the actual force, and it is expressed as follows: where F B and F T are the predicted contact force values on the human back and thigh, respectively; F B g and F T g are the actual contact force values on the human back and thigh, respectively.
A greedy algorithm can make greedy choices at each step to ensure that the objective function is optimized [34].In this study, a greedy algorithm-based motion adjustment method is proposed to improve patient comfort.As relative positions of the holding points cannot be changed during the lifting-up operation, the lifting state is adjusted by changing the angle between the thigh and horizontal line (θ 1 ) and the angle between the upper body and horizontal line (θ 2 ).The lifting state adjustment steps are presented in Figure 7.
chine system.The action set A = (a 1 , a 2 , …, a 9 ) has nine actions, as presented in Table 1.As shown in Table 1, action is defined as a change in an angle of 0 or 5 degrees, which will be added to θ 1 or θ 2 .The next lifting state S_ = (s 1 _, s 2 _, …, s 9 _) has nine candidate virtual lifting states obtained from the corresponding nine actions.The specific steps of the virtual lifting state generation are as follows: (1) Delineate the human skeleton by connecting the head and the midpoint of shoulders (L&R), hips (L&R), knees (L&R), and ankles (L&R), as shown in Figure 8.
(2) Add an action a i = (σ 1 , σ 2 ) on the current lifting state by rotating line 123 and the lifting point 6 σ 1 degree around point 3 and rotating line 345 and the lifting point 7 σ 2 degree around point 3.
(3) Record the new lifting state after action a i is conducted as the next virtual lifting state denoted by s i _.
Figure 7. Flowchart for lifting state adjustment.A represents a virtual action set; a is an action in the action set A; s is the current lifting state, and s_ is the next lifting state, which is generated from s and a through a real human-machine system; S_ is the next virtual lifting state set.The virtual human-machine system functions to generate the next virtual lifting state set based on s and an A.
While the real human-machine system functions to generate the real next lifting state based on s and A represents a virtual action set; a is an action in the action set A; s is the current lifting state, and s_ is the next lifting state, which is generated from s and a through a real human-machine system; S_ is the next virtual lifting state set.The virtual human-machine system functions to generate the next virtual lifting state set based on s and an A.
While the real human-machine system functions to generate the real next lifting state based on s and a, DNN is the proposed neural network, and Min() is a minimized function, which is capable of choosing a minimum value from an array.F is the virtual contact force set, while f is the real contact force.Arrow in dash functions as updating.
As shown in Figure 7, at the beginning of the method, an action set A is applied to the current lifting state to generate the next virtual lifting state by a virtual human-machine system.The action set A = (a 1 , a 2 , . .., a 9 ) has nine actions, as presented in Table 1.As shown in Table 1, action is defined as a change in an angle of 0 or 5 degrees, which will be added to θ 1 or θ 2 .The next lifting state S_ = (s 1 _, s 2 _, . .., s 9 _) has nine candidate virtual lifting states obtained from the corresponding nine actions.The specific steps of the virtual lifting state generation are as follows: σ 1 , σ 2 0, 0 0, 5 5, 0 5, 5 5, −5 −5, 0 −5, 5 −5, −5 (1) Delineate the human skeleton by connecting the head and the midpoint of shoulders (L&R), hips (L&R), knees (L&R), and ankles (L&R), as shown in Figure 8.
(2) Add an action a i = (σ 1 , σ 2 ) on the current lifting state by rotating line 123 and the lifting point 6 σ 1 degree around point 3 and rotating line 345 and the lifting point 7 σ 2 degree around point 3.
(3) Record the new lifting state after action a i is conducted as the next virtual lifting state denoted by s i _.
The next virtual lifting state set S_ is used as the input of the proposed DNN to obtain the contact force set F = (f 1 , f 2 , . .., f 9 ).Further, the minimum contact force f m is obtained by the following: where min(•) is a minimum function to find the minimum value of a list.
Electronics 2024, 13, x FOR PEER REVIEW 8 of 14 a, DNN is the proposed neural network, and Min() is a minimized function, which is capable of choosing a minimum value from an array.F is the virtual contact force set, while f is the real contact force.Arrow in dash functions as updating.The next virtual lifting state set S_ is used as the input of the proposed DNN to obtain the contact force set F = (f 1 , f 2 , …, f 9 ).Further, the minimum contact force f m is obtained by the following: where min(•) is a minimum function to find the minimum value of a list.Next, compare f m with the contact force f obtained by a tactile sensor on the robot's arm corresponding to the current lifting state.If f m ≥ f, the adjustment process terminates; otherwise, action am is performed to generate a new f m of the real robot system to update its current lifting state and contact force, as indicated by the dashed red arrow in Figure 7; finally, the system conducts an iteration from the beginning, until the adjustment breaks.

Experimental Results
To assess the validity, first, we collected a new dataset for training the developed DNN.Then, we employed the DNN to estimate contact forces and compared the accuracy and speed with the other two methods commonly used in contact-force estimation.Then, the DNN was applied to a dual-arm nursing-care robot for motion adjustment and verified its effectiveness.

Validation of the Developed DNN in Nursing Environment
(1) Dataset collection: The data collection experiment was conducted on a dual-arm robot platform.As shown in Figure 9, the robot comprises a head, chassis, body, and robotic arm.The robot's arms are segmented into upper arms and forearms, linked by elbow Next, compare f m with the contact force f obtained by a tactile sensor on the robot's arm corresponding to the current lifting state.If f m ≥ f, the adjustment process terminates; otherwise, action a m is performed to generate a new f m of the real robot system to update its current lifting state and contact force, as indicated by the dashed red arrow in Figure 7; finally, the system conducts an iteration from the beginning, until the adjustment breaks.

Experimental Results
To assess the validity, first, we collected a new dataset for training the developed DNN.Then, we employed the DNN to estimate contact forces and compared the accuracy and speed with the other two methods commonly used in contact-force estimation.Then, the DNN was applied to a dual-arm nursing-care robot for motion adjustment and verified its effectiveness.

Validation of the Developed DNN in Nursing Environment
(1) Dataset collection: The data collection experiment was conducted on a dual-arm robot platform.As shown in Figure 9, the robot comprises a head, chassis, body, and robotic arm.The robot's arms are segmented into upper arms and forearms, linked by elbow joints.The complete robotic arm is attached to the body via a shoulder joint.The joints linking the body and chassis include the lumbar and hip joints.The robot stands at a height of 1350 mm with a body thickness of around 1000 mm.The distance between the shoulder joints is approximately 688 mm, the maximum arm diameter is 100 mm, and the total mass is 150 kg.
weigh the subjects.A total of 10 blocks, evenly distributed on the back and legs of each subject, were selected as candidate lifting points and denoted by (B1, B2, …, B5) and (T1, T2, …, T5), and nine markers were covered on human joint to mark the position of human joint, see Figure 10.A multiple-calibrated motion capture device with a frame rate of 10 Hz was used to record the motion of the joint markers.Two tactile sensors covering the robot's arms were used to record the contact forces between the human and the robot.In the data collection process, a subject was lifted and then adjusted by the robot from the initial state to the final state along a preset trajectory, as shown in Figure 11.The dataset consists of 500,000 samples, each of which included the positions of the human joint and lifting point, human weight, and contact force.The data augmentation techniques [35] were applied to the data, and the data were split into two sets: an evaluation set consisting of data from five actors (two females and three males) and a training set consisting of data from 45 actors (18 females and 27 males).The new dataset was named the contact force for patient transfer robot (CFPR) dataset.A few typical examples of the data samples are depicted in Figure 12.The experiment recruited 50 subjects (20 females and 30 males).The age of subjects ranges from 22 to 65 Y/O, and the average age is 34 Y/O.The heights and weights of the subjects were also different from each other.The heights are from 1.50 m to 1.85 m, and the weights are from 47 kg to 72 kg.As shown in Figure 9, a weight scale was used to weigh the subjects.A total of 10 blocks, evenly distributed on the back and legs of each subject, were selected as candidate lifting points and denoted by (B 1 , B 2 , . .., B 5 ) and (T 1 , T 2 , . .., T 5 ), and nine markers were covered on human joint to mark the position of human joint, see Figure 10.A multiple-calibrated motion capture device with a frame rate of 10 Hz was used to record the motion of the joint markers.Two tactile sensors covering the robot's arms were used to record the contact forces between the human and the robot.
Electronics 2024, 13, x FOR PEER REVIEW 9 of 14 weigh the subjects.A total of 10 blocks, evenly distributed on the back and legs of each subject, were selected as candidate lifting points and denoted by (B1, B2, …, B5) and (T1, T2, …, T5), and nine markers were covered on human joint to mark the position of human joint, see Figure 10.A multiple-calibrated motion capture device with a frame rate of 10 Hz was used to record the motion of the joint markers.Two tactile sensors covering the robot's arms were used to record the contact forces between the human and the robot.In the data collection process, a subject was lifted and then adjusted by the robot from the initial state to the final state along a preset trajectory, as shown in Figure 11.The dataset consists of 500,000 samples, each of which included the positions of the human joint and lifting point, human weight, and contact force.The data augmentation techniques [35] were applied to the data, and the data were split into two sets: an evaluation set consisting of data from five actors (two females and three males) and a training set consisting of data from 45 actors (18 females and 27 males).The new dataset was named the contact force for patient transfer robot (CFPR) dataset.A few typical examples of the data samples are depicted in Figure 12.In the data collection process, a subject was lifted and then adjusted by the robot from the initial state to the final state along a preset trajectory, as shown in Figure 11.The dataset consists of 500,000 samples, each of which included the positions of the human joint and lifting point, human weight, and contact force.The data augmentation techniques [35] were applied to the data, and the data were split into two sets: an evaluation set consisting of data from five actors (two females and three males) and a training set consisting of data from 45 actors (18 females and 27 males).The new dataset was named the contact force for patient transfer robot (CFPR) dataset.A few typical examples of the data samples are depicted in Figure 12.  (2) Contact force estimation: The experiment was conducted in the PyTorch environment.The training and testing processes were run on a PC with an Intel Core i7-6700HQ CPU, 8 GB RAM, and 4G NVIDIA GeForce GTX 950M GPU.The proposed method was compared with two commonly used methods in lifting-force estimation on the LFPR dataset.
(3) Evaluation metric: To evaluate the proposed model's performance, the average accuracy (AA) was used as an evaluation metric, and its calculation method was as follows.First, the 10 N rule was used to evaluate the correctness of estimation.In particular, an estimation was regarded as correct when the error between the predicted value and the true value was less than 10 N. In addition, the ratio of the number of correct estimations to the total number of estimations (CE-to-TE ratio) was used to assess the accuracy of contact force estimation.Finally, the average CE-to-TE value was regarded as average accuracy.
The results of the average accuracy and speed of the total contact force are presented in Table 2.The study by Mukai et al. [12] had the fastest speed, but its accuracy is the lowest among all methods.This is because this method used the mechanical lifting-up model where a human is regarded as a two-link object [13], while the posture of the lower   (2) Contact force estimation: The experiment was conducted in the PyTorch environment.The training and testing processes were run on a PC with an Intel Core i7-6700HQ CPU, 8 GB RAM, and 4G NVIDIA GeForce GTX 950M GPU.The proposed method was compared with two commonly used methods in lifting-force estimation on the LFPR dataset.
(3) Evaluation metric: To evaluate the proposed model's performance, the average accuracy (AA) was used as an evaluation metric, and its calculation method was as follows.First, the 10 N rule was used to evaluate the correctness of estimation.In particular, an estimation was regarded as correct when the error between the predicted value and the true value was less than 10 N. In addition, the ratio of the number of correct estimations to the total number of estimations (CE-to-TE ratio) was used to assess the accuracy of contact force estimation.Finally, the average CE-to-TE value was regarded as average accuracy.
The results of the average accuracy and speed of the total contact force are presented in Table 2.The study by Mukai et al. [12] had the fastest speed, but its accuracy is the lowest among all methods.This is because this method used the mechanical lifting-up model where a human is regarded as a two-link object [13], while the posture of the lower (2) Contact force estimation: The experiment was conducted in the PyTorch environment.The training and testing processes were run on a PC with an Intel Core i7-6700HQ CPU, 8 GB RAM, and 4G NVIDIA GeForce GTX 950M GPU.The proposed method was compared with two commonly used methods in lifting-force estimation on the LFPR dataset.
(3) Evaluation metric: To evaluate the proposed model's performance, the average accuracy (AA) was used as an evaluation metric, and its calculation method was as follows.First, the 10 N rule was used to evaluate the correctness of estimation.In particular, an estimation was regarded as correct when the error between the predicted value and the true value was less than 10 N. In addition, the ratio of the number of correct estimations to the total number of estimations (CE-to-TE ratio) was used to assess the accuracy of contact force estimation.Finally, the average CE-to-TE value was regarded as average accuracy.
The results of the average accuracy and speed of the total contact force are presented in Table 2.The study by Mukai et al. [12] had the fastest speed, but its accuracy is the

Figure 1 .
Figure 1.Operational steps of a patient transfer robot.

Figure 1 .
Figure 1.Operational steps of a patient transfer robot.

Figure 2 .
Figure 2. Structure of the developed network.

Figure 3 .
Figure 3. Function of the encoder.

Figure 2 .
Figure 2. Structure of the developed network.

Figure 2 .
Figure 2. Structure of the developed network.

Figure 3 .
Figure 3. Function of the encoder.Figure 3. Function of the encoder.

Figure 3 .
Figure 3. Function of the encoder.Figure 3. Function of the encoder.Heatmaps 1.The Gaussian function, which can transform a 2D point to a 2D metric, was used to convert the human joint position to a set of confidence maps S = (S 1 , S 2 , . ..,

Figure 4 .
Figure 4. Structure of the first level subnetwork.S is the heatmap 1, L is the vector field, S' is the heatmap 2, C: X-Y represents a convolutional layer, which includes X convolutional kernels with the size of Y × Y, and P:N is a maxpooling layer, where N is the stride of filter.

Figure 5 .
Figure 5. Structure of the second level subnetwork.F, F', Z, and Z' are feature maps, N is the number of iterations.

Figure 4 .
Figure 4. Structure of the first level subnetwork.S is the heatmap 1, L is the vector field, S ′ is the heatmap 2, C: X-Y represents a convolutional layer, which includes X convolutional kernels with the size of Y × Y, and P:N is a maxpooling layer, where N is the stride of filter.

Figure 4 .
Figure 4. Structure of the first level subnetwork.S is the heatmap 1, L is the vector field, S' is the heatmap 2, C: X-Y represents a convolutional layer, which includes X convolutional kernels with the size of Y × Y, and P:N is a maxpooling layer, where N is the stride of filter.

Figure 5 .
Figure 5. Structure of the second level subnetwork.F, F', Z, and Z' are feature maps, N is the number of iterations.

Figure 5 .
Figure 5. Structure of the second level subnetwork.F, F ′ , Z, and Z ′ are feature maps, N is the number of iterations.
WV i , i = 1, 2, …, 8, to obtain matrices Q i , K i , and V i that indicate the query, key, and value of them, respectively[32].

Figure 6 .
Figure 6.Structure of the multi-head attention module.

Figure 6 .
Figure 6.Structure of the multi-head attention module.

Figure 7 .
Figure 7. Flowchart for lifting state adjustment.A represents a virtual action set; a is an action in the action set A; s is the current lifting state, and s_ is the next lifting state, which is generated from s and a through a real human-machine system; S_ is the next virtual lifting state set.The virtual human-machine system functions to generate the next virtual lifting state set based on s and an A. While the real human-machine system functions to generate the real next lifting state based on s and a, DNN is the proposed neural network, and Min() is a minimized function, which is capable of choosing a minimum value from an array.F is the virtual contact force set, while f is the real contact force.Arrow in dash functions as updating.

Figure 8 .
Figure 8. Sequences of generating the virtual lifting state.

Figure 8 .
Figure 8. Sequences of generating the virtual lifting state.

Figure 9 .
Figure 9. Platform for data collection.

Figure 9 .
Figure 9. Platform for data collection.

Figure 9 .
Figure 9. Platform for data collection.

Figure 12 .
Figure 12.Typical examples of the CFPR dataset.The first column shows the position of human joints and the lifting points, the second column indicates the weight of the subjects, and the last column depicts the contact force on thigh and back of the experimenters.

Figure 11 .
Figure 11.Lifting state adjustment for data collection.

Figure 12 .
Figure 12.Typical examples of the CFPR dataset.The first column shows the position of human joints and the lifting points, the second column indicates the weight of the subjects, and the last column depicts the contact force on thigh and back of the experimenters.

Figure 12 .
Figure 12.Typical examples of the CFPR dataset.The first column shows the position of human joints and the lifting points, the second column indicates the weight of the subjects, and the last column depicts the contact force on thigh and back of the experimenters.