A Novel Gesture Recognition System for Intelligent Interaction with a Nursing-Care Assistant Robot

The expansion of nursing-care assistant robots in smart infrastructure has provided more applications for homecare services, which has raised new demands for smart and natural interaction between humans and robots. This article proposed an innovative hand motion trajectory (HMT) gesture recognition system based on background velocity features. Here, a new wearable wrist-worn camera prototype for gesture’s video collection was designed, and a new method for the segmentation of continuous gestures was shown. Meanwhile, a nursing-care assistant robot prototype was designed for assisting the elderly, which is capable of carrying the elderly with omnidirectional motion and grabbing the specified object at home. In order to evaluate the performance of the gesture recognition system, 10 special gestures were defined as the move commands for interaction with the robot, and 1000 HMT gesture samples were obtained from five subjects for leave-one-subject-out (LOSO) cross-validation classification with an average recognition accuracy of up to 97.34%. Moreover, the performance and practicability of the proposed system were further demonstrated by controlling the omnidirectional movement of the nursing-care assistant robot using the predefined gesture commands.


Introduction
The evolution of the Internet of Things (IoT) network has made intelligent devices more available, which offers more possibilities to facilitate people's lives [1,2].In aging societies, one focus of smart infrastructure field is the assistance of the elderly and the disabled by using advanced IoT devices.Therefore, there is a strong demand for robots to tackle problems that resulted from the aging population, such as the lack of caregivers for nursing and accompanying of the elderly, and are promoting the development of nursing-care assistant robots [3].Providing more natural and intelligent interaction modes [4,5] with the nursing-care assistant robots [6] is one of the frontiers of smart infrastructure development [7].Hand gesture recognition paves an appropriate way [8] to obtain people's intention for the control of smart devices, and some progress has been made in previous studies.Elderly people can use smart IoT devices to express intention by making corresponding gestures, so as to control various smart devices remotely [9] at home [10].Therefore, natural human-robot interaction based on hand gestures [11] will become a popular research topic in the near future.
Hand gestures are the most common means for nonverbal communication [12].Generally speaking, gestures are divided into two types: static gestures and dynamic gestures.The former mainly focuses on the finger's flex angles and poses [13,14], while the latter pays more attention to the hand motion trajectory (HMT) [15].In previous studies, sensors for the above two types of gesture recognition mainly referred to two categories: image-based sensors [16] and non-image based sensors [17].Most previous studies of static hand gestures recognition have used non-image based sensors (integrated in wearable gloves and bands [14]), while the studies of HMT gesture recognition are based on fixed image-based sensors (such as using in-depth information provided by Kinect [18]).Inertial sensors are commonly applied in the field of non-image based HMT gesture recognition.Xu et al. and Xie et al. used the accelerometer inertial sensor for the HMT gesture recognition with a mean recognition accuracy of 95.6% and 98.9% [19,20].However, their methods were based on the feature of acceleration, which was susceptible to the sensor's posture.Besides, the acceleration is not as intuitive as the velocity or displacement when representing a trajectory gesture, which might further limit the system's performance on more diverse and complex gestures.
In recent years, significant efforts have been devoted to developing the image-based sensors for HMT gesture recognition.Plouffe et al. used the Kinect sensor to achieve the recognition of static and dynamic hand gesture recognition in real time, and achieved an average accuracy of 92.4% [21].Tang et al. proposed an approach for continuous hand trajectory recognition based on the depth data collected by the Kinect2 sensor [22].In addition, Zhang et al. proposed a novel system for dynamic continuous hand gesture recognition based on a frequency-modulated continuous wave radar sensor [23], which achieved a high recognition rate of 96%.However, the above methods for the dynamic gesture recognition have to rely on the position-fixed sensors, which limits the spatial flexibility of gesture actions and is not a suitable human-robot interaction [24] mode for the elderly.What's more, there is a study on gesture recognition without spatial position restriction.Kim et al. recovered the full three-dimensional (3D) pose of the user's hand using a wrist-worn sensor [25].It should be pointed out that the work of Kim et al. is only effective for static finger gesture recognition, and cannot achieve the HMT gesture recognition.
In this article, we propose a novel HMT gesture recognition system based on a wearable wrist-worn camera, and apply it to the intelligent interaction with a nursing-care assistant robot, as shown in Figure 1.To our best knowledge, this is the first study of HMT gesture recognition using a wearable wrist-worn camera based on background velocity analysis, which has no workspace restrictions.In addition, we proposed a reliable method to detect the start/end point of effective HMT gestures for continuous gesture segmentation, which is achieved by detecting the fist motion and the hand motion velocity.Furthermore, we constructed an algorithm framework that is composed of hand region segmentation, background velocity calculation, continuous gesture segmentation, and gesture type classification.We also designed the prototype of an HMT gesture recognition system and carried out experimental verification and results analysis.To further demonstrate the practicability of the proposed system, we designed a prototype of a nursing-care assistant robot for the aged-care at home, and defined 10 special gestures to interact with the nursing-care assistant robot.

System Architecture
The architecture of the HMT gesture recognition system consists of three parts: data acquisition, data processing, and a natural human-robot interface for smart infrastructure.In the data acquisition, the subject puts the wearable wrist-worn camera on their right wrist and performs gestures.The camera records the original video data of the background, which reflects the HMT.After that, the data are transmitted to the host computer through Wi-Fi, and then the data are processed.During data processing, a series of algorithms are used to recognize the hand region, calculate the velocity of the background based on matching the Speeded-Up Robust Features (SURF) feature points, and segment the continuous gesture.Then, the velocity data of the effective gesture are obtained.After that, the classification algorithm is used to recognize the target gesture.The gesture recognition results correspond to the predefined control commands, so that various smart IoT devices in smart homes can be remotely controlled by the gesture.The highlight application of this study is the interaction between human and robot.A nursing-care assistant robot is designed for the assistance of elderly people at home, and the completed prototype can achieve two working modes of the manin-seat interaction mode and remote interaction mode based on the proposed HMT gesture recognition system.

The Wearable Camera Architecture
For the HMT gesture recognition in this study, we designed a new wearable wrist-worn camera.The hardware structure is shown in Figure 2. We used SolidWorks to design the lightweight and foldable structures firstly, as shown in Figure 2a.The device is designed according to the requirements of lightweight, unobtrusiveness, and portability, as a wearable device for the daily use of the elderly [26].The base, shell, and camera hood are all manufactured by the 3D printing of nylon material (the thickness of printing shell is one mm), which has the advantage of being lightweight (the final prototype is 114 g in weight).The foldable structure of the camera partially endows the The remaining part of this article is organized as follows: the architecture of the wearable wrist-worn camera and the new nursing-care assistant robot are described in Section 2. The algorithm framework for the gesture recognition and the gesture principle designed for the navigation of the nursing-care assistant robot are presented in Section 3. In Section 4, the description of the experimental process and the evaluation of the proposed gesture recognition system are conducted, and the application of the interaction with the nursing-care assistant robot is carried out.Finally, Section 5 gives the discussion and conclusion of our work.

System Architecture
The architecture of the HMT gesture recognition system consists of three parts: data acquisition, data processing, and a natural human-robot interface for smart infrastructure.In the data acquisition, the subject puts the wearable wrist-worn camera on their right wrist and performs gestures.The camera records the original video data of the background, which reflects the HMT.After that, the data are transmitted to the host computer through Wi-Fi, and then the data are processed.During data processing, a series of algorithms are used to recognize the hand region, calculate the velocity of the background based on matching the Speeded-Up Robust Features (SURF) feature points, and segment the continuous gesture.Then, the velocity data of the effective gesture are obtained.After that, the classification algorithm is used to recognize the target gesture.The gesture recognition results correspond to the predefined control commands, so that various smart IoT devices in smart homes can be remotely controlled by the gesture.The highlight application of this study is the interaction between human and robot.A nursing-care assistant robot is designed for the assistance of elderly people at home, and the completed prototype can achieve two working modes of the man-in-seat interaction mode and remote interaction mode based on the proposed HMT gesture recognition system.

The Wearable Camera Architecture
For the HMT gesture recognition in this study, we designed a new wearable wrist-worn camera.The hardware structure is shown in Figure 2. We used SolidWorks to design the lightweight and foldable structures firstly, as shown in Figure 2a.The device is designed according to the requirements of lightweight, unobtrusiveness, and portability, as a wearable device for the daily use of the elderly [26].The base, shell, and camera hood are all manufactured by the 3D printing of nylon material (the thickness of printing shell is one mm), which has the advantage of being lightweight (the final prototype is 114 g in weight).The foldable structure of the camera partially endows the device with a compact structure.The base and an elastic fabric wristband are bonded by melt adhesive (we use a dispensing gun to heat the melt adhesive and apply the base to the fabric).The mentioned foldable structure guarantees the unobtrusiveness of the device for users.The drawer type structure between the shell and the base is adopted, which can be easily dismantled and installed, and satisfies the user's usage requirements of portability.device with a compact structure.The base and an elastic fabric wristband are bonded by melt adhesive (we use a dispensing gun to heat the melt adhesive and apply the base to the fabric).The mentioned foldable structure guarantees the unobtrusiveness of the device for users.The drawer type structure between the shell and the base is adopted, which can be easily dismantled and installed, and satisfies the user's usage requirements of portability.The selected image-based sensor is the Raspberry Pi Camera Module, which is a CMOS-type (Complementary Metal Oxide Semiconductor) 175-degree wide-angle camera that is especially compatible with Raspberry Pi with a resolution of five million pixels (2952 × 1944 pixels).As the control unit, Raspberry Pi Zero W integrates a 1-Ghz single-core central processing unit (CPU) and 512 MB RAM with additional support for 802.11 b/g/n wireless LAN connectivity.The module is suitable for prototype development and the verification of smart infrastructure under the Internet of Things technology due to its small size (65 mm × 30 mm × 5 mm) and wireless transmission compatibility.In addition, compared with other controller modules such as Arduino, it has a higher clock frequency, which is more suitable for fast image processing and acquisition.In order to ensure the small size of the integrated design, the camera module and the Raspberry Pi are connected by a flexible flat cable (FFC).According to the power supply requirement and the size limitation of the integrated design, two rechargeable lithium batteries with a rated voltage of 3.7 V are selected for parallel output with a total capacity of 2000 mAh.Meanwhile, in order to meet the demand of Raspberry Pi and the camera module's power supply, the boost converter is used to get 5 V of voltage output.The integrated implementation of the wearable wrist-worn camera described above and the prototype are shown in Figure 2b.
In the process of collecting the original video data, the Raspberry Pi runs the python script on The selected image-based sensor is the Raspberry Pi Camera Module, which is a CMOS-type (Complementary Metal Oxide Semiconductor) 175-degree wide-angle camera that is especially compatible with Raspberry Pi with a resolution of five million pixels (2952 × 1944 pixels).As the control unit, Raspberry Pi Zero W integrates a 1-Ghz single-core central processing unit (CPU) and 512 MB RAM with additional support for 802.11 b/g/n wireless LAN connectivity.The module is suitable for prototype development and the verification of smart infrastructure under the Internet of Things technology due to its small size (65 mm × 30 mm × 5 mm) and wireless transmission compatibility.In addition, compared with other controller modules such as Arduino, it has a higher clock frequency, which is more suitable for fast image processing and acquisition.In order to ensure the small size of the integrated design, the camera module and the Raspberry Pi are connected by a flexible flat cable (FFC).According to the power supply requirement and the size limitation of the integrated design, two rechargeable lithium batteries with a rated voltage of 3.7 V are selected for parallel output with a total capacity of 2000 mAh.Meanwhile, in order to meet the demand of Raspberry Pi and the camera module's power supply, the boost converter is used to get 5 V of voltage output.The integrated implementation of the wearable wrist-worn camera described above and the prototype are shown in Figure 2b.
In the process of collecting the original video data, the Raspberry Pi runs the python script on the Raspbian system (a system based on Debian GNU/Linux for Raspberry Pi hardware development) to establish the TCP (Transmission Control Protocol) server firstly, and then connects with the TCP client on the computer.After the successful connection, the Raspberry Pi collects the video data from the camera module, and then transmits the video data to the computer with the configuration of 320 × 240 resolution and a frame rate of 12 frames per second (FPS).The above parameters based on experimental optimization can reduce the packet loss and satisfy the data processing requirements of subsequent algorithms.The algorithmic details of data processing will be introduced in detail in Section 3.
The exact size of the wearable camera can be found in Figure 2a.The camera hood is equipped with a camera module, which can rotate 0-150 degrees between the camera hood and the shell through a rotating shaft.There are irregular grooves designed at the bottom of the base for uniform melt adhesive.The angle θ is set to 80 degrees in the working state of this study, and it can be easily folded in the non-working state.The test result showed that the working current is 0.21 A in the video transmission state, and 0.11 A in the boot state without data transmission.According to the battery capacity and actual use test, the device can work continuously for more than four hours, which fully meets the requirements of the use in household conditions [27].Furthermore, the system power consumption can be further reduced by monitoring the motion velocity threshold that triggers the sleep mode.

The Nursing-Care Assistant Robot
With the advent of an aging society, many robots, such as mental commitment robots dedicated for mental healing [28] and the smart wheelchairs [29], have been proposed to help on-site caregivers.Mukai et al. developed an assistant robot, RIBA, to lift a human in its arms [30].The above nursing robots are mainly oriented toward hospitals and clinics.In this study, we have integrated the design of a nursing-care assistant robot for aged-care at home that can not only carry people similar to a wheelchair, but also grasp the target object.As shown in Figure 3, the mechanical structure of the nursing robot consists of four parts: omnidirectional mobile chassis, lift adjusting mechanism, dual manipulator above the chassis, and the seat part at the fore.The YuMi collaborative robot produced by Asea Brown Boveri Ltd. (ABB) was chosen as the dual manipulator.The frame of the other structures was made of all the aluminum profiles.The design of all of the mechanical structures of this cooperative robot was carried out in SolidWorks.Meanwhile, the corresponding structural stability was checked to ensure the reliability and safety of the household environment.Limited by the length of the article, the details of the check are not carried out in this article; these are provided in the corresponding supplementary materials, Check S1.The dimensions of the robot are shown in Figure 3, where the maximum (550 mm) of longitudinal lift is reflected.
The user can sit on the seat part in front of the nursing cooperative robot, and the dual manipulator makes corresponding nursing actions behind the user such as assisting in helping the user get up from the seat, fetching the target object, and so on.In detail, the robot can move its dual arms to a suitable position providing supporting points such as the chair arms for the elderly to get up from the seat.An electric lifting adjusting mechanism is designed between the dual manipulator and the mobile chassis.The corresponding height between the dual manipulator and the user can be adjusted to adapt to users with different body shapes and ensure the space for the manipulator with different movements.Similarly, the sliding rail mechanism between the seat part and the mobile chassis can adjust the relative distance between the user and the manipulator, which ensures the user's comfort and a wide application for different people.
sits on the nursing-care assistant robot's seat part, this mode realizes the function of assisting the elderly in moving to the designated destination and taking some necessary objects such as medicines.The remote interaction mode refers to the condition that interacts with the robot remotely, which is aiming to assist the elderly in picking up distant objects.

Algorithm for HMT Gesture Recognition
After obtaining the original video data, MATLAB is used to decode the data for the subsequent processing in the computer.In order to realize continuous HMT gesture recognition, the main idea of our scheme is as follows.The hand in the middle of the captured video footage is almost static, and the feature that can reflect the trajectory of hand motion is the background variation around the hand.We select the velocity of the background as the characteristic parameter, and indirectly reflect the actual hand motion trajectory based on the motion velocity of the background in the video.Secondly, in order to distinguish the effective gestures and other ineffective gestures in the process of continuous hand motion, we used an innovative method to segment gestures by detecting the motion of fist bobbing and the changes of hand motion velocity.The motion of the fist bobbing is defined as flexing and then extending the wrist.Since the size of the fist varies from user to user, the angle of wrist flex varies from person to person.What the user needs to ensure is that the fist disappears from the video screen when the wrist is flexed.It can be more clearly shown by the demonstration video in the supplementary materials, Video S1.After obtaining the effective gesture data, we classify the gestures by cross-validation, and obtain the gesture recognition results.The algorithm flow diagram is shown in Figure 4.
In order to implement the above algorithm framework, we divide the continuous HMT gesture recognition algorithm framework into the following four main steps: (1) hand region segmentation, which can obtain the height feature of the hand; (2) background velocity calculation, which can be obtained by the background feature point matching; (3) continuous gesture segmentation, which is achieved by the start-end signal detection; and (4) data normalization and cross-validation classification, which can obtain the recognition accuracy to evaluate the performance of the system.The algorithm is discussed in detail in the subsections of this section.As shown in Figure 3, the communication protocol of the nursing-care assistant robot mainly related to the communication between the omnidirectional mobile chassis and the dual manipulator.In this design, the STM32F405RGT6 provided by STMicroelectronics is selected as the microcontroller unit (MCU) in the main control board.The motion control of the dual manipulator is based on the IRC5 controller of ABB.The single manipulator consists of a mechanical arm and a gripper, which is independently communicated with the main control board via Ethernet.The omnidirectional mobile chassis is driven by four servo motors, which are controlled by the corresponding servo motor controller.The Controller Area Network (CAN-bus) communication mode is chosen between the mobile chassis servo controller and the MCU.The nursing-care assistant robot that we proposed is aiming to assist and nurse the elderly at home.Two working modes of the nursing-care assistant robot are designed to better meet the above nursing needs: man-in-seat interaction mode and remote interaction mode.The man-in-seat interaction mode refers to the near-field control.When the user sits on the nursing-care assistant robot's seat part, this mode realizes the function of assisting the elderly in moving to the designated destination and taking some necessary objects such as medicines.The remote interaction mode refers to the condition that interacts with the robot remotely, which is aiming to assist the elderly in picking up distant objects.

Algorithm for HMT Gesture Recognition
After obtaining the original video data, MATLAB is used to decode the data for the subsequent processing in the computer.In order to realize continuous HMT gesture recognition, the main idea of our scheme is as follows.The hand in the middle of the captured video footage is almost static, and the feature that can reflect the trajectory of hand motion is the background variation around the hand.We select the velocity of the background as the characteristic parameter, and indirectly reflect the actual hand motion trajectory based on the motion velocity of the background in the video.Secondly, in order to distinguish the effective gestures and other ineffective gestures in the process of continuous hand motion, we used an innovative method to segment gestures by detecting the motion of fist bobbing and the changes of hand motion velocity.The motion of the fist bobbing is defined as flexing and then extending the wrist.Since the size of the fist varies from user to user, the angle of wrist flex varies from person to person.What the user needs to ensure is that the fist disappears from the video screen when the wrist is flexed.It can be more clearly shown by the demonstration video in the supplementary materials, Video S1.After obtaining the effective gesture data, we classify the gestures by cross-validation, and obtain the gesture recognition results.The algorithm flow diagram is shown in Figure 4.

Hand Region Segmentation
In the process of continuous gesture recognition, the motion of fist bobbing should be judged by the height change of the hand region.In addition, the foreground area of the hand should be removed, and only the background part should be reserved for subsequent processing when calculating the velocity of the background feature.Therefore, the recognition and segmentation of the hand region is the important part of the algorithm framework.In this subsection, we will describe the algorithms for hand region segmentation and hand height calculation.
In order to improve the efficiency of the algorithm and the recognition velocity of the whole system, we reduced the pixels of the original video frame to 72 × 96 pixels from 240 × 320 pixels before the hand region segmentation.Since the camera module used in this study is a 175-degree wide-angle camera, the original image has a wide-angle distortion effect, as shown in Figure 5a.Generally, the wide-angle distortion can be corrected by the corresponding correction algorithm [31].However, the hand region is an important parameter in continuous gesture segmentation and background velocity calculation in this study.Particularly, we didn't use the general wide-angle distortion correction algorithm; instead, we used the pixel reduction method to remove the serious distortion image corner, which results in the pixels being cut down to 54 × 73 pixels, as shown in Figure 5b.This method is very suitable for this study.The serious wide-angle distortion can be eliminated conveniently, and the characteristics of the hand region in the middle of the image are highlighted.At the same time, in order to improve the performance of the subsequent processing algorithm, the red-green-blue (RGB) image is transformed into a L*a*b* color space.The transformation result is shown in Figure 5c.Then, we use the simple linear iterative cluster (SLIC) [32] algorithm, which can generate superpixels to further divide the pixel meshes and reduce the computation cost of hand region segmentation.The result of the SLIC is shown in Figure 5d.
The lazy snapping algorithm [33] is used to realize the segmentation of the hand region, which is an interactive algorithm for image segmentation.The foreground and background are segmented based on the seed pixels specified by the user.In the original video images, the distinction between the foreground and background is more obvious.Obviously, the middle part of the image is the hand region, while the rest is the environmental background.Therefore, the video image sequence is segmented by giving the initial foreground and background seed pixels.As shown in Figure 5e, the green area is the seed pixels of the foreground Sfore(0), and the blue area is the seed pixels of the background Sback(0).
As shown in Figure 5f, the hand region segmentation result Rhand(i) of the current frame can be obtained by giving the Sfore(i−1) and Sback(i−1) of the previous frame.The foreground seed pixels Sfore(i) for the next gesture segmentation are obtained by combining Sfore(0) and the erosion of the segmentation result of the hand region of the current frame Rhand(i).The background seed pixels In order to implement the above algorithm framework, we divide the continuous HMT gesture recognition algorithm framework into the following four main steps: (1) hand region segmentation, which can obtain the height feature of the hand; (2) background velocity calculation, which can be obtained by the background feature point matching; (3) continuous gesture segmentation, which is achieved by the start-end signal detection; and (4) data normalization and cross-validation classification, which can obtain the recognition accuracy to evaluate the performance of the system.The algorithm is discussed in detail in the subsections of this section.

Hand Region Segmentation
In the process of continuous gesture recognition, the motion of fist bobbing should be judged by the height change of the hand region.In addition, the foreground area of the hand should be removed, and only the background part should be reserved for subsequent processing when calculating the velocity of the background feature.Therefore, the recognition and segmentation of the hand region is the important part of the algorithm framework.In this subsection, we will describe the algorithms for hand region segmentation and hand height calculation.
In order to improve the efficiency of the algorithm and the recognition velocity of the whole system, we reduced the pixels of the original video frame to 72 × 96 pixels from 240 × 320 pixels before the hand region segmentation.Since the camera module used in this study is a 175-degree wide-angle camera, the original image has a wide-angle distortion effect, as shown in Figure 5a.Generally, the wide-angle distortion can be corrected by the corresponding correction algorithm [31].However, the hand region is an important parameter in continuous gesture segmentation and background velocity calculation in this study.Particularly, we didn't use the general wide-angle distortion correction algorithm; instead, we used the pixel reduction method to remove the serious distortion image corner, which results in the pixels being cut down to 54 × 73 pixels, as shown in Figure 5b.This method is very suitable for this study.The serious wide-angle distortion can be eliminated conveniently, and the characteristics of the hand region in the middle of the image are highlighted.At the same time, in order to improve the performance of the subsequent processing algorithm, the red-green-blue (RGB) image is transformed into a L*a*b* color space.The transformation result is shown in Figure 5c.Then, we use the simple linear iterative cluster (SLIC) [32] algorithm, which can generate superpixels to further divide the pixel meshes and reduce the computation cost of hand region segmentation.The result of the SLIC is shown in Figure 5d.

Background Velocity Calculation
In this study, background velocity is the dominant feature reflecting the trajectory of the hand.How to obtain the background velocity in the video image is a problem to be solved after the foreground of the hand region and the background of the environment have been segmented.The acquisition of velocity depends on the selection of reference points.In this study, the Speeded-Up Robust Features (SURF) [34] keypoints are chosen to be the reference points, and the background velocity is obtained by the average displacement of the matched SURF keypoints between two adjacent image frames.
This part of the specific algorithm flow is introduced below.Firstly, the SURF keypoints of two adjacent image frames are extracted, and the keypoints in the foreground are removed according to the segmentation results of the hand region, which results that only the keypoints in the background are retained.As shown in Figure 6a,b, the green plus signs and the red dots are the keypoints extracted from the two adjacent image frames, respectively.Moreover, the corresponding matching distance is calculated by using the method Lowe et al. proposed in [35].The keypoints obtained from the two adjacent image frames are matched according to the minimum distance, and the matching keypoints whose distance exceeds the threshold Lowe et al. proposed in [35] will be removed.
As shown in Figure 6c, after obtaining the matching keypoints between two adjacent image frames, the displacement of the matching keypoints between two adjacent image frames with that of the frame interval is 1/12 s, according to the above data configuration.The velocity vector of each matching was shown by the yellow arrow in Figure 6c.In order to quantify the background velocity by the velocity vector of the matching points, all of the velocity vectors are represented in the coordinate system, as shown in Figure 6d  The lazy snapping algorithm [33] is used to realize the segmentation of the hand region, which is an interactive algorithm for image segmentation.The foreground and background are segmented based on the seed pixels specified by the user.In the original video images, the distinction between the foreground and background is more obvious.Obviously, the middle part of the image is the hand region, while the rest is the environmental background.Therefore, the video image sequence is segmented by giving the initial foreground and background seed pixels.As shown in Figure 5e, the green area is the seed pixels of the foreground S fore (0), and the blue area is the seed pixels of the background S back (0).
As shown in Figure 5f, the hand region segmentation result R hand (i) of the current frame can be obtained by giving the S fore (i−1) and S back (i−1) of the previous frame.The foreground seed pixels S fore (i) for the next gesture segmentation are obtained by combining S fore (0) and the erosion of the segmentation result of the hand region of the current frame R hand (i).The background seed pixels S back (i) for the next gesture segmentation are obtained by subtracting the expansion of the segmentation result of the hand region of the current frame R hand (i) from S back (0).Therefore, the whole video sequence can obtain the segmentation results of the hand region.After obtaining the segmentation results of the hand region of each frame, which can serve for the background velocity calculation, the hand height H(i) of the current frame can be calculated by the highest pixel of the hand region, which is used for the subsequent continuous gesture segmentation algorithm.

Background Velocity Calculation
In this study, background velocity is the dominant feature reflecting the trajectory of the hand.How to obtain the background velocity in the video image is a problem to be solved after the foreground of the hand region and the background of the environment have been segmented.The acquisition of velocity depends on the selection of reference points.In this study, the Speeded-Up Robust Features (SURF) [34] keypoints are chosen to be the reference points, and the background velocity is obtained by the average displacement of the matched SURF keypoints between two adjacent image frames.
This part of the specific algorithm flow is introduced below.Firstly, the SURF keypoints of two adjacent image frames are extracted, and the keypoints in the foreground are removed according to the segmentation results of the hand region, which results that only the keypoints in the background are retained.As shown in Figure 6a,b, the green plus signs and the red dots are the keypoints extracted from the two adjacent image frames, respectively.Moreover, the corresponding matching distance is calculated by using the method Lowe et al. proposed in [35].The keypoints obtained from the two adjacent image frames are matched according to the minimum distance, and the matching keypoints whose distance exceeds the threshold Lowe et al. proposed in [35] will be removed.

Continuous Gestures Segmentation
One of the focuses in this study is to recognize a series of continuous HMT gestures correspondingly.In order to realize the recognition and classification of the corresponding gestures, it is necessary to distinguish between the effective gestures and the ineffective gestures in the process of continuous hand motion.Two special rules are defined to distinguish the start point and the end point of an effective gesture: The user should perform the fist bobbing before starting an effective gesture and maintain a motionless state for more than half a second after performing an effective gesture.During the algorithm debugging progress, we obtained total of 1000 gesture samples from five subjects at first, and saved the 1000 gestures as the dataset for a gesture segmentation test.The details of the acquisition of the 1000-gestures dataset will be described in detail in Section 4. The specific algorithm flow is introduced as follows.
Due to the instability of the test subjects' movements, nominal motionless during performing gestures is not completely static, but slightly quivering.In order to define the nominal motionless state, we set the maximum velocity threshold Vt.If the velocity mean of a frame's last 0.5 s is smaller As shown in Figure 6c, after obtaining the matching keypoints between two adjacent image frames, the displacement of the matching keypoints between two adjacent image frames with that of the frame interval is 1/12 s, according to the above data configuration.The velocity vector of each matching was shown by the yellow arrow in Figure 6c.In order to quantify the background velocity by the velocity vector of the matching points, all of the velocity vectors are represented in the coordinate system, as shown in Figure 6d.The 1/5 maximum and the 1/5 minimum values of the vector components V X and V y are deemed invalid, which may be affected by the matching at the edge of the image.The invalid values are shown by the pink circle marker in Figure 6d, while the remaining valid values are shown by the blue dot marker.The final velocity values of the current frame are represented by the red plus sign marker in Figure 6d, which was obtained by averaging all of the valid velocity values.

Continuous Gestures Segmentation
One of the focuses in this study is to recognize a series of continuous HMT gestures correspondingly.In order to realize the recognition and classification of the corresponding gestures, it is necessary to distinguish between the effective gestures and the ineffective gestures in the process of continuous hand motion.Two special rules are defined to distinguish the start point and the end point of an effective gesture: The user should perform the fist bobbing before starting an effective gesture and maintain a motionless state for more than half a second after performing an effective gesture.During the algorithm debugging progress, we obtained total of 1000 gesture samples from five subjects at first, and saved the 1000 gestures as the dataset for a gesture segmentation test.The details of the acquisition of the 1000-gestures dataset will be described in detail in Section 4. The specific algorithm flow is introduced as follows.
Due to the instability of the test subjects' movements, nominal motionless during performing gestures is not completely static, but slightly quivering.In order to define the nominal motionless state, we set the maximum velocity threshold V t .If the velocity mean of a frame's last 0.5 s is smaller than V t , the hand is considered as being in a motionless state in this frame.Half a second is enough for judging according to the mentioned rule after performing an effective gesture.In order to investigate the optimal threshold of V t for gesture segmentation, we conduct segmentation accuracy tests of the 1000-gestures dataset described above based on five motionless states with five different V t values from one pixel/image to five pixels/image.The experimental results are shown in Table 1.The best segmentation accuracy can be obtained when V t is set as three pixels/image.Thus, the threshold is set as three pixels/image to define the motionless state.As shown in Figure 7, the algorithm flow chart for the segmentation of a triangular trajectory is presented as a sample.Based on the results of the hand region segmentation, we have obtained the variation curve of the height H(i) of the hand region, which reflects the motion of the fist bobbing.The velocity curve, which reflects the hand motion trajectory, was obtained by the SURF keypoints matching.We defined the effective gesture start rule as follows: when the successive descending edge and rising edge appear in the H(i) curve, as shown in the interval S1 in Figure 7, the motion of fist bobbing appears.This indicates that the subsequent green interval-1 may have the start point of an effective gesture.If the velocity curve changes so greatly, it wouldn't be considered as a motionless state within interval-1; then, the effective gesture starts, as shown in the pink interval-2, which indicates an effective gesture.The intervals S3, S4, and S5 correspond to the three-step straight line trajectory of the complete triangular trajectory gesture, which make up interval-2.A motionless interval, S2, is allowed between interval S1 and interval S3, which starts after the fist bobbing and continues until the effective gesture begins.We set the time-window width of interval-1 to two seconds, which is experimentally optimized.The start signals of the 1000-gestures dataset mentioned above are tested with seven time-window widths of interval-1 from 1000 ms to 4000 ms with a step of 500 ms, as shown in Table 2, and the best corresponding accuracy of gesture segmentation is obtained when the time-window width is set as 2000 ms (2 s).If the motionless interval S2 is longer than the time-window width of interval-1, the effective gesture interval-2 will not appear.That is to say, gestures that start from two seconds after the fist bobbing will be considered invalid.
say, gestures that start from two seconds after the fist bobbing will be considered invalid.

Interval-1 (ms)
1000 1500 2000 2500 3000 3500 4000 Segmentation accuracy (%) 90.4 99.0 99.2 98.9 98.5 96.5 58.2The above rule defines the start point of the effective gesture.After the gesture is completed, the following rule is defined to determine the end point of an effective gesture: when an effective gesture  The above rule defines the start point of the effective gesture.After the gesture is completed, the following rule is defined to determine the end point of an effective gesture: when an effective gesture is completed, we define the motionless state within a certain period of more than half a second as an end signal, as shown in interval S6.Considering that individual differences may have an impact on the duration of the motionless state, we calculated the gesture segmentation accuracies of the 1000-gestures dataset under seven time-window widths of S6 (100 ms, 500 ms, 900 ms, 1000 ms, 1100 ms, 1500 ms, and 1900 ms), as shown in Table 3.The best segmentation accuracy can be obtained when the time-window width S6 is set as 1000 ms (one second); hence, the time-window width of S6 is set as one second for the segmentation of continuous gestures.If the motionless interval's duration is lower than one second, it will not be judged as an end signal of the gesture.So far, we have completed the definition of the start point and the end point of an effective gesture through which we can achieve the segmentation of the continuous gestures to get a complete velocity curve data of a single gesture.At this time, the single gesture velocity data sequence has been obtained.

Classification
After obtaining the data for a single gesture, it is necessary to identify the corresponding type of the gesture.Since different gestures have different durations and movement speeds, the velocity data of separated single gestures are normalized before classification by linear interpolation and resampling with the number of sampling points setting as 30.After obtaining the normalized data, a dynamic time warping (DTW) [36] algorithm is used to measure the similarity between different gestures.Then, we use three different methods of cross-validation to classify gestures after data acquisition.Finally, we use the k-Nearest Neighbor (kNN) algorithm [37] to classify the input gestures and determine the gesture type.The above algorithms are all implemented in MATLAB.

Principle for Navigation of a Nursing-Care Assistant Robot
According to the hardware of the nursing-care assistant robot described above, the chassis can move in all directions through different movement combinations of four Mecanum wheels.We defined six kinds of movements of the moving chassis: forward, backward, left, right, clockwise rotation, and counterclockwise rotation.The six movements mentioned above can accomplish the basic actions of omnidirectional movement and the flexible navigation of the robot at home.
In order to interact with the nursing-care assistant robot through HMT gestures, the one-to-one gesture command that corresponds to the movement of the nursing-care assistant robot needs to be defined.We defined 10 corresponding HMT gestures, as shown in Figure 8, for the interaction between the human and the nursing-care assistant robot.Six of them (gestures 1-6) are used for the six kinds of chassis movements mentioned above.In addition, two gestures (gestures 7 and 8) are defined for the acceleration and deceleration, respectively, and gesture 10 is defined for the stop.According to the introduction of the control system of the nursing-care assistant robot, the control of the mobile chassis and the dual manipulator are mutually independent.In order to realize the overall control of the nursing-care assistant robot, we defined gesture 9 for switching the control interface to realize the function of the dual manipulator.

Dataset
Before the experiment, the approval of the Ethics Committee of the 117 th Hospital of People's Liberation Army of China (PLA) has been obtained, and all of the subjects have signed a consent form.In this study, the performance of the HMT gesture recognition system is tested and verified with five subjects in two postures (sitting and standing); then, we conducted the application on the nursingcare assistant robot by controlling the robot's movement based on the predefined 10 gestures using the wrist-worn camera.
In order to verify the performance of the gesture recognition system, five subjects of four males and one female, aged from 20-30 with a healthy and movable body, were included in the experiment.In the experiment, the test subjects performed the predefined 10 gestures continuously in one round as a gesture combination.Each subject repeated the gesture combination 10 times in both sitting conditions and standing conditions, respectively.In other words, 200 gesture samples were collected from each subject.Finally, 500 gesture samples were obtained in each condition, and a total of 1000 gesture samples were obtained.During the experiment, the participant put the wearable wrist-worn camera on their right wrist.Then, the camera was positioned to ensure a suitable proportion of the hand area in the image.After that, the participant performed the predefined 10 HMT gestures one by

Dataset
Before the experiment, the approval of the Ethics Committee of the 117th Hospital of People's Liberation Army of China (PLA) has been obtained, and all of the subjects have signed a consent form.In this study, the performance of the HMT gesture recognition system is tested and verified with five subjects in two postures (sitting and standing); then, we conducted the application on the nursing-care assistant robot by controlling the robot's movement based on the predefined 10 gestures using the wrist-worn camera.
In order to verify the performance of the gesture recognition system, five subjects of four males and one female, aged from 20-30 with a healthy and movable body, were included in the experiment.In the experiment, the test subjects performed the predefined 10 gestures continuously in one round as a gesture combination.Each subject repeated the gesture combination 10 times in both sitting conditions and standing conditions, respectively.In other words, 200 gesture samples were collected from each subject.Finally, 500 gesture samples were obtained in each condition, and a total of 1000 gesture samples were obtained.During the experiment, the participant put the wearable wrist-worn camera on their right wrist.Then, the camera was positioned to ensure a suitable proportion of the hand area in the image.After that, the participant performed the predefined 10 HMT gestures one by one to collect gesture data following the predefined rules.The participant performed the motion of fist bobbing firstly, and then made the single effective HMT gesture.After finishing a single gesture, the participant needed to maintain a motionless state for more than half a second.The test subjects were guided by the above rules and pretrained to perform the gesture combination once, which was not included as one of the experimental samples to be analyzed.After finishing all of the HMT gestures, the original video data of the gestures were collected and processed to extract the velocity data of the effective single HMT gesture.Then, a DTW algorithm was used for measuring the similarity of the different gestures, and three different methods of cross-validation were used to classify the gestures.The recognition accuracies based on different cross-validation were obtained for the performance verification.After the experiments for performance verification, we applied the HMT gesture recognition system to interact with the nursing-care assistant robot.Similarly, the participant put on the wearable camera to control the nursing-care assistant robot in its man-in-seat interaction mode and remote interaction mode, respectively, based on the predefined 10 gestures.What is different from the experiments in the application is that the HMT gesture recognition is in real time, which is based on a specified training set using a representative method of cross-validation for the classification.The data processing and algorithm verification that were involved in this experiment were all carried out in version 2018a of MATLAB installed on a computer configured with 8 G of memory and 2.6 Ghz CPU.

Results of the Continuous Gesture Segmentation
According to the segmentation algorithm of continuous gestures defined in Section 3, 10 consecutive gestures of all of the groups were segmented to extract the effective single gesture.In the experiment, 94 sets of 100 consecutive gestures from the five subjects were segmented into 10 corresponding gestures correctly.The start and end points of the segmentation were also in line with the expectations of the algorithm.Among the six groups of gestures with incorrect segmentation, there were three groups that were caused by the absence of start and end points because of the non-standardized gestures (including the too-small motion of the fist bobbing before the single effective gesture, and the loss of a motionless state after finishing the single effective gesture).The remaining three groups of segmentation errors were caused by the motion of fist bobbing during the process of the effective HMT gesture, which lead to an extra wrong segmentation.Finally, there were 992 gestures completely segmented out of the whole 1000 gestures correctly, which led to a segmentation accuracy rate of 99.20%.The effective gestures that were not correctly segmented were processed manually by specifying the start and end points.

Results of the Background Velocity
According to the above results of continuous gesture segmentation and the calculation of background velocity based on the matching of the SURF keypoints, a group of the velocity curves of the 10 predefined gestures in this experiment is shown in Figure 9.The velocity curves of V X and V y are mapped from the background velocity to the gesture trajectory velocity.As seen from the figure, the simple movement between gesture 1 and 4 takes less time to make; the effective gesture action duration is about one second, and the complex movement between gestures 5 and 10 is about twice the simple duration of about three seconds.To reduce the influence of the different lengths of gesture time for classification, all of the hand gesture data were normalized by the method mentioned in Section 3.
are mapped from the background velocity to the gesture trajectory velocity.As seen from the figure, the simple movement between gesture 1 and 4 takes less time to make; the effective gesture action duration is about one second, and the complex movement between gestures 5 and 10 is about twice the simple duration of about three seconds.To reduce the influence of the different lengths of gesture time for classification, all of the hand gesture data were normalized by the method mentioned in Section 3.

Results of the HMT Gesture Recognition
After obtaining the effective velocity data of the corresponding gestures, 1000 gestures were classified with three different cross-validation methods, based on the distance between the velocity data of the corresponding gestures calculated by a DTW algorithm.In order to meet the data requirements of the DTW algorithm, the original velocity data were normalized and resampled with 30 sampling points by linear interpolation before classification.
The three cross-validation methods are introduced as follows: (1) leave-one-subject-out (LOSO) cross-validation is a method that refers to selecting the sample data of one subject as the test set and Finally, the accuracies of gesture recognition using different cross-validation methods are shown in Table 4.The mean recognition accuracy with the LOSO cross-validation was up to 97.34%, which verifies the system's performance in front of an unknown subject.The LPO cross-validation achieved a mean accuracy of 96.55%, which is lower than LOSO, and reflects the variations between different subjects and the diversity of our data.The LOOWS method achieved a mean accuracy of 98%, which is higher than LOSO, and indicates that the user can easily add their characteristic gestures to the system, and the gestures can be recognized efficiently.As shown in Table 4, the HMT gesture recognition accuracies in the standing condition are slightly higher than those in the sitting condition.Besides the random influence of the external environment, the features of the gesture in the standing condition are more evident than those in the sitting condition since the action space in the standing condition is much larger than that in the sitting condition.In order to cater for the interactive application of the two interaction modes with the nursing-care assistant robot, the gesture velocity data collected in the standing and sitting conditions based on LOSO cross-validation are analyzed, respectively.The confusion matrices of the recognition results under the two conditions are shown in Figure 10.In addition, the precision, recall, and F-measure of the 10 gestures' classification are calculated based on the confusion matrix to further verify the performance of the gesture recognition, which are shown in Figure 11.The mean value of the F-measure under the sitting condition is up to 0.984; under the standing condition, it is 0.963.The higher F-measure value in sitting conditions indicates its better classification performance than that in standing conditions.In order to cater for the interactive application of the two interaction modes with the nursingcare assistant robot, the gesture velocity data collected in the standing and sitting conditions based on LOSO cross-validation are analyzed, respectively.The confusion matrices of the recognition results under the two conditions are shown in Figure 10.In addition, the precision, recall, and Fmeasure of the 10 gestures' classification are calculated based on the confusion matrix to further verify the performance of the gesture recognition, which are shown in Figure 11.The mean value of the F-measure under the sitting condition is up to 0.984; under the standing condition, it is 0.963.The higher F-measure value in sitting conditions indicates its better classification performance than that in standing conditions.The overall recognition results of sitting conditions and standing conditions based on LOSO cross-validation are clearly shown from the confusion matrix in Figure 10.Among the confusing gestures, we analyzed the reasons for some having a high proportion of varying classification types.Gesture 7 was classified inaccurately as gesture 4 with the proportion of 4.67% under the sitting condition and 3.11% under the standing condition, because of the missing horizontal motion detection.The missing detection was caused by the incorrect end of a gesture due to the unexpected pauses during a single gesture period.For similar reasons to those mentioned above, gesture 8 was classified inaccurately as gesture 4 with the proportion of 4.67%, and gesture 10 was recognized incorrectly as gesture 2 with the proportion of 2.67%.There is another situation: gesture 8 was classified wrongly as gesture 2 with the proportion of 2.67%, which was caused by the unobvious longitudinal motion in gesture 8.For example, the subjects immediately started the showcased gestures with the uncompleted motion of the fist bobbing, resulting in the insufficient longitudinal motion.

Interaction with Nursing-Care Assistant Robot
HMT gesture recognition was applied to the intelligent interaction with the nursing-care assistant robot at home.Based on the control system and interaction mode of the nursing-care The overall recognition results of sitting conditions and standing conditions based on LOSO cross-validation are clearly shown from the confusion matrix in Figure 10.Among the confusing gestures, we analyzed the reasons for some having a high proportion of varying classification types.Gesture 7 was classified inaccurately as gesture 4 with the proportion of 4.67% under the sitting condition and 3.11% under the standing condition, because of the missing horizontal motion detection.The missing detection was caused by the incorrect end of a gesture due to the unexpected pauses during a single gesture period.For similar reasons to those mentioned above, gesture 8 was classified inaccurately as gesture 4 with the proportion of 4.67%, and gesture 10 was recognized incorrectly as gesture 2 with the proportion of 2.67%.There is another situation: gesture 8 was classified wrongly as gesture 2 with the proportion of 2.67%, which was caused by the unobvious longitudinal motion in gesture 8.For example, the subjects immediately started the showcased gestures with the uncompleted motion of the fist bobbing, resulting in the insufficient longitudinal motion.

Interaction with Nursing-Care Assistant Robot
HMT gesture recognition was applied to the intelligent interaction with the nursing-care assistant robot at home.Based on the control system and interaction mode of the nursing-care assistant robot proposed in Section 2, we improved the HTM gesture recognition system to a real-time HMT gesture recognition system, which was more efficient for the interaction application with the robot.Different from the algorithm framework proposed in the theoretical verification part of Section 3, the real-time HMT gesture recognition system's algorithm framework had the specified training templates based on the LOSO cross-validation, which is more practical and faster during the interaction progress.
The interaction application with the nursing-care assistant robot was carried out under two different interaction modes of the robot.As shown in Figure 12a, under the man-in-seat interaction mode, the user wore a wearable camera on his right wrist, and sat on the seat part in front of the nursing-care assistant robot.Then, the user made corresponding predefined gestures to guide the robot.Due to the additional background velocity of the robot when the user interacted with the robot in the man-in-seat interaction mode, there was a deviation between the experimental results in sitting conditions and the actual control interaction process in theory.However, the subsequent application shows that this additional velocity hardly affected the recognition accuracy, which illustrates the robustness of the proposed HMT gesture recognition system.The HMT gesture recognition in standing conditions corresponds to the remote interaction mode of the nursing-care assistant robot.The remote interactive state is shown in Figure 12b.The user remotely guides the robot to a specified condition, and then the command mode is switched for the control of the dual manipulator by performing the gesture command.After that, the manipulator is remotely operated to complete the action of assisting in grabbing the specified objects.During the progress of the interaction, the original video data of the gesture collected by the wearable camera is transmitted to the host computer through Wi-Fi for processing and recognition, and the recognition result communicates with the robot's control system through the predefined control commands corresponding to the predefined HMT gesture.action of assisting in grabbing the specified objects.During the progress of the interaction, the original video data of the gesture collected by the wearable camera is transmitted to the host computer through Wi-Fi for processing and recognition, and the recognition result communicates with the robot's control system through the predefined control commands corresponding to the predefined HMT gesture.

Discussion and Conclusions
In this paper, an innovative HMT gesture recognition system based on background velocity features using a wearable wrist-worn camera was proposed and applied to intelligent interaction with a nursing-care assistant robot.In this study, the environment image data were collected during the user's hand motion when using a wearable wrist-worn camera.The velocity of the HMT gesture was reflected by background velocity, which was calculated by the displacement of the matching SURF points between adjacent frames.In addition, we defined a reliable rule to segment the continuous gestures by detecting the motion of fist bobbing and the background velocity, and the accuracy of the gesture segmentation reached 99.2% with the 1000 effective gestures that were obtained.

Discussion and Conclusions
In this paper, an innovative HMT gesture recognition system based on background velocity features using a wearable wrist-worn camera was proposed and applied to intelligent interaction with a nursing-care assistant robot.In this study, the environment image data were collected during the user's hand motion when using a wearable wrist-worn camera.The velocity of the HMT gesture was

Figure 1 .
Figure 1.The flow-process diagram of the proposed hand motion trajectory (HMT) gesture recognition system based on the wearable wrist-worn camera for application in the smart infrastructure.

Figure 1 .
Figure 1.The flow-process diagram of the proposed hand motion trajectory (HMT) gesture recognition system based on the wearable wrist-worn camera for application in the smart infrastructure.

Figure 2 .
Figure 2. (a) Concept design of the wearable wrist-worn camera with an emphasis on wearability and user's comfort; (b) Integrated implementation of the wearable wrist-worn camera.

Figure 2 .
Figure 2. (a) Concept design of the wearable wrist-worn camera with an emphasis on wearability and user's comfort; (b) Integrated implementation of the wearable wrist-worn camera.

Figure 3 .
Figure 3. Concept design of the nursing-care assistant robot consisting of an omnidirectional mobile chassis, lift adjusting mechanism, dual manipulator, and the seat part.The communication protocol of the robot control system is shown as well.

Figure 3 .
Figure 3. Concept design of the nursing-care assistant robot consisting of an omnidirectional mobile chassis, lift adjusting mechanism, dual manipulator, and the seat part.The communication protocol of the robot control system is shown as well.

Figure 4 .
Figure 4.The algorithm flow diagram of the continuous HMT gesture recognition.

Figure 4 .
Figure 4.The algorithm flow diagram of the continuous HMT gesture recognition.

Figure 5 .
Figure 5.The processing of video image for hand region segmentation: (a) Original red-green-blue (RGB) image; (b) Image after compressing and cutting pixels; (c) RGB image is transformed into L*a*b* color space; (d) The superpixels result of the simple linear iterative cluster (SLIC) algorithm; (e) Seed pixels of the foreground and background for the lazy snapping algorithm; (f) The result of hand region segmentation using the lazy snapping algorithm.
. The 1/5 maximum and the 1/5 minimum values of the vector components VX and Vy are deemed invalid, which may be affected by the matching at the edge of the image.The invalid values are shown by the pink circle marker in Figure 6d, while the remaining valid values are shown by the blue dot marker.The final velocity values of the current

Figure 5 .
Figure 5.The processing of video image for hand region segmentation: (a) Original red-green-blue (RGB) image; (b) Image after compressing and cutting pixels; (c) RGB image is transformed into L*a*b* color space; (d) The superpixels result of the simple linear iterative cluster (SLIC) algorithm; (e) Seed pixels of the foreground and background for the lazy snapping algorithm; (f) The result of hand region segmentation using the lazy snapping algorithm.

Figure 6 .
Figure 6.Speeded-Up Robust Features (SURF) keypoints are extracted, and the average displacement of matching keypoints between the adjacent frames was used to characterize the velocity of background.(a) The extracted SURF keypoints of the previous image frame; (b) The extracted SURF keypoints of the current image frame; (c) The matching keypoints between adjacent image frames; (d) Determination of the effective values of current frame's velocity.

Figure 6 .
Figure 6.Speeded-Up Robust Features (SURF) keypoints are extracted, and the average displacement of matching keypoints between the adjacent frames was used to characterize the velocity of background.(a) The extracted SURF keypoints of the previous image frame; (b) The extracted SURF keypoints of the current image frame; (c) The matching keypoints between adjacent image frames; (d) Determination of the effective values of current frame's velocity.

Figure 7 .
Figure 7. Description of the continuous gesture segmentation principle based on the process of a triangular HMT gesture.

Figure 7 .
Figure 7. Description of the continuous gesture segmentation principle based on the process of a triangular HMT gesture.

Figure 8 .
Figure 8.Ten predefined gestures for the interaction between human and nursing-care assistant robot.

Figure 8 .
Figure 8.Ten predefined gestures for the interaction between human and nursing-care assistant robot.

Figure 9 .
Figure 9. Velocity curves sample of the 10 predefined HMT gestures.4.2.3.Results of the HMT Gesture Recognition After obtaining the effective velocity data of the corresponding gestures, 1000 gestures were classified with three different cross-validation methods, based on the distance between the velocity data of the corresponding gestures calculated by a DTW algorithm.In order to meet the data requirements of the DTW algorithm, the original velocity data were normalized and resampled with 30 sampling points by linear interpolation before classification.The three cross-validation methods are introduced as follows: (1) leave-one-subject-out (LOSO) cross-validation is a method that refers to selecting the sample data of one subject as the test set and the sample data of the other subjects as the training set; (2) leave-other-subject-out (LPO) cross-validation refers to selecting the sample data of one subject as the training set and the sample data of the other subjects as the test set; (3) leave-one-group-out within one subject (LOOWS) cross-validation refers to selecting one group of samples from one subject as the test set and the other groups of samples of this subject as the training set.The types of the test gestures were determined according to the shortest distance to the training set based on the DTW algorithm, and the most types that the test gesture was determined to be were taken as the recognition results.Finally, the accuracies of gesture recognition using different cross-validation methods are shown in Table4.The mean recognition accuracy with the LOSO cross-validation was up to 97.34%, which verifies the system's performance in front of an unknown subject.The LPO cross-validation achieved a mean accuracy of 96.55%, which is lower than LOSO, and reflects the variations between different subjects and the diversity of our data.The LOOWS method achieved a mean accuracy of 98%, which is higher than LOSO, and indicates that the user can easily add their characteristic gestures to the system, and the gestures can be recognized efficiently.As shown in Table4, the HMT gesture recognition accuracies in the standing condition are slightly higher than those in the sitting condition.Besides the random influence of the external environment, the features of the gesture in the standing condition are more evident than those in the sitting condition since the action space in the standing condition is much larger than that in the sitting condition.

Figure 10 .
Figure 10.(a) Confusion matrix of gesture recognition in sitting conditions; (b) Confusion matrix of gesture recognition in standing conditions.

Figure 10 .Figure 11 .
Figure 10.(a) Confusion matrix of gesture recognition in sitting conditions; (b) Confusion matrix of gesture recognition in standing conditions.Appl.Sci.2018, 8, x FOR PEER REVIEW 15 of 19

Figure 11 .
Figure 11.(a) The precision, recall, and F-measure of gesture classification in sitting conditions; (b) The precision, recall, and F-measure of gesture classification in standing conditions.

Figure 12 .
Figure 12.The implementation of the interaction with the nursing-care assistant robot: (a) Man-inseat interaction mode of the nursing-care assistant robot; (b) Remote interaction mode of the nursingcare assistant robot.

Figure 12 .
Figure 12.The implementation of the interaction with the nursing-care assistant robot: (a) Man-in-seat interaction mode of the nursing-care assistant robot; (b) Remote interaction mode of the nursing-care assistant robot.

Table 1 .
Segmentation accuracy of continuous gestures with different values of V t .

Table 2 .
Segmentation accuracy with different time-window widths of interval-1.

Table 2 .
Segmentation accuracy with different time-window widths of interval-1.

Table 3 .
Segmentation accuracy with different time-window widths of interval S6.

Table 4 .
Gesture recognition accuracies using three cross-validation methods with five subjects.LOSO: leave-one-subject-out, LPO: leave-other-subject-out, LOOWS: leave-one-group-out within one subject.