The overall hierarchical design of the proposed system is depicted in
Figure 1. The control architecture is divided into two major systems: the low-level local control system (LCS), responsible for the internal control of the prosthetic hand, and the high-level sensor-based control system (SCS), which fuses multi-modal information to generate the commands to drive the prosthetic hand. The SCS is divided into three major modules: muscle activation subsystem (MASS), computer vision subsystem (CVSS), and grasp prediction subsystem (GPSS).
2.1.2. Computer Vision Subsystem
The CVSS applies a simple and computationally inexpensive algorithm to process the images taken by a single low-cost Logitech C525 HD camera, which has a clip-on 180° fish-eye lens in an eye-to-hand configuration fixed to a pair of glasses worn by the user. The CVSS is designed based on basic processing elements to be easily implemented in most of the commercial integrated processing units with minimum need for memory and computational power. This subsystem initiates by taking a single snapshot of the operator’s point of view. Images are taken at a low pixel resolution of 640 × 480. Tilt and rotation angles are adjusted to centre the image with the user line of sight. The camera setup does not occlude the vision of the user. The captured image is enhanced using contrast correction, then converted to grey scale and subsequently processed for edge detection purposes. While more complex techniques are available [
34,
35], the computational intensity and challenges in implementing them into embedded code systems on board of the Bebionic hand dictated the implementation of leaner algorithms. In this paper, we utilized light image processing approaches to simplify the computation and maximize practicality and translation to the clinic. Due to this, we intentionally avoided using an advanced image processing module to evaluate whether basic algorithms can give us enough information to fuse the image-based features with mechanomyography for predicting the intention of the user. For this, in the image processing module, edge detection was performed by applying Sobel and Canny edge operators. Subsequently, the image was smoothed by a 5 × 5 Gaussian filter given by (1), with
σ = 3 to merge nearby pixels to enclosed regions.
A binary filter was applied to enhance the detected edges using a threshold. Image processing results in the detection of numerous blobs, which are filtered based on a number of properties to find the object of interest after several experiments to fine tune the parameters considering various objects and various angles taking into account the context of the task. This filter removes blobs with pixel density A, below A
min = 10 and above Amax = 5000, and height and width greater than H
cam = 480 and W
cam = 640, respectively. Blobs with a difference in Euclidean distance E
Dist, from its centre of gravity to the centre of the image, above the distance threshold E
cen = 125, were removed. Since the camera was head-mounted, it was assumed that the subject’s gaze is in line with the centre of the camera; therefore, blobs with E
Dist > 125 were considered not to be within the subject’s visual interest. The remaining blobs were given a score B
Score according to Equation (2) and the highest scoring blob is identified as the object of interest.
Properties from the resulting region of interest were calculated from its filled convex hull to be used as the feature space for object classification. The properties are listed in
Table 1. During the training phase, each object template was generated by collecting its properties using 20 images taken at different viewpoints. During real-time object recognition, a K-nearest neighbour (KNN) classifier was used to identify the object of interest by comparing the extracted features to those collected in the corresponding object template. This classifier will be referred to as the object classifier (OC) throughout the rest of this paper. The detected object specified a set of possible grasp patterns unique to that object (see
Figure 2), which were then used by the GPSS to determine the most suitable grasp.
2.1.3. Grasp Prediction Subsystem
After the object of interest was detected by the CVSS, the corresponding set of object-specific grasp patterns were given to the GPSS. This subsystem had the role of data monitoring, data fusion, and command generation. As the participant generated an intention to reach for an object, the GPSS was initiated using the signal produced by the MASS. Thus, the GPSS started recording motion data using a pair of custom IMUs to estimate the kinematics of motion (position and orientation). The IMUs were affixed to the wrist and the biceps to capture kinematic features of the forearm and upper arm, respectively.
Orientation was estimated using a gradient descent algorithm developed by Madgwick et al. [
36]. The gradient descent algorithm was computationally less demanding and can operate with lower sampling rates, enabling applications in low-cost, wearable IMU systems that are capable of running wirelessly over extended periods of time. This information was captured in the form of quaternions.
Displacement was measured with respect to the resting position (origin) and the object depth and angle locations, according to the workspace layout. During the reaching phase, kinematic information was collected and timestamped at a frequency of 1 KHz by the use of the IMU.
We initially focused on using distribution features of the quaternion displacement trajectory during different grasp patterns. Although this approach proved to be relatively successful, varying the position of the object reduced the classifier accuracy considerably due to the high variability in the arm’s movement trajectory, impeding the ability to discriminate between different grasps. As a result, individual vector components representing the change in displacement of a single point were analysed in this study by considering the X, Y, and Z quaternion components as an imaginary vector (the imaginary quaternion vector part) in 3-dimensional space with an origin at [0,0,0] (resting position). In this study, rather than only looking at the distribution features of the whole displacement profile, single point features were also considered and tracked. A total of 84 kinematics features were extracted from displacement, velocity, and acceleration.
A list of the 14 most informative types of kinematics features selected based on the scoring method described in the next section (Remark—Feature Selection for Grasp Classification), is shown in
Table 2. The maximum value along the positive vector axis is denoted as Max(+) and the maximum value along the negative vector axis is denoted as Max(−). These 14 types of features were calculated for the six segment motions (mentioned below) resulting in the 84 kinematic features. The segment motions were related to the forearm and upper arm Cartesian space in X, Y, and Z directions as illustrated in
Figure 3, represented by the terms FAx, FAy, and FAz, for the forearm and UAx, UAy, and UAz, for the upper arm (e.g., PoG:Disp-FAx refers to the final displacement of the forearm at the point of grasp in the x direction).
The object detected by the CVSS determines the set of possible grasp patters to be established by the GPSS. After a 3-s window of data recording during reach, inherent physiological features (estimated from the IMU data) were extracted and compared against a pre-existing subject- and object-specific grasp template to classify the intended grasp using a second KNN classifier, referred as the grasp classifier (GC) in the rest of this paper. Once the intended grasp was identified based on the augmented feature space (fusion of muscle activity, contextual information, and kinematic data), the appropriate output command corresponding to the selected grasp pattern was generated and sent to the LCS for the execution.
Remark—Feature Selection for Grasp Classification: As mentioned, the grasp classifier works based on the fusion of multi-modal information including the kinematics of motion during the reaching phase. However, many trajectories can be chosen by an individual when reaching to grasp an object. These trajectories can cause variations in the forearm and upper arm orientation, velocity, and acceleration. Furthermore, the reaching paths may vary depending on the location of the object, thus increasing the variability and redundancy in the feature space. In order to augment the input data, several different reaching trajectories and object locations were considered in the training phase. For this, the input space (which includes MMG data, contextual vision-based information, and kinematics of motion) was recorded under multiple variations of (a) elevation/declination, (b) angular rotation, and (c) depth relative to the participant. Subsequently, in order to reduce the computational cost, the minimum number of required features for classification was calculated using a sequential feature selection method. This process involves the creation of candidate feature subsets by sequentially adding each feature in turn and performing leave-one-out cross-validation using a criterion defined by the KNN classifier (which uses a Euclidean distance metric to measure feature performance). The criterion returned by the classifier was summed and divided by the number of observations, which were used to evaluate each candidate feature subset by adding features one by one based on the minimization of the mean criterion value, the error rate. This error rate was estimated as the average error of all the folded partitions according to (3), where 𝐸 is the true error rate, K = N is the number of folds (with 𝑁 as the total number of samples), and 𝐸𝑖 is the error rate for fold 𝑖.
The top 10 features were selected, ordered based on ascending error rate, and scored according to the order they were selected through forward sequential feature selection. Namely, each feature selected was issued a score 𝑆 given by (4), where 𝑅 is the position of the feature in the ordered selection based on the error rate, and 𝑆
𝐶 (equal to 1) was the maximum initial score. The mean scores were then calculated across all participants to identify the most important features during the grasp.