Hybrid FPGA–CPU-Based Architecture for Object Recognition in Visual Servoing of Arm Prosthesis

The present paper proposes an implementation of a hybrid hardware–software system for the visual servoing of prosthetic arms. We focus on the most critical vision analysis part of the system. The prosthetic system comprises a glass-worn eye tracker and a video camera, and the task is to recognize the object to grasp. The lightweight architecture for gaze-driven object recognition has to be implemented as a wearable device with low power consumption (less than 5.6 W). The algorithmic chain comprises gaze fixations estimation and filtering, generation of candidates, and recognition, with two backbone convolutional neural networks (CNN). The time-consuming parts of the system, such as SIFT (Scale Invariant Feature Transform) detector and the backbone CNN feature extractor, are implemented in FPGA, and a new reduction layer is introduced in the object-recognition CNN to reduce the computational burden. The proposed implementation is compatible with the real-time control of the prosthetic arm.


Introduction and State-of-the Art
One of the problems assistive robotics addresses is the production of upper limb prostheses for amputees. Despite great progress in upper limb bionic prostheses, allowing for object-of-interest reaching and grasping, the key remaining issues relate to their control by the operator. To overcome the limitations of traditional control solely based on the electromyographic (EMG) activity of the remaining muscles, promising alternatives consider hybrid systems combining noninvasive motion capture and vision control [1,2]. They include camera vision modules that allow for recognition of the subject's intention to grasp an object and assist visual control of prosthetic arms for object reaching and grasping [3].
The computer vision algorithms which are implemented in these systems comprise the latest object recognition approaches, such as deep neural network (DNN) classifiers and regressors [4]. In our previous work [5] we proposed an FPGA-implemented SIFT detector for matching of views in a multi-camera visual prosthesis servoing system. Despite the fact that the visual servoing of robotic arms has been a highly researched subject [6], the application to arm neuroprostheses implies supplementary constraints. The whole control device has to be lightweight and worn by the subject. Hence, it is necessary first to minimize the equipment and second to propose efficient lightweight solutions for visual scene analysis by the camera worn by the subject.
Real-time performance is also a mandatory requirement for our target application [2,7]. As the fastest visuomotor response to a perturbation takes about 90 ms [8], and feedback delays of 100 ms or more are known to deteriorate the performance of online feedback control [9], computation time should remain as low as possible, and below 100 ms.
In this work, we propose a hybrid hardware/software (HW/SW) architecture for the analysis of a visual scene for the visual servoing of a neuroprosthetic arm using a glass-worn camera. The visual task here is to recognize the object the subject intends to grasp and localize it in the egocentric visual scene.

State-of-the-Art Hybrid Solutions in Robotic Vision
As the core block for object recognition in our system is a convolutional neural network (CNN), we further present a brief state-of-the-art review of lightweight CNNs for object detection.

State-of-the-Art lightweight CNNs for Object Detection
In recent years, in the field of computer vision, the most popular algorithms for object detection are deep convolutional neural networks, such as faster regions with CNN (Fast R-CNN) [10], you only look once (YOLO) [11], and single shot detector (SSD) [12]. These detectors are based on deep residual networks (Resnet) [13], very deep convolutional networks (VGGnet) [14], Alexnet [15], MobileNet [16], and GoogleNet [17]. Resnet [13] was proposed by He et al. and uses residual blocks, which are illustrated in Figure 1.
Denoting the desired underlying mapping as H(x).
F(x) := H(x) − x where we let the stacked, nonlinear layer fit another mapping of F(x). The original mapping is recast into F(x) + x. It is easier to optimize the residual mapping than to optimize the original mapping. F(x) + x can be realized by feedforward neural networks with shortcut connections, as illustrated in Figure 1. Shortcut connections can skip one or more layers. In Resnet [13], the shortcut connections' outputs are simply added to the outputs of the stacked layer. The computational cost of the Resnet [13] is high which makes real-time implementation difficult. However, there are methods that can accelerate the computational speed.
VGGNet [14] is a simple deep convolutional neural network, where deep refers to the number of layers. The VGG-16 consists of 13 convolutional layers and 3 fully connected layers. The convolutional layers are simple because they use only 3 × 3 filters and pooling layers. This architecture has become popular in image classification problems.
Faster R-CNN [10] was proposed by Ren et al. This architecture has gained popularity among object detection algorithms. Faster R-CNN [10] is composed of the following four parts: • feature extraction module, this can be a VGGnet [14], Mobilnet [16], or Resnet [13]; • region proposal module to generate the bounding boxes around the object; • classification layer to detect the class of the object-for example, cat, dog, etc.; • regression layer to make the prediction more precise.
The computational speed of the network depends on the feature extraction module and the size of the region proposal module.
Both SSD [12] and YOLO [11] are single-stage detectors. They are significantly faster than two-stage detectors (region-based methods), such as Faster R-CNN [10]. However, in cases when the objects have not so much variability, neither interclass nor intraclass Faster R-CNN [10] is a well-suited network. In our problem, we are interested in naturally cluttered home environments, where the subject intends to grasp an object, such as in kitchens. The vision analysis system we propose has to be designed to recognise objects to grasp in the video, similar to the grasping-in-the-wild (GITW) dataset [18]. This dataset was recorded in natural environments by several healthy volunteers and we made it publicly available on the CNRS NAKALA platform. The objects here, seen from the glass-mounted camera, are quite small. Their surface merely represents 10% of the whole video frame. Hence, Faster R-CNN [10] is a better choice than the SSD [12] and YOLO [11]. This is due to the fact that Faster R-CNN achieves higher mean average precision (mAP) than them, as reported by Huang et al. [19] for small objects.
The original Faster R-CNN [10] uses VGGnet [14] as a feature extractor. However, the mAP is higher when Resnet [13] is used as a backbone [20]. When the object is small, the mAP of the backbone with Resnet [13] is higher than the backbone with MobileNet [16], as reported in [19].
There are several possible ways to accelerate an algorithm [21]. In our case, FPGA was chosen in the interest of developing a lightweight and portable device [22].
Due to the large computing and memory bandwidth requirements, deep learning neural networks are trained on high-performance workstations, computing clusters, or GPUs using floating-point numbers. The memory access pattern of the inference step of a trained network is different, offering more data reuse and requiring smaller memory bandwidth. It makes FPGAs a versatile platform for acceleration. Computing with floatingpoint numbers is a resource-intensive process for FPGA in terms of digital signal processing (DSP) slices and logic resource usage. Memory bandwidth, required to load 32 bit floatingpoint state values and weights, can be still high compared with the capabilities of low-power FPGA devices. Additionally, a significant amount of memory is required for buffering state values and partial results in the on-chip memory of the FPGA. One possible solution would consist of using the industry standard bfloat, 16-bit, floating-point representation, which can improve the inference speed of an FPGA. Observations show [26] that the value of weights, state values, and partial results during the computation usually fall in a relatively small range and the 8-bit exponent range of the bfloat type is practically never used. If the range of the values during the computation is known in advance, then fixed-point numbers can be used. One of the major application areas of FPGAs is signal processing; therefore, the DSP slices are designed for fast, fixed-point multiply-accumulate (MAC) or multiply-add (MADD) operations, which can be utilized during neural network inference.
Converting a neural network model trained with floating-point numbers to a fixedpoint FPGA-based implementation usually requires an additional step called quantization.
Here, a small training set is used to determine the fixed-point weights and optimize the position of the radix point in each stage of the computation. The common bit width for quantization is 16 or 8 bits, where the accuracy of the network is slightly reduced. In some cases, even a binary representation is possible [26], eliminating all multiplications from the computation, which makes FPGA implementation very efficient while the accuracy is decreased slightly.
For latency-sensitive applications, this fixed-point model can be implemented on a streaming architecture, such as FINN [26], where layers of the network are connected directly on the FPGA. Using this structure, loading and storing state values can be avoided. In an ideal case, when the number of weights is small enough, they can be stored in the on-chip memories, further reducing the memory bandwidth requirements of the system. This also results in lower dissipated power due to the high energy requirement of offchip data movement. Another approach used in Vitis AI [23] and Apache TVM VTA [24] is to divide the computation into a series of matrix-matrix multiplications and create a customized ISA (instruction set architecture) to execute these operations efficiently. The resulting system might have higher memory bandwidth requirements and longer latency, but can be easily reprogrammed to infer a different network during different steps of an image processing application.
Apache TVM VTA [24] is an open, generic, and customizable deep learning accelerator with a complete TVM-based compiler stack. It is an end-to-end hardware-software deep learning system stack that combines TVM and VTA. It contains the hardware design drivers, a just-in-time (JIT) runtime, and an optimizing compiler stack based on TVM.
The main advantages of the quantization are reduced complexity of the circuit, efficient use of dedicated hardware resources, reduced on-chip memory requirements, reduced off-chip memory bandwidth, and smaller power dissipation. Thus, for a lightweight bodyworn device, Vitis AI [23] is a good choice, because it can accelerate the network with minimal accuracy loss.
The remainder of the paper is organized as follows. In Section 2, we present the system overview for object detection in egocentric camera view, previously developed in [4], which we further adapt. In Section 3, we propose a hybridization of the solution for the FPGA-CPU board to be incorporated into a body-worn device for prosthetic control. In Section 4, we present our results, measuring the execution time, while comparing it on different platforms. Section 5 concludes our work and outlines its perspectives.

System Overview for Object Detection
In this section, we present a system overview for object detection in egocentric video, explain each module, and propose our adaptation of a gaze-driven CNN for object recognition to meet the real-time constraints of our hybrid solution.

System Overview
The vision analysis part, which is the most critical in the whole chain of prosthesis servoing, is presented in Figure 2. The underlying hypothesis for the functioning of vision-guided neuroprostheses is that the upper limb amputee wearing the neuroprosthesis is first looking at the object they wish to grasp. The subject is wearing a Tobii glasses device, which acquires an ego-visual scene and records gaze fixations of the subject in their coordinate system-see the left-most block in Figure 2. The recorded gaze fixations allow for roughly localizing the object of interest in video frames. Nevertheless, visual saccades to the distractors in a visual scene, microsaccades, and initial scene exploration before the subject finds the object make these measurements noisy. Hence, two blocks of the system-gaze point alignment and gaze point noise reduction-serve to estimate the position of the gaze fixation on the object in the current ego-video frame. The gaze point alignment module aims to estimate and compensate for the ego-motion between the past frames and the current frame. For more details, see Section 2.2. The goal of the gaze point noise reduction module is to reduce the noise in the current frame. This noise can be a head motion, or a product of the user being distracted and looking at another object for a moment. For more details, see Section 2.3. Then, the video frame is cropped around the estimated gaze point to limit the area of the object search. Finally, different object proposal bounding boxes (BBs) at different scales are generated around the point for object localization. The gaze point-centred image and the set of BB coordinates are then submitted to the gaze-driven CNN-see the right-most block in Figure 2. The gaze-driven CNN is pre-trained on the taxonomy of objects to detect. It outputs the best score for the object class and the best-scored bounding box. When the object is localized in a video frame, the 3D position of it for prosthesis servoing can be estimated from eye tracker depth measures of gaze fixation and the coordinates of the centre of the best-scored bounding box.
The resolution of the Tobii first-person view camera is full HD (1920 × 1080 p), with a frame rate of 25 frames per second (fps). The real-time requirement for the system in our case means that each processing step of the localization of the object of interest in the glasses-mounted camera in a current video frame has to be lower than 40 ms (the video acquisition rate), and the latency of the whole system should be lower than 100 ms to leave the place for mechanical servoing of the prosthetic arm [7]. In this work, we do not consider depth estimation, which is a simple regression from eye tracker gaze fixation measures-our focus is on object detection. In the following passages, we present each system block in detail.

Gaze Point Alignment
The Tobii glass camera and eye tracker system output the coordinates of gaze fixations in each video frame of the first-person integrated camera.
Even if the subject is looking at the same object to grasp during the object reaching, the projected gaze points will vary between two consecutive frames because of the body and ocular movements. Furthermore, saccades provoked by distractors can deviate from the human gaze. Hence, the first step consists of the estimation of a gaze fixation in the current (reference) video frame using all the past recorded gaze fixations. It is necessary to estimate and compensate the ego-motion between the past frames and the current frame to collect all gaze points in the same reference frame. We show an illustration of such a collection in Figure 3, where the light is the gaze fixation point, and more distanced it is from the current timestamp. Motion compensation from the past frames to the current frame is realized by a sequential homography transformation computed between consecutive frames.
Suppose a video sequence given with N frames and a list of gaze points, g n = {(g xn , g yn ), n = 1 . . . N}. The system operates as follows: for each pair of consecutive frames, it extracts the characteristic keypoints and local features. In our case, the keypoint extractor is the scale invariant feature transform [27] (SIFT). A fast library for approximate nearest neighbours (FLANN)-based matcher [28] is used to find the good matches between the SIFT descriptors of the two frames.
The final step is to estimate the homography transformation matrix, H n , n = 1, . . . , N, with N, the number of the current frame, based on the good matches. Then, the gaze fixations can be projected from all frames into the current frame by a composition of homographies H n . In this projection, we use a sliding window of duration, ∆t = 10, frames which correspond with 400 ms time interval, with the scene apprehension time by the subjects in our experiments. Therefore, for the current frame, N, the collected gaze points areĝ N,n , n = N − ∆t, . . . , N.

Noise Reduction
The goal of this module is to reduce the noise of the gaze fixations projected into the current frame.
The list of the aligned gaze fixations,ĝ N,n , n = N − ∆t, . . . , N, is the input of the kernel density estimator (KDE) with Gaussian kernel [29], which predicts the most probable location of the gaze fixation in the current frame. The KDE estimates the values as described in the following equation: where a kernel, K(x, h), is a positive function that is controlled by the bandwidth parameter, h. In our case, the bandwidth, h, parameter of the Gaussian kernel was set to 1, as default. L N is the number of gaze points projected in the current frame N The maximum of the estimated density surface is considered as a predictor of the gaze fixation point in the current frame. The search for the maximum is realized inside a bounding box which encompasses all projected gaze fixationsĝ N,n , n = N − ∆t, . . . , N, using full search method with pixel accuracy. An example of an estimated gaze point in a frame is presented in Figure 4, see the bright disk of the largest diameter.

Gaze-Driven Object Recognition CNN
This module recognizes the object location and type (e.g., bowl, pan, etc.) in a firstperson video frame. A limited number of bounding boxes of different scales is generated around the estimated gaze fixation point to localize the object. The module's input is thus the estimated gaze fixation pointĝ n , the cropped frame around the estimated gaze fixation, and the possible bounding boxes of the object generated aroundĝ n -see the second block in Figure 2.
In the current work, 9 bounding boxes (BB) have been generated with different scale and shape factors. The size of a cropped frame is 300 × 300 px [4]. For the size of BB, we have considered the width and the height between 67 and 223, in accordance with the frame resolution and the typical object sizes in egocentric visual scenes.
Recognition of the object is carried out by a CNN classifier applied to each of the generated bounding boxes. The BB with the maximum score is thus considered as the object location. Figure 5 shows the structure of the gaze-driven CNN. The backbone is a Resnet50 in the first four layers, see the left-most block in Figure 5. These layers serve as feature extractors from the input image. The input of the backbone is a cropped video frame of size 300 px × 300 px × 3. The output is a 1024 × 19 × 19 feature tensor. Not all feature channels are equally important for object classification when using the backbone. To select the most important ones, and to reduce the computational burden of the remaining part of the network, we introduce a reduction layer (RL). It reduces the number of channels in the input tensor to a given channel number CH (in our case, CH can be: 32, 64, 96, 128, 256, 512, 1024).
The input of RL is the backbone output tensor of dimension 1024 × 19 × 19. The RL applies a 2D convolution [30] over an input signal composed of several input planes. Assume that the input is (N, C in , H, W) and the output is (N, C out , H out , W out ), then the RL can be precisely described as follows: where is the valid 2D cross-correlation operator, M is the batch size, C denotes the number of channels, H is the height of input planes in pixels, and W is the width in pixels. Bounding boxes generated around the estimated gaze fixation point, and feature tensor with the reduced number of channels (CH× 19 × 19) are the inputs of the Faster R-CNN module [31] (ROI Heads). The module predicts the object type and location as a 17 × 9 tensor as we have 9 BBs (see Figure 6 and work with a 17-class taxonomy comprising 16 object classes and a rejection class, as in [4]. This tensor contains the probability of each bounding box for each class. Equation (2) is the output tensor of the ROI heads (Faster R-CNN [31]), where Ci are the categories and B are the bounding boxes.
The class scores of bounding boxes are aggregated, as in [4], by multiple instance learning [32] (MIL). The input of the MIL aggregation is the output tensor of the Faster R-CNN [31]. The module predicts the class of the frame. The frame-level score (ŷ( f , c)) is calculated as shown in Equation (3).
Here, f is the frame, c is the class, b is the bounding box, and y(b, c) is the score of the bounding box. γ is an open parameter.
MIL aggregation will produce the vector of the frame-level scores for the object categories. This vector can be finally transformed into the vector of object probabilities using a simple softmax operator: p(f, c) = softmax (ŷ(f, c)).

System Hybridization
To propose a hybridization of the system, compatible with real-time performance, we have conducted thorough time measurements on different CPUs and processors to identify the most time-critical modules. The bottleneck is the Scale Invariant Feature Transform (SIFT) detector, which is required in our system for geometric alignment of gaze pointssee Figure 2. The main steps of the SIFT are the following: scale-space extrema detection, keypoint localization, orientation assignment, and descriptor generation. For hardware acceleration, we have chosen Xilinx UltraScale ZCU102 [33] FPGA as it supports the parallel execution, and the energy consumption is very low. In our previous work [5], we proposed an SIFT detector on FPGA. It comprises a non-maximum suppression method to filter the keypoints which are too close, instead of the Taylor expansion in the keypoint localization step. We use this implementation in the present work.
The other complex module is the CNN for object recognition. Nevertheless, CNN is pre-trained offline for a given set of object categories. The spatial regularity of the CNN inference makes it ideal for FPGA implementation, and hundreds of papers have been published in this area in recent years. The proposed solutions can be divided into two classes: streaming architectures and parametrizable blocks.
The structure of the streaming architectures closely follows the data flow of the given network by connecting templated processing blocks in a pipeline. Input and output of the blocks are data streams (FIFO interfaces) and each operation in the networke.g., convolution, pooling, nonlinear response, etc.-has a dedicated block for FPGA implementation [34].
The usual template parameters in the case of a convolution block are the number of input and output layers and the size of the convolution window. The input image is fed into the system in a row-wise order, which makes it possible to connect the network directly to a camera input. The latency of the resulting system is low because the convolution blocks can start processing as soon as the first rows required for the computation are available.
The main drawback of the streaming architecture is that all the weights for the computation must be stored on-chip, which is not possible for large networks. In addition, the computation load of the layers is very different. Therefore, different design optimization strategies must be used for each layer, which makes the design process complicated.
Another approach is to use a compiler to break down the entire CNN computation into a series of tensor operations and create parametrizable hardware blocks to efficiently execute them [24,35]. The fundamental building block of these architectures is a matrixmatrix multiplication block, which is usually extended by an additional functional unit to efficiently carry out other operations, such as max pooling and nonlinear transformation. The matrix-matrix multiplication is usually carried out by a systolic array of multiplyaccumulate (MAC) units. A critical part of the system is the compiler, which is also responsible for the optimal scheduling of the tensor operations. The input image, network weights, and partial results are stored in off-chip memory, so the network size is not limited by the size of the FPGA device. On the other hand, the latency of the CNN computation is higher in this case because the entire image frame must be captured and stored in the memory before processing is started. Performance of the system might be also limited by the available off-chip memory bandwidth.
Taking into account the real-time constraints and also dissipation power, we implement a hybrid solution both for the preliminary processing steps before feeding gaze-driven CNN and the CNN as well. Referring to Figure 2, the hybridization of the preliminary steps is given in Table 1.

KDE estimation X -
As for the gaze-driven CNN implementation, accordingly with the time measures for real-time compatibility and simplification of R-CNN input by channel number reduction we proposed-see Section 2.4-only the ResNet backbone is implemented on FPGA; as depicted in Figure 5. The details of all modules from the input of CNN to the final aggregation of decisions by MIL are given in Table 2 below.

Resnet50 -X
Reduction layer X -Faster R-CNN X -

MIL aggregation X -
The reference software implementation of the system was executed on a four-core Intel i5 7300HQ [36] laptop CPU running at 2.5 GHz. This software system is also compiled for the four-core ARM Cortex A53 [37] processor system (PS) of the Xilinx Zynq UltraScale+ XCZU9EG device on the ZCU102 development board. Based on these measurements, the system was partitioned between the PS and the programmable logic (PL) parts of the device. Specialized accelerator circuits were designed for the modules of the proposed system, which cannot be executed fast enough on the ARM Cortex A53 processors. A traditional register-transfer-level (RTL)-based design of a digital circuit is time consuming; therefore, the Xilinx Vitis HLS system was used to create the FPGA-based circuits from a high-level C/C++ description.
We give our measures justifying these choices and the overall results in the next section.

Results
In this section, we discuss the measured computing time of the different steps of the proposed algorithm.

Dataset
The GITW [18] dataset contains egocentric videos recorded by a camera on the eye tracker glasses. It includes the gaze points of where the person was looking at each moment. The videos were recorded in the wild, in real kitchens, by different subjects, and every video was recorded by a subject who grasped a kitchen object.
The acquisition device used was Tobii Glasses 2 (eye tracker) with an egocentric scene camera. The Tobii Glasses video resolution is HD (1280 pixels × 720 pixels), and the video frame rate is 25 fps. There are 16 different kitchen objects in the videos: bowl, plate, wash liquid, vinegar bottle, milk bottle, oil bottle, glass, lid, saucepan, frying pan, and mug. Different subjects recorded the dataset in five different kitchens. The videos were short, around 10 s long. The GITW [18] dataset contains 404 videos overall. The dataset is freely available for research.
We carried out the time measurements on a subset of the GITW dataset, containing fifteen videos of "grasping a bowl" actions, recorded by four different subjects. The kitchen environments are of different complexity, from a scene with just a few objects, such as the BowlPlace1 videos, to a highly cluttered scene, such as BowlPlace4. The class bowl object had a strong inner variance: different colours, the material of the bowl object, and even a transparent one. The lighting conditions and the visibility are different. Moreover, sometimes, we obtained strong blurring effects due to the camera motion, which was worn on the person's body.

Geometric Alignment Measurements
For the completeness of time measures of the whole system, we present here the result of our previous work [22]. The time measures of the geometric alignment module are given in Table 3. The OpenCV [38] library 4.5.5 version was used during this experiment. The geometric alignment consists of an SIFT [27] keypoint extractor, an FLANN matcher [28], and a homography estimator. In the first part of Table 3, we give measures on embedded mobile ZCUs. The left-most column of Table 3 contains the name of the video file. The SIFT points have been detected in the mask, centred on the estimated gaze fixation point in each frame. The radius of the mask was chosen to encompass approximately 100 points. The second column contains the mean mask radius with standard deviation. For the geometric alignment by homography, we detected keypoints in two video frames: the current and previous reference frames. In the next columns, we give time figures on ARM A53 processors for keypoint (KP) computation on one frame, the matching time, and homography computation time.
In Table 3, the second column contains the number of detected SIFT points with the corresponding mask radius. We also present it as the mean and standard deviation on the whole video. The time figures are given for general purpose Intel processors.
The matcher, the homography estimator, and the gaze projection on ZCU102 are fast enough for real-time processing, as illustrated in Table 3. The worst-case scenario was 0.024 s, for the FLANN matcher [28], which means that the frame rate does not exceed 40 fps. This speed is enough for controlling a robotic arm.
However, the SIFT keypoint extractor was slower than the required processing time. While the worst-case scenario on the Intel i5 7300HQ CPU took 0.072 s, which is around 13.81 fps, on the ARM A53, it took 0.866 s, which is around 1.15 fps. For real-time processing, a rate of least 10 fps is required.  Table 4 illustrates a comparison of the estimated time of KDE computation between the Intel i5-7300HQ and the Xilinx ZCU102 ARM Cortex A53. The second column contains the available number of gaze points during a frame gaze point estimation. The Intel i5-7300HQ [36] computes the KDE at 80 fps on average, and the ARM A53 [37] computes the KDE at 7.9 fps on average. In some critical cases, when the scattering of the subject's gaze fixations is too strong, then the computation time is higher than in real-time, and is 3.9 s per frame, see the "Lid" sequence. Evidently, in such a case of highly cluttered scenes and problems of ocular movements, our system shows its limits. The problem is caused by outlier gaze fixation points, which fall far away from the majority, increasing the KDE search area. The solution might be to use a simple clustering algorithm to find the outlier gaze fixation points and discard them. Since only the last 10 gaze fixation points are used, we think this clustering can be carried out in a short time.

Kernel Density Estimation
However, if the projected gaze fixations in the current frame are sufficiently close (in the radius of 10 pixels approximately, which is the "normal case"), the ARM A53 [37] can compute the KDE in real-time.

Bounding Box Generation Time Measurements
The bounding box generation is fast on the Intel i5 7300HQ CPU. On average, 1 frame is processed in 0.42351 ± 0.01991 milliseconds, which is more than 2500 fps. The embedded ARM A53 processor is also fast enough to generate bounding boxes in real-time. The average computation time was 2.659 ± 0.027 milliseconds, which is more than 376 fps.

Gaze-Driven Object-Recognition CNN Time Measurements
Here, all measurements were taken by PyTorch. 1.6. [39]. The measurements in Table 5 show that the most time-consuming part of the CNN is the Resnet50 backbone. In every case, the backbone can process a frame in 0.09 s on Intel i5 7300 CPU, which is equal to 11 fps. On the ARM A53 processor, see Table 6, this time, presented in the second column, is even higher. It is about 1.8 s, thus giving 0.5 fps. This is below the required computational speed. Higher channel number causes larger computational complexity in the reduction layer and the Region of Interest (ROI) heads, as shown in Tables 5 and 6. Nevertheless, with a reasonable number of channels after the reduction, not exceeding 128, these blocks run in real-time, with 82 fps for channel reduction and 25 fps for ROI heads. Table 5. Measurements of the gaze-driven, object-recognition CNN in the Intel i5 7300 CPU. The first column contains the reaming number of channels after the reduction layer. Each column shows the elapsed time during the computation in milliseconds.

Number of Channel
Backbone ( The slowest part of the system was, thus, the backbone; therefore, it was implemented in FPGA. The accelerated Resnet50 CNN on ZCU102 can process an image in 0.02686 s, which is 37.23 fps. This is high enough for real-time processing. Table 6. Measurements of the gaze-driven, object-recognition CNN in the ARM A53 CPU. The first column contains the reaming number of the channels after the reduction layer. Each column shows the elapsed time during the computation in milliseconds.

Number of Channel
Backbone ( The measurements in Table 6 show the results of the ARM A53 CPU.

Gaze-Driven Faster RCNN Accuracy
As Table 7 and Figure 7 show, the current architecture can perform sufficiently well on our real-world data. Reducing the number of channels to 128 does not impoverish the classification accuracy too much, compared with the initial 1024 feature channels of the backbone, as we can see from Table 7. The average accuracy and loss are computed per class of objects.  Table 8 shows a comparison between different object recognition methods from the state-of-the-art methods and our method. The state-of-the-art methods, such as lightweight YOLO V3 [20] and SSD Mobilnet V2 [12], are trained on the COCO and VOC datasets. We have a specific and very cluttered kitchen environment. For this reason, we do not think that these object detectors are suitable in our case. From the computational time point of view [40], implemented on the same architecture, they are a bit faster: 13.2 fps object recognition for YOLO V3 [20] and 78.8 fps for SSD Mobilnet V2 [12]. In our work, we take profit from the availability of gaze fixations in real-time, which can drive object localization. However, the actual implementation of KDE on CPU makes the system slower. We have 12.64 fps for object recognition and its localization. The bottleneck is the KDE estimation, which we are now improving. Nevertheless, our actual computation times are compatible with real-time prosthesis control.  Table 9 illustrates the average computational time of the system in milliseconds. The first row contains the module name, and the second row contains the Intel i5 7300HQ [36] CPU results. In the third row, the ARM A53 [37]-embedded CPU results are given. The fourth row contains the hybrid (ZCU102 [33] and the ARM A53 [37]) results.
The total computation time is 182.782 ms in the Intel i5 7300HQ, which is 5.471 fps. The ARM 53 [37]-embedded CPU is the slowest because it is needed 2868.066 ms per frame, which is 0.349 fps. The hybrid embedded solution is computed in a frame of 236.507 ms, which is 4.228 fps. The hybrid embedded solution is equally as fast as the Intel i5 7300HQ [36], and the power consumption of the hybrid embedded solution is 5.6 W, which is less than the Intel i5 7300HQ [36] CPU 45 W.
The measurements show that the current experimental setup with the whole chain of modules is not yet suitable for real-time processing. However, with pipelining the modules, with some delays, the real-time processing speed is achievable.

Conclusions and Perspectives
In this paper, we have proposed a hybrid implementation of a visual analysis part for visual servoing of a prosthetic arm. The system was partitioned between the FPGA fabric and the ARM Cortex A53 processors of the Xilinx ZCU102 development board, based on the computing performance measurements of the building blocks. As a reference, the computing time of each image processing step was also measured on a laptop microprocessor and its power dissipation was estimated.
The measurements show that the gaze point alignment steps are fast enough on the ARM Cortex A53 [37]-embedded CPU, except the SIFT [27] point extraction step. Therefore the SIFT [27] detection module is implemented on the programmable logic part of the Xilinx ZCU102 [33] FPGA board.
In some cases, we find that the variance of the computing time of the KDE in our current setup is very high and slows down processing. In these scenes, most of the gaze points are located over the object to grasp, except one or two, which scattered around the image due to the saccadic movement of the eye. To overcome this problem, we plan to apply an outlier filtering by clustering before KDE computation.
The gaze-driven CNN is built on 4 different modules: Resnet50 [13], reduction layer, Faster R-CNN [10], and multiple instance learning (MIL) aggregation. Resnet50 [13] was accelerated on FPGA because the measured computational speed on the ARM Cortex A53 processor was only 0.55 fps, which was improved to 37.23 fps. The Faster R-CNN is also slow, providing only 3.5 fps when the number of input channels is 1024. We thus proposed a new reduction layer between the Resnet50 [13] and the Faster R-CNN [10] to reduce the number of input channels for the latter block. The frame rate can be increased to 25 fps when the number of input channels for the Faster R-CNN is reduced to 128 by the reduction layer. The experiments show that the accuracy using only 128 channels is still high enough for the bounding box computation.
The experimental setup, with the whole chain of modules is not suitable for real-time processing (236.507 ms on average, or approximately 4 fps). However, this computing time can be improved by pipelining the system and processing different frames at each stage, because each block can finish processing an image within 40 ms. The drawback of pipelining is increased latency. The latency of our current system is around 250 ms, which is higher than the latency allowed by the control of the robotic arm (∼100 ms) and is mainly caused by the KDE block. In the future, the KDE search algorithm will be optimized.
The power consumption and processing speed for the different architectures show that the embedded system, accelerated with FPGA, is a feasible solution for creating a wearable device.

Data Availability Statement:
Publicly available datasets were analyzed in this study. This data can be found here: Grasping-in-the-wild (GITW) dataset at NAKALA CNRS server https://www.labri. fr/projet/AIV/graspinginthewild.php, accessed on 30 December 2021.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: