A Cost-Effective Person-Following System for Assistive Unmanned Vehicles with Deep Learning at the Edge

The vital statistics of the last century highlight a sharp increment of the average age of the world population with a consequent growth of the number of older people. Service robotics applications have the potentiality to provide systems and tools to support the autonomous and self-sufficient older adults in their houses in everyday life, thereby avoiding the task of monitoring them with third parties. In this context, we propose a cost-effective modular solution to detect and follow a person in an indoor, domestic environment. We exploited the latest advancements in deep learning optimization techniques, and we compared different neural network accelerators to provide a robust and flexible person-following system at the edge. Our proposed cost-effective and power-efficient solution is fully-integrable with pre-existing navigation stacks and creates the foundations for the development of fully-autonomous and self-contained service robotics applications.


I. INTRODUCTION
Person-following is a well-known problem in robotic autonomous navigation that consists of the ability to detect and follow a target person with a mobile platform. This task can be achieved with a variety of sensing and moving systems and has fundamental roles in a variety of applications in domestic, industrial, underwater and aerial scenarios [1]. Due to the sharp increment of life expectancy in the last century, the world population has seen a progressive increase in the number of older people [2]. This trend offers an excellent opportunity for developing new service robotics applications to provide continuous assistance to autonomous elders in everyday life. These robotic platforms should be able to identify the target person and follow him to offer their support. Person-following assumes, thus, a vital role, as a technology necessary to enable a variety of different applications. In this context, it is essential to develop a system that focuses on robustness to different domestic scenarios and efficiency to be implemented on low-power devices, without the need for external computing devices. Moreover, the ability to run a person-following algorithm entirely onboard makes the system less prone to security and privacy issues, avoiding unnecessary transmission of sensitive information, such as domestic camera streams.
Generally speaking, a person-following system is composed of a sensing device, a detection algorithm able to provide an estimate of the target position and a following algorithm to control the robot's movements. Indoor robotic platforms use a variety of perception devices, divided into exteroceptive, such as cameras, LiDARs and ultrasonic sensors, and proprioceptive, such as inertial measurement units (IMUs), gyroscopes, accelerometers and encoders. Different solutions to the detection problem can be found in the literature, depending on the used sensors, on the application scenario and the type of approach [1]. Recent developments in deep learning techniques [3] for computer vision gave a significant boost to the ability to efficiently extract meaning from visual information and inspired several solutions for the personfollowing problem.
Inspired by these approaches, we employed the popular deep learning object detection algorithm YOLOv3-tiny [4], suitably re-trained for the specific task, to detect the target person from an RGB-D frame and compute his location with respect to the robot reference frame. The extracted information was then fed to an efficient control algorithm that generated the suitable linear and angular commands for the robot actuators to achieve person-following. Moreover, we tested the proposed approach on several embedded platforms designed explicitly for the edge AI, that consists of deploying artificial intelligence algorithms on low-power devices. We compared the obtained results with a particular focus on the trade-off between performance and power consumption. The overall solution proposed represents a cost-effective, low-power pipeline for the person-following problem that can be easily employed at the edge as a primary component in complex service robotics tasks.
II. RELATED WORKS Related literature is organized as follows. Firstly, several methods for person-following are analyzed, with a focus on the sensing devices used and on the strategies used to detect the target. Then, deep learning techniques for object detection are briefly discussed, with attention paid to recent developments in edge AI implementations.

A. Person Following
The task of recognizing and localizing a person to be followed by a robotic platform has been widely discussed in the literature since the nineties. Islam et al. [1] reviewed and categorized a large number of works focused on achieving person-following in a variety of conditions, such as ground, underwater and aerial scenarios. For what concerns terrestrial applications, important classifications of the different methods are based on the kinds of devices used to sense the environment and on the strategy used to detect the target person.
Most ground applications use a simple unicycle model that controls the robot 2D motion in polar coordinates, with a linear velocity on the xy plane and angular velocity about the zaxis [5]. The chosen detection system should, therefore, be able to find the target position and distance from the robot. Several systems use laser range finders (LRF) measures that directly provide a set of distances, which are clustered and interpreted to extract relevant features. People are localized usually by means of leg [6,7,8,9,10,11] or torso identification [12,13,14]. However, these methods mainly rely on static features extracted from 2D point clouds that frequently lead to a poor detection quality. Visual sensors are much more informative since they allow one to sense the entire body of the target, but simple RGB cameras are not enough since a distance measure is also needed. The two main categories of visual sensors able to catch depth information are stereo and RGB-D cameras. Several works [15,16,17,18,19,20] use the first approach to approximate the distance information by triangulation methods applied on two or more RGB views of the same scene. However, the most used visual sensors for person detection are RGB-D cameras [21,22,23,24,25,26,27,28,29,30,31,32] that are able to get both RGB images and depth maps by exploiting infrared light. Several methods employ sensor fusion techniques to merge information from different kinds of sensing systems. For example, Alvarez et al. [33] used both images to detect the human torso and lasers to track the legs, Susperregi et al. [34] used an RGB-D camera, lasers and a thermal sensor; and Wang et al. [35] used a monocular camera with an ultrasonic sensor. Hu et al. [36] adopt eda human walking model using a combination of RGB-D data, LRF leg tracking and robot odometry and a sonar sensor for obstacle avoidance during navigation. Koide et al. [14] used LRF data to detect people in the scene, and then cameras to identify them and extract relevant features. Cosgun et al. [8], on the other hand, manually selected the target from an RGB-D view of the environment, and then tracked it with LRF leg identification. Merging data from multiple sensors allows one to increase detection accuracy, but with high increases in the system's complexity and costs. Furthermore, the presence of multiple sources of data requires hardware with high computational power to enable real-time processing. Since our focus was on developing an embedded, cost-effective, low-power system, we selected a low-cost RGB-D camera as the only sensing device.
Focusing on vision-based methods, different strategies can be adopted to detect the person in the environment. Mi et al. [26], Ren et al. [25] and Chi et al. [29] all adopted the Microsoft Kinect SDK that directly provides skeleton position. Satake et al. [16,17] used manually designed templates to extract relevant features and find the target location. Munaro et al. [23], Brookshire [15] and Basso et al. [22], instead, adopted histograms of the oriented gradients (HOG) method for human detection originally proposed by Dalal et al. [37]. More recently, machine learning techniques have been used to solve the person-following task. Chen et al. [19] used an online AdaBoost classifier initialized on a manually-selected bound-ing box of the target person. Chen [31], instead, used an upperbody detector based on an SVM to get the human position and extract relevant features used during the tracking phase. More recently, deep learning models have been employed to further boost detection accuracy. Chen et al. [18] proposed a CNNbased classifier trained on a manually-selected target with an online learning procedure. Masuzawa et al. [28] adopted the YOLO method [38] to identify the person due to its high results in both precision and recall rates. Wang et al. [20] also employed YOLO as a person detector, but only to predict the initial position of the target, since they are not able to run the algorithm in real-time due to hardware limitations. Jiang et al. [30] jointly used a DCNN-based detector and a PN classifier based on random forests to enhance person localization and tracking. Finally, Yang et al. [32] used a DNN to identify a bounding box image to be scored against the preregistered user image.
Exactly as in [20,28], we purpose a person-following approach based on the YOLO network, but our methodology is different from theirs. We use a newer and smaller version of YOLO (YOLOv3-tiny), and the re-training and the optimization of the network involve: -Eliminating the tracking part and relating an additional filter thanks to the continuous detection of the target, so reducing the computational complexity of the solution; -Running the detection at the edge, so it can be easily realized on the neural board accelerator, without adding an expensive onboard computer (low-cost).

B. Deep Learning for Real-Time Object Detection
Object detection is a field of computer vision that deals with localizing and labeling objects inside an image. Before the recent huge developments in deep learning techniques, object detection was classically performed with machine learning methods such as the cascade classifier based on Haar-like features [39] or coupled with feature extraction algorithms like the histograms of oriented gradients (HOG) [37,40]. With the recent developments in deep learning, considerable improvements in both the accuracy and efficiency of object detection algorithms have been achieved. Current techniques are split into region proposal methods and single-shot detectors. The former firstly identifies areas inside the image that most likely contain objects of interest, abd then feeds them to a second stage that predicts label and bounding box dimensions. In this category, we find algorithms such as R-CNN [41], fast R-CNN [42] and faster R-CNN [43]. Single-shot detectors, on the other hand, treat the detection task as a regression problem and directly perform both localization and labeling with a single stage. This method makes them generally faster than region proposal techniques, but with slightly less accuracy. The most known single shot detectors are SSD [44] and YOLO [38], with its evolutions YOLOv2 [45], YOLOv3 [4] and YOLOv4 [46]. Lightweight versions of these methods, such as YOLOv3-tiny, have been specifically developed to be implemented in low-power real-time systems and therefore are most suitable for service robotics applications.
Recently, several advancements have been made in edge AI, where deep neural networks are deployed on low-power real-time embedded systems [47]. This field of research has principally flourished thanks to the release of hardware platforms specifically designed to accelerate deep neural network inferences. NVIDIA released boards with onboard GPUs, such as Jetson TX2, AGX Xavier and Nano. Intel produced two generations of USB hardware accelerators called Neural Computing Stick (NCS), and recently Google released its own Coral board and USB accelerator, able to boost inference performance using the Tensor Processing Unit (TPU) chips. In the literature can be found several works which apply optimization techniques to object detection algorithms to deploy them on embedded devices with hardware acceleration [48,49,50,51,52].
In our work, we fine-tuned a pre-trained YOLOv3-tiny network for the person detection task, and we propose a costeffective person-following system that can generate suitable velocity commands for the robotic actuators, based on RGB-D images. The proposed methodology was extensively tested with several edge AI devices in order to compare performance and power consumption for the different possible configurations. Finally, a potential implementation of the proposed system was integrated and tested with a standard robotic platform.
The rest of the paper is organized as follows. Section III presents the dataset used in the re-training and the hardware setup. Section IV discusses the proposed methodology with an extensive description of the detection mechanism and the control algorithm. Finally, Section V presents the experimental discussion, the performance comparison on the considered hardware platforms and the final complete implementation.

III. MATERIALS AND DATA
The network adopted was pre-trained with the COCO dataset, which contains 80 classes of objects with their respectively bounding box and marks. Subsequently, a technique named transfer learning [53] was adopted to realize the retraining and the fine-tuning of the network using a smaller dataset composed by the person class only. In this way, a custom version of the network YOLOv3-tiny was produced, optimizing it for accurate and efficient detection. That network was tested and compared with the original model, producing some metric evaluation results. The deep learning network was evaluated on different edge AI devices by assessing the performance of each of them in terms of inference speed and power consumption. Finally, a specific hardware solution was selected to assemble a robotic platform and test it in a real environment.

A. Data Description
The images of people used to create the person dataset were extracted from the OIDv4 [54] dataset, which are divided into training, validation and testing. During the re-training phase, we used 6001 images, divided into the training set, 5401, and the test set, 600. The dimension of the images was imposed to be equal to 416 × 416 during training, in order to be more coherent with the pre-trained input dimension of the original network and the native resolution, 480 × 640, of the depth camera.

B. Hardware Description
The main request to fulfill is the necessity of a real-time response in each step that makes up this robotic application: person detection algorithm, data elaboration, control of the robot and the navigation into the indoor environment. All these operations must be as instantaneous as possible to avoid the loss of the person to follow-extremely probable in case the person moves away from the robot view.
For what concerns the embedded implementation of the neural network, different platforms were evaluated and are shown in Figure 1 and summarized in Table I: a Raspberry Pi 3 B+ with Intel NCS, a Raspberry Pi 3 B+ with Movidius NCS2, a Coral USB Accelerator, an NVIDIA Jetson AGX Xavier developer kit and an NVIDIA Jetson Nano.
The Neural Computing Sticks (NCS) are USB dedicated hardware accelerators specifically used to perform deep neural network inferences. Both the first and the second generations of the NCS have been tested: the first has a Myriad 2 Vision Processing Unit (VPU), while the second has Myriad X VPU and reaches eight times the performance of the previous version. These two components request a USB 3.0 or 2.0 interface so that they can be easily used with cheap singleboard computers such as a Raspberry board.  The Coral USB Accelerator, released in 2019, is an onboard edge TPU coprocessor able to reach high-performance machine learning inference, with a limited power cost, for Ten-sorFlow Lite models. The board can work at different clock frequencies: maximum or reduced. These frequency types are one the twice of the other, so using the maximum frequency there is an increase of the inference speed with a consequent increase of the power consumption.
The NVIDIA Jetson AGX Xavier, released in 2018, is a System-On-Module able to guarantee high performance and power efficiency. The board contains DRAM, CPU, PMIC, flash memory storage and a dedicated GPU for hardware acceleration, so it has been specifically realized to perform rapidly different neural network operations. The kit is also supplied with several software libraries as NVIDIA JetPack, DeepStream SDKs, CUDA, cuDNN, and TensorRT. It is possible to set different power mode configurations also selecting the number of CPU cores utilized: 10 W (2 cores), 15 W (4 cores), 30 W (2, 4, 6 or 8 cores).
The NVIDIA Jetson Nano is a lightweight, powerful computer explicitly designed for AI in order to run multiple neural networks in parallel for image elaboration. The board mounts a 128-core NVIDIA Maxwell GPU, a Quad-Core ARM Cortex-A57 MPCore CPU and a 4 GB LPDDR4 memory and reaches the peak performance of 472 GFLOPs. It can work in two power modes: at 5 W or 10 W without the support of Tensor cores during the inference acceleration.
The complete hardware selected for testing in the test environment is an upgrade of the TurtleBot3 Waffle Pi from ROBOTIS 1 , a standard robotic platform supported by ROS and extremely used by developers. This robot uses as onboard PC the Jetson Xavier developer kit 2 , and it has been equipped with an additional camera sensor, an Intel RealSense Depth camera D435i 3 . This low-cost depth camera, ideal for navigation or object recognition applications, is composed of an RGB module and two infrared cameras separated by a wide IR projector.
IV. PROPOSED METHODOLOGY The goal of this research was to develop an autonomous, real-time, person-following assistive system able to promote the aging-in-place of independent elderly people.
The workflow of our solution is exposed here. Firstly, the detection and localization of the person in the environment are realized, using the RGB-D information from the camera and a neural network specifically designed for fast and accurate real-time object detection, YOLOv3-tiny. Successively, the information about the position of the person with respect to the robot reference frame is used to create a tailored control algorithm, using a linear trend for the linear velocities and a parabolic trend for the angular velocities. In this way, the robot can follow the person while remaining at a certain safe distance from him, thereby avoiding both hitting and losing the target. The pseudo-code of the overall algorithm is reported in the Algorithm 1.
The proposed algorithm, in the practical implementation, is integrated with the open-source Robot Operating System (ROS) 4 to set the control to the actuators of the robot. The result is an autonomous, cost-effective person-following system with deep learning at the edge, easily integrable in different unmanned vehicles.

A. Person Localization
By using a re-trained version of YOLOv3-tiny for object detection and the RGB-D camera chosen for this application, it is possible to detect and localize a person in space interactively. In fact, once the network is optimized, the precision and recall values allow one to have a continuous detection of the target without the use of a tracker algorithm or additional filter to support the control implementation. That means the use of the network is sufficient to realize real-time personfollowing, while reducing the computational cost and other power consumption. These considerations are supported both from the average precision AP 50 obtained from the re-trained network, and the specific use case taking into account: a selfsufficient older person in his or her home environment. As the target to follow is an elderly person, its moving velocity is reduced so the tracker is superfluous and consequently the control is smooth, and this is also supported thanks to the reduced speed of the robot.
Object detection is an important area of research, interested in the processing of images and videos to detect and recognize objects. You only look once (YOLO) [4,38,45] is the object detection method commonly used in the real-time processing image applications. This model, based on a feed-forward convolutional neural network, is considered an evolution of the single-shot-multibox detector (SSD) concept with the idea of predicting both the bounding boxes and the class detection probability simply analyzing the image once. Its architecture is based on a single neural network trained end-to-end to increase the accuracy and to reduce the predictions of false positives on the background.
The operations done by the network can be divided into four steps: 1) The input image is processed with a grid cell as a reference frame. 2) Each grid cell generates bounding boxes and predicts their confidence rate. The confidence rate depends on the accuracy of the network during the detection. 3) Each grid cell has a probability score for each class.
The number of classes depends on the dataset used during the training process of the network. 4) The total number of bounding boxes is minimized by setting a minimum confidence rate and using the nonmaximum suppression (NMS) algorithm to obtain the final predictions that can be used to generate the final output: an input image with the bounding boxes over the detected objects with the reference classes and the accuracy percentages.
During the evolution of the YOLO architecture, incremental improvements can be recorded in the different versions developed, starting from YOLOv2, which includes many features to increase the performance, until reaching YOLOv3 and YOLOv4, the last two versions of the model, in which there are notable improvements in the capability of the network to detect objects. In our proposed methodology, we suggest a re-trained and optimized version of YOLOv3-tiny, which is the lightweight version of YOLOv3 with a reduced number of trainable parameters. In Table II is the structure of the modified architecture of YOLOv3-tiny for only the class person.

1) Person Detection and Localization Implementation:
The RGB camera frames are the input data that feed the retrained YOLOv3-tiny network for the class person. As already introduced, the full input image is treated in functional regions represented as a grid of cells. In each region, bounding boxes are weighted by the predicted probabilities, and the predictions are the result of single network evaluation.
In order to localize the position of the person in the video frame, it is sufficient to use the bounding box information provided by the network. As it is depicted in Figure 2, the bounding box of the detected person directly provides the coordinates of the angle, x, y, with respect to the R0 reference frame.
By using the information of the center, with respect to R0, it is possible to calculate the coordinates of the midpoint of the person detected with respect to the new reference frame R1 (x p , y p ): where x c and y c are respectively 319 and 239 pixels due to the image resolution, 640 × 480, of the camera taken as reference in this study. The x p and y p coordinates are necessary to locate the person in the 2D space and x p ; in particular, is fundamental to develop the angular control algorithm, to adjust the rotation of the robot.
Once known, with the pixel coordinates of the center of the bounding box detecting the person, it is possible to obtain the distance between the robot and the person by merely extracting this information from the corresponding pixel of the depth camera matrix. The dimension of that matrix is equal to the resolution of the acquired RGB frame, and for each pixel position, there is a value in millimeters representing the distance of the camera from what it sees. Since the limits of the depth computation, of the camera taken as reference in this study, are 0.105 m and 8 m, the values of the depth matrix range from 0 to 8000. The depth value extracted represents the missing coordinate to localize a person in the 3D environment. The z coordinate is strictly necessary to realize the linear velocity control algorithm, which is able to regulate the forward or backward movement of the robot.
2) Detection Situation Rules: Three possible detection situations are taken into account by controlling the output value, N detection , of the network: 1) Nothing detected: If nothing is detected, the robot stops. Differently, suppose the robot loses the person it is following. In that case, it continues to move in the direction of the last detection with the previous velocity commands for a pre-imposed time t. After that, if nothing is detected again, the robot stops. 2) One person detected: The robot follows the movement of the person while remaining at a certain safe distance from him. 3) More than one person simultaneously detected: The robot stops for a prefixed time and then it restarts the normal detection operation. There could be many other solutions to implement, for example, a person tracking algorithm to follow one of the people detected [55,56,57]. In particular, the presence of a tracking algorithm or an additional filter will have to be considered if the use case is changed-for example, in case this approach will be used in an office environment. However, for this specific application, it has been decided to block the robot directly because it has been assumed that the person using it should be self-sufficient, living in the house alone. Therefore, if the person receives visits, it would be unpleasant and unnecessary to have a robot following him inside the house.

B. Person-Following Control Algorithm
The linear and the angular velocity are regulated using different functions, so, in order to obtain a correct control algorithm, it is necessary to combine them simultaneously.
1) Angular Velocity Control: The angular velocity is proportional to the horizontal coordinate, x p , of the center of the bounding box detecting the person, computed with respect to the reference frame located in the center of the frame of the camera. It has a parabolic trend in order to make the movements of the robot smoother and more natural. By considering this reference frame, the dx value is positive if the person is on the left side of the frame or negative if it is on the right side. Figure 3 shows a graphical representation of how the dx value is obtained. The controller function generated has the dx value, measured in pixel, as input, and gives as output the angular velocity, v angularθ , according to the following formula: The terms max vel and min vel , which are equal and opposite values, are the upper and the lower limits of the angular velocity of the robotic platform. The number 320 pixels stands for the maximum number of pixels for each side (left and right) of the video frame because the resolution of the image received from our camera is 640 × 480.
2) Linear Velocity Control: The linear velocity depends linearly on the distance between the robot and the person detected d m measured in meters. In particular, it is in function of the value obtained from the depth matrix of the camera in the center point of the bounding box detecting the person. This control can be represented by a linear trend divided into three regions. In the first region, the distance is superior compared to the set upper limit m v uplim , so here the robot moves straight on following a linear proportional trend until it reaches its maximum speed saturating to that value. In the second region, there is a stop condition; in fact, the robot is at the safety distance from the person and remains there to avoid losing the person. The zero value is also assigned when the distance is 0 m, in order to avoid a particular case in the code. In fact, the depth value obtained from our camera has a limit of 0.105 m, so nothing should be detected at a lower distance. The final region is between the limit of the camera: 0.105 m and the distance lower-limit m v lowlim . In this condition, the robot goes back following a linear proportional trend until it reaches its maximum negative speed saturating to that value. Here is reported the formula responsible for the linear velocity v linearx : The values m1, q1, m2 and q2 were found using the equation of the straight line passing through two points.

V. EXPERIMENTAL DISCUSSION AND RESULTS
In this section, we firstly discuss some technical details of the re-training procedure of the tiny version of YOLOv3. Then experimental evaluations are discussed for both the model and its deployment on the selected embedded devices. Finally, we present our platform implementation that represents one of the possible practical configurations to realize the personfollowing solution presented with this work.

A. Person Detector Training and Optimization
In order to obtain a lightweight and efficient network for the detection of the target, we modified the original model to classify and localize the class person only. Using OIDv4 [54], we collected a set X of 6001 training samples, reserving 600 of them for testing. Making use of transfer learning [58], we started our training from a pre-trained backbone, from layer 0 to 15 in Table II. That greatly speeds up the training, drastically reducing the number of samples required to achieve a high level of accuracy. We trained for 20 epochs with a linear learning rate decay and an initial value of η = 0.0001. We adopted momentum optimization [59] with β = 0.9 and a batch size of 32. The training procedure lasted approximately one hour on a workstation with an NVIDIA RTX 2080 Ti and 64 GB of DDR4 SDRAM.
It is possible to observe the effectiveness of the re-training procedure from Table III. The re-trained version of YOLOv3tiny gains more than 30% of average precision (AP) at 0.5 of intersection over unit (IOU). Moreover, the resulting singleclass network is 23% faster, in terms of inference latency, than the multi-class counterpart. That is due to the reduced number of features maps in the final detection section of the network, from layer 15 to 23 of Table II. Finally, we optimized the resulting re-trained model with two different libraries: TensorRT and TensorFlow Lite 5 . Optimization is a fundamental process and aims at reducing latency, inference cost, accelerator compatibility, memory and storage footprint. That is mainly achieved with two distinct techniques: model pruning and quantization. The first one simplifies the topological structure, removing unnecessary parts of the architecture, or favors a more sparse model introducing zeros to the parameter tensors. On the other hand, quantization reduces the precision of the numbers used to represent model parameters from float32 to float16 up to int8. That can be accomplished after the training procedure (post-training quantization) or during the training procedure (quantization-aware training), adding fake quantization nodes inside the network and making it robust to quantization noise. Indeed, optimizations can potentially result in changes in model accuracy, and so any operation must be carefully evaluated.
In order not to affect the accuracy of our YOLOv3-tiny implementation, we applied basic pruning optimizations with the TensoRT library, removing unnecessary operations and setting to zero irrelevant weights. Indeed, person detection is a critical step in our solution, and it requires the maintenance of a certain level of performance. Nevertheless, in order to test also the performance of the custom TPU ASIC of the Coral Accelerator, we produced a full integer model with TensorFlow-Lite optimizer to be compatible with the hardware of the device. When only applying model pruning we obtained an insignificant accuracy loss; with 8-bit precision the model loses 22% of its original AP 50 . Indeed, darker scenes, with partially occluded and small targets, are not precisely detected anymore. However, latency and inference costs are significantly reduced using this extreme optimization procedure.

B. Inference with Edge AI Accelerators
After the training and optimization procedures, the re-trained model was deployed on the different edge device configurations presented in Section III. We tested the performance in terms of absorbed power and frame rate in order to outline different hardware solutions for our proposed cost-effective person-following system. A single-board computer, Raspberry Pi 3B+, was used for all configurations that require a host device.
Firstly, we measured the power consumption of the different solutions at an idle condition, and then we executed the model for approximately five minutes to reach steady-state behavior. We directly measured the current absorbed from the power source, thereby obtaining the power consumption of the entire system.
Since the Jetson boards allow the user to select different working power conditions, we tested all of them. The results are presented in Table IV. The second version of the Intel Movidius Neural Stick achieves a higher frame rate with less power consumption. However, either Jetson Nano running modes reach higher performance at the expense of higher current absorption. On the other hand, Jetson AGX Xavier achieves a much higher frame rate on all running modes, but with other levels of power consumption. Finally, full integer quantization greatly reduces the latency of the model running at more than 30 fps with only 7 W. However, as previously stated, the accuracy loss in this last case could compromise the correct functioning of the entire system in certain types of application.

C. Platform Implementation
We tested the proposed cost-effective person-following system in a real environment with the configuration presented in Table V; the specifics about its hardware components have been already introduced in Section III-B. Figure 4 shows the assembled robot adopted for this application. The tests, performed in a real environment, show robot behavior consistent with expectations. The person detection algorithm is high-speed and reached a high level of performance. The network improvement, obtained from the retraining, is considerable (30.09%), and this implies optimal real-time results and perfect control of the movements of the robot that follows the person.
By testing the network on the Jetson Xavier board, we have obtained a high frame rate (30+ fps at the maximum power of  During the experiments, conducted in the test environment, we have noticed that using all the outputs of the network to modify the control velocity could be counterproductive because it causes a continuous variation of the inputs of the robot control and consequently a not regular movement of the robot. Differently, imposing the updating of the output of the robot command velocity each 2 Hz, the tests give optimal results: the robot follows the target always in realtime, but with a notable increase in the smoothness of its movement. It is essential to underline that the implementation of the control adopted has been explicitly designed for the use case taken into account, considering an elderly person as the target and the speed limits of the considered robotic platform: ±0.26 m/s for linear velocity and ±1.8 rad/s for angular velocity. These limits are reasonable for the considered indoor application since walking sessions are usually short and performed with very limited speed and frequent pauses. However, the control rule can be easily adapted to more performing prototypes if a higher linear speed limit is needed.
During the test phase, the presented control of the robot has been perfected, in particular, considering the characteristics of the platform: -The optimal distance limits able to define the three areas of the linear velocity control are defined as m v lowlim = 1.7 m and m v uplim = 1.9 m. In the range between these two values the robot is in the safe distance zone, so it can only rotate because the linear velocity stays at zero, in order to avoid the generation of any dangerous situations for the target person.
-The best linear increment is computed during both the forward and backward movements of the robot.
Thus, the final obtained values of m1, q1, m2 and q2, presented in Section IV, for our platform implementation, are reported in Table VI. Moreover, in Figure 5(a) the developed angular velocity and linear velocity behaviors are represented while taking into account max vel and min vel of the robotic platform.
It is important to underline that the initial linear velocity control has been designed with different slopes and without singularities. However, during the test phase, trouble has been highlighted: the robot in the restart moving phase proceeded so slowly that it was unable to follow the target without losing it correctly. That is the reason for the introduction of the step singularities, visible on Figure 5(b), that, thanks to the small linear velocity of both the robot and the target (the older person), allow one to have a balanced movement and not jerky. The values m1, q1, m2, q2 identified, after a test phase, as the best choice of the linear control algorithm, taking into account the robot adopted in our case study, are here reported. The overall system has been tested in several environments with different light conditions and target velocities verifying the correctness and completeness of system functionality. The final result meets the demands of an accurate real-time application; the robot moves in safety, consistent with the movements of the person, limiting the chances of losing the target to chase.
We can conclude that we have not realized a simple object tracker, but a person-following method that focuses on costeffectiveness, since it cuts unnecessary computations to have a low-cost, functional system. Besides, the detection is realized at the edge, so the network is optimized to run on neural accelerators. In this way, it is not necessary to have an expensive computer onboard the robot, which would imply both the increase of the price and the consumption of additional power. The reduction in computational cost and power consumption let us use different types of hardware, presented in Table I, associated with their respective performances, reported in Table IV. VI. CONCLUSIONS We proposed a cost-effective person-following system for self-sufficient older adults assistance that exploits latest advancements in deep learning optimization techniques and edge AI devices to bring inference directly on the robotic platform with high performance and limited power consumption. We tested different embedded device configurations, and we presented a possible practical implementation to realize the suggested system. The discussed solution is easily replaceable and fully-integrable in pre-existing navigation stacks. Future research may integrate the person-following method with concrete applications and monitoring tools for self-sufficient older adults.