An Autonomous Robotic System for Object Retrieval and Delivery: Enhancing Independence for Users Living with Disability and Older Adults

Li, Jincheng; Lin, Chenghao; Mazen, Amna; Bazzi, Youssef A.

doi:10.3390/robotics15020041

Open AccessArticle

An Autonomous Robotic System for Object Retrieval and Delivery: Enhancing Independence for Users Living with Disability and Older Adults

¹

Department of Electrical & Computer Engineering and Computer Science, University of Detroit Mercy, Detroit, MI 48221, USA

²

Department of Applied Computing, College of Computing, Michigan Technological University, Houghton, MI 49931, USA

³

Department of Manufacturing and Mechanical Engineering Technology, College of Engineering, Michigan Technological University, Houghton, MI 49931, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Robotics 2026, 15(2), 41; https://doi.org/10.3390/robotics15020041

Submission received: 20 November 2025 / Revised: 20 January 2026 / Accepted: 2 February 2026 / Published: 12 February 2026

(This article belongs to the Special Issue AI-Powered Robotic Systems: Learning, Perception and Decision-Making)

Download

Browse Figures

Versions Notes

Abstract

As the global population ages, there is a growing need for assistive technologies to help older adults maintain their independence. This work presents a cost-effective autonomous socially assistive robot designed for object retrieval and delivery, enhancing accessibility in home environments. The system is built on the Robot Operating System (ROS) framework and integrates three key components: the Pioneer P3-DX mobile robot for autonomous navigation, the ReactorX-200 robotic arm for pick-and-place operations, and the Kinect v2 RGB-D camera for object detection and localization. Users interact with the robot through natural language processing by issuing voice commands to retrieve various objects. Microsoft Azure-powered speech recognition processes these commands to extract keywords and then localize requested objects on a predefined building map. Pioneer P3-DX, equipped with a Hokuyo LiDAR, enables autonomous navigation and obstacle avoidance, while Kinect v2, integrated with the YOLOv8 algorithm, facilitates object recognition and localization. The robot retrieves and delivers the user’s requested objects while following the shortest available path. Experimental evaluations in a home environment demonstrate the system’s effectiveness in identifying and retrieving requested objects. The subsystems achieve a success rate of 85–95% across more than 50 runs, highlighting their strong performance. The proposed approach provides a proof of concept for future advancements in assistive robotics, demonstrating the seamless integration of advanced technologies into a cost-effective and user-friendly platform.

Keywords:

autonomous navigation; mobile manipulator; socially assistive robots; user independence; voice recognition; YOLOv8

1. Introduction

With the global population aging rapidly, the proportion of older adults (OAs) in many countries is steadily increasing. By 2050, the United Nations report that the number of OAs worldwide will reach 2.1 billion [1], increasing the demand for solutions that support independent living for older adults [2,3]. In addition to OAs, many users living with disability have difficulty with moving, carrying, and handling and rely heavily on caregivers for assistance. This raises the need for robotic systems to enhance the independence of users living with disability and older adults.

Socially assistive robots (SARs) have emerged as a promising solution to support independent living for older adults by providing companionship, assistance with daily tasks, and health monitoring [4]. Recent studies highlight the significant potential of SARs in eldercare, driving substantial global investments aimed at their integration into daily life [5,6]. Nanavati et al. [7] provided a systematic review of physically assistive robots that emphasizes trends toward higher autonomy, improved interaction interfaces, and the need for evaluations with end users in real-world settings. Jung and Shin [8] investigated the intention of people with physical disabilities to use care robots. This investigation concluded that the majority of participants expressed willingness to adopt such systems. Complementing this, a scoping review on humanoid robots assisting activities of daily living for people with physical disabilities reports generally positive user perceptions, while underscoring limited technical readiness and personalization for home deployment [9]. To effectively assist users, SARs require several key functionalities, including a voice recognition algorithm for natural interaction, an object detection algorithm to identify and locate the requested items, a grasping algorithm to retrieve these items, and an autonomous navigation algorithm to enable seamless movement within the user’s environment. Researchers have extensively explored these individual capabilities to enhance the overall performance and usability of SARs.

For speech recognition techniques, Rendyansyah et al. [10] used Mel-frequency cepstral coefficients combined with artificial neural networks and deep neural networks to control the movements of a 4-DOF robot. Similarly, Li et al. [11] integrated a deep learning-based speaker separation model with an automatic speech recognition system, allowing robots to interpret spoken commands while accurately filtering out background noise. Object detection methods enable SARs to precisely detect and locate target objects, even in complex environments. Gupta et al. [12] proposed a geocentric embedding for depth images to improve object detection and instance segmentation in RGB-D images, achieving significant gains over existing methods. Finally, SARs require navigation algorithms that are efficient in dynamic environments. Traditional planners such as the Dynamic Window Approach (DWA) [13] have been adapted to account for human presence. Recent studies have increasingly focused on integrating social norms directly into path planning. Approaches such as Social Force Models [14] and deep learning techniques, like Socially Aware Navigation with Graph Neural Networks [15], have shown promising results.

Although these single-function technologies have achieved significant progress, a fully integrated system combining these functionalities was lacking. In this work, we present a unified modular assistive framework that seamlessly integrates multiple algorithms to address human functioning limitations due to a disabling health condition or age-related functional decline.

1.1. Research Questions

The goal of this work is to design, implement, and evaluate an autonomous socially assistive robot capable of retrieving and delivering objects to support older adults and individuals experiencing functioning limitations. The following research questions were defined:

RQ1: How can voice-based human–robot interaction be leveraged to enable intuitive and accessible communication for object retrieval tasks in home environments?

RQ2: Which robot navigation algorithm is most effective for enabling reliable autonomous mobility in cluttered and dynamic household settings?

RQ3: How can deep learning-based object detection and AprilTag-assisted localization provide robot perception for identifying and retrieving user-requested items?

RQ4: How can an integrated mobile manipulator, combining a Pioneer P3-DX base and a ReactorX-200 robotic arm, be coordinated to achieve grasping and delivery of objects to the user?

1.2. Research Variables

The research variables are categorized into navigation stack parameters and manipulator joints.

Navigation Stack Parameters: Several parameters within the move_base navigation stack were manipulated to adapt the system to the Pioneer P3-DX mobile robot equipped with a 270-degree LiDAR. These parameters include costmap-related variables such as robot footprint and inflation radius, as well as global and local planner parameters (e.g., lethal_cost, neutral_cost, cost_factor, path_distance_bias, goal_distance_bias, and occdist_scale). The description of these parameters is summarized in Table 1.

Manipulator Joints: The joint variables of the ReactorX-200 manipulator, along with their default safe operating limits as defined in the firmware, are summarized in Table 2. These limits constrain the inverse kinematics solution space and ensure that all computed joint configurations remain within mechanically feasible and collision-safe ranges during grasp execution.

1.3. Contribution of the Study

The main contribution of this work is the integration of subsystems to develop an end-to-end object retrieval solution that helps older adults and users with disabilities. The proposed system consists of the following key components:

Voice Recognition: A speech recognition module interprets user commands to identify requested objects.
Environment Mapping: The system generates a detailed map of the user’s home environment and a list that assigns spatial locations to the identified objects.
Path Optimization: An optimization algorithm minimizes the robot’s travel distance while navigating the mapped environment.
Navigation Stack Implementation: A global planner generates global path using the generated map to target objects’ locations, while a local planner uses real-time LiDAR data to dynamically avoid obstacles.
Object Detection and Localization: A YOLO-based deep learning model is used for robust object detection and localization within the environment. AprilTag markers are employed to accurately align the robotic arm by transforming object coordinates from the camera frame to the robot’s coordinate frame, ensuring precise grasping.
Robotic Arm Motion Planning: A motion planning algorithm enables the robotic arm to grasp requested objects, store them in an attached holder, and deliver them to the user once collection is complete. By integrating these algorithms, the proposed system advances autonomous robotic assistance, promoting greater accessibility and independence for individuals with disabilities.

The remainder of this paper is organized as follows: Section 2 describes the hardware design and system integration of the autonomous robotic platform, including the incorporation of navigation, perception, and manipulation. Section 3 presents experimental evaluations conducted in a real-world, home-like environment to assess the system’s performance across object detection, grasping, and delivery tasks. Finally, Section 4 summarizes key findings and discusses directions for future enhancements.

2. Methodology

The proposed robot assistant system is designed to assist older adults and people with disabilities in performing their daily living activities. Figure 1 illustrates the physical setup of the proposed system. The system comprises four main components: a Pioneer P3-DX mobile robot (MobileRobots Inc., Amherst, NH, USA), a ReactorX-200 arm (Trossen Robotics, Downers Grove, IL, USA), sensors including a Hokuyo UST-10LX LiDAR (Osaka, Japan) and RGB-D Kinect v2 camera, and an Intel NUC for processing. The camera integrated with the YOLOv8 algorithm is employed for object recognition and localization within the robot’s surroundings. The Pioneer P3-DX mobile robot, equipped with a Hokuyo LiDAR, is utilized for navigation and obstacle avoidance, enabling it to reach the user-requested objects. The ReactorX-200 robotic arm is then used to pick and place the identified target objects.

In the proposed system, the user requests the robot to retrieve various objects through voice commands. The system processes these commands by extracting keywords and localizing the requested objects within the predefined locations on the building map. For example, the keyword “pen” is associated with the “office” location in the building map. Once all commands are received, the system employs an optimization navigation algorithm to efficiently sequence the navigation to these locations while minimizing time and avoiding revisiting the same location.

Figure 2 illustrates the system workflow of our proposed home assistant system. The system operates in two main modes: navigation and grasping. In navigation mode, the Pioneer P3-DX robot moves from one location to another on the map to retrieve the requested objects. Upon reaching the object’s designated location (e.g., an kitchen) and the camera detects the requested object in the environment, the system then switches to grasping mode. In this mode, the trained YOLOv8 model is used classify objects in the robot’s environment and accurately detect the location of the requested objects with respect to the robotic arm frame. The object’s location detected by the camera is sent to the ReactorX-200 arm, which then picks it up and places it in a designated item holder attached to the Pioneer P3-DX robot. After successfully retrieving the object, the system reactivates the navigation mode to proceed to the next requested object location. This process is repeated for all requested objects. Once all objects are collected, the robot returns to the starting point where the user is waiting. The upcoming subsections provide a detailed discussion of the algorithms implemented in this system.

2.1. Voice Recognition

The proposed system is launched when the user provides verbal instructions by specifying the objects to retrieve. Microsoft^® Azure Speech-to-Text [17] is used to enable seamless interaction between the user and the proposed system. Microsoft^® Azure Speech-to-Text [17] is a robust cloud-based platform that transcribes spoken commands into text with high accuracy and real-time processing. To refine the extracted text, non-essential components such as articles (“a,” “an,” “the”) and prepositions are removed to enhance the system’s ability to process commands effectively.

2.2. Navigation and Path Planning

After extracting the keywords from the user commands, the system correlates the requested objects with their predefined locations on the map. For example, object keywords such as apple, pen, and comb are mapped to corresponding semantic locations such as the kitchen, office, and bedroom. We also maintain a list that stores the coordinate positions of each location within the map. Consequently, when the user requests an apple, the robot navigates to the kitchen’s coordinate positions in the map and initiates a local exploration routine until the Kinect camera, in conjunction with the YOLO detection algorithm, identifies and localizes the target object.

The building map was constructed using the ROS Simultaneous Localization and Mapping (SLAM) package, v. 2.6.10 [18], which allows the robot to build an environment map while determining its location autonomously. Using the building map and the LiDAR live data, the move_base package [19] enabled the robot to navigate the locations and retrieve the requested objects. The move_base package [19] provides a ROS interface for autonomous navigation by integrating global and local planners with obstacle avoidance mechanisms.

In this work, we integrated the A* algorithm [20] as a global path planner and the Dynamic Window Approach (DWA) [13] as a local path planner. The A* algorithm efficiently computes the robot’s optimal global path based on the pre-constructed map, while the DWA refines the generated global path accounting for the robot’s kinematic and dynamic constraints and real-time LiDAR data. This dual-layered approach equips the navigation module to dynamically adapt to sudden environmental changes, ensuring reliability in real-world scenarios. To optimize the retrieval of multiple objects along the shortest route, we applied the Traveling Salesman Problem (TSP) formulation [21] to determine the optimal ordering of object locations. The TSP seeks the shortest possible route that visits each city exactly once and returns to the origin city [22]. The move_base package receives these sorted target locations and uses the global and local planners to produce the robot’s linear and angular velocities.

2.3. Grasping and Trajectory Planning

Upon reaching the object’s designated location (e.g., an office), the Pioneer P3-DX robot performs a local exploration routine, moving forward and rotating in place until the target object is detected using the YOLOv8 algorithm [23], which identifies and classifies objects in the robot’s environment. The object’s relative position is estimated using the Kinect depth camera. If the object lies beyond the manipulator’s maximum reach of 550 mm, the mobile robot autonomously repositions itself to be within the arm’s reach distance ensuring successful retrieval. At this point, the navigation mode is deactivated, and the system seamlessly transitions to the grasping mode for object retrieval.

After the requested object is detected and localized using the YOLOv8 algorithm, its coordinates are transformed into the manipulator’s end-effector frame. To accurately compute the transformation matrix between the Kinect camera frame and the ReactorX-200 manipulator frame, we incorporate hand–eye calibration [24]. Hand–eye calibration establishes the spatial relationship between the robot’s end effector (the ‘hand’) and the camera (the ‘eye’). The calibration process involves capturing multiple images from various viewpoints to ensure a robust estimation of the transformation matrix. In this work, we use an AprilTag system [25], a fiducial marker system consisting of easily detectable 2D markers, to perform the hand–eye calibration, as illustrated in Figure 3. The AprilTag is detected within the camera’s field of view during the calibration process, as shown in Figure 3a. The system uses the detected pose of these tags relative to the camera to calculate the transformation between the camera frame and the arm frame, as illustrated in Figure 3b.

The hand–eye calibration is performed once and subsequently reused in the grasping mode. In this mode, the Rapidly Exploring Random Tree (RRT-Connect) algorithm [26] is employed to generate an obstacle-free trajectory, enabling the manipulator to grasp the object and place it into the item holder mounted on the mobile robot. RRT-Connect algorithm computes an obstacle-free trajectory to guide the manipulator from its current configuration to the target configuration as a sequence of

X Y Z

Cartesian points. Each point along the generated trajectory must be converted into joint space positions using inverse kinematics. We applied the Denavit–Hartenberg (D-H) method [27] which is widely employed in robotics to facilitate forward and inverse kinematics. It models the manipulators kinematics by representing transformations between adjacent links through rotation and translation matrices.

For the ReactorX-200 manipulator, the joint axes, angles, and link lengths are defined according to the D-H parameters, as illustrated in Figure 4. The inverse kinematic transformation matrix, Equation (1), takes the end-effector XYZ Cartesian position as input and computes the corresponding five joint angles. This transformation matrix consists of two components: the rotation matrix (R) represents the rotational orientation of the end effector, and the translation matrix (P) describes the positional displacement in the 3D space. Equation (2) shows the details of the rotation and translational matrices where the symbols s and c denote sine and cosine functions, respectively, with subscripts indicating the corresponding joint angles. For instance,

c_{234} =

cos (θ_{2} + θ_{3} + θ_{4})

describes the cosine of the cumulative rotation of three consecutive joints

θ_{2}, θ_{3}, θ_{4}

.

(\begin{matrix} θ_{1} \\ θ_{2} \\ θ_{3} \\ θ_{4} \\ θ_{5} \end{matrix}) = (\begin{matrix} R & P \\ 0 & 1 \end{matrix}) (\begin{matrix} X \\ Y \\ Z \end{matrix})

(1)

\begin{matrix} R & = (\begin{matrix} s_{1} s_{5} + c_{1} c_{5} c_{234} & c_{1} c_{5} - c_{1} c_{5} c_{234} & - c_{1} s_{234} \\ - c_{1} s_{5} + c_{1} c_{5} c_{234} & - c_{1} c_{5} - s_{1} s_{5} c_{234} & - s_{1} s_{234} \\ - c_{5} s_{234} & - s_{5} s_{234} & - c_{234} \end{matrix}) \\ P & = (\begin{matrix} c_{1} (l_{2} c_{2} + l_{3} c_{23} - l_{4} s_{234}) \\ s_{1} (l_{2} c_{2} + l_{3} c_{23} - l_{4} s_{234}) \\ l_{1} - l_{2} s_{2} - l_{3} s_{23} - l_{4} c_{234} \end{matrix}) \end{matrix}

(2)

The following subsections describe the experimental design used to evaluate the performance of each component of the proposed robotic system: speech recognition, robot navigation, object detection, and manipulator grasping.

2.4. Experimental Design

All experiments were conducted under controlled laboratory conditions to ensure repeatability and consistency. More than 50 trials were conducted for every subsystem. All experiments were conducted exclusively by the five authors of this paper. No external human participants were involved in the experimental evaluation.

2.5. Metrics

Different evaluation metrics were employed to assess the performance of each component of the proposed system, including Azure speech recognition, mobile robot navigation, YOLO-based object detection, and manipulator grasping.

2.5.1. Azure Voice Recognition

The performance of the Azure voice recognition module was evaluated using a success rate metric that measures its ability to identify the requested object from spoken commands correctly. The authors issued natural-language voice commands, incorporating variations in phrasing and pronunciation to reflect realistic user interaction. The evaluation was conducted using full natural-language sentences (e.g., “bring me a pen” or “I want a screwdriver”), where the system was required to extract and correctly recognize the target keyword corresponding to one of the four experimental objects: pen, screwdriver, card, and marker. A trial was considered successful if the system correctly extracted the intended target object keyword from the spoken command.

2.5.2. Mobile Robot Navigation Algorithm

Navigation experiments assessed the mobile robot’s ability to reach predefined goal poses using the move_base framework autonomously. Each trial required the robot to reach the target location within positional and orientation tolerances of 0.05 m for xy_goal_tolerance and 2 degrees for yaw_goal_tolerance, without collision.

2.5.3. Object Detection Using YOLO

To assess the performance of the proposed YOLO framework, several evaluation metrics were employed, including confusion matrix, accuracy, precision, recall, and F1-score. The confusion matrix provides a detailed breakdown of prediction outcomes across all classes by reporting true positives (

T P

), false positives (

F P

), false negatives (

F N

), and true negatives (

T N

). Overall accuracy measures the proportion of correctly classified detections relative to the total number of detections, as shown in Equation (3). Precision, Equation (4), quantifies the reliability of the predicted class labels. Recall, Equation (5) measures the model’s ability to correctly detect instances of each class. Finally, the F1-score represents the harmonic mean of precision and recall, as indicated in Equation (6).

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(3)

Precision = \frac{TP}{TP + FP}

(4)

Recall = \frac{TP}{TP + FN}

(5)

F 1 ‐ score = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}

(6)

2.5.4. Manipulator Grasping

Grasping experiments evaluated the ReactorX-200 manipulator’s ability to successfully grasp detected objects and place them into the designated storage holder. Each grasping trial was initiated only after successful completion of speech recognition, navigation, and object detection, ensuring end-to-end system validation. The manipulator grasping performance was evaluated based on the successful generation of a collision-free joint-space trajectory that reaches the specified end-effector Cartesian goal while respecting the kinematic and joint-limit constraints of the RX-200 manipulator mentioned in Table 2.

3. Results

This section provides detailed implementation insights about environment map creation, robot navigation and object grasping and retrieval.

3.1. Experiment Environment

The first floor of the Engineering Building at the University of Detroit Mercy (UDM) was utilized to simulate a home environment for testing the proposed system. The ROS SLAM package [18] based on the Hokuyo LiDAR sensor was used to generate a map of the simulated environment. The simulated environment included typical household spaces, such as an office, a kitchen, and corridors, as illustrated in Figure 5. The furniture was arranged to reflect the layout of a standard home, providing a realistic testing environment to evaluate the robots’ capabilities in object interaction, navigation, and task execution.

3.2. Robot’s Navigation Stack

Using the created map, the ‘move_base’ navigation stack [19] facilitates the movement of the mobile robot from one location to another while avoiding obstacles. The performance of the navigation stack relies on two categories of parameters: costmap parameters and planner parameters. The key costmap parameters include the robot’s footprint (size of the robot in the costmap) and inflation_radius (the extent to which obstacles are inflated in the costmap, which depends on the robot size). Global planner parameters such as lethal_cost, neutral_cost, and cost_factor influence whether the planner generates paths that pass through the center of obstacle-free regions rather than skirting their edges. In contrast, the local planner parameters path_distance_bias, goal_distance_bias, and occdist_scale govern the robot’s immediate path-following behavior.

Specific adjustments were made to these parameters to suit the Pioneer P3Dx robot equipped with a 270-degree LiDAR. Table 3 lists the specific parameter values used. For instance, reverse driving was disabled due to the LiDAR 90-degree blinding zone. path_distance_bias parameter was reduced to ensure smoother obstacle avoidance. Figure 6 illustrates the navigation process in a predefined map visualized from the kitchen to the main corridor in RViz. This figure demonstrates the integration of costmaps and planners, highlighting the system’s ability to navigate efficiently within a dynamic environment.

Once the robot reaches the location of the requested object, the grasping mode is activated, and YOLOv8 is used for object detection, classification, and localization. In this work, we collected a dataset to fine-tune YOLOv8 to detect user-requested objects. The upcoming subsection will discuss in detail the dataset and the YOLO performance optimization.

3.3. Object Detection and Localization

In this work, a custom dataset comprising 247 images captured from various perspectives was collected and annotated. Each image contains one or more user-requested objects, with the training set including four object categories: pens, markers, cards, and screwdrivers. This is just a sample of objects that the user can ask the robot to retrieve. Given the manipulator’s 150 g payload constraint, we restricted our experiments to user-requested items whose mass falls within the allowable load capacity. The dataset can be extended to include other objects based on the user’s needs. These items were deliberately selected for their elongated shapes, making them suitable for secure grasping within the width constraints of the ReactorX-200 end effector.

The dataset was divided into 182 images for training, 17 images for validation, and 48 images for testing purposes. We employed various data augmentation techniques to enhance the robustness of the training dataset and improve the model’s generalization capability. First, random rotations were applied to simulate varying viewing angles, enabling the model to recognize objects from different perspectives. Gaussian blurring was introduced to replicate slightly out-of-focus imagery, improving the model’s tolerance to suboptimal image quality. Additionally, random noise was added to account for real-world imperfections and sensor variability. Finally, brightness adjustments were made to reflect diverse lighting conditions. These augmentations not only increased the diversity of the training data but also significantly improved the model’s adaptability to real-world environments.

In this work, we tested two pre-trained YOLO object detection models, YOLOv8n and YOLOv8m, during the training phase. According to the official Ultralytics benchmarks [28], YOLOv8m contains 25.9M parameters, approximately 8× more than YOLOv8n (3.2M), and requires 78.9B FLOPs, compared to 8.7B FLOPs for YOLOv8n. In terms of inference speed, YOLOv8n runs at approximately 80 ms per frame on CPU (ONNX), whereas YOLOv8m requires around 235 ms, making it nearly 3× slower. Training time scales similarly with model size, and YOLOv8m requires approximately 7–9× longer training time than YOLOv8n under identical settings. Given the limited computational resources of our host device (Intel NUC) and the additional processing demands of point cloud data, YOLOv8n imposed significantly less load on the CPU while maintaining competitive accuracy in object detection tasks. As a result, YOLOv8n was selected to satisfy real-time constraints on an edge computing platform in this work.

Figure 7 illustrates the training and validation losses while training the YOLOv8n model for 170 epochs using our collected custom dataset. In this figure, the training results demonstrate that the model became more capable of accurate target localization and classification. Losses in both the training and validation sets decreased steadily as the number of iterations increased except for the valbox_loss figure. The valbox_loss suffered from some fluctuations stemming from the limited size and variability of the validation dataset, as well as the sensitivity of bounding-box regression to object localization uncertainty. Despite short-term oscillations, the overall trend demonstrates a gradual downward trajectory, indicating convergence rather than instability. The training process was performed on an NVIDIA GeForce RTX 3080 Ti laptop GPU with 16 GB of memory.

Figure 8 illustrates the model’s accuracy across different confidence levels for various object categories. The model achieves an optimal overall F1-score of

0.99

at a confidence threshold of

0.358

, highlighting a strong performance across all classes.

Figure 9 illustrates the confusion matrix of the trained YOLOv8n model for object detection on the validation and testing datasets. In this matrix, the rows represent predicted categories, and the columns represent true categories. The model perfectly predicted the categories of “card,” “pen,” and “marker,” with 16 correct predictions for each and no misclassifications. For “screwdriver,” the model also performed almost perfectly, correctly identifying it 16 times with only one instance misclassified as “background.”

3.4. Object Grasping and Retrieval

In the grasping evaluated scenarios, several assumptions were made. Instead of a standard table, we used a box with a height lower than the mobile robot (24 cm), as illustrated in Figure 10. Additionally, the test objects were placed in an upright orientation to ensure full visibility for detection by the Kinect camera. Figure 10 shows a sample of real-time testing of YOLOv8n after training. This figure indicates the good performance of the trained model in detecting and localizing objects in real time, even in a cluttered background.

A pipeline for object localization using the trained YOLOv8n model, followed by grasping with the ReactorX 200 robotic arm, demonstrated using a pen as an example is shown in Figure 11. In Figure 11a, the pen is detected in real-time by the Kinect camera and highlighted with a bounding box displaying the item ID, name, and confidence score. The point cloud data from the RGB-D camera and the detected target positions are visualized within the coordinate frame of the robotic arm’s base in Figure 11b. The alignment between the detected target positions and the corresponding point cloud of the object confirms the successful recognition and localization of the target. Subsequently, the detected coordinates of the pen are transformed from the camera frame to the robotic arm’s coordinate system to enable precise manipulation. To account for the physical constraints introduced by the Pioneer P3-DX robot’s chassis and the item holder, the chassis is modeled as a static virtual obstacle within the planning environment (Figure 11c). Finally, Figure 11d illustrates the successful grasping of the pen by the ReactorX 200 arm using the transformed coordinates.

To evaluate the system’s reliability, we conducted multiple pick-and-place trials across varied scenarios. The arm consistently handled all four test objects, with only occasional failures.In the failure cases, the arm initiated the grasp but exhibited slight lateral deviation—shifting marginally to the right or left—which resulted in unsuccessful pickups. Further analysis showed that these failures primarily stemmed from sensor noise and cumulative positioning drift caused by minor wheel slip during extended multi-object retrieval and delivery sequences, which increased the probability of misalignment and subsequent pickup errors over time.

4. Discussion

This section discusses the key findings of the study in relation to the research questions defined in Section 1 and highlights the practical implications and limitations of the proposed autonomous robotic system.

4.1. Discussion of Key Findings

RQ1: Voice-based human–robot interaction. The experimental results demonstrate that voice-based interaction provides an intuitive and accessible interface for initiating object retrieval tasks in indoor environments. The integration of Azure speech recognition enabled reliable extraction of task-relevant keywords, including object names and locations, without requiring the user to interact with traditional graphical interfaces. The observed success rate ranged between 85–95%, with occasional failures caused by pronunciation ambiguity or transient network latency. This interaction paradigm is particularly suitable for older adults and individuals with functional limitations, as it reduces physical effort while maintaining task flexibility.

RQ2: Mobile robot navigation algorithm. The results confirm that SLAM-based environment mapping combined with the move_base navigation stack enables robust autonomous mobility in cluttered and dynamic household-like environments. By tuning costmap and planner parameters to the Pioneer P3-DX platform, the robot successfully navigated between predefined semantic locations while avoiding obstacles. The success rate ranged between 90–95% under mostly static indoor conditions, with failures typically occurring due to local planner oscillations or temporary localization drift. The navigation framework proved effective in maintaining positional accuracy and smooth motion, supporting reliable transitions between navigation and manipulation phases.

RQ3: Object detection and localization. The deep learning-based perception module achieved high accuracy in detecting and classifying user-requested objects across varying viewpoints and lighting conditions. The use of YOLOv8n provided a favorable balance between computational efficiency and detection performance, making it suitable for deployment on resource-constrained platforms. Across validation and test sets, detection performance was consistently high, with overall accuracy and F1-scores in the 85–95% range. Occasional performance degradation was observed under challenging lighting conditions or partial occlusion, which reflects realistic operational constraints. The integration of object detection with RGB-D point cloud data enabled object localization, an essential step for manipulation tasks. However, we observed slight lateral shifts of the manipulator relative to the target object during execution in some trials, which we will discuss in the limitations subsection.

RQ4: Coordinated mobile manipulation and object delivery. The coordinated integration of the Pioneer P3-DX mobile base and the ReactorX-200 robotic arm enabled the system to execute complete object retrieval and delivery tasks autonomously. Modeling the mobile robot chassis as a virtual obstacle ensured collision-free manipulation, while inverse kinematics and motion planning allowed the arm to grasp and transport objects securely. The results demonstrate that a tightly integrated mobile manipulator can achieve reliable end-to-end task execution in a home-like environment.

Overall, the findings validate the feasibility of combining voice interaction, autonomous navigation, perception, and manipulation into a unified robotic framework for assistive applications. Rather than introducing new algorithms, the contribution of this work lies in the system-level integration and experimental validation of an end-to-end assistive robotic workflow.

4.2. Limitations and Future Work

Despite the promising results, several limitations of the proposed system should be acknowledged. First, the experiments were conducted in a controlled indoor environment with predefined semantic locations, which may not fully capture the variability of real residential settings. Second, the set of detectable and manipulable objects was limited to lightweight items due to the ReactorX-200 manipulator’s payload constraints.

In some grasping trials, occasional object detection performance degradation was observed under challenging lighting conditions or partial occlusion. In addition, the manipulator exhibited slight lateral shifts relative to the target object during execution. Further analysis suggests that these deviations are primarily caused by sensor noise and residual calibration errors between the RGB-D camera and the robotic arm. This highlights the need for camera calibration to improve grasping precision.

5. Conclusions

This paper presents a complete end-to-end robotic system designed to assist with object retrieval and delivery, aimed at enhancing the independence of older adults and users living with disability. The proposed system integrates ROS-based navigation, Azure-powered speech recognition, and YOLOv8n-based object detection to support intuitive human–robot interaction in domestic environments. A custom dataset of 247 annotated RGB images was collected, augmented, and used to train a lightweight YOLOv8n model, achieving 85–95% accuracy and F1-score across validation and test sets while maintaining real-time inference capability on resource-constrained hardware. The speech recognition module demonstrated a success rate of 85–95% over approximately 50 trials, reliably extracting target object keywords despite variations in phrasing and pronunciation. Autonomous navigation experiments achieved a 90–95% success rate, meeting strict positional (0.05 m) and orientation (2°) tolerances without collision in indoor environments. Object localization and grasp execution were enabled through extrinsic calibration between the Azure Kinect camera and the robotic arm, supporting consistent end-to-end object retrieval. Overall, the quantitative results confirm the system’s effectiveness in autonomous navigation, obstacle avoidance, and object retrieval within realistic domestic scenarios.

Future work will focus on refining camera calibration procedures and incorporating online calibration or visual servoing techniques to reduce alignment errors during manipulation. Addressing these limitations will be critical for transitioning the proposed system from a laboratory setting to real-world assistive deployment scenarios. In addition, we will actively incorporate end-user participation to guide system refinement and ensure that the research agenda remains aligned with real-world needs. We plan to engage users, caregivers, and domain experts as informants to identify high-priority objects and tasks on which the robot should be trained, and to provide structured usability and UX feedback throughout iterative development. Furthermore, we plan to integrate door-opening capabilities into the manipulator, further enhancing the system’s reliability and applicability in real-world environments.

Author Contributions

Conceptualization, A.M. and Y.A.B.; methodology, A.M., J.L. and C.L.; software, J.L. and C.L.; validation, A.M., J.L. and C.L.; formal analysis, A.M., J.L. and C.L.; investigation, A.M. and Y.A.B.; resources, A.M. and Y.A.B.; data curation, A.M., J.L. and C.L.; writing—original draft preparation, A.M., J.L. and C.L.; writing—review and editing, A.M., J.L. and C.L.; visualization, A.M., J.L. and C.L.; supervision, A.M. and Y.A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

United Nations. Decade of Healthy Ageing: Baseline Report. 2020. Available online: https://cdn.who.int/media/docs/default-source/decade-of-healthy-ageing/decade-proposal-final-apr2020-en.pdf (accessed on 1 February 2026).
Spillman, B.C.; Lubitz, J. The effect of longevity on spending for acute and long-term care. N. Engl. J. Med. 2000, 342, 1409–1415. [Google Scholar] [CrossRef] [PubMed]
Alemayehu, B.; Warner, K.E. The lifetime distribution of health care costs. Health Serv. Res. 2004, 39, 627–642. [Google Scholar] [CrossRef] [PubMed]
Asgharian, P.; Panchea, A.M.; Ferland, F. A review on the use of mobile service robots in elderly care. Robotics 2022, 11, 127. [Google Scholar] [CrossRef]
Fracasso, F.; Buchweitz, L.; Theil, A.; Cesta, A.; Korn, O. Social robots acceptance and marketability in Italy and Germany. Int. J. Soc. Robot. 2022, 14, 1463–1480. [Google Scholar] [CrossRef]
Dolic, Z.; Castro, R.; Moarcas, A. Robots in Healthcare: A Solution or a Problem? European Parliamentary Research Service: Luxembourg, 2019. [Google Scholar]
Nanavati, A.; Ranganeni, V.; Cakmak, M. Physically assistive robots: A systematic review of mobile and manipulator robots that physically assist people with disabilities. Annu. Rev. Control. Robot. Auton. Syst. 2023, 7, 123–147. [Google Scholar] [CrossRef]
Jung, S.H.; Shin, Y.S. Factors associated with intention to use care robots among people with physical disabilities. Nurs. Outlook 2024, 72, 102145. [Google Scholar] [CrossRef] [PubMed]
Sørensen, L.; Johannesen, D.T.; Johnsen, H.M. Humanoid robots for assisting people with physical disabilities in activities of daily living: A scoping review. Assist. Technol. 2024, 37, 203–219. [Google Scholar] [CrossRef] [PubMed]
Rendyansyah, R.; Prasetyo, A.P.P.; Sembiring, S. Voice command recognition for movement control of a 4-DoF robot arm. J. Tek. Elektro 2022, 14, 118–124. [Google Scholar] [CrossRef]
Li, S.-A.; Liu, Y.-Y.; Chen, Y.-C.; Feng, H.-M.; Shen, P.-K.; Wu, Y.-C. Voice interaction recognition design in real-life scenario mobile robot applications. Appl. Sci. 2023, 13, 3359. [Google Scholar] [CrossRef]
Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning rich features from RGB-D images for object detection and segmentation. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 345–360. [Google Scholar]
Fox, D.; Burgard, W.; Thrun, S. The dynamic window approach to collision avoidance. IEEE Robot. Autom. Mag. 1997, 4, 23–33. [Google Scholar] [CrossRef]
Luber, M.; Spinello, L.; Silva, J.; Arras, K.O. Socially-aware robot navigation: A learning approach. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 902–907. [Google Scholar]
Manso, L.J.; Jorvekar, R.R.; Faria, D.R.; Bustos, P.; Bachiller, P. Graph neural networks for human-aware social navigation. In Advances in Physical Agents II (WAF 2020); Springer: Cham, Switzerland, 2021; pp. 167–179. [Google Scholar]
Trossen Robotics. Interbotix RX-200 Robot Arm Specifications. Available online: https://docs.trossenrobotics.com/interbotix_xsarms_docs/specifications/rx200.html (accessed on 17 January 2025).
Satapathi, A.; Mishra, A. Build a desktop application for speech-to-text conversation using Azure cognitive services. In Developing Cloud-Native Solutions with Microsoft Azure and .NET; Springer: Berkeley, CA, USA, 2022; pp. 219–230. [Google Scholar]
Durrant-Whyte, H.; Bailey, T. Simultaneous localization and mapping: Part I. IEEE Robot. Autom. Mag. 2006, 13, 99–110. [Google Scholar] [CrossRef]
Quigley, M.; Gerkey, B.; Smart, W.D. Programming Robots with ROS; O’Reilly Media: Sebastopol, CA, USA, 2015. [Google Scholar]
Hart, P.E.; Nilsson, N.J.; Raphael, B. A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 1968, 4, 100–107. [Google Scholar] [CrossRef]
Biggs, N. The Traveling Salesman Problem: A Guided Tour of Combinatorial Optimization; Wiley Online Library: Hoboken, NJ, USA, 1986. [Google Scholar]
Voudouris, C.; Tsang, E. Guided local search and its application to the traveling salesman problem. Eur. J. Oper. Res. 1999, 113, 469–499. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Reddy, R.; Venkata, C. A review on YOLOv8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; pp. 529–545. [Google Scholar]
Jiang, J.; Luo, X.; Luo, Q.; Qiao, L.; Li, M. An overview of hand-eye calibration. Int. J. Adv. Manuf. Technol. 2022, 119, 77–97. [Google Scholar] [CrossRef]
Wang, J.; Olson, E. AprilTag 2: Efficient and robust fiducial detection. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Republic of Korea, 9–14 October 2016; pp. 4193–4198. [Google Scholar]
Kuffner, J.J.; LaValle, S.M. RRT-Connect: An Efficient Approach to Single-Query Path Planning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), San Francisco, CA, USA, 24–28 April 2000; IEEE: New York, NY, US, 2000; Volume 2, pp. 995–1001. [Google Scholar]
Niku, S.B. Introduction to Robotics: Analysis, Control, Applications; John Wiley & Sons: Hoboken, NJ, USA, 2020. [Google Scholar]
Ultralytics. YOLOv8 Models: Performance Metrics. Available online: https://docs.ultralytics.com/models/yolov8/#performance-metrics (accessed on 1 February 2026).

Figure 1. Assistant robot system components.

Figure 2. System Workflow of a Home Assistant Robot: Integrating Microsoft Azure Speech-to-Text for Object Requests, Move_base Navigation Stack Planning (A* Global and DWA Local Planners), YOLOv8 for Object Detection, and the DH Method for Robotic Arm Control.

Figure 3. Hand–eye calibration using an AprilTag. (a) Real ReactorX-200 robotic arm with the attached Apriltag. (b) Transformation of the detected object’s pose from the camera frame to the robot base frame. The coordinate axes follow the standard convention: blue for the x-axis, red for the y-axis, and green for the z-axis.

Figure 4. ReactorX-200 D–H parameter representation of the robotic manipulator. Joint axes are labeled as

θ_{1}

–

θ_{5}

, indicating the corresponding revolute joint variables and link lengths are denoted by

L_{1}

–

L_{4}

.

Figure 4. ReactorX-200 D–H parameter representation of the robotic manipulator. Joint axes are labeled as

θ_{1}

–

θ_{5}

, indicating the corresponding revolute joint variables and link lengths are denoted by

L_{1}

–

L_{4}

.

Figure 5. Map of home simulated environment at UDM.

Figure 6. Robot navigation path (black curved line) from the kitchen to the main corridor within a pre-created map using the move_base navigation stack. The colored background represents the costmap, where darker regions indicate higher obstacle costs and lighter regions denote free space. Cyan points correspond to sensor-based obstacle detections, while black contours outline the mapped walls and static obstacles. The red robot footprint indicates the robot’s current pose and the red arrow shows its orientation.

Figure 7. Training and validation losses while training YOLOv8n on our collected custom dataset.

Figure 8. F1-Confidence Curve of Training YOLOv8n on Our Custom Dataset.

Figure 9. Confusion Matrix on Validation and Testing Datasets.

Figure 10. Sample of Real-time Testing of YOLOv8n Model.

Figure 11. Overview of the pen localization and grasping pipeline, including detection, coordinate transformation, obstacle modeling, and final grasping by the ReactorX 200 arm. (a) Pen detection using a trained YOLOv8n model. (b) Coordinate transformation of the detected pen from the camera frame to the robot arm frame. (c) Modeling the Pioneer P3-DX robot’s chassis as a static obstacle for safe grasping. (d) The ReactorX 200 arm successfully grasping the pen.

Table 1. Move_base Package Parameters’ Description.

Parameter	Description
lethal_cost	Cost assigned to impassable areas
neutral_cost	Baseline cost for traversable regions
cost_factor	Weighting factor applied to obstacle costs
path_distance_bias	Weight influencing path smoothness and adherence
goal_distance_bias	Weight influencing convergence toward the goal
occdist_scale	Weight applied to obstacle proximity cost
xy_goal_tolerance	Positional tolerance in the x and y directions
yaw_goal_tolerance	Orientation tolerance in yaw angle

Table 2. Joint Variables and Default Limits for the ReactorX-200 Manipulator [16].

Joint Variable	Min	Max	Description
$θ_{1}$ (Waist)	$- 180 °$	$180 °$	Base rotation joint controlling horizontal orientation of the manipulator.
$θ_{2}$ (Shoulder)	$- 108 °$	$113 °$	Shoulder joint governing vertical lifting and lowering of the arm.
$θ_{3}$ (Elbow)	$- 108 °$	$93 °$	Elbow joint controlling arm extension and retraction.
$θ_{4}$ (Wrist Angle)	$- 100 °$	$123 °$	Wrist pitch joint enabling angular adjustment of the end effector.
$θ_{5}$ (Wrist Rotate)	$- 180 °$	$180 °$	Wrist rotation joint controlling end-effector orientation.
Gripper opening	$30 mm$	$74 mm$	Linear opening range of the gripper used for grasp execution.

Table 3. Move_base Package Parameters’ assigned values.

Parameter	Value	Description
lethal_cost	253	Impassable areas cost
neutral_cost	60	Moderate area cost
cost_factor	0.7	Cost weighting parameter
path_distance_bias	32.0	Path distance weight
goal_distance_bias	20.0	Goal distance weight
occdist_scale	0.02	Obstacle cost weight
xy_goal_tolerance	0.2	Goal tolerance in x, y
yaw_goal_tolerance	0.2	Goal tolerance in yaw

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, J.; Lin, C.; Mazen, A.; Bazzi, Y.A. An Autonomous Robotic System for Object Retrieval and Delivery: Enhancing Independence for Users Living with Disability and Older Adults. Robotics 2026, 15, 41. https://doi.org/10.3390/robotics15020041

AMA Style

Li J, Lin C, Mazen A, Bazzi YA. An Autonomous Robotic System for Object Retrieval and Delivery: Enhancing Independence for Users Living with Disability and Older Adults. Robotics. 2026; 15(2):41. https://doi.org/10.3390/robotics15020041

Chicago/Turabian Style

Li, Jincheng, Chenghao Lin, Amna Mazen, and Youssef A. Bazzi. 2026. "An Autonomous Robotic System for Object Retrieval and Delivery: Enhancing Independence for Users Living with Disability and Older Adults" Robotics 15, no. 2: 41. https://doi.org/10.3390/robotics15020041

APA Style

Li, J., Lin, C., Mazen, A., & Bazzi, Y. A. (2026). An Autonomous Robotic System for Object Retrieval and Delivery: Enhancing Independence for Users Living with Disability and Older Adults. Robotics, 15(2), 41. https://doi.org/10.3390/robotics15020041

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Autonomous Robotic System for Object Retrieval and Delivery: Enhancing Independence for Users Living with Disability and Older Adults

Abstract

1. Introduction

1.1. Research Questions

1.2. Research Variables

1.3. Contribution of the Study

2. Methodology

2.1. Voice Recognition

2.2. Navigation and Path Planning

2.3. Grasping and Trajectory Planning

2.4. Experimental Design

2.5. Metrics

2.5.1. Azure Voice Recognition

2.5.2. Mobile Robot Navigation Algorithm

2.5.3. Object Detection Using YOLO

2.5.4. Manipulator Grasping

3. Results

3.1. Experiment Environment

3.2. Robot’s Navigation Stack

3.3. Object Detection and Localization

3.4. Object Grasping and Retrieval

4. Discussion

4.1. Discussion of Key Findings

4.2. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI