1. Introduction
With the global population aging rapidly, the proportion of older adults (OAs) in many countries is steadily increasing. By 2050, the United Nations report that the number of OAs worldwide will reach 2.1 billion [
1], increasing the demand for solutions that support independent living for older adults [
2,
3]. In addition to OAs, many users living with disability have difficulty with moving, carrying, and handling and rely heavily on caregivers for assistance. This raises the need for robotic systems to enhance the independence of users living with disability and older adults.
Socially assistive robots (SARs) have emerged as a promising solution to support independent living for older adults by providing companionship, assistance with daily tasks, and health monitoring [
4]. Recent studies highlight the significant potential of SARs in eldercare, driving substantial global investments aimed at their integration into daily life [
5,
6]. Nanavati et al. [
7] provided a systematic review of physically assistive robots that emphasizes trends toward higher autonomy, improved interaction interfaces, and the need for evaluations with end users in real-world settings. Jung and Shin [
8] investigated the intention of people with physical disabilities to use care robots. This investigation concluded that the majority of participants expressed willingness to adopt such systems. Complementing this, a scoping review on humanoid robots assisting activities of daily living for people with physical disabilities reports generally positive user perceptions, while underscoring limited technical readiness and personalization for home deployment [
9]. To effectively assist users, SARs require several key functionalities, including a voice recognition algorithm for natural interaction, an object detection algorithm to identify and locate the requested items, a grasping algorithm to retrieve these items, and an autonomous navigation algorithm to enable seamless movement within the user’s environment. Researchers have extensively explored these individual capabilities to enhance the overall performance and usability of SARs.
For speech recognition techniques, Rendyansyah et al. [
10] used Mel-frequency cepstral coefficients combined with artificial neural networks and deep neural networks to control the movements of a 4-DOF robot. Similarly, Li et al. [
11] integrated a deep learning-based speaker separation model with an automatic speech recognition system, allowing robots to interpret spoken commands while accurately filtering out background noise. Object detection methods enable SARs to precisely detect and locate target objects, even in complex environments. Gupta et al. [
12] proposed a geocentric embedding for depth images to improve object detection and instance segmentation in RGB-D images, achieving significant gains over existing methods. Finally, SARs require navigation algorithms that are efficient in dynamic environments. Traditional planners such as the Dynamic Window Approach (DWA) [
13] have been adapted to account for human presence. Recent studies have increasingly focused on integrating social norms directly into path planning. Approaches such as Social Force Models [
14] and deep learning techniques, like Socially Aware Navigation with Graph Neural Networks [
15], have shown promising results.
Although these single-function technologies have achieved significant progress, a fully integrated system combining these functionalities was lacking. In this work, we present a unified modular assistive framework that seamlessly integrates multiple algorithms to address human functioning limitations due to a disabling health condition or age-related functional decline.
1.1. Research Questions
The goal of this work is to design, implement, and evaluate an autonomous socially assistive robot capable of retrieving and delivering objects to support older adults and individuals experiencing functioning limitations. The following research questions were defined:
RQ1: How can voice-based human–robot interaction be leveraged to enable intuitive and accessible communication for object retrieval tasks in home environments?
RQ2: Which robot navigation algorithm is most effective for enabling reliable autonomous mobility in cluttered and dynamic household settings?
RQ3: How can deep learning-based object detection and AprilTag-assisted localization provide robot perception for identifying and retrieving user-requested items?
RQ4: How can an integrated mobile manipulator, combining a Pioneer P3-DX base and a ReactorX-200 robotic arm, be coordinated to achieve grasping and delivery of objects to the user?
1.2. Research Variables
The research variables are categorized into navigation stack parameters and manipulator joints.
Navigation Stack Parameters: Several parameters within the
move_base navigation stack were manipulated to adapt the system to the Pioneer P3-DX mobile robot equipped with a 270-degree LiDAR. These parameters include costmap-related variables such as robot footprint and inflation radius, as well as global and local planner parameters (e.g.,
lethal_cost,
neutral_cost,
cost_factor,
path_distance_bias,
goal_distance_bias, and
occdist_scale). The description of these parameters is summarized in
Table 1.
Manipulator Joints: The joint variables of the ReactorX-200 manipulator, along with their default safe operating limits as defined in the firmware, are summarized in
Table 2. These limits constrain the inverse kinematics solution space and ensure that all computed joint configurations remain within mechanically feasible and collision-safe ranges during grasp execution.
1.3. Contribution of the Study
The main contribution of this work is the integration of subsystems to develop an end-to-end object retrieval solution that helps older adults and users with disabilities. The proposed system consists of the following key components:
Voice Recognition: A speech recognition module interprets user commands to identify requested objects.
Environment Mapping: The system generates a detailed map of the user’s home environment and a list that assigns spatial locations to the identified objects.
Path Optimization: An optimization algorithm minimizes the robot’s travel distance while navigating the mapped environment.
Navigation Stack Implementation: A global planner generates global path using the generated map to target objects’ locations, while a local planner uses real-time LiDAR data to dynamically avoid obstacles.
Object Detection and Localization: A YOLO-based deep learning model is used for robust object detection and localization within the environment. AprilTag markers are employed to accurately align the robotic arm by transforming object coordinates from the camera frame to the robot’s coordinate frame, ensuring precise grasping.
Robotic Arm Motion Planning: A motion planning algorithm enables the robotic arm to grasp requested objects, store them in an attached holder, and deliver them to the user once collection is complete. By integrating these algorithms, the proposed system advances autonomous robotic assistance, promoting greater accessibility and independence for individuals with disabilities.
The remainder of this paper is organized as follows:
Section 2 describes the hardware design and system integration of the autonomous robotic platform, including the incorporation of navigation, perception, and manipulation.
Section 3 presents experimental evaluations conducted in a real-world, home-like environment to assess the system’s performance across object detection, grasping, and delivery tasks. Finally,
Section 4 summarizes key findings and discusses directions for future enhancements.
2. Methodology
The proposed robot assistant system is designed to assist older adults and people with disabilities in performing their daily living activities.
Figure 1 illustrates the physical setup of the proposed system. The system comprises four main components: a Pioneer P3-DX mobile robot (MobileRobots Inc., Amherst, NH, USA), a ReactorX-200 arm (Trossen Robotics, Downers Grove, IL, USA), sensors including a Hokuyo UST-10LX LiDAR (Osaka, Japan) and RGB-D Kinect v2 camera, and an Intel NUC for processing. The camera integrated with the YOLOv8 algorithm is employed for object recognition and localization within the robot’s surroundings. The Pioneer P3-DX mobile robot, equipped with a Hokuyo LiDAR, is utilized for navigation and obstacle avoidance, enabling it to reach the user-requested objects. The ReactorX-200 robotic arm is then used to pick and place the identified target objects.
In the proposed system, the user requests the robot to retrieve various objects through voice commands. The system processes these commands by extracting keywords and localizing the requested objects within the predefined locations on the building map. For example, the keyword “pen” is associated with the “office” location in the building map. Once all commands are received, the system employs an optimization navigation algorithm to efficiently sequence the navigation to these locations while minimizing time and avoiding revisiting the same location.
Figure 2 illustrates the system workflow of our proposed home assistant system. The system operates in two main modes: navigation and grasping. In
navigation mode, the Pioneer P3-DX robot moves from one location to another on the map to retrieve the requested objects. Upon reaching the object’s designated location (e.g., an kitchen) and the camera detects the requested object in the environment, the system then switches to
grasping mode. In this mode, the trained YOLOv8 model is used classify objects in the robot’s environment and accurately detect the location of the requested objects with respect to the robotic arm frame. The object’s location detected by the camera is sent to the ReactorX-200 arm, which then picks it up and places it in a designated item holder attached to the Pioneer P3-DX robot. After successfully retrieving the object, the system reactivates the
navigation mode to proceed to the next requested object location. This process is repeated for all requested objects. Once all objects are collected, the robot returns to the starting point where the user is waiting. The upcoming subsections provide a detailed discussion of the algorithms implemented in this system.
2.1. Voice Recognition
The proposed system is launched when the user provides verbal instructions by specifying the objects to retrieve. Microsoft
® Azure Speech-to-Text [
17] is used to enable seamless interaction between the user and the proposed system. Microsoft
® Azure Speech-to-Text [
17] is a robust cloud-based platform that transcribes spoken commands into text with high accuracy and real-time processing. To refine the extracted text, non-essential components such as articles (“a,” “an,” “the”) and prepositions are removed to enhance the system’s ability to process commands effectively.
2.2. Navigation and Path Planning
After extracting the keywords from the user commands, the system correlates the requested objects with their predefined locations on the map. For example, object keywords such as apple, pen, and comb are mapped to corresponding semantic locations such as the kitchen, office, and bedroom. We also maintain a list that stores the coordinate positions of each location within the map. Consequently, when the user requests an apple, the robot navigates to the kitchen’s coordinate positions in the map and initiates a local exploration routine until the Kinect camera, in conjunction with the YOLO detection algorithm, identifies and localizes the target object.
The building map was constructed using the ROS Simultaneous Localization and Mapping (SLAM) package, v. 2.6.10 [
18], which allows the robot to build an environment map while determining its location autonomously. Using the building map and the LiDAR live data, the
move_base package [
19] enabled the robot to navigate the locations and retrieve the requested objects. The
move_base package [
19] provides a ROS interface for autonomous navigation by integrating global and local planners with obstacle avoidance mechanisms.
In this work, we integrated the A* algorithm [
20] as a global path planner and the Dynamic Window Approach (DWA) [
13] as a local path planner. The A* algorithm efficiently computes the robot’s optimal global path based on the pre-constructed map, while the DWA refines the generated global path accounting for the robot’s kinematic and dynamic constraints and real-time LiDAR data. This dual-layered approach equips the navigation module to dynamically adapt to sudden environmental changes, ensuring reliability in real-world scenarios. To optimize the retrieval of multiple objects along the shortest route, we applied the Traveling Salesman Problem (TSP) formulation [
21] to determine the optimal ordering of object locations. The TSP seeks the shortest possible route that visits each city exactly once and returns to the origin city [
22]. The move_base package receives these sorted target locations and uses the global and local planners to produce the robot’s linear and angular velocities.
2.3. Grasping and Trajectory Planning
Upon reaching the object’s designated location (e.g., an office), the Pioneer P3-DX robot performs a local exploration routine, moving forward and rotating in place until the target object is detected using the YOLOv8 algorithm [
23], which identifies and classifies objects in the robot’s environment. The object’s relative position is estimated using the Kinect depth camera. If the object lies beyond the manipulator’s maximum reach of 550 mm, the mobile robot autonomously repositions itself to be within the arm’s reach distance ensuring successful retrieval. At this point, the
navigation mode is deactivated, and the system seamlessly transitions to the
grasping mode for object retrieval.
After the requested object is detected and localized using the YOLOv8 algorithm, its coordinates are transformed into the manipulator’s end-effector frame. To accurately compute the transformation matrix between the Kinect camera frame and the ReactorX-200 manipulator frame, we incorporate hand–eye calibration [
24]. Hand–eye calibration establishes the spatial relationship between the robot’s end effector (the ‘hand’) and the camera (the ‘eye’). The calibration process involves capturing multiple images from various viewpoints to ensure a robust estimation of the transformation matrix. In this work, we use an AprilTag system [
25], a fiducial marker system consisting of easily detectable 2D markers, to perform the hand–eye calibration, as illustrated in
Figure 3. The AprilTag is detected within the camera’s field of view during the calibration process, as shown in
Figure 3a. The system uses the detected pose of these tags relative to the camera to calculate the transformation between the camera frame and the arm frame, as illustrated in
Figure 3b.
The hand–eye calibration is performed once and subsequently reused in the grasping mode. In this mode, the Rapidly Exploring Random Tree (RRT-Connect) algorithm [
26] is employed to generate an obstacle-free trajectory, enabling the manipulator to grasp the object and place it into the item holder mounted on the mobile robot. RRT-Connect algorithm computes an obstacle-free trajectory to guide the manipulator from its current configuration to the target configuration as a sequence of
Cartesian points. Each point along the generated trajectory must be converted into joint space positions using inverse kinematics. We applied the Denavit–Hartenberg (D-H) method [
27] which is widely employed in robotics to facilitate forward and inverse kinematics. It models the manipulators kinematics by representing transformations between adjacent links through rotation and translation matrices.
For the ReactorX-200 manipulator, the joint axes, angles, and link lengths are defined according to the D-H parameters, as illustrated in
Figure 4. The inverse kinematic transformation matrix, Equation (
1), takes the end-effector XYZ Cartesian position as input and computes the corresponding five joint angles. This transformation matrix consists of two components: the rotation matrix (
R) represents the rotational orientation of the end effector, and the translation matrix (
P) describes the positional displacement in the 3D space. Equation (
2) shows the details of the rotation and translational matrices where the symbols
s and
c denote sine and cosine functions, respectively, with subscripts indicating the corresponding joint angles. For instance,
describes the cosine of the cumulative rotation of three consecutive joints
.
The following subsections describe the experimental design used to evaluate the performance of each component of the proposed robotic system: speech recognition, robot navigation, object detection, and manipulator grasping.
2.4. Experimental Design
All experiments were conducted under controlled laboratory conditions to ensure repeatability and consistency. More than 50 trials were conducted for every subsystem. All experiments were conducted exclusively by the five authors of this paper. No external human participants were involved in the experimental evaluation.
2.5. Metrics
Different evaluation metrics were employed to assess the performance of each component of the proposed system, including Azure speech recognition, mobile robot navigation, YOLO-based object detection, and manipulator grasping.
2.5.1. Azure Voice Recognition
The performance of the Azure voice recognition module was evaluated using a success rate metric that measures its ability to identify the requested object from spoken commands correctly. The authors issued natural-language voice commands, incorporating variations in phrasing and pronunciation to reflect realistic user interaction. The evaluation was conducted using full natural-language sentences (e.g., “bring me a pen” or “I want a screwdriver”), where the system was required to extract and correctly recognize the target keyword corresponding to one of the four experimental objects: pen, screwdriver, card, and marker. A trial was considered successful if the system correctly extracted the intended target object keyword from the spoken command.
2.5.2. Mobile Robot Navigation Algorithm
Navigation experiments assessed the mobile robot’s ability to reach predefined goal poses using the move_base framework autonomously. Each trial required the robot to reach the target location within positional and orientation tolerances of 0.05 m for xy_goal_tolerance and 2 degrees for yaw_goal_tolerance, without collision.
2.5.3. Object Detection Using YOLO
To assess the performance of the proposed YOLO framework, several evaluation metrics were employed, including confusion matrix, accuracy, precision, recall, and F1-score. The confusion matrix provides a detailed breakdown of prediction outcomes across all classes by reporting true positives (
), false positives (
), false negatives (
), and true negatives (
). Overall accuracy measures the proportion of correctly classified detections relative to the total number of detections, as shown in Equation (
3). Precision, Equation (
4), quantifies the reliability of the predicted class labels. Recall, Equation (
5) measures the model’s ability to correctly detect instances of each class. Finally, the F1-score represents the harmonic mean of precision and recall, as indicated in Equation (
6).
2.5.4. Manipulator Grasping
Grasping experiments evaluated the ReactorX-200 manipulator’s ability to successfully grasp detected objects and place them into the designated storage holder. Each grasping trial was initiated only after successful completion of speech recognition, navigation, and object detection, ensuring end-to-end system validation. The manipulator grasping performance was evaluated based on the successful generation of a collision-free joint-space trajectory that reaches the specified end-effector Cartesian goal while respecting the kinematic and joint-limit constraints of the RX-200 manipulator mentioned in
Table 2.
3. Results
This section provides detailed implementation insights about environment map creation, robot navigation and object grasping and retrieval.
3.1. Experiment Environment
The first floor of the Engineering Building at the University of Detroit Mercy (UDM) was utilized to simulate a home environment for testing the proposed system. The ROS SLAM package [
18] based on the Hokuyo LiDAR sensor was used to generate a map of the simulated environment. The simulated environment included typical household spaces, such as an office, a kitchen, and corridors, as illustrated in
Figure 5. The furniture was arranged to reflect the layout of a standard home, providing a realistic testing environment to evaluate the robots’ capabilities in object interaction, navigation, and task execution.
3.2. Robot’s Navigation Stack
Using the created map, the ‘move_base’ navigation stack [
19] facilitates the movement of the mobile robot from one location to another while avoiding obstacles. The performance of the navigation stack relies on two categories of parameters: costmap parameters and planner parameters. The key costmap parameters include the robot’s
footprint (size of the robot in the costmap) and
inflation_radius (the extent to which obstacles are inflated in the costmap, which depends on the robot size). Global planner parameters such as
lethal_cost,
neutral_cost, and
cost_factor influence whether the planner generates paths that pass through the center of obstacle-free regions rather than skirting their edges. In contrast, the local planner parameters
path_distance_bias,
goal_distance_bias, and
occdist_scale govern the robot’s immediate path-following behavior.
Specific adjustments were made to these parameters to suit the Pioneer P3Dx robot equipped with a 270-degree LiDAR.
Table 3 lists the specific parameter values used. For instance, reverse driving was disabled due to the LiDAR 90-degree blinding zone.
path_distance_bias parameter was reduced to ensure smoother obstacle avoidance.
Figure 6 illustrates the navigation process in a predefined map visualized from the kitchen to the main corridor in RViz. This figure demonstrates the integration of costmaps and planners, highlighting the system’s ability to navigate efficiently within a dynamic environment.
Once the robot reaches the location of the requested object, the grasping mode is activated, and YOLOv8 is used for object detection, classification, and localization. In this work, we collected a dataset to fine-tune YOLOv8 to detect user-requested objects. The upcoming subsection will discuss in detail the dataset and the YOLO performance optimization.
3.3. Object Detection and Localization
In this work, a custom dataset comprising 247 images captured from various perspectives was collected and annotated. Each image contains one or more user-requested objects, with the training set including four object categories: pens, markers, cards, and screwdrivers. This is just a sample of objects that the user can ask the robot to retrieve. Given the manipulator’s 150 g payload constraint, we restricted our experiments to user-requested items whose mass falls within the allowable load capacity. The dataset can be extended to include other objects based on the user’s needs. These items were deliberately selected for their elongated shapes, making them suitable for secure grasping within the width constraints of the ReactorX-200 end effector.
The dataset was divided into 182 images for training, 17 images for validation, and 48 images for testing purposes. We employed various data augmentation techniques to enhance the robustness of the training dataset and improve the model’s generalization capability. First, random rotations were applied to simulate varying viewing angles, enabling the model to recognize objects from different perspectives. Gaussian blurring was introduced to replicate slightly out-of-focus imagery, improving the model’s tolerance to suboptimal image quality. Additionally, random noise was added to account for real-world imperfections and sensor variability. Finally, brightness adjustments were made to reflect diverse lighting conditions. These augmentations not only increased the diversity of the training data but also significantly improved the model’s adaptability to real-world environments.
In this work, we tested two pre-trained YOLO object detection models, YOLOv8n and YOLOv8m, during the training phase. According to the official Ultralytics benchmarks [
28], YOLOv8m contains 25.9M parameters, approximately 8× more than YOLOv8n (3.2M), and requires 78.9B FLOPs, compared to 8.7B FLOPs for YOLOv8n. In terms of inference speed, YOLOv8n runs at approximately 80 ms per frame on CPU (ONNX), whereas YOLOv8m requires around 235 ms, making it nearly 3× slower. Training time scales similarly with model size, and YOLOv8m requires approximately 7–9× longer training time than YOLOv8n under identical settings. Given the limited computational resources of our host device (Intel NUC) and the additional processing demands of point cloud data, YOLOv8n imposed significantly less load on the CPU while maintaining competitive accuracy in object detection tasks. As a result, YOLOv8n was selected to satisfy real-time constraints on an edge computing platform in this work.
Figure 7 illustrates the training and validation losses while training the YOLOv8n model for 170 epochs using our collected custom dataset. In this figure, the training results demonstrate that the model became more capable of accurate target localization and classification. Losses in both the training and validation sets decreased steadily as the number of iterations increased except for the valbox_loss figure. The valbox_loss suffered from some fluctuations stemming from the limited size and variability of the validation dataset, as well as the sensitivity of bounding-box regression to object localization uncertainty. Despite short-term oscillations, the overall trend demonstrates a gradual downward trajectory, indicating convergence rather than instability. The training process was performed on an NVIDIA GeForce RTX 3080 Ti laptop GPU with 16 GB of memory.
Figure 8 illustrates the model’s accuracy across different confidence levels for various object categories. The model achieves an optimal overall F1-score of
at a confidence threshold of
, highlighting a strong performance across all classes.
Figure 9 illustrates the confusion matrix of the trained YOLOv8n model for object detection on the validation and testing datasets. In this matrix, the rows represent predicted categories, and the columns represent true categories. The model perfectly predicted the categories of “card,” “pen,” and “marker,” with 16 correct predictions for each and no misclassifications. For “screwdriver,” the model also performed almost perfectly, correctly identifying it 16 times with only one instance misclassified as “background.”
3.4. Object Grasping and Retrieval
In the grasping evaluated scenarios, several assumptions were made. Instead of a standard table, we used a box with a height lower than the mobile robot (24 cm), as illustrated in
Figure 10. Additionally, the test objects were placed in an upright orientation to ensure full visibility for detection by the Kinect camera.
Figure 10 shows a sample of real-time testing of YOLOv8n after training. This figure indicates the good performance of the trained model in detecting and localizing objects in real time, even in a cluttered background.
A pipeline for object localization using the trained YOLOv8n model, followed by grasping with the ReactorX 200 robotic arm, demonstrated using a pen as an example is shown in
Figure 11. In
Figure 11a, the pen is detected in real-time by the Kinect camera and highlighted with a bounding box displaying the item ID, name, and confidence score. The point cloud data from the RGB-D camera and the detected target positions are visualized within the coordinate frame of the robotic arm’s base in
Figure 11b. The alignment between the detected target positions and the corresponding point cloud of the object confirms the successful recognition and localization of the target. Subsequently, the detected coordinates of the pen are transformed from the camera frame to the robotic arm’s coordinate system to enable precise manipulation. To account for the physical constraints introduced by the Pioneer P3-DX robot’s chassis and the item holder, the chassis is modeled as a static virtual obstacle within the planning environment (
Figure 11c). Finally,
Figure 11d illustrates the successful grasping of the pen by the ReactorX 200 arm using the transformed coordinates.
To evaluate the system’s reliability, we conducted multiple pick-and-place trials across varied scenarios. The arm consistently handled all four test objects, with only occasional failures.In the failure cases, the arm initiated the grasp but exhibited slight lateral deviation—shifting marginally to the right or left—which resulted in unsuccessful pickups. Further analysis showed that these failures primarily stemmed from sensor noise and cumulative positioning drift caused by minor wheel slip during extended multi-object retrieval and delivery sequences, which increased the probability of misalignment and subsequent pickup errors over time.
4. Discussion
This section discusses the key findings of the study in relation to the research questions defined in
Section 1 and highlights the practical implications and limitations of the proposed autonomous robotic system.
4.1. Discussion of Key Findings
RQ1: Voice-based human–robot interaction. The experimental results demonstrate that voice-based interaction provides an intuitive and accessible interface for initiating object retrieval tasks in indoor environments. The integration of Azure speech recognition enabled reliable extraction of task-relevant keywords, including object names and locations, without requiring the user to interact with traditional graphical interfaces. The observed success rate ranged between 85–95%, with occasional failures caused by pronunciation ambiguity or transient network latency. This interaction paradigm is particularly suitable for older adults and individuals with functional limitations, as it reduces physical effort while maintaining task flexibility.
RQ2: Mobile robot navigation algorithm. The results confirm that SLAM-based environment mapping combined with the move_base navigation stack enables robust autonomous mobility in cluttered and dynamic household-like environments. By tuning costmap and planner parameters to the Pioneer P3-DX platform, the robot successfully navigated between predefined semantic locations while avoiding obstacles. The success rate ranged between 90–95% under mostly static indoor conditions, with failures typically occurring due to local planner oscillations or temporary localization drift. The navigation framework proved effective in maintaining positional accuracy and smooth motion, supporting reliable transitions between navigation and manipulation phases.
RQ3: Object detection and localization. The deep learning-based perception module achieved high accuracy in detecting and classifying user-requested objects across varying viewpoints and lighting conditions. The use of YOLOv8n provided a favorable balance between computational efficiency and detection performance, making it suitable for deployment on resource-constrained platforms. Across validation and test sets, detection performance was consistently high, with overall accuracy and F1-scores in the 85–95% range. Occasional performance degradation was observed under challenging lighting conditions or partial occlusion, which reflects realistic operational constraints. The integration of object detection with RGB-D point cloud data enabled object localization, an essential step for manipulation tasks. However, we observed slight lateral shifts of the manipulator relative to the target object during execution in some trials, which we will discuss in the limitations subsection.
RQ4: Coordinated mobile manipulation and object delivery. The coordinated integration of the Pioneer P3-DX mobile base and the ReactorX-200 robotic arm enabled the system to execute complete object retrieval and delivery tasks autonomously. Modeling the mobile robot chassis as a virtual obstacle ensured collision-free manipulation, while inverse kinematics and motion planning allowed the arm to grasp and transport objects securely. The results demonstrate that a tightly integrated mobile manipulator can achieve reliable end-to-end task execution in a home-like environment.
Overall, the findings validate the feasibility of combining voice interaction, autonomous navigation, perception, and manipulation into a unified robotic framework for assistive applications. Rather than introducing new algorithms, the contribution of this work lies in the system-level integration and experimental validation of an end-to-end assistive robotic workflow.
4.2. Limitations and Future Work
Despite the promising results, several limitations of the proposed system should be acknowledged. First, the experiments were conducted in a controlled indoor environment with predefined semantic locations, which may not fully capture the variability of real residential settings. Second, the set of detectable and manipulable objects was limited to lightweight items due to the ReactorX-200 manipulator’s payload constraints.
In some grasping trials, occasional object detection performance degradation was observed under challenging lighting conditions or partial occlusion. In addition, the manipulator exhibited slight lateral shifts relative to the target object during execution. Further analysis suggests that these deviations are primarily caused by sensor noise and residual calibration errors between the RGB-D camera and the robotic arm. This highlights the need for camera calibration to improve grasping precision.
5. Conclusions
This paper presents a complete end-to-end robotic system designed to assist with object retrieval and delivery, aimed at enhancing the independence of older adults and users living with disability. The proposed system integrates ROS-based navigation, Azure-powered speech recognition, and YOLOv8n-based object detection to support intuitive human–robot interaction in domestic environments. A custom dataset of 247 annotated RGB images was collected, augmented, and used to train a lightweight YOLOv8n model, achieving 85–95% accuracy and F1-score across validation and test sets while maintaining real-time inference capability on resource-constrained hardware. The speech recognition module demonstrated a success rate of 85–95% over approximately 50 trials, reliably extracting target object keywords despite variations in phrasing and pronunciation. Autonomous navigation experiments achieved a 90–95% success rate, meeting strict positional (0.05 m) and orientation (2°) tolerances without collision in indoor environments. Object localization and grasp execution were enabled through extrinsic calibration between the Azure Kinect camera and the robotic arm, supporting consistent end-to-end object retrieval. Overall, the quantitative results confirm the system’s effectiveness in autonomous navigation, obstacle avoidance, and object retrieval within realistic domestic scenarios.
Future work will focus on refining camera calibration procedures and incorporating online calibration or visual servoing techniques to reduce alignment errors during manipulation. Addressing these limitations will be critical for transitioning the proposed system from a laboratory setting to real-world assistive deployment scenarios. In addition, we will actively incorporate end-user participation to guide system refinement and ensure that the research agenda remains aligned with real-world needs. We plan to engage users, caregivers, and domain experts as informants to identify high-priority objects and tasks on which the robot should be trained, and to provide structured usability and UX feedback throughout iterative development. Furthermore, we plan to integrate door-opening capabilities into the manipulator, further enhancing the system’s reliability and applicability in real-world environments.