1. Introduction
The robotic arm system has been widely investigated and developed for various research purposes. In [
1,
2], the position and orientation of any point on a robotic arm can be derived using kinematic equations. Furthermore, the desired speed or force at any location on the arm can be obtained through the Jacobian matrix. Based on these foundational theories, several practical applications have emerged. For example, in [
3], multiple mobile manipulator robots can cooperatively grasp and transport an object along a desired path, while [
4] demonstrates a mobile manipulator robot that follows a planned path while pushing an object.
Recent reviews highlight that mobile manipulators, with various levels of autonomy, are advancing rapidly—especially for deployment in hazardous, industrial, and logistics applications [
5,
6]. Mobile manipulators today are increasingly deployed as single- or dual-entity systems, enabling both fully autonomous and semi-autonomous operation under challenging conditions [
5]. Market analyses confirm that the delivery robot sector is expanding rapidly, with projections for exponential growth in both Asia and the US, and increasing integration into last-mile logistics and urban environments [
6,
7].
With the integration of cameras, robotic arms have greatly contributed to industrial automation, reducing the need for humans to perform dangerous or monotonous tasks. For instance, the system in [
8] can both wipe a whiteboard and perform peg-hole insertion. In [
9], fuzzy logic is employed to design a shape-sorting controller for a vision-enabled robotic manipulator. In [
10], trajectory tracking for robotic manipulators is achieved using a nonlinear backstepping controller with velocity observers. Despite these advancements, the delivery of goods still requires considerable human effort. If a mobile robot equipped with a robotic arm could automatically fetch and deliver items to specified locations, it would reduce workplace injuries. Such systems are known as intelligent mobile robots. The intelligent mobile robot described in [
11] can autonomously open doors. In [
12], a dual-arm mobile manipulator can grasp a book from a table and return it to a bookshelf. The system in [
13] features autonomous grasping, allowing the robot to approach and capture a target even if the object moves. In [
14], the robot can pick items from warehouse shelves, and the mechanism in [
15] can grasp objects at varying heights.
A 2022 survey by Wu et al. [
16] provides a comprehensive overview of learning-based control for robotic visual servoing, including recent advances in deep learning and reinforcement learning techniques. Likewise, Amaya-Mejía et al. (2024) surveyed modern approaches to visual servoing in autonomous manipulation in challenging contexts such as on-orbit servicing [
17], confirming the trend toward multi-modal visual control for robust, adaptive robotic manipulation.
The primary design objective of the intelligent mobile robot in this study is to grasp a target object with arbitrary orientation, even if the vehicle does not stop at a precise position. Afterward, the system transports the target object to another location and places it appropriately.
The proposed mobile robot system utilizes a single camera, one robotic arm, and a vehicle platform. Unlike previous configurations [
15], the camera in this system is mounted on the end-effector of the manipulator. The camera detects the target object and determines the positions of both the end-effector and the mobile robot. Detection is achieved using HSV color space methods. To achieve accurate positioning, this study employs a visual servoing approach [
18,
19,
20], which can calculate the necessary velocity for the camera to reach the desired position.
The visual servoing method implemented here is called “Homography-Based 2D Visual Servoing” (HBVS) [
21]. Leveraging properties of the homography matrix [
22,
23], the HBVS approach computes the relative translation and rotation between two coordinate frames. Given a set of desired feature points in the image, if the camera can observe the current feature points, the HBVS algorithm determines how to move the camera so that the actual and desired points coincide. The camera can be mounted either on the manipulator or the mobile platform. For example, refs. [
24,
25] place the camera on a mobile robot to achieve target localization with HBVS. The system in [
13] employs a camera on the end-effector, as in the current study. Additionally, autonomous underwater vehicles have utilized HBVS for station-keeping [
26] and localization [
27].
Recent developments in homography-based visual servoing confirm the strong interest in its application to underactuated as well as fully actuated robotic systems, including UAVs and mobile manipulators [
17,
28,
29]. For example, Huang et al. (2023) introduced a robust HBVS method for quadrotor UAVs, while geometry-based extensions and deep vision-based adaptations are emerging [
28,
29]. Challenges remain in integrating efficient visual feedback and improving the robustness and efficiency of HBVS for tasks such as autonomous picking in logistics, surgery, and space environments [
5,
30].
While a variety of mobile manipulation systems have been explored in previous studies, many existing approaches rely on multiple cameras, extensive sensor arrays, or address navigation and manipulation as separate challenges. Integrated solutions utilizing a single camera for both mobile base control and precise object manipulation, particularly with validation on real robots in practical delivery scenarios, appear to be relatively limited. Specifically, there seems to be a gap in the literature regarding systems that achieve autonomous navigation and dexterous pick-and-place tasks with minimal hardware and robust experimental evaluation.
To the best of our understanding, this work represents an initial attempt at demonstrating a complete object delivery cycle using only a single end-effector-mounted camera for both vehicle guidance and arm control, accompanied by comprehensive experimental validation.
Unlike prior approaches that mount the camera on the mobile base or in the surrounding environment, our system positions the camera directly on the end-effector of the robotic arm. This configuration enables more precise and adaptive visual feedback during grasping, significantly enhancing the robot’s ability to localize and grasp objects from various positions and orientations. The camera-on-end-effector design ensures that relevant feature points remain in view throughout manipulation, resulting in higher grasp success rates, particularly when vehicle stopping accuracy is limited.
The main contributions of this work are summarized as follows:
We present a unified mobile robot system that integrates a single camera for both vehicle and manipulator visual servoing.
We develop and empirically validate a homography-based control strategy for object delivery, encompassing both navigation and pick-and-place, with minimal hardware requirements.
We offer detailed experimental results in real-world settings to illustrate the system’s effectiveness and reliability.
The remainder of this paper is organized as follows. The system construction is described in
Section 2.
Section 3 provides details on the robotic arm used in this study. The proposed control designs for both the arm and mobile base are presented in
Section 4.
Section 5 discusses experimental results, and
Section 6 concludes the paper and suggests directions for future research.
2. System Construction
The system used in this study includes three main components: the robotic arm system, the vehicle system, and the camera system, as illustrated in
Figure 1. This section describes the construction of these components and provides a detailed explanation of the control objectives.
In this work, the vehicle system is referred to as the Eddie robot. The movement of the Eddie robot is controlled by applying different voltages at specific time intervals to each motor, allowing the robot to move forward, backward, turn left, and turn right. Commands are sent from the computer to the control panel, enabling the Eddie robot to perform different movements based on the situation.
The robotic arm utilized in this system is a 6-degree-of-freedom (6-DOF) manipulator. The reference configuration is shown in
Figure 2. In each frame
, link
connects the frame
at Motor
to the frame
at Motor
. This reference configuration is used to develop the configuration kinematic equations [
13,
15], which are essential for controlling the robotic arm. The motors employed in the robotic arm are AX-12+ servomotors. In this study, goal positions are sent to each motor to precisely control the movement of the robotic arm. Communication between the computer and the motors is handled via USB2DYNAMIXEL, which allows for the use of different programming languages to send commands to the motors.
The camera used in the system is a Logitech c170, featuring a diagonal field of view of 58° and a focal length of 2.3 mm. The input image size is 160 columns by 120 rows of pixels. As shown in
Figure 2, the camera is mounted on the robotic arm’s end-effector and continuously captures images during operation. These images are used both for vehicle localization, as the system approaches the target platform, and for end-effector localization during the object grasping phase. Image information is processed using the Homography-Based Visual Servoing (HBVS) algorithm [
21] to determine the required velocity of the camera, thereby guiding the end-effector closer to the desired position through matrix transformations.
3. Robotic Arm Analysis
The reference configuration of the robotic arm is illustrated in
Figure 2 and
Figure 3. The transformation from frame
i to frame
j can be expressed as follows:
where
is a
rotation matrix,
is a
translation vector, and
denotes the transformation for each link.
The manipulator Jacobian describes the relationship between the velocities of the joints and the velocity of the end-effector with respect to the base frame. This relationship enables the use of the Jacobian matrix to control the robotic arm by specifying joint velocities, thereby allowing the end-effector to follow the desired trajectory. The Jacobian is defined as follows:
where
is the velocity of the end-effector with respect to the first frame (a
vector),
is the
Jacobian matrix, and
is the joint velocity vector (also
).
The terms
and
can be written as follows:
where
and
are
matrices.
The rotational part of the Jacobian,
, is given by the following:
and
can be written as follows:
where
and
denotes the
z-axis vector of frame
b expressed in frame
a.
The manipulator Jacobian defined by (
6) and (
7) can be calculated by determining the parameters
,
, and
. The parameter
can be derived from
as shown in (
2), while
and
are given by (
1).
4. Control System Design
During the vehicle localization process, the robotic arm is held in a fixed configuration so that the camera is parallel to the direction of the vehicle. By applying homography-based visual servoing, if the calculated translation velocity in the x-direction is greater than 0, the vehicle needs to turn right to decrease the x-direction velocity below 0. Conversely, if the calculated x-direction velocity is less than 0, the vehicle needs to turn left to increase it. These turning maneuvers are combined with forward movement, allowing the vehicle to approach its goal position. The vehicle will stop in front of the grasping platform when the translation velocity in the z-direction is calculated to be less than a fixed threshold.
Once the vehicle has stopped at the grasping position, it begins searching for the target object. To locate the object placed on the grasping platform, the robotic arm gradually changes its configuration until the camera detects the target. If the object is detected on either side of the system, the arm will rotate approximately 10° in that direction and search again. The search process ends when either the target is found directly in front of it, or the camera fails to see the target after all arm movements are exhausted.
Visual servoing is a technique that uses image data to control robots. The approach used in this study is homography-based 2D visual servoing (HBVS), relying solely on image information for robot control. The control law for HBVS is as follows:
where
The six parameters are determined by trial and error. For example, if the camera moves too quickly and the object leaves the image frame, is decreased; if the camera moves too slowly, is increased. These values are selected to keep the visual target within the image plane during manipulator movement.
While the selection of the adaptive parameters in the above equations is based on practical experimentation and iterative tuning, this heuristic approach is common in applied visual servoing and robotics literature, especially when system dynamics, sensor noise, and mechanical uncertainties can hardly be modeled exactly. To date, a rigorous theoretical justification (e.g., through Lyapunov or input-to-state stability analysis) remains an open challenge for image-based and homography-based visual servoing subject to hardware constraints and quantization. Our empirical parameter choices ensure responsiveness while avoiding overshoot or loss of the object within the image frame.
In practice, larger values improve convergence speed but can reduce robustness—potentially causing the target to leave the field of view under disturbances or delays. Conversely, smaller values enhance stability and robustness against noise, but at the expense of slower response and potentially longer task times. Our tuning was guided by repeated experiments seeking a balance: the gains are chosen large enough for efficient task completion, but not so large as to trigger instability or target loss during real robot runs. We acknowledge that formal analysis and parameter optimization are potential future research directions.
The HBVS task function is defined as follows:
where
is defined as the centroid of the feature points in the reference image. The vectors
and
represent, respectively, the translation and rotation errors of the camera.
Importantly, HBVS requires that the determinant of the homography matrix
equals 1. Therefore, the system first normalizes the homography matrix by dividing it by the cube root of its determinant before computing the error functions (
15) and (
16). After calculating the error, the system checks if the errors satisfy the convergence conditions:
,
, and
. These criteria are designed so that the gripper can descend successfully and grasp the object. If met, the localization process stops; otherwise, the system continues adjusting the gripper. Additionally, if the
x-position is less than 18 cm from the origin of
in
Figure 2 and the computed velocity is negative in
x, the system determines the arm is too close and commands the vehicle to move backward slightly.
In the grasping process, the system defines the parameter
as the number of iterations used for updating the Jacobian matrix in (
3). A larger value of
results in a more precise grasping process, while a smaller value allows for faster execution. The system also defines the parameter
, which represents the desired translation and rotation increments of the end-effector, divided by
, where the value
is selected heuristically. In particular,
is set as follows:
Next, the system inverts the Jacobian matrix in (
3) to calculate
. The joint rotation angles in the Jacobian matrix are then updated by adding the computed increments to the previous values.
These calculations are repeated 100 times. After 100 iterations, the final joint angles are computed by summing all increments and are then used to control the robotic arm, guiding the gripper downward to grasp the object.
If, during these updates, any joint angle exceeds its allowable range, the system checks whether the
x-position relative to the origin of
in
Figure 2 is greater than 25 cm. If it is, the vehicle moves forward; otherwise, it moves backward. The system then restarts the search, localization, and grasping procedures, allowing it to try different configurations for the robotic arm.
After the object is grasped, the arm is returned to its initial vehicle-control configuration. The system then performs object recognition and rotates left until the target sign appears in the right half of the image, at which point it stops turning and moves toward the sign. The full control procedure is shown in
Figure 4.
5. Practical Experiments
All experiments were conducted in a corridor of a laboratory building. The floor surface was flat and smooth without any obstacles. Experiments took place during the daytime under clear weather conditions, using only ambient sunlight for illumination. Background clutter was minimal, and there were no moving objects in the scene during the tests.
To showcase the system’s robustness in the presence of disturbances and varying environmental conditions, we conducted three experiments designed to evaluate its performance under such challenges:
Case 1: Disturbance at placing platform; grasping platform ahead; object without orientation; placing platform ahead.
Case 2: Disturbance at grasping platform; grasping platform left; object with orientation; placing platform ahead.
Case 3: Disturbance at grasping platform; grasping platform ahead; object without orientation; placing platform elsewhere.
Figure 5,
Figure 6 and
Figure 7 illustrate that the system was able to accurately identify the correct sign at the placing platform and successfully complete the delivery task in each of the three scenarios. The error function curves for the visual servoing system, shown in
Figure 8,
Figure 9 and
Figure 10, indicate that both the translational errors and the
z-axis rotation error converge closely to zero in all cases. In addition, the initial and final images of HBVS, displayed in
Figure 11,
Figure 12 and
Figure 13, demonstrate that the actual and target feature points nearly overlap for each case.
Table 1 presents the success rates for each stage of the experiment, from gripper localization to object grasping. Both vehicle control and object placement performance were influenced by environmental conditions. Out of 20 trials, the system achieved an overall success rate of 80%, with performance largely determined by the effectiveness of the grasping phase. Due to constraints in personnel and available time, the experiment was limited to 20 trials, and baseline methods were not directly compared. As a result, these success rates should be considered preliminary; more extensive statistical analysis and baseline comparisons will be provided in future work.
The accuracy of the grasping process is influenced by the performance of the HBVS algorithm, the quality of the motors, and insufficient rigidity in the mechanical design of the robotic arm. Greater motor accuracy and higher image resolution would improve the precision of the HBVS process. The performance of the HBVS algorithm is constrained by the maximum and minimum speeds of each robotic arm motor. If the parameters in (
9)–(
14) are not properly defined, the robotic arm may fail to reach the desired speed. As a result, the target object may move outside the camera’s field of view, or the HBVS may not converge.
Furthermore, reducing weight and torsion in the limbs and joints, or redesigning the robotic arm using stronger materials, may reduce excessive motion and shaking, thereby significantly improving the success rate.
Notably, the computation time for each HBVS and Jacobian update cycle was not individually recorded during our experiments. Nevertheless, the implemented system responded in real-time with no perceptible delay between visual feedback and robot action, demonstrating acceptable practical performance for the presented use case. We recognize that detailed computational profiling could strengthen quantitative evaluation, and we plan to include such measurements in future work.
6. Conclusions and Future Works
This study demonstrates the application of an intelligent mobile robot equipped with a single camera located on the gripper, capable of autonomously transporting objects from one location to another. Experimental results show that the implemented homography-based visual servoing (HBVS) strategy successfully localizes both the vehicle and the gripper. Moreover, the Jacobian matrix approach proved effective for visual servoing as well as grasping procedures. This method allows the robot to accurately position its gripper above a target object and to execute pick-and-place operations.
The main contributions of this research include the application of HBVS for localizing both the gripper and the vehicle, as well as the design of a comprehensive control procedure tailored for pick-and-place tasks performed by the intelligent mobile robot. It should be noted, however, that HBVS is not ideally suited for moving the camera across larger distances. When the camera is far from the target, a higher value is needed to avoid excessively slow motion. Conversely, when the camera approaches the target, keeping unchanged can result in excessive speed, which may cause the target to fall outside the camera’s field of view. Therefore, the control parameters are adaptively adjusted based on the error in each direction to ensure that the camera’s motion remains sufficiently rapid without losing the target.
This work also introduces a general control framework for object delivery systems. While HBVS was employed in this study, it may be substituted with alternative visual servoing algorithms, such as image-based visual servoing. Accuracy and robustness of grasping could be further improved through advanced path planning techniques, which help minimize unwanted gripper motion and vibration. Vehicle navigation can also benefit from more sophisticated path planning, obstacle avoidance, or enhanced image processing, enabling operation in more complex environments.
Notably, since significant motion and vibration were observed in the robotic arm during experiments, the present system does not yet achieve precise object placement—a limitation we aim to address in future work. Additionally, comparing our approach with other visual servoing methods will be an important avenue for future investigation. Finally, we recognize that our evaluation was limited to a single laboratory setting and a moderate variety of objects and scenes. To fully validate generalization ability, future studies will expand to a broader range of object types, diverse environments, and detailed analyses of success rates across different scenarios.