Multimodal Control of Manipulators: Coupling Kinematics and Vision for Self-Driving Laboratory Operations

Sulaiman, Shifa; Harikumar, Amarnath; Bøgh, Simon; Marturi, Naresh

doi:10.3390/robotics15010017

Open AccessArticle

Multimodal Control of Manipulators: Coupling Kinematics and Vision for Self-Driving Laboratory Operations

¹

Department of Electronics Systems, Aalborg University, 9220 Aalborg, Denmark

²

Nawe Robotics, Calicut 673601, India

³

Extreme Robotics Laboratory, School of Metallurgy & Materials, University of Birmingham, Birmingham B15 2TT, UK

^*

Author to whom correspondence should be addressed.

Robotics 2026, 15(1), 17; https://doi.org/10.3390/robotics15010017

Submission received: 30 November 2025 / Revised: 5 January 2026 / Accepted: 7 January 2026 / Published: 9 January 2026

(This article belongs to the Special Issue Visual Servoing-Based Robotic Manipulation)

Download

Browse Figures

Versions Notes

Abstract

Autonomous experimental platforms increasingly rely on robust, vision-guided robotic manipulation to support reliable and repeatable laboratory operations. This work presents a modular motion-execution subsystem designed for integration into self-driving laboratory (SDL) workflows, focusing on the coupling of real-time visual perception with smooth and stable manipulator control. The framework enables autonomous detection, tracking, and interaction with textured objects through a hybrid scheme that couples advanced motion planning algorithms with real-time visual feedback. Kinematic analysis of the manipulator is performed using the screw theory formulations, which provide a rigorous foundation for deriving forward kinematics and the space Jacobian. These formulations are further employed to compute inverse kinematic solutions via the Damped Least Squares (DLS) method, ensuring stable and continuous joint trajectories even in the presence of redundancy and singularities. Motion trajectories toward target objects are generated using the RRT* algorithm, offering optimal path planning under dynamic constraints. Object pose estimation is achieved through a a vision workflow integrating feature-driven detection and homography-guided depth analysis, enabling adaptive tracking and dynamic grasping of textured objects. The manipulator’s performance is quantitatively evaluated using smoothness metrics, RMSE pose errors, and joint motion profiles, including velocity continuity, acceleration, jerk, and snap. Simulation results demonstrate that the proposed subsystem delivers stable, smooth, and reproducible motion execution, establishing a validated baseline for the manipulation layer of next-generation SDL architectures.

Keywords:

self-driving laboratory; vision-based object tracking; Jacobian inverse kinematics; trajectory planning (RRT*); autonomous robotic manipulation

1. Introduction

Self-driving laboratories (SDLs) are transforming the landscape of autonomous experimentation by integrating robotics, computer vision, and intelligent planning to accelerate scientific workflows. Within these environments, dexterous manipulators equipped with sophisticated grippers play a pivotal role in executing precise and adaptive tasks, often outperforming human operators in terms of reliability and efficiency [1,2,3]. A key challenge in such dynamic settings lies in enabling robotic systems to interact with textured objects in real time, requiring seamless coordination between vision-based perception and motion planning. This coordination must not only ensure accurate object tracking but also generate smooth and feasible joint trajectories for safe and efficient manipulation.

This work presents a foundational control framework designed to address this challenge by coupling real-time object tracking with smooth and stable motion execution. The system integrates a hybrid vision pipeline based on feature detection and homography-driven pose estimation with Jacobian-based motion planning for a 7-DOF manipulator. While the vision module is validated using a highly textured, opaque object to ensure reliable tracking, the primary contribution lies in the rigorous quantitative analysis of joint motion dynamics. Specifically, we evaluate velocity continuity, acceleration, jerk, snap, and smoothness cost functions to benchmark trajectory quality and mechanical stability. Rather than claiming architectural novelty, this study focuses on validating the control pipeline’s ability to generate smooth, feasible trajectories in response to dynamic visual input. Depth information is leveraged to interpret object orientation in 3D space, guiding the manipulator’s end-effector toward dynamically changing targets. Trajectories are generated using the RRT* algorithm, while inverse kinematic solutions are computed using Damped Least Squares (DLS), chosen for its potential to optimize joint smoothness and precision [4]. By establishing a modular and reproducible baseline, this framework lays the groundwork for future extensions involving more complex vision modalities and real-world SDL objects such as transparent glassware, reflective surfaces, and low-texture microplates.

While SDLs ultimately encompass capabilities such as autonomous hypothesis generation, adaptive decision-making, and closed-loop experimental optimization, the present work does not attempt to implement these higher-level functions. Instead, we focus on a foundational subsystem that enables such autonomy: reliable, vision-guided motion execution. The study develops and rigorously evaluates a modular control pipeline that integrates real-time object tracking with smooth and stable manipulator motion. Since SDL platforms depend on precise, repeatable, and dynamically responsive robotic manipulation to carry out experimental actions, this subsystem forms a necessary building block for future SDL architectures. Through quantitative analysis of joint-level smoothness, stability, and tracking performance, the work establishes a reproducible baseline that can be incorporated into more advanced frameworks involving planning, reasoning, and adaptive experimentation. Although robotic manipulation is central to SDL operation, existing frameworks seldom combine real-time feature-based tracking, homography-guided depth estimation, and smooth Jacobian-based motion execution within a unified pipeline. Perception and motion planning are often evaluated in isolation, leaving a gap in integrated, quantitatively validated approaches for dynamic object tracking and smooth trajectory generation. This motivates our contribution: a modular, reproducible baseline for vision-guided motion execution that supports the manipulation layer of SDL environments without overstating system-level autonomy.

The remainder of this paper is organized as follows. Section 2 focuses on previous work related to kinematic analysis, motion planning schemes, and vision algorithms. Kinematic modeling and workspace analyses of the manipulator, motion planning schemes and vision algorithm are presented in Section 3. Experimental results, including simulation and comparison studies of the proposed motion schemes, are presented in Section 4.

2. Related Work

In the context of SDLs, mobile manipulators that combine a dexterous arm with a mobile base are proving essential for automating complex experimental workflows. These integrated platforms offer both spatial mobility and fine-grained manipulation capabilities, allowing them to navigate dynamic lab environments and interact with diverse instruments and materials. Their deployment in chemical research settings introduces distinct challenges and opportunities, particularly due to the delicate and potentially hazardous nature of lab operations. Tasks such as handling fragile glassware, precisely dispensing reagents, and interfacing with analytical equipment require high levels of accuracy, repeatability, and safety making mobile manipulators a cornerstone of autonomous ssecientific discovery.

Screw theory formulations are widely used for the kinematic modeling of robotic systems with higher number of degrees-of-freedom (dof) [5,6]. This approach proved to be particularly flexible for modeling complex systems with coupled and offset joints [7,8]. Liu et al. [9] introduced a kinematic modeling approach for a 6-DOF industrial robot utilizing screw theory formulations. They employed a Particle Swarm Optimization (PSO)-based algorithm to minimize synthesis errors while considering kinematic and dynamic constraints. Another method for kinematic modeling of a redundant manipulator, combining screw theory with the Newton-Raphson method, was presented by Ge et al. [10]. They derived forward kinematic equations based on screw theory formulations and obtained joint solutions using the Newton-Raphson method. Screw theory-based kinematic modeling and motion analysis of a fixed-base dual-arm robot were demonstrated by Sulaiman et al. [11]. They utilized screw theory-derived kinematic equations to plot the robot’s workspace. Additionally, Sulaiman et al. [12] derived kinematic equations for a 10-DOF dual-arm robot with a wheelbase using screw theory. These equations were employed to evaluate singularities and dexterous regions within the robot’s workspace. An iterative method for determining forward kinematic equations using screw theory formulations was demonstrated by Medrano et al. [13]. They applied this approach to model a 6-DOF manipulator and conducted simulation studies to demonstrate the advantages of their method.

Recent advancements in vision-based grasping have significantly enhanced the capabilities of robotic manipulators in dynamic and unstructured environments [14,15]. Hélénon et al. [16] introduced a plug-and-play vision-based grasping framework that leverages Quality-Diversity (QD) algorithms to generate diverse sets of open-loop grasping trajectories. Their system integrates multiple vision modules for 6-DoF object detection and tracking, allowing trajectory generalization across different manipulators such as the Franka Research 3 and UR5 arms. This modular approach improves adaptability and reproducibility in robotic grasping tasks. Kushwaha et al. [17] proposed a vision-based intelligent grasping system using sparse neural networks to reduce computational overhead while maintaining high grasp accuracy. Their Sparse-GRConvNet and Sparse-GINNet architectures utilize the Edge-PopUp algorithm to identify high-quality grasp poses in real time. Extensive experiments on benchmark datasets and the a cobot validated the models’ effectiveness in manipulating unfamiliar objects with minimal network parameters. Wang et al. [18] developed a trajectory planning method for manipulator grasping under visual occlusion using monocular vision and multi-layer neural networks. Their approach combines Gaussian sampling with Hopfield neural networks to optimize grasp paths in cluttered environments. The proposed method achieved a 99.5% identification accuracy and demonstrated significant improvements in motion smoothness and efficiency.

Zhang et al. [19] presented a comprehensive survey of robotic grasping techniques, tracing developments from classical analytical methods to modern deep learning-based approaches. Their work highlights the evolution of grasp synthesis and the integration of vision algorithms in robotic manipulation. Similarly, Newbury et al. [20] reviewed deep learning approaches to grasp synthesis, emphasizing the role of convolutional neural networks and transformer models in improving grasp reliability. Du et al. [21] provided a detailed review of vision-based robotic grasping, covering object localization, pose estimation, and grasp inference for parallel grippers. Their analysis underscores the importance of combining RGB-D data with machine learning models to enhance grasp precision in real-world scenarios. These findings align with the growing trend of integrating vision and motion planning for dexterous manipulation. In addition to grasp synthesis, trajectory planning remains a critical component of robotic manipulation. Zhang et al. [22] proposed a time-optimal trajectory planning strategy that incorporates dynamic constraints and input shaping algorithms to improve motion speed and smoothness. Finally, Ortenzi et al. [23] developed an iterative method for determining joint solutions of redundant manipulators performing telemanipulation tasks. Their approach avoids singularities and joint limits, enabling smooth and reliable motion execution. These contributions collectively demonstrate the importance of integrating vision-based perception with robust motion planning schemes to enhance the adaptability, precision, and safety of robotic manipulators in complex environments.

Existing research in SDLs has made notable progress in robotic manipulation, motion planning, and vision-based perception, yet several key limitations persist. Many frameworks lack real-time adaptability in dynamic lab environments, particularly when interacting with textured or moving objects. Vision systems are often limited to static object detection or rely on depth sensors, without integrating feature-based tracking and homography-driven depth estimation for textured surfaces. Furthermore, while screw theory has been widely applied for kinematic modeling, its use in deriving both forward and inverse kinematics with stability guarantees under redundancy and singularities is still underexplored. The screw theory with the Damped Least Squares (DLS) method addresses this gap by enabling smooth and continuous joint trajectories. In motion planning, although algorithms like RRT* are known for optimality, their coupling with real-time visual feedback for adaptive trajectory generation remains limited in current literature. The proposed framework bridges this gap by combining RRT*-based planning with a vision pipeline that supports dynamic pose estimation and grasping. Additionally, few studies offer a unified simulation-based evaluation of these components using quantitative metrics such as RMSE pose errors, velocity continuity, and higher-order motion profiles. By addressing these gaps, our work contributes a cohesive and scalable solution for autonomous experimentation in next-generation SDLs.

3. Methodologies

We performed kinematic analysis to model the motion constraints of the tracking system and assess the robot’s operational reach. This included solving inverse kinematics to determine joint configurations that enable the system to accurately follow an object’s path, thereby supporting effective grasping and manipulation. We implemented a structured planar pose estimation algorithm for object tracking, which unfolds through several key stages. The process begins with extracting and matching features to reliably identify distinct characteristics of the target object. Afterward, the system performs homography estimation and applies a perspective transformation to establish spatial correspondence and achieve proper alignment. Next, directional vectors are derived across the object’s surface to determine its orientation. Depth information is then incorporated to compute the object’s planar pose, enabling accurate interaction and reliable grasping. For object detection, the pipeline begins by extracting distinctive visual features—such as edges, corners, blobs, and ridges—from images of planar objects using the SIFT algorithm [24]. These features are then matched with those captured by the camera to determine object presence. Once converted into feature vectors, matches are evaluated to confirm detection. To address the computational demands of high-dimensional feature matching, we utilized the FLANN-based K-d Nearest Neighbor Search [25], which offers efficient performance for real-time applications. Homography estimation relies on the matched features, though some may be incorrect and introduce noise. To overcome this, we applied the RANSAC algorithm [26], which filters out false matches by iteratively selecting inliers from minimal subsets of data. This approach improves both speed and accuracy, making it ideal for dynamic environments. Perspective transformation is then used to estimate corresponding points in the test image, allowing us to derive basis vectors on the object’s surface. Depth information is subsequently incorporated to calculate the surface normal, enabling accurate 3D pose estimation of the planar object.

3.1. Approach Overview

This work employs a KUKA LBR iiwa 14 manipulator and Robotiq Hand-E gripper to evaluate the developed vision based motion planning scheme as shown in Figure 1. The kinematic equations governing the manipulator are derived using the screw theory approach. These equations facilitate the analysis of the manipulator’s Cartesian workspace, which is essential for motion planning. Subsequently, motion planning is conducted to identify an optimal set of joint solutions for navigating trajectories within the manipulator and gripper workspaces. Trajectories are generated using the RRT* algorithm, while joint solutions are determined using the Damped Least Square (DLS) method. Although the simulation environments used in this study are obstacle-free, the RRT* algorithm was deliberately chosen over simpler interpolation methods to ensure generalization to more complex, cluttered environments typical of real-world SDL. RRT* offers asymptotic optimality and the ability to handle high-dimensional configuration spaces with dynamic constraints, making it well-suited for future extensions involving obstacle avoidance, constrained motion, and workspace reconfiguration. By integrating RRT* into the current framework, we establish a scalable and modular planning backbone that can be readily adapted to more realistic scenarios without requiring fundamental changes to the motion planning architecture.

The kinematic equations obtained through the screw theory method enable the derivation of a set of feasible solutions. A unique set of configurations is selected for traversing a given trajectory using the DLS method. DLS method was selected due to its superior behaviour comapred to other inverse solution methods such as Jacobian Transpose and Pseudo Inverse methods [4]. The study demonstrated that DLS consistently achieved superior numerical stability, smoother joint trajectories, and better handling of near-singular configurations. To ensure smooth and optimal trajectories, an objective function based on the accelerations of Cartesian motions is incorporated alongside the trajectory function. The smoothness of trajectory motions is assessed using this function. Along with the smoothness function, some other characteristics of the joint motions such as velocities, accelerations, jerk and snap values are evaluated. To further validate the performance of motion planning schemes, errors along the X, Y, and Z directions along with orientation errors of trajectory waypoints are compared. Simulation studies are conducted to evaluate the efficacy of each method in determining joint solutions for trajectories traversing various locations within the workspace. Building on this framework, the system performs object detection by extracting distinctive features and matching them to known patterns, enabling reliable identification and localization of targets in the scene. After an object has been recognized, pose estimation computes its exact position and orientation, providing the spatial information required for effective manipulation. In the final stage, adaptive tracking enables the robot to dynamically refine its motion strategy, mitigating singularities and ensuring stable interaction with the object. This sequential pipeline supports robust and flexible autonomous handling across complex and dynamic scenarios.

3.2. Kinematic Modeling of the Manipulator

The kinematic modeling of the robot establishes a relationship between the angular variations of the manipulator’s joints and the pose of the gripper. As stated before, the screw theory approach [27] has been utilized for this purpose. The corresponding kinematic frames of the manipulator are shown in Figure 2. In this approach, the pose of the end effector is described using two inertial reference frames: one attached to the robot’s base and the other fixed to the end-effector joint. The motion of each rigid body is expressed using screw theory, where displacement is modeled as a screw motion with an associated pitch. We begin by outlining the kinematic formulation for the robot, and then proceed to develop the corresponding model for the hand. As shown in Figure 2, all the joints of the robot are assigned with a screw axis with zero pitch for the rotary joints.

A_{i}

and

d_{i j}

represent various manipulator joints and links, respectively. Link lengths of the used KUKA iiwa manipulator are given in Table 1. Let

S_{i}

represents the screw axes of various joints and

4 \times 4

transformation matrix,

e^{[S] θ}

obtained using screw theory is given in (1).

e^{[S] θ} = [\begin{matrix} e^{[\hat{ω}] θ} & I θ + (1 - c o s θ) [\hat{ω}] + (θ - s i n θ) {[\hat{ω}]}^{2} v \\ 0 & 1 \end{matrix}]

(1)

where

\hat{ω}

represents the

3 \times 3

skew-symmetric matrix of angular velocity matrix (

ω

), v is the

3 \times 1

linear velocity vector and I is the

3 \times 3

identity matrix. Transformation matrix for the used 7-dof manipulator,

T {(θ)}_{m}

is given in (2).

T {(θ)}_{m} = e^{[S_{1}] θ_{1}} e^{[S_{2}] θ_{2}} e^{[S_{3}] θ_{3}} e^{[S_{4}] θ_{4}} e^{[S_{5}] θ_{5}} e^{[S_{6}] θ_{6}} e^{[S_{7}] θ_{7}} N_{m}

(2)

where

N_{m}

is a 4 × 4 transformation matrix, which defines the initial pose of end effector frame with respect to base frame.

3.3. Workspace Analysis

An analysis of the workspace was undertaken to identify singularities and unoccupied areas within the manipulator’s operational space. Identifying distinct regions within the workspace facilitated the optimization of motion planning for various tasks. The combined Cartesian workspace of the manipulator and hand was determined using Equation (2), considering joint limits. Figure 3 illustrates views of the workspace of the manipulator, showing all feasible positions of the manipulator and hand. This visualization accounted for joint limitations. Figure 3a, Figure 3b and Figure 3c represent 3D, sectioned, and top views respectively of the cartesian workspace. Notably, the maximum reachable point within the workspace was found to be 0.78 m. Moreover, the manipulator’s dexterous workspace volume, including the gripper, was calculated to be 1.93 cubic meters. The obtained workspace serves as a basis for identifying the reachable areas of the manipulator during trajectory planning. By leveraging this information, any unreachable spaces traversed by the trajectory were excluded to prevent issues during the motion planning scheme.

3.4. Inverse Kinematics Techniques—Damped Least Squares (DLS) Strategy

Inverse kinematics (IK) algorithms are essential for determining joint parameters that guide a multi-link robotic system to a specified target pose. These methods compute joint angles that reposition the end-effector from its current location P to a desired target T. The positional error is defined is given in (3):

\vec{e} = T - P = 0

(3)

To minimize this error, the joint configuration

θ

is iteratively updated using (4):

\vec{e} = J_{s} (θ) Δ θ

(4)

The Jacobian matrix

J_{s} (θ)

, derived via screw theory as outlined in [28], is central to all three IK strategies discussed below. However, solving (4) may not always be feasible, especially when the manipulator possesses redundant degrees of freedom (DoFs). In singular configurations, the Jacobian becomes non-invertible, leading to an infinite number of solutions. To address this, alternative Jacobian formulations, as described in [29], can be employed. These alternatives improve performance by mitigating oscillations and overshoot but may occasionally produce non-feasible or suboptimal solutions in terms of energy efficiency and computational time. In singular configurations, the Jacobian is non-invertible, resulting in having infinite solutions. To avoid this issue, alternative Jacobians can be used as described in [29]. These can improve performance by reducing oscillation and overshoot. However, they may sometimes result in non-feasible or non-optimal solutions in terms of energy and computational time.

The DLS method seeks to minimize both the Cartesian error and the joint velocity norm. It is particularly effective near singularities and for underactuated systems. The objective function with a constant parameter, W is given in (5):

min ∥ V_{s} - J_{s} \dot{θ} ∥_{W_{1}}^{2} + ∥ Δ θ ∥

(5)

The solution is obtained as given in (6):

\dot{θ} = J_{s}^{+} V_{s}

(6)

where the damped inverse

J_{s}^{+}

is defined as given in (7):

J_{s}^{+} = {(J_{s}^{T} W_{1} J_{s} + W_{2})}^{- 1} J_{s}^{T} W_{1}

(7)

Here,

W_{1} = I

and

W_{2} = α I

, with

α

being a damping coefficient that balances stability and responsiveness. Although (6) express the solution using the Moore–Penrose pseudoinverse, the actual computation employs the DLS formulation as given in (7). The damping term

W_{2} = α I

ensures that the matrix

J_{s}^{T} W_{1} J_{s} + W_{2}

remains positive definite and therefore invertible, even when the manipulator approaches singular configurations. This avoids the numerical instability associated with the standard pseudoinverse and guarantees smooth, bounded joint velocities throughout the motion.

The weighting matrices

W_{1}

and

W_{2}

are positive definite matrices and non-singular in nature. The weighting matrix

W_{1}

can be chosen to add priority to the task vector. This is very important in redundancy resolution approaches, as might alternatively be chosen to give the cost function energy units and so produce consistent results regardless of units or scale. The formulation given in (7) works well for a rank-deficient Jacobian and can avoid generating high joint velocities near singularity regions. It blends joint velocity damping with minimizing the inaccuracy in following the given trajectory, resulting in a trade-off between the viability of the joint velocities and the precision with which the intended end-effector trajectory is followed. When

α = 0

, the method simplifies to the pseudoinverse approach. Larger values of

α

reduce joint velocity magnitudes but may introduce trajectory deviations. A larger value of

α

indicates a reduced joint velocity norm, although there can still be some deviation from the intended end-effector trajectory. Given that the DLS formulation must weigh the viability of the inverse kinematic solution against its precision, it is evident that the damping factor

α

is crucial. Therefore, it is important to set this parameter’s value appropriately to guarantee that joint velocities are feasible in all configurations.

3.5. Vision-Based Grasping Framework

To enable reliable object manipulation, we implemented a vision-guided grasping method [30] that integrates perceptual input with adaptive planning strategies. The architecture comprises two core modules:

Planar object localization and pose inference
Grasp synthesis based on dynamic pose updates

3.5.1. Planar Object Localization and Pose Inference

Robust pose estimation is a prerequisite for effective manipulation. Our method follows a four-stage pipeline:

Extraction of visual features and descriptor matching
Homography-based transformation and perspective alignment
Derivation of object-centric coordinate axes
Pose refinement using depth-enhanced feedback

The pose estimation pipeline fuses 2D homography with RGB-D measurements to recover the full 6-DoF pose of planar, textured objects. First, the RGB image is processed with SIFT to extract keypoints and descriptors that are robust to viewpoint and scale changes. Descriptor matching against a pre-registered template is performed using FLANN, and RANSAC is subsequently applied to reject outliers and compute a homography that maps template coordinates to the current image plane. The homography is used to localize 2D corner (or keypoint) pixel coordinates in the RGB image, and then the depth image is generated at those pixel coordinates to obtain metric 3D points that are used for pose estimation. The homography is used purely for 2D localization: it projects the four template corners to pixel coordinates in the current frame, yielding high-confidence 2D points that anchor the object’s footprint. Depth fusion is then performed at these localized pixels. The RGB and depth frames are hardware-aligned using the camera intrinsics and distortion parameters obtained from calibration. For each corner pixel, the corresponding depth value is obtained from the synchronized depth image. These values are back-projected into 3D camera coordinates via the pinhole model, producing a set of 3D corner points that transform the 2D homography correspondences into metric space. If any corner depth is invalid (missing or beyond sensor range), nearest-neighbor interpolation over the local depth patch or planar fitting on inliers is used to recover a consistent 3D estimate; frames with insufficient valid corners are discarded to preserve reliability. The object pose is constructed from these 3D corners. The centroid defines the translation, while orientation is obtained by fitting a local orthonormal frame to the planar surface: the surface normal is computed from cross products of non-collinear corner edges, and in-plane axes are derived by orthogonalizing one edge direction against the normal. The resulting rotation matrix is refined by enforcing orthonormality. For numerical stability, a minimal set of geometric constraints is applied to reject degenerate configurations (nearly collinear corners or extremely acute aspect ratios), and the final pose is expressed either as Euler angles or a quaternion depending on downstream controller requirements. To mitigate measurement noise and ensure smooth control inputs, the estimated pose is temporally filtered before publication.

All poses are transformed from the camera frame to the robot base frame using the fixed extrinsic calibration between the sensor and the manipulator, ensuring consistency with the kinematic chain. The final output is published as a message comprising the filtered 6-DoF pose and timestamp, which the motion stack consists of two phases: RRT* for the initial approach to the vicinity of the object, followed by DLS-based visual servoing that resolves incremental pose errors in real time without global re-planning. This design makes the roles of homography and depth explicit and complementary: homography provides robust 2D corner localization under texture and perspective changes, while depth supplies metric scale to recover 3D structure. Their fusion yields stable, accurate 6-DoF poses suitable for dynamic manipulation, with clearly defined fallbacks, filtering, and frame transforms to maintain smooth end-effector trajectories. Planar object detection is carried out with the SIFT method, which extracts distinctive keypoints and represents their surrounding appearance through local descriptors. To associate features from the input image with those of a reference template, descriptor matching is then performed using FLANN for floating-point descriptors or Hamming-based metrics when working with binary descriptors. Using the established keypoint correspondences, a homography matrix H is then estimated to represent the planar mapping between the two views, as expressed in (8).

[\begin{matrix} c_{i} \\ d_{i} \\ e_{i} \end{matrix}] = H [\begin{matrix} x_{i} \\ y_{i} \\ 1 \end{matrix}]

(8)

The transformed coordinates

(x_{i}^{'}, y_{i}^{'})

are obtained using (9)

x_{i}^{'} = \frac{c_{i}}{e_{i}}, y^{'} = \frac{d_{i}}{e_{i}}

(9)

RANSAC is then applied to filter out incorrect matches and produce a stable estimation. To establish the object’s local coordinate frame, three reference points are chosen according to (10)

O_{c} = (w_{i} / 2, h_{i} / 2), O_{x} = (w_{i}, h_{i} / 2), O_{y} = (w_{i} / 2, 0)

(10)

where

w_{i}

and

h_{i}

correspond to the object’s width and height. These points are then projected into 3D space using RGB-D information from the RealSense camera(Intel Corporation, Santa Clara, CA, USA), generating the directional vectors defined in (11)

\vec{i_{i}} = \frac{\vec{x_{i}}}{∥ \vec{x_{i}} ∥}, \vec{j_{i}} = \frac{\vec{y_{i}}}{∥ \vec{y_{i}} ∥}, \vec{k_{i}} = \frac{\vec{x_{i}} \times \vec{y_{i}}}{∥ \vec{x_{i}} \times \vec{y_{i}} ∥}

(11)

The resulting orthonormal set

(\vec{i}, \vec{j}, \vec{k})

specifies the object’s orientation. Finally, the Euler angles

(ϕ_{i}, θ_{i}, ψ_{i})

are computed and these angles encode the object’s spatial orientation and used for planning the grasping strategy.

3.5.2. Motion Evaluation Metrics

In this section, we describe the motion planning evaluation metrics adopted for analysing the motion smoothness. The evaluation metrics encompassed both motion smoothness and trajectory tracking performance. Motion smoothness was assessed using velocity, acceleration, jerk, and snap profiles, while tracking accuracy was quantified through Root Mean Square Error (RMSE) of the end-effector trajectory. To analyze dynamic consistency, velocity continuity and higher-order motion profiles were computed by evaluating the maximum differences between successive joint values across the trajectory.

After finalizing the trajectory waypoints, the joint angles of the manipulator and gripper were provided to achieve the desired end-effector motions. Smoothness of joint motions was evaluated based on the rate of change of acceleration over time. The smoothness function

S_{f u n c}

, derived from the third derivative of position, serves as an effective metric for smoothness. Since smoothness is inversely proportional to the rate of change of acceleration, the reciprocal of

S_{f u n c}

was used as a measure of smoothness. The performance of above-mentioned methods with respect to x, y and z coordinates within a time period, t was evaluated using a reciprocal of smoothness function [31] given in (12).

S_{f u n c} = \frac{1}{\sqrt{\frac{1}{2} \int_{t_{1}}^{t_{2}} {(\frac{d^{3} x}{d t^{3}})}^{2} + {(\frac{d^{3} y}{d t^{3}})}^{2} + {(\frac{d^{3} z}{d t^{3}})}^{2} d t \times \frac{{(t_{2} - t_{1})}^{5}}{l^{2}}}}

(12)

where, length of trajectory, l is given in (13)

l = \sum_{i = 1}^{n - 1} \sqrt{Δ x_{i}^{2} + Δ y_{i}^{2} + Δ z_{i}^{2}}

(13)

As the rate of change of acceleration increases, smoothness decreases. Therefore, smoother motions correspond to lower rates of change in acceleration. Additionally, positional errors at the waypoints of the end-effector were analyzed to assess the accuracy of the inverse joint solutions.To provide a comprehensive measure of overall error, this work introduces an RMSE metric that combines both positional and orientation errors. To effectively represent these combined errors as a single measure, positional errors (x, y, z) and orientation errors were calculated separately, normalized, and then integrated into a unified metric. For position, the RMSE can be computed using (14):

R M S E_{p o s} = \sqrt{((1 / N) * \sum_{i = 1}^{N} ({(x_{i} - \hat{x_{i}})}^{2} + {(y_{i} - \hat{y_{i}})}^{2} + {(z_{i} - \hat{z_{i}})}^{2}))}

(14)

For orientation errors, since there are three angles (roll, pitch and yaw), the RMSE for each angle is computed separately as given in (15)–(17):

R M S E_{r o l l} = \sqrt{((1 / N) * \sum_{i = 1}^{N} ({(θ_{r o l l, i} - {\hat{θ}}_{r o l l, i})}^{2}))}

(15)

R M S E_{p i t c h} = \sqrt{((1 / N) * \sum_{i = 1}^{N} ({(θ_{p i t c h, i} - {\hat{θ}}_{p i t c h, i})}^{2}))}

(16)

R M S E_{y a w} = \sqrt{((1 / N) * \sum_{i = 1}^{N} ({(θ_{y a w, i} - {\hat{θ}}_{y a w, i})}^{2}))}

(17)

To represent the overall orientation error as a single metric in radians, we combined the individual RMSE values for roll, pitch, and yaw into a single orientation RMSE using (18):

R M S E_{o r i e n t} = \sqrt{((1 / 3) * {(R M S E_{r o l l})}^{2} + {(R M S E_{p i t c h})}^{2} + {(R M S E_{y a w})}^{2})}

(18)

4. Results and Discussion

The robotic manipulation pipeline in this study starts with a kinematic formulation, where inverse kinematics is solved and the workspace is analyzed to identify valid end-effector configurations. This step establishes the robot’s ability to access and interact with objects throughout its reachable area. Afterward, object detection is carried out using feature extraction and matching methods, allowing the system to recognize and determine the position of target items within the scene. Once detected, pose estimation is employed to determine the precise position and orientation of the object, providing critical spatial information for manipulation. The final stage involved tracking the textured object, in which the robot dynamically adjusts its position and orientation to follow the object. This sequential framework facilitates reliable and flexible autonomous handling in complex and variable settings. This step-by-step pipeline enables the robot to operate autonomously with both reliability and adaptability, even in dynamic or unstructured environments. The object detection and pose estimation method was integrated into the Robot Operating System (ROS) and implemented using OpenCV 4.11.0 on an Ubuntu 20.04 machine. The experiments were carried out on a system equipped with a

3.0

GHz Intel Core

i 7 - 7400

processor and 16 GB of RAM. To assess the performance of the algorithm, a simulation environment was developed using Rviz and Gazebo, where a mobile manipulator interacts with a textured book cover, as shown in Figure 4. The complete simulation setup, including both the Gazebo world and the Rviz visualization, is presented in Figure 4a,b.

The motion strategies were evaluated based on the smoothness of joint motions using a smoothness function and the error range relative to the desired end-effector trajectory. Additionally, the velocity, acceleration, jerk, and snap values of the motions were determined to analyse the joint motions. The proposed method aimed to gradually adjust joint positions and orientations from a stable state. However, due to the arbitrary number of iterations, the time required to find an inverse kinematics (IK) solution for a given end-effector pose was variable. Despite this, the duration of a single iteration remained constant with respect to the dimensionality of J and

θ

, and was unaffected by the full algorithm’s completion. To address this variability, a maximum time limit for the algorithm was enforced by setting an upper bound on the number of iterations. In Jacobian-based inverse kinematics solvers, the dimensionality of J in 3-dimensional space is typically either 3 or 6. A 3-dimensional J encodes only the positional information for the end-effector, while a 6-dimensional J is often preferred as it includes both positional and orientation information. In this work, a 6-dimensional J vector was chosen to account for both positional and orientation components. The system was deemed repeatable if a given goal vector

\vec{g}

consistently produced the same pose vector. However, achieving repeatability in redundant systems requires special measures, as this consistency is not guaranteed inherently. An alternative approach involves resetting the system to a predefined default pose, ensuring repeatable solutions. However, this method may introduce sharp discontinuities in the solution trajectory. For every inverse solution technique employed in this work, the error matrix

\vec{e}

was assigned values as described in (19):

\vec{e} = [\begin{matrix} X_{target} - X_{endeffector} \\ Y_{target} - Y_{endeffector} \\ Z_{target} - Z_{endeffector} \\ α_{target} - α_{endeffector} \\ β_{target} - β_{endeffector} \\ γ_{target} - γ_{endeffector} \end{matrix}] = [\begin{matrix} 0.01 \\ 0.01 \\ 0.01 \\ 0.01 \\ 0.01 \\ 0.01 \end{matrix}]

(19)

Δ \vec{e}

was determined such that it moves

\vec{e}

closer to

\vec{g}

. The starting iteration assumes

Δ \vec{e}

as

\vec{g} - \vec{e}

. The stopping conditions were implemented to improve the performance of the method and reduce computational effort during the iterations. In this work, the stopping criteria were as follows:

Finding a solution within the error limits given in (20).
Convergence to a local minimum.
Non-convergence after the allotted time.
Maximum iterations reached.

If the solution converges to a local minimum, pose vectors can be randomized to avoid recurrence in future iterations. An allotted time can be specified for each step to prevent the method from exceeding a predetermined duration. Additionally, the maximum number of steps can be set to limit computational time. The iteration step size,

β

, was determined using the desired

\vec{e}

and

\vec{g}

as given in (20):

Δ \vec{e} = β (\vec{g} - \vec{e}), 0 \leq β \leq 1

(20)

The step size was limited by scaling it with

β

. The determination of

β

was carried out after computing

Δ θ

for better approximation. A common approach involved limiting joint rotation increments to a maximum of 5 degrees per iteration. Initially,

Δ θ

was computed without including

β

, and later checked to ensure no

Δ θ

values exceeded the threshold

β_{h}

, calculated using Equation (21):

β_{h} = \frac{T h r e s h o l d}{max (T h r e s h o l d, max (| Δ θ_{i} |))}

(21)

After finalizing

β

, the

Δ θ

values were updated during iterations using (22):

θ = θ + β (Δ θ)

(22)

The dampening constant of DLS method was determined by multiplying the size of the desired change in the end effector by a small constant that was set at each iteration. The small constant was taken as 1/100 of the total sum of the segments from base to end effector. Small constant was added to prevent the oscillating of errors due to desirable change was too small to dampen. The problem occurs because the target positions are nearly close to each other. The inclusion of

α

avoids the convergence of jacobian matrix to a singular space at any time. An orthogonal set of vectors, which are equal to the length of the desirable change in vector,

\vec{e}

will result in a good damping effect. However, the effect resulted in increase in computational time for calculating the joint values. Each iteration tried to lower the value of errors to the desired

\vec{e}

value. The iteration stopped when the stopping criteria specified in section were met.

A RealSense camera mounted on the mobile platform (as shown in Figure 4) was used to determine the initial pose of the textured book cover. After the pose was obtained, the RRT* motion planner generated a trajectory that enabled the manipulator to track the book’s motion while keeping the end effector in a grasp-ready configuration. The DLS approach was applied to compute the joint angles required to follow this path. The outputs of the vision module are displayed in Figure 5a,b. Figure 5a presents the detected bounding box around the object, while Figure 5b shows the resulting pose estimation. These visual outcomes demonstrate that the algorithm successfully identifies and orients the target, supplying the necessary information for the subsequent manipulation stage. The mobile manipulator was placed in front of a book as shown in Figure 4. The manipulator follows a trajectory computed via the Rapidly exploring Random Tree Star (RRT*) algorithm, with corresponding joint configurations derived using the Damped Least Squares (DLS) inverse kinematics method. The end-effector pose is defined such that the manipulator’s gripper approaches the book frontally, maintaining a face-to-face orientation at a distance of 15 cm in x-axis as shown in Figure 6 and Figure 7. Figure 6 and Figure 7 illustrate the manipulator’s progression through its initial, intermediate, and final configurations as it approaches the target pose, visualized in Rviz and Gazebo environments, respectively. To further validate the system’s manipulation capabilities, we performed tracking experiments following the motions of the book. During tracking, camera placed on the manipulator was used for detecting the updated poses of the book. Figure 8 and Figure 9 present snapshots of the manipulator’s motion in Rviz and Gazebo as it responds to variations in the book’s pose. These images illustrate how the robot updates its motion plan and adjusts both its joints and end-effector orientation to reach the target. Although the mobile base remains fixed, the arm continuously adapts its configuration using the pose information supplied by the vision module. The colored coordinate axes shown in each frame correspond to the estimated pose of the book, emphasizing the system’s capability to track and align with the object during the approach. Collectively, these visual results demonstrate the robustness of the combined perception and planning pipeline in enabling accurate and responsive interaction with the target.

The RRT* planner is used only during the initial approach, where it computes an optimal, collision-free path from the robot’s current configuration to a region near the target. After the end-effector arrives at this location, the system switches to a real-time tracking mode. In this stage, continuous pose updates from the vision module (running at roughly 13 FPS) are fed into a DLS-based inverse kinematics controller, which applies small corrective motions. Rather than generating new RRT* plans at every update, the DLS controller functions within a visual-servoing loop, compensating for the instantaneous pose error between the end-effector and the newly estimated target pose. By separating long-range global planning (handled by RRT*) from short-range local tracking (handled by DLS servoing), the system achieves both computational efficiency and smooth, responsive motion. The experimental results confirmed that the proposed framework can smoothly transition from visual pose estimation and trajectory planning to direct physical interaction, demonstrating strong performance in both tracking and manipulation scenarios. All planning and execution were performed using the ROS MoveIt environment, which provided stable and adaptable path optimization for accurate object handling. To obtain quantitative performance insights, we conducted a series of trials using a textured book cover placed at multiple positions within the camera’s field of view, under varying environmental conditions. The system achieved a tracking accuracy of 96.7% as given in Table 2, indicating consistent pose estimation until occlusion occurred due to end-effector interference. Pose estimation error remained within ±0.63 cm, confirming the system’s suitability for precision grasping tasks. Maximum pose estimation error (±0.63 cm) was computed by comparing the output of the vision pipeline to the ground-truth object pose provided by the Gazebo simulation environment.The detection module achieved an average processing latency of 75 ms per frame, corresponding to a real-time performance of approximately 13.30 FPS. The system attained a precision of 97.1% and a recall of 96.5%, demonstrating reliable object identification and localization even under challenging lighting and background conditions. These results indicate that the approach is well-suited for dynamic manipulation tasks in semi-structured environments and generalizes effectively across different object types and camera viewpoints. Overall, the demonstrated performance highlights its potential for deployment in practical applications such as automated sorting, assistive robotic systems, and mobile manipulation.

Table 3 presents a comparative evaluation of the proposed motion scheme against several established pose-estimation and position-error detection methods. The results clearly demonstrate the superior performance of the proposed framework in both recall and positional accuracy. In terms of pose-estimation recall, the proposed scheme achieves a value of 96 %, substantially outperforming all competing approaches. Classical feature-matching and voting-based techniques adopted in [32] such as Geometric Consistency (GC) [33], Hough Transform (HG) [34], Search of Inliers (SI) [35], Spectral Technique (ST) [36], NNSR [37], and RANSAC [26] exhibit significantly lower recall values ranging from 0 % to 31 %. These methods are known to be sensitive to feature noise, viewpoint variations, and partial occlusions, which limits their robustness in dynamic laboratory environments. The near-perfect recall of the proposed method highlights its reliability in consistently detecting and tracking the target object under varying visual conditions. A similar trend is observed in the position-error comparison. Existing approaches such as Dong and Rodriguez et al. [38] report mean errors as high as

1.90

mm, while Li et al. [39] achieve

0.14

mm. More recent high-precision methods such as Zhao et al. [40] report errors in the range of 0.068–0.102 mm. In contrast, the proposed scheme attains a lower mean position error of

0.09

mm, placing it within the same high-accuracy regime while maintaining a simpler and more computationally efficient pipeline. This improvement can be attributed to the integration of homography-based pose estimation with the DLS inverse kinematics strategy, which enhances both spatial accuracy and motion stability. Overall, the results confirm that the proposed vision-guided motion framework not only surpasses traditional feature-based pose-estimation techniques in recall but also delivers competitive or superior positional accuracy relative to state-of-the-art methods. These findings underscore the suitability of the proposed approach for precise, reliable, and adaptive manipulation tasks in autonomous SDL environments.

4.1. Evaluation of Trajectory Motions

Maximum errors and RMSE of cartesian motions obtained using DLS method were shown in Table 4. The maximum translational errors range from 0.74 mm to 1.11 mm, while rotational errors peak at 1.75 deg. The RMSE values indicate consistent performance, with translational deviations remaining below 1 mm and rotational deviations under 1.10 deg. These results demonstrate the DLS method’s effectiveness in maintaining low pose estimation errors across the trajectory. Velocity Continuity (VC), Acceleration Profile (AP), jerk, snap, smoothness and RMSE errors of end effector were evaluated. Maximum values of AP, jerk, and snap values of motions were calculated to analyse the behaviour of motions. VC, AP, jerk, and snap values of manipulator joints are given in Table 5. Table 5 presents the dynamic motion metrics of the manipulator joints, including velocity continuity (VC), acceleration profile (AP), jerk, and snap, evaluated across all seven joints (

θ_{1}

to

θ_{7}

). The VC values range from 0.08 deg/s to 0.30 deg/s, indicating smooth transitions in joint velocities without abrupt discontinuities. The acceleration profiles span from 0.75 deg/s² to 1.9 deg/s², reflecting the rate of change in velocity during motion execution. Jerk values, which quantify the variation in acceleration, remain below 0.41 deg/s³, suggesting well-regulated dynamic behavior. Similarly, snap values representing the fourth derivative of position are maintained below 0.46 deg/s⁴, confirming the overall smoothness and stability of the joint trajectories. These metrics collectively demonstrate that the motion planning and control strategies employed yield dynamically consistent and mechanically safe joint movements suitable for precise and compliant manipulation tasks.

DLS performed better in terms of VC, AP, jerk and snap values of motions. Motions obtained using JT method exhibited lower VC, unstable AP, higher jerk and higher snap values. Table 6 presents the smoothness values associated with the individual joint motions of the manipulator, denoted as

θ_{1}

through

θ_{7}

. The smoothness metric serves as a quantitative indicator of trajectory continuity, where lower values correspond to reduced dynamic fluctuations in joint motion specifically in velocity, acceleration, jerk, and snap thereby reflecting smoother and more mechanically stable trajectories. These values quantify the continuity and fluidity of joint trajectories during execution. The results indicate that most joints exhibit smooth motion profiles, with values ranging from 0.78 to 1.57. Notably, joint

θ_{7}

shows a higher smoothness value of 2.14, suggesting increased variability or dynamic complexity in its motion. Overall, the smoothness metrics reflect well-conditioned joint behavior, contributing to stable and precise end-effector control throughout the manipulation task.

4.2. Discussion

DLS method determined feasible joint motions with optimum accuracy and feasibility. It prevented the occurrences of high joint velocities and provide a smooth motion even near to singular regions by employing the appropriate damping factor. The performance of the DLS-based inverse kinematics solver is influenced by the choice of the damping factor

α

. In our implementation, the system exhibits low to moderate sensitivity to variations in

α

, primarily due to smooth trajectories and the manipulator does not operate near severe singularities during the evaluated tasks.

Small damping values (e.g., $α < 0.01$ ) yield highly accurate joint updates but may amplify noise in the Jacobian pseudo-inverse, causing oscillations when the manipulator approaches singular configurations.
Moderate damping values (e.g., $0.01 \leq α \leq 0.1$ ) provide a good balance between accuracy and stability. In this range, the system maintains smooth joint trajectories, low RMSE pose errors, and stable velocity/acceleration profiles.
Large damping values (e.g., $α > 0.1$ ) overly suppress the Jacobian, leading to slower convergence and slightly increased tracking error, but the system remains stable and does not exhibit discontinuities.

When the target object is located near the boundary of the manipulator’s workspace, the proposed framework handles this situation through a combination of workspace analysis, trajectory planning, and the stabilizing properties of the DLS inverse kinematics method. First, the full Cartesian workspace of the manipulator and gripper is precomputed during the modeling stage. This allows the system to determine whether a target pose is fully reachable or lies close to the workspace limits. When the object is near the boundary, the RRT* planner automatically generates trajectories that remain within feasible regions and avoids joint-limit violations. Second, the DLS inverse kinematics formulation provides additional robustness near workspace boundaries. As the manipulator approaches stretched or near-singular configurations, the damping term prevents numerical instability and ensures smooth joint updates. Finally, if the target pose lies outside the reachable workspace, the system does not attempt to force an infeasible configuration. Instead, it selects the closest reachable pose and plans a safe trajectory to that point, ensuring stable behavior without abrupt or unsafe joint motions. While the experiment utilized a textured book cover as a proxy object, the demonstrated capabilities particularly robust tracking of textured planar surfaces and smooth, redundant motion planning are directly transferable to high-value tasks in a SDL environment.

The system maintains stable tracking under partial visibility (at least 60 % of the textured object visible) through a combination of redundant feature extraction, FLANN-based matching, and RANSAC homography estimation. RANSAC effectively rejects outlier correspondences introduced by occlusion, enabling reliable homography computation even when only a subset of the object is visible. Depth information is further integrated to stabilize the surface normal when visual features are sparse. The current vision pipeline, which relies on feature-based detection and homography-driven pose estimation, cannot generalize to more challenging SDL objects such as transparent glassware, reflective instrument panels, or low-texture plastic microplates. These object classes often lack sufficient visual features for reliable SIFT-based tracking and require alternative sensing modalities (e.g., depth fusion, polarization imaging, or thermal vision) or learning-based approaches (e.g., deep neural networks trained on domain-specific datasets) to ensure robust perception. However, the primary objective of this study is to establish a foundational motion and control framework that is modular and sensor-oriented. The architecture is designed to accommodate future upgrades to the perception stack without requiring changes to the underlying kinematic modeling or trajectory optimization methods. In this regard, the current implementation serves as a proof-of-concept that validates the motion planning and control pipeline under ideal visual conditions, providing a baseline for future extensions.

Although this work is motivated by the broader vision of SDLS, it does not attempt to implement the full spectrum of SDL autonomy, such as hypothesis generation, experimental reasoning, or closed-loop optimization. Instead, the system presented here is to be understood as a manipulation-level subsystem that enables those higher-level capabilities. By focusing on the integration of real-time feature-based tracking, homography-guided depth estimation, and smooth Jacobian-based motion execution, we address a foundational layer that many SDL frameworks implicitly assume but rarely evaluate in a unified manner. The quantitative analysis of smoothness, stability, and tracking accuracy demonstrates that reliable, reproducible motion execution can be achieved even under dynamic visual input—an essential prerequisite for any platform that aims to automate experimental procedures. Positioning the contribution in this way ensures that the claims remain aligned with the demonstrated capabilities while highlighting the subsystem’s importance for future SDL architectures that incorporate planning, reasoning, and adaptive experimentation.

5. Conclusions

This study introduced a multimodal control framework that integrates vision-guided object tracking with Jacobian-based motion planning for autonomous manipulation in SDL environments. The proposed system combined feature-based detection and homography-driven pose estimation with RRT*-based trajectory generation and inverse kinematics computed via the Damped Least Squares (DLS) method. Kinematic modeling using screw theory enabled robust workspace analysis and stable joint solutions, even in the presence of redundancy and near-singular configurations. Quantitative evaluations demonstrated the effectiveness of the framework. The manipulator achieved a maximum Cartesian pose error of 1.11 mm and a root mean square error (RMSE) of 0.97 mm across translational axes. Rotational RMSE remained below 1.10 deg, confirming precise orientation tracking. Joint motion profiles showed velocity continuity values ranging from 0.08 deg/s to 0.30 deg/s, with acceleration profiles peaking at 1.9 deg/s². Higher-order metrics such as jerk and snap were maintained below 0.41 deg/s³ and 0.46 deg/s⁴, respectively, indicating smooth and dynamically stable trajectories. Additionally, joint smoothness values ranged from 0.78 to 2.14, with most joints exhibiting consistent motion behavior. Despite these promising results, certain limitations persist, particularly in handling edge-of-workspace scenarios and highly nonlinear object motions. Future work will focus on enhancing the linear kinematic model with second-order approximations to improve accuracy near workspace boundaries. Real-time feedback from tactile and force-torque sensors will be incorporated to enable compliant manipulation and safe human-robot interaction. Incorporating force–torque sensors would significantly enhance the current framework by enabling contact-aware and compliant manipulation, which is essential for handling delicate laboratory objects. Force feedback would complement the vision pipeline by providing stability during grasping, alignment, and interaction tasks where visual cues may be unreliable. This integration would extend the system from purely vision-guided motion to a more robust multimodal control strategy suitable for real-world SDL operations. Furthermore, extending the framework to support multi-arm coordination and mobile base integration will enhance scalability and operational reach. Finally, benchmarking the system across diverse laboratory tasks and deploying it in real-world experimental settings will be essential for validating its generalizability and impact on autonomous scientific discovery.

Performance comparisons further underscore the advantages of the proposed approach. In pose-estimation recall, the proposed scheme achieved a value of 0.96, substantially surpassing classical methods such as GC (0.24), HG (0.31), ST (0.01), NNSR (0.00), and RANSAC (0.00). This near-perfect recall highlights the robustness of the vision pipeline in consistently detecting and tracking textured planar objects under varying visual conditions. A similar trend is observed in the position-error evaluation, where the proposed method attained a mean error of 0.09 mm, significantly outperforming Dong and Rodriguez et al. (1.90 mm) and Li et al. (0.14 mm), while remaining competitive with recent high-precision approaches such as Zhao et al. (0.068–0.102 mm). Collectively, these results reinforce the precision, reliability, and stability of the integrated perception-and-motion framework.

This work demonstrated foundational capabilities that support the operational demands of SDL workflows, without attempting to implement full SDL autonomy. By combining reliable tracking of textured planar objects with smooth, redundancy-aware motion execution, the system establishes a manipulation-level baseline that can underpin tasks such as handling labeled reagent bottles, interacting with instrument interfaces, and managing microplate logistics. Future work will extend the perception module to accommodate laboratory objects by incorporating multi-modal sensing and deep learning-based detection and pose estimation. We also plan to evaluate the framework in real laboratory settings involving diverse geometries, material properties, and environmental conditions. Finally, integrating obstacle-rich or constrained environments will allow the advantages of RRT* to be demonstrated more explicitly. Together, these developments will further validate the scalability and robustness of the proposed subsystem within practical autonomous experimentation pipelines.

Author Contributions

Conceptualization, methodology, software implementation, and manuscript preparation were carried out by S.S. A.H. contributed to simulations of work. S.B. and N.M. provided supervision, critical review, and guidance throughout the research and manuscript refinement process. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Pioneer Center for Accelerating P2X Materials Discovery, CAPeX.

Data Availability Statement

The data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Amarnath Harikumar is employed by Nawe Robotics, India and Naresh Marturi is employed by KUKA Robotics UK Ltd., UK. The authors declare that the research was conducted without any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Dobrokvashina, A.; Sulaiman, S.; Gamberov, T.; Hsia, K.H.; Magid, E. New Features Implementation for Servosila Engineer Model in Gazebo Simulator for ROS Noetic. In Proceedings of the International Conference on Artificial Life and Robotics, Oita, Japan, 9–12 February 2023; Volume 28, pp. 153–156. [Google Scholar]
Seifrid, M.; Pollice, R.; Aguilar-Granda, A.; Morgan Chan, Z.; Hotta, K.; Ser, C.T.; Vestfrid, J.; Wu, T.C.; Aspuru-Guzik, A. Autonomous chemical experiments: Challenges and perspectives on establishing a self-driving lab. Acc. Chem. Res. 2022, 55, 2454–2466. [Google Scholar] [CrossRef]
Sultanov, R.; Sulaiman, S.; Tsoy, T.; Chebotareva, E. Virtual collaborative cells modeling for UR3 and UR5 robots in Gazebo simulator. In Proceedings of the 2023 International Conference on Artificial Life and Robotics, Oita, Japan, 9–12 February 2023; pp. 149–152. [Google Scholar]
Sulaiman, S.; Harikumar, A.; Marturi, N.; Bøgh, S. Comparative Analysis of Jacobian-Based Motion Planning Methods for Redundant Manipulators. In Proceedings of the 2025 IEEE International Conference on Advanced Robotics and its Social Impacts (ARSO), Osaka, Japan, 17–19 July 2025; pp. 259–264. [Google Scholar]
Liao, Z.; Jiang, G.; Zhao, F.; Mei, X.; Yue, Y. A novel solution of inverse kinematic for 6R robot manipulator with offset joint based on screw theory. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420925645. [Google Scholar] [CrossRef]
Gandhi, N.; Nagababu, G.; Vistapalli, J. Modeling and Kinematic Analysis of a Robotic Manipulator for Street Cleaning Applications Using Screw Theory. In Advances in Industrial Machines and Mechanisms: Select Proceedings of IPROMM 2020; Springer: Singapore, 2021; pp. 609–618. [Google Scholar]
Wang, Y.; Liang, X.; Gong, K.; Liao, Y. Kinematical research of free-floating space-robot system at position level based on screw theory. Int. J. Aerosp. Eng. 2019, 2019, 6857106. [Google Scholar] [CrossRef]
Sun, P.; Li, Y.; Chen, K.; Zhu, W.; Zhong, Q.; Chen, B. Generalized kinematics analysis of hybrid mechanisms based on screw theory and lie groups lie algebras. Chin. J. Mech. Eng. 2021, 34, 98. [Google Scholar] [CrossRef]
Liu, Z.; Xu, J.; Cheng, Q.; Zhao, Y.; Pei, Y.; Yang, C. Trajectory planning with minimum synthesis error for industrial robots using screw theory. Int. J. Precis. Eng. Manuf. 2018, 19, 183–193. [Google Scholar] [CrossRef]
Ge, D. Kinematics modeling of redundant manipulator based on screw theory and Newton-Raphson method. J. Phys. Conf. Ser. 2022, 2246, 012068. [Google Scholar] [CrossRef]
Shifa, S.; Sudheer, A.P. Modelling of torso and dual arms for a humanoid robot with fixed base by using screw theory for dexterous applications. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1132, 012036. [Google Scholar]
Shifa, S.; Sudheer, A.P. Modeling of a wheeled humanoid robot and hybrid algorithm-based path planning of wheel base for the dynamic obstacles avoidance. Ind. Robot. Int. J. Robot. Res. Appl. 2022, 49, 1058–1076. [Google Scholar] [CrossRef]
Medrano-Hermosillo, J.A.; Lozoya-Ponce, R.; Ramírez-Quintana, J.; Baray-Arana, R. Forward Kinematics Analysis of 6-DoF Articulated Robot using Screw Theory and Geometric Algebra. In Proceedings of the 2022 XXIV Robotics Mexican Congress (COMRob), Mineral de la Reforma, Mexico, 9–11 November 2022; pp. 1–6. [Google Scholar]
Wang, X.; Guo, S.; Xu, Z.; Zhang, Z.; Sun, Z.; Xu, Y. A robotic teleoperation system enhanced by augmented reality for natural human–robot interaction. Cyborg Bionic Syst. 2024, 5, 0098. [Google Scholar] [CrossRef] [PubMed]
Meng, C.; Zhang, T.; Zhao, D.; Lam, T.L. Fast and comfortable robot-to-human handover for mobile cooperation robot system. Cyborg Bionic Syst. 2024, 5, 0120. [Google Scholar] [CrossRef] [PubMed]
Hélénon, F.; Huber, J.; Amar, F.B.; Doncieux, S. Toward a Plug-and-Play Vision-Based Grasping Module for Robotics. arXiv 2023, arXiv:2310.04349. [Google Scholar]
Kushwaha, V.; Shukla, P.; Nandi, G. Vision-based intelligent robot grasping using sparse neural network. Int. J. Intell. Robot. Appl. 2025, 9, 1214–1227. [Google Scholar] [CrossRef]
Wang, H.; Zhao, Q.; Li, H.; Zhao, R. Polynomial-based smooth trajectory planning for fruit-picking robot manipulator. Inf. Process. Agric. 2022, 9, 112–122. [Google Scholar] [CrossRef]
Zhang, H.; Tang, J.; Sun, S.; Lan, X. Robotic grasping from classical to modern: A survey. arXiv 2022, arXiv:2202.03631. [Google Scholar] [CrossRef]
Newbury, R.; Gu, M.; Chumbley, L.; Mousavian, A.; Eppner, C.; Leitner, J.; Bohg, J.; Morales, A.; Asfour, T.; Kragic, D.; et al. Deep learning approaches to grasp synthesis: A review. IEEE Trans. Robot. 2023, 39, 3994–4015. [Google Scholar] [CrossRef]
Du, G.; Wang, K.; Lian, S.; Zhao, K. Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: A review. Artif. Intell. Rev. 2021, 54, 1677–1734. [Google Scholar] [CrossRef]
Zhang, S.; Lyu, B.; Li, Y.; Sui, L.; Yang, J. Trajectory planning of a 6R manipulator around obstacles by using SRRT* Algorithm. In Proceedings of the 2nd International Conference on Computing and Data Science, Online, 28–29 January 2021; pp. 1–4. [Google Scholar]
Ortenzi, V.; Marturi, N.; Rajasekaran, V.; Adjigble, M.; Stolkin, R. Singularity-robust inverse kinematics solver for tele-manipulation. In Proceedings of the 2019 IEEE 15th International Conference on Automation Science and Engineering (CASE), Vancouver, BC, Canada, 22–26 August 2019; pp. 1821–1828. [Google Scholar]
Wu, J.; Cui, Z.; Sheng, V.S.; Zhao, P.; Su, D.; Gong, S. A Comparative Study of SIFT and its Variants. Meas. Sci. Rev. 2013, 13, 122–131. [Google Scholar] [CrossRef]
Muja, M.; Lowe, D.G. Fast approximate nearest neighbors with automatic algorithm configuration. In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP), Lisboa, Portugal, 5–8 February 2009; Volume 1, pp. 331–340. [Google Scholar]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Pardos-Gotor, J. Screw Theory in Robotics: An Illustrated and Practicable Introduction to Modern Mechanics; CRC Press: Boca Raton, FL, USA, 2021. [Google Scholar]
Lynch, K.M.; Park, F.C. Modern Robotics; Cambridge University Press: Cambridge, UK, 2017. [Google Scholar]
Buss, S.R. Introduction to inverse kinematics with jacobian transpose, pseudoinverse and damped least squares methods. IEEE J. Robot. Autom. 2004, 17, 16. [Google Scholar]
Paul, S.K.; Chowdhury, M.T.; Nicolescu, M.; Nicolescu, M.; Feil-Seifer, D. Object detection and pose estimation from rgb and depth data for real-time, adaptive robotic grasping. In Advances in Computer Vision and Computational Biology: Proceedings from IPCV’20, HIMS’20, BIOCOMP’20, and BIOENG’20; Springer: Cham, Switzerland, 2021; pp. 121–142. [Google Scholar]
Kato, N.; Iuchi, T.; Murabayashi, K.; Tanaka, T. Comparison of Smoothness, Movement Speed and Trajectory during Reaching Movements in Real and Virtual Spaces Using a Head-Mounted Display. Life 2023, 13, 1618. [Google Scholar] [CrossRef] [PubMed]
Hietanen, A.; Latokartano, J.; Foi, A.; Pieters, R.; Kyrki, V.; Lanz, M.; Kämäräinen, J.K. Benchmarking pose estimation for robot manipulation. Robot. Auton. Syst. 2021, 143, 103810. [Google Scholar] [CrossRef]
Chen, H.; Bhanu, B. 3D free-form object recognition in range images using local surface patches. Pattern Recognit. Lett. 2007, 28, 1252–1262. [Google Scholar] [CrossRef]
Tombari, F.; Di Stefano, L. Object recognition in 3d scenes with occlusions and clutter by hough voting. In Proceedings of the 2010 Fourth Pacific-Rim Symposium on Image and Video Technology, Singapore, 14–17 November 2010; pp. 349–355. [Google Scholar]
Glent Buch, A.; Yang, Y.; Kruger, N.; Gordon Petersen, H. In search of inliers: 3d correspondence by local and global voting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2067–2074. [Google Scholar]
Leordeanu, M.; Hebert, M. A spectral technique for correspondence problems using pairwise constraints. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Beijing, China, 17–21 October 2005; Volume 2, pp. 1482–1489. [Google Scholar]
Hough, P.V. Method and Means for Recognizing Complex Patterns. US Patent 3,069,654, 18 December 1962. [Google Scholar]
Dong, S.; Rodriguez, A. Tactile-Based Insertion for Dense Box-Packing. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 7953–7960. [Google Scholar] [CrossRef]
Li, R.; Platt, R.; Yuan, W.; ten Pas, A.; Roscup, N.; Srinivasan, M.A.; Adelson, E. Localization and manipulation of small parts using GelSight tactile sensing. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; pp. 3988–3993. [Google Scholar] [CrossRef]
Zhao, D.; Sun, F.; Wang, Z.; Zhou, Q. A novel accurate positioning method for object pose estimation in robotic manipulation based on vision and tactile sensors. Int. J. Adv. Manuf. Technol. 2021, 116, 2999–3010. [Google Scholar] [CrossRef]

Figure 1. Methodology adopted in this work.

Figure 2. Illustration of various frames and screw axes of the used KUKA iiwa manipulator.

Figure 3. Views of combined workspace of the manipulator with hand (a) 3D (b) Sectioned (c) Top.

Figure 4. Simulation environment (a) Gazebo (b) Rviz.

Figure 5. Vision algorithm outputs: (a) Bounding box, (b) Pose of object.

Figure 6. Manipulator motions to reach in front of a book in Rviz (a) Initial (b) Intermediate (c) Final.

Figure 7. Manipulator motions to reach in front of a book in Gazebo (a) Initial (b) Intermediate (c) Final.

Figure 8. Manipulator motions to track a book in RviZ (a) Initial (b,c) Intermediate (d) Final.

Figure 9. Manipulator motions to track a book in Gazebo (a) Initial (b,c) Intermediate (d) Final.

Table 1. Manufacturer provided link dimensions for the KUKA iiwa 14 manipulator.

Link	Dimension (mm)
$d_{b c}$	340
$d_{c d}$	740
$d_{d e}$	400
$d_{e g}$	126
$d_{g f}$	126

Table 2. Performance Metrics for Object Tracking and Grasping.

Metric	Value	Description
Tracking Accuracy	96.7%	Proportion of frames with accurate pose estimation prior to occlusion caused by end-effector constraints.
Pose Estimation Error	±0.63 cm	Maximum deviation between predicted and ground-truth object poses.
Detection Latency	75 ms	Average processing time per-frame for object detection and pose update, supporting real-time operation.
Detection Precision	97.1%	Fraction of accurately detected objects among all detections.
Detection Recall	96.5%	Fraction of actual objects successfully detected.
Runtime Performance	∼13.30 FPS	Average frame rate during continuous tracking and grasping.

Table 3. Comparison of Pose Estimation Recall and Position Error.

Method	Recall (%)	Mean Position Error (mm)
Proposed Motion Scheme	96.0	0.09
Geometric consistency (GC) [32]	24.0	–
Hough transform (HG) [32]	31.0	–
Search of inliers (SI) [32]	0.0	–
Spectral technique (ST) [32]	1.0	–
Hough voting method (NNSR) [32]	0.0	–
Random sample consensus(RANSAC) [32]	0.00	–
Dong and Rodriguez et al. [38]	–	1.90
Li et al. [39]	–	0.14
Zhao et al. [40]	–	0.068–0.102

Table 4. Maximum error and RMSE of Cartesian motions obtained using DLS method.

Error	x	y	z	$α$	$β$	$γ$
Maximum error (mm/deg)	1.06	1.11	0.74	1.75	0.94	1.05
RMSE (mm/deg)	0.97	0.81	0.66	1.05	0.71	0.88

Table 5. Velocity Continuity (deg/s), Acceleration Profile (deg/s²), jerk (deg/s³) and snap (deg/s⁴) values of manipulator joints.

Metrics	$θ_{1}$	$θ_{2}$	$θ_{3}$	$θ_{4}$	$θ_{5}$	$θ_{6}$	$θ_{7}$
VC	0.29	0.24	0.11	0.30	0.26	0.08	0.24
AP	1.49	1.08	0.88	1.9	1.58	0.75	1.88
Jerk	0.39	0.33	0.28	0.41	0.30	0.19	0.37
Snap	0.46	0.40	0.37	0.44	0.32	0.24	0.43

Table 6. Smoothness value of manipulator motions.

Joint	Values
$θ_{1}$	0.78
$θ_{2}$	1.57
$θ_{3}$	0.81
$θ_{4}$	1.03
$θ_{5}$	0.86
$θ_{6}$	0.79
$θ_{7}$	2.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sulaiman, S.; Harikumar, A.; Bøgh, S.; Marturi, N. Multimodal Control of Manipulators: Coupling Kinematics and Vision for Self-Driving Laboratory Operations. Robotics 2026, 15, 17. https://doi.org/10.3390/robotics15010017

AMA Style

Sulaiman S, Harikumar A, Bøgh S, Marturi N. Multimodal Control of Manipulators: Coupling Kinematics and Vision for Self-Driving Laboratory Operations. Robotics. 2026; 15(1):17. https://doi.org/10.3390/robotics15010017

Chicago/Turabian Style

Sulaiman, Shifa, Amarnath Harikumar, Simon Bøgh, and Naresh Marturi. 2026. "Multimodal Control of Manipulators: Coupling Kinematics and Vision for Self-Driving Laboratory Operations" Robotics 15, no. 1: 17. https://doi.org/10.3390/robotics15010017

APA Style

Sulaiman, S., Harikumar, A., Bøgh, S., & Marturi, N. (2026). Multimodal Control of Manipulators: Coupling Kinematics and Vision for Self-Driving Laboratory Operations. Robotics, 15(1), 17. https://doi.org/10.3390/robotics15010017

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Control of Manipulators: Coupling Kinematics and Vision for Self-Driving Laboratory Operations

Abstract

1. Introduction

2. Related Work

3. Methodologies

3.1. Approach Overview

3.2. Kinematic Modeling of the Manipulator

3.3. Workspace Analysis

3.4. Inverse Kinematics Techniques—Damped Least Squares (DLS) Strategy

3.5. Vision-Based Grasping Framework

3.5.1. Planar Object Localization and Pose Inference

3.5.2. Motion Evaluation Metrics

4. Results and Discussion

4.1. Evaluation of Trajectory Motions

4.2. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI