Dual-Arm Robotic Textile Unfolding with Depth-Corrected Perception and Fold Resolution

Båserud, Tilla Egerhei; Johansen, Joakim; Jha, Ajit; Tyapin, Ilya

doi:10.3390/robotics15040078

Open AccessArticle

Dual-Arm Robotic Textile Unfolding with Depth-Corrected Perception and Fold Resolution

Department of Engineering Sciences, University of Agder, 4609 Kristiansand, Norway

^*

Author to whom correspondence should be addressed.

^†

These authors were affiliated with the University of Agder while this work was conducted.

Robotics 2026, 15(4), 78; https://doi.org/10.3390/robotics15040078

Submission received: 10 March 2026 / Revised: 29 March 2026 / Accepted: 1 April 2026 / Published: 8 April 2026

(This article belongs to the Section Sensors and Control in Robotics)

Download

Browse Figures

Versions Notes

Abstract

Reliable textile recycling requires automated unfolding to expose hidden hard components such as zippers, buttons, and metal fasteners, which otherwise risk damaging machinery and compromising downstream processes. This paper presents the design and implementation of an automated textile unfolding system based on a dual-arm robotic manipulation framework. The system uses two Interbotix WidowX 250s 6-DoF robotic arms and an Intel RealSense L515 LiDAR camera for visual perception. The unfolding process consists of three stages: initial dual-arm stretching to reduce major folds, refinement through a second stretch targeting the lower region, and a machine-learning stage that employs a YOLOv11 framework trained on depth-encoded textile images, followed by a depth-gradient-based estimator for fold direction. The system applies an extremity-based grasping strategy that selects leftmost and rightmost textile points from a custom error-corrected depth map, enabling robust grasp point selection, and a fold direction estimation method based on depth gradients around the detected fold. The most confident fold region is selected, an unfolding direction is determined using depth ranking, and the textile is manipulated until a flat state is confirmed through depth uniformity. Experiments show that depth correction significantly reduces spatial error in the robot frame, while segmentation and extremity detection achieve high accuracy across varied fold configurations, and the YOLOv11n-based model reaches 98.8% classification accuracy, while fold direction is estimated correctly in 87% of test cases. By enabling robust, largely autonomous textile unfolding, the system demonstrates a practical approach that could support safer and more efficient automated textile recycling workflows.

Keywords:

robotic textile manipulation; dual-arm manipulation; deformable object handling; depth sensing; YOLO; textile recycling

1. Introduction

Over recent decades, the global textile and fashion industry has witnessed unprecedented growth, driven largely by fast fashion trends, shorter product life cycles, and increased consumer demands for affordable clothing [1]. This expansion has led to increased clothing consumption and, in turn, a surge in textile waste. In Europe, discarded clothing and home textiles from consumers account for approximately 85% of total textile waste [2]. Despite growing awareness and some improvement efforts, at least half of household textile waste still ends up in landfills, incinerators, or is exported to countries outside of Europe, where the final outcome of the waste is often unknown [3]. Today, less than 1% of textiles are recycled, due to limitations in collection, sorting, and preprocessing [2]. In response to these challenges, the EU’s Waste Framework requires all member countries to establish separate collection systems for textile waste by 2025 [2]. In line with this directive, the Norwegian government has introduced regulations that prohibit textiles from being disposed of with residual waste. Starting 1 January 2025, textiles must be sorted for separate recycling [4].

However, textile sorting in Norway presents specific challenges. Manual sorting is labour-intensive and costly due to high labour expenses, which is why a large portion of textile sorting currently occurs outside of Europe [3]. This practice not only contributes to higher transport-related emissions but also hinders the development and expansion of domestic recycling systems [2,5]. A suggested solution is to automate the textile sorting process using robotic systems equipped with advanced imaging, machine learning, and intelligent decision-making algorithms that can identify, sort, and preprocess textiles, thereby increasing the flexibility of the recycling process and making it suitable for a large variety of textiles in high volumes [6].

Textiles arriving at recycling facilities typically present in highly irregular and complex conditions such as crumpled, folded, partially tangled, and often obscured. Before these textiles can be reliably processed further (e.g., classified or cut), they must first be systematically unfolded to expose essential internal features clearly [2]. Despite its critical importance, automated unfolding has received comparatively little attention, with some existing robotic sorting approaches relying heavily on manual sorting methods [7]. Automating textile unfolding presents distinct and significant technical challenges due to the inherently deformable, flexible, and unpredictable nature of textiles. Unlike rigid objects, textiles vary widely in material type, patterns, fabric properties, size, thickness, and geometric configurations, all of which complicate robotic manipulation [8,9]. A central challenge in robotic textile manipulation is accurately perceiving the three-dimensional configuration of the fabric and identifying reliable grasp points. Irregularities such as folds, wrinkles, and self-occlusions introduce significant uncertainty, making visual perception and manipulation planning particularly difficult [10]. To address these issues, advanced perception systems have been developed that integrate interactive sensing with occlusion-aware models, enabling more accurate estimation of the textile’s full state under complex and variable geometries [10]. Automating the unfolding of textiles necessitates precise, coordinated control of multiple robotic arms. Dual-arm robotic manipulation introduces complexities such as workspace coordination, collision avoidance, and real-time decision-making in dynamic environments. Ambidextrous dual-arm robotic systems require sophisticated coordination and control to synchronise the movements of both arms and hands seamlessly, particularly when handling deformable objects under changing contact conditions [11].

Reliable unfolding of textiles is essential for maintaining the efficiency and safety of downstream recycling operations. Unfolding is a critical prerequisite for exposing hidden hard components such as zippers, buttons, and metal fasteners, which are often embedded within folds or layers of fabric. If left undetected, these elements can severely damage industrial cutting equipment and, in some cases, even pose fire hazards due to metal-induced sparks during mechanical processing. These risks can lead to costly downtime, increased maintenance demands, and significant safety concerns [12,13]. In summary, the urgent need for automated textile unfolding arises from the unique physical complexity of textiles and their disruptive impact on recycling workflows. Successfully addressing these challenges with advanced robotics, occlusion-aware perception, and adaptive control systems is essential for improving the safety, efficiency, and scalability of textile recycling. These technologies not only boost material recovery but also advance Europe’s and the global community’s transition toward more sustainable and economically resilient waste management systems [14,15,16].

1.1. Contributions

This paper addresses the technical challenges associated with robotic manipulation of textiles, specifically focusing on the design, implementation, and validation of a robotic system capable of autonomously unfolding textile materials. The primary aim is to create a robust and reliable system that effectively integrates advanced visual perception, dual-arm robotic manipulation, and AI-driven fold detection. By developing this integrated approach, the research directly supports the pre-processing phase in textile recycling. The key contributions of this research are detailed as follows:

Modular Robotic Manipulation and Perception Framework: Developed a robotic manipulation framework utilising two six-degree-of-freedom (6-DOF) robotic arms, explicitly configured to handle the complexities associated with textile manipulation. In addition, the robotic control system is modular and scalable, ensuring the integration of additional robotic components, sensory inputs, and AI/ML algorithms with ease. This makes it applicable not only to unfolding but also potentially to broader textile processing tasks, including sorting, classification, and disassembly.
Extremity-Based Grasping and Unfolding Algorithm: Unlike traditional highest–lowest grasping methods, this work designs and validates a textile grasping method based on identifying textile extremities (leftmost and rightmost points).
Integrated Visual Perception Pipeline with Depth Correction: A visual perception and calibration pipeline is developed for precise depth sensing and 3D localization of textiles. The system uses LiDAR-based depth sensing with custom depth-correction models and algorithms, reducing perception errors and enhancing the reliability of grasp detection and grasping.
Extensive Experimental Validation and Performance Evaluation: Detailed and systematic experimental assessments are performed to demonstrate the practicality and robustness of the robotic system. First, comprehensive assessments of visual sensing accuracy, including camera depth measurement validation, intrinsic calibration comparisons, pose estimation methods, and spatial error correction, are conducted. Second, the robotic manipulation framework is evaluated through gripper design comparisons and systematic testing of grasp reliability. Finally, the end-to-end system integration with the proposed AI/ML model is validated through multiple real-world use cases.

To clarify novelty, the contribution of this work is not any single component in isolation, but the way depth error modelling, staged dual-arm manipulation, and fold-aware perception are combined into a single recycling-oriented pipeline. In particular, we explicitly correct spatially varying depth bias for a LiDAR-based camera using a workspace-fitted correction surface, and we use this corrected height map directly for grasp point projection and fold reasoning. Building on this, we adopt a three-stage policy that separates coarse tension-driven unfolding from local fold resolution, where a learned detector provides the fold location used for grasp targeting and for local direction estimation. This structure was chosen to reduce problem complexity stage-by-stage and to keep the perception–action loop robust under limited sensing and constrained hardware.

1.2. Structure of the Paper

This paper is organized into eight sections. Section 2 reviews existing research in robotic textile manipulation, including grasping strategies, visual perception techniques, and machine learning approaches for handling deformable objects. The proposed system overview is presented in Section 3. Section 4 presents the system methodology, covering visual sensing, dual-arm manipulation, and the integration of object detection within a ROS2-based software framework. Experiments are presented in Section 5. Section 6 presents the results from visual perception, robotic control, and fold detection, along with an overall assessment of unfolding performance. Section 7 discusses key findings, system limitations, and opportunities for future improvements in the context of textile recycling. Section 8 concludes the paper by summarizing the main contributions and practical outcomes.

2. Related Work

This Section provides an overview of existing work on robotic manipulation of textiles and related deformable objects. This paper focuses on four aspects that are directly relevant to the proposed system: perception for deformable textiles, dual-arm unfolding strategies, grasp planning, and machine-learning-based methods for detecting folds and garment features.

A number of recent surveys summarize the broader field of deformable object manipulation and its applications to cloth [9,17,18,19]. These works highlight three recurring themes: the difficulty of perceiving and representing highly deformable materials, the importance of suitable control and coordination strategies (often using two arms), and the growing role of learning-based methods. The present paper builds on these insights and narrows the focus to textile unfolding in a recycling context.

To perform automatic tasks, a robot must first perceive the state of its environment. In the case of textile unfolding, the robot must be able to detect the textile, distinguish it from the background, and estimate its three-dimensional configuration sufficiently well to select grasp points and plan safe motions. A large body of work relies on RGB-D cameras to acquire colour and depth information, which can be converted into partial 3D point clouds to capture the spatial geometry of visible cloth surfaces [9,10]. Depth information is typically fused with RGB data to segment the textile from the background and to identify edges, corners, or salient regions.

Desingh et al. review perception pipelines for deformable objects and emphasise that depth noise, self-occlusion and complex geometries make state estimation particularly challenging [20]. Similar issues occur in textile-specific systems where RGB-D or structured-light scanners are used to perceive garments. For instance, early work on garment manipulation at ARM Lab combines depth sensing with geometric reasoning to locate graspable regions on garments [8]. In an industrial setting, Andronas et al. demonstrate a robotic cell for handling deformable gaskets and seals, where the perception system must cope with non-rigid shapes and variable lighting conditions [21].

More specialised setups combine depth with additional sensing modalities. Proesmans et al. use structured depth data together with an infrared tactile sensor to perceive the shape of hanging cloth and to detect contact along its edges [22]. Xue et al. propose a unified framework for folding and unfolding rectangular cloth that relies on RGB-D sensing to estimate the cloth pose on a table [23]. In most of these systems, depth measurements are used more or less as provided by the camera, with only standard intrinsic and extrinsic calibration.

Control-oriented approaches have also addressed flexible sheet handling on a table using vision-driven feedback. For example, Zacharia et al. [24] present a robotic system based on fuzzy visual servoing for manipulating flexible sheets lying on a work surface, illustrating how control design and visual feedback can compensate for deformation and partial occlusion. While our work does not implement fuzzy servoing, this study provides a useful reference point for table-top deformable manipulation where control and sensing must operate under changing geometry.

In contrast, the system developed in this paper explicitly models spatial depth errors for a LiDAR-based camera [25,26]. A correction surface is fitted over the workspace and applied to every depth frame before any grasp planning takes place. This depth-corrected representation is then used for textile segmentation and extremity detection. The goal is not to reconstruct a full cloth model, but to obtain a consistent and metrically accurate height map that is reliable enough for grasp point projection and collision-free motion planning.

Dual-arm manipulation frameworks are widely used in textile unfolding because two grippers can more easily stretch and reorient cloth than a single arm [9,17]. More recent work on compliant interaction (e.g., variable admittance control under environmental constraints) further motivates conservative, tension-preserving manipulation policies when interaction forces and contact conditions are uncertain [27]. Several authors have proposed geometric strategies that exploit garment shape. Triantafyllou et al. introduced a geometric approach in which garment features are detected and used to plan dual-arm motions that unfold the cloth on a table [28]. In later work they consider type-specific strategies that tailor the unfolding sequence to the identified garment class [29].

Other systems focus on pulling along edges. Gabas et al. presented a dual-edge classifier that labels which edges of a towel are suitable for grasping and pulling, allowing the robot to gradually unfold the cloth while maintaining tension between the two grippers [30]. Kuribayashi et al. proposed a dual-arm system for rectangular cloth that combines perception with pre-defined motion primitives to automate both unfolding and folding tasks [31]. Proesmans et al. demonstrated UnfoldIR, where tactile sensing and dual-arm edge tracing are used to unfold cloth in the air [22]. At the other extreme, dynamic manipulation strategies such as FlingBot use fast flinging motions to unfold cloth in a more exploratory manner [32].

The system in this paper follows the same basic intuition that two arms are advantageous, but the unfolding policy is organised differently. Rather than explicitly detecting corners or classifying edges, we use extremities of the segmented contour as generic grasp points. These extremities are used in two stretching stages designed to reduce large folds while maintaining tension: an initial stretch from leftmost and rightmost points and a second stretch focused on the lower region of the textile. Only after this “macro” unfolding stage is complete do we invoke a learning-based fold detector to address remaining local folds. This design makes the pipeline less dependent on garment category and more directly aligned with the goal of obtaining a flat configuration suitable for recycling.

Direct side-by-side robotic benchmarks against alternative unfolding systems would be valuable, but they require access to comparable dual-arm hardware and repeatable test conditions. In the present work, we, therefore, (i) provide an explicit segmentation comparison against a strong modern baseline (SAM2) within the same recorded scenes, and (ii) position the overall pipeline relative to representative prior systems in terms of table-top assumptions, the use of depth correction, staged manipulation policies, and whether learning is used in the perception-to-action loop.

Representative examples include the control-oriented system by Zacharia et al. [24] and the broader survey perspective provided by Jiménez [33]. Table 1 summarises this qualitative positioning and highlights how the proposed system differs from representative prior approaches in terms of table-top assumptions, dual-arm operation, depth error modelling, and the use of learning in the perception-to-action loop.

Grasp planning is essential for textile unfolding. Early work often considered simple rectangular items such as towels and relied on corner detection and multiple views to identify stable grasp points [8,34]. These methods typically detect corners or peaks in the cloth silhouette and then choose grasps that are known to lead to good unfolding or folding configurations. Template-based approaches extend this idea by matching the segmented outline of a garment to a set of canonical shapes and then selecting grasps based on the recognised type [28,29].

Jiménez [33] provides an overview of visual grasp point localisation and cloth state recognition, highlighting how segmentation quality and feature reliability strongly constrain downstream manipulation. This perspective is directly relevant to our choice of a depth-corrected perception pipeline, where grasp targets are derived from stable geometric cues (contour extremities) rather than relying on garment templates or category-specific keypoints.

More recent methods treat grasp planning as part of a general deformable-object manipulation pipeline. Kaltsas et al. review a range of strategies, from purely geometric heuristics to learning-based affordance maps that encode how likely a grasp at a given pixel is to be useful [17]. Qian et al. use a deep network to segment cloth versus background and specific regions within the cloth, and then derive robust grasps from these segmented regions [35]. Other approaches rely on height maps or surface curvature derived from depth data and select grasps in regions that are expected to have a strong effect on the global cloth state [9]. Learning-based methods such as CeDiRNet represent grasp points through a combination of centre location and pulling direction, trained on large datasets of simulated towel configurations [36].

In this work, grasp planning is intentionally simple and deterministic. After segmentation, the largest contour is assumed to correspond to the textile, and the leftmost and rightmost points on this contour are extracted in image coordinates. These extremities are shifted slightly inward towards the contour centroid to avoid fragile edge-only grasps and then projected into the robot frame using the corrected depth model. The same procedure is reused in the lower portion of the cloth for the refinement stage. Compared to methods in [28,35,36], the extremity-based strategy does not require garment classification or extensive training data. It trades some optimality for robustness and speed, which is acceptable in a recycling context where textile type and exact pose vary considerably.

Detecting and unfolding folds is one of the main challenges in textile manipulation, and machine learning has become a natural tool for this task. Several works use deep networks to predict semantic keypoints or garment parts from RGB images, such as sleeve ends, collars or hems, which are then used as proxies for folds and graspable regions [17,29,37]. Other approaches train detectors on fabric defects or wrinkles. Li et al. propose a YOLOv5-based system to detect wrinkles and corner points on fabrics for quality inspection [38], while Hassan et al. use enhanced convolutional neural networks to detect small defects on textile surfaces in a human–robot collaboration setting [39].

There is also a growing interest in learning richer state representations for cloth. Desingh et al. discuss methods that learn latent embeddings of cloth configuration from images and use them for prediction and control [20]. Nahavandi et al. review how machine learning is being integrated into robotic manipulation more broadly, emphasising detection-style networks and affordance maps as particularly suitable for tasks where structure is local in the image, such as folds and wrinkles [18].

The system developed in this work follows this general trend but targets a specific perception-to-action link. A YOLOv11-based model is trained on depth-coded images to detect V-shaped folds on the textile surface. The detector produces bounding boxes with confidence scores, and the highest-confidence fold region is selected as a candidate for manipulation. Around this region, a depth-gradient-based method evaluates a set of compass-aligned directions and ranks them according to how well they follow the local fold geometry. The robot then executes a grasp-and-pull along the selected direction. In contrast to methods that use learning only to label folds or keypoints [37,40], this combination of detection and geometric direction estimation is tightly integrated into a three-stage unfolding policy whose end goal is a flat, recycling-ready textile. Taken together, these design choices position the proposed system as a recycling-oriented textile unfolding pipeline that combines depth-error-aware perception, staged dual-arm manipulation, and fold-specific perception-to-action coupling within a single framework.

3. System Overview

The system is designed to support automated preprocessing in textile recycling, where reliable unfolding is needed to expose embedded hard components before cutting. To achieve this, the setup combines three main subsystems: (i) an RGB–LiDAR perception module for spatial awareness and textile segmentation, (ii) a dual-arm robotic manipulation framework for grasping and lifting, and (iii) an AI module based on a YOLOv11 model for detecting and resolving residual folds. An overview of the dual-arm setup and workspace is shown in Figure 1. Typical textile configurations during unfolding are illustrated in Figure 2, and the high-level control flow of the unfolding process is summarised in Figure 3.

3.1. Hardware Setup

The physical setup consists of two Interbotix WidowX 250s 6-DOF robotic arms positioned on either side of a central workbench, as shown in Figure 1. The arms were chosen for their compatibility with the ROS 2 (Open Robotics, San Jose, CA, USA) framework and the Interbotix ROS packages (Trossen Robotics, Downers Grove, IL, USA), which allow straightforward integration with Python-based perception and control modules. Each arm is mounted on a 3D-printed platform that holds an AprilTag, used to estimate the pose of the robot bases relative to the camera and to maintain a consistent coordinate system throughout experiments.

A centrally mounted Intel RealSense L515 LiDAR camera (Intel Corporation, Santa Clara, CA, USA) provides synchronised RGB and depth data from an overhead viewpoint. From this position, the camera captures high-resolution depth maps of the entire workspace, which are used for surface analysis, textile segmentation, grasp point estimation, and fold detection. The L515 was selected due to its compact form factor, millimetre-scale depth accuracy at the operating range used in the experiments, and robust performance under typical indoor lighting conditions. Its LiDAR-based depth sensing and tight alignment between colour and depth channels make it well suited for tasks that require precise 3D localisation on textile surfaces.

The system is controlled via a ROS 2-based software interface implemented in Python 3.8 (Python Software Foundation, Wilmington, DE, USA). The software is organised into several concurrently running components responsible for camera handling, perception, coordinate frame management, robot control, state machine execution, and AI model inference. When the system is launched, it automatically initialises the depth camera, both robot arms, AprilTag-based pose tracking, and the internal process controller. A simple graphical interface provides real-time feedback, including camera images, detected contours, grasp points, and current system state.

Before use, the system performs calibration steps to improve spatial accuracy. The factory intrinsic parameters of the RealSense L515 can be used directly, but checkerboard-based calibration is available when higher precision is required. Extrinsic calibration between the camera and the robot frames is carried out using AprilTags, and a depth error model is constructed by sampling depth measurements across the table and comparing them with ground-truth positions in the robot coordinate frame. The resulting correction surface is applied to all depth data used for grasp planning and fold analysis, improving geometric consistency and reducing systematic height errors.

3.2. Three-Stage Unfolding Pipeline

The unfolding procedure is structured into three stages that gradually transition the textile from a crumpled state to a fully unfolded configuration. Figure 2 shows example textile states at different points in this process.

Stage 1: Initial Stretching. Starting from an initial crumpled configuration, the system segments the textile from the background using depth and colour information. The leftmost and rightmost extremities of the textile contour are then detected in the camera image and projected into the robot frame using the corrected depth model. These extremities are used as grip points for the left and right robotic arms. Both arms lift and move apart to apply tension across the textile, reducing major folds and transforming it into a more organised, semi-folded state.

Stage 2: Refinement. After the initial stretch, the system analyses the lower region of the textile to identify the remaining large folds. In practice, this is done by restricting the search to the bottom portion of the segmented area (for example, the lowest 20% of the textile in image coordinates) and again detecting extremities within this region. New grasp points are selected and a second stretch is performed. This refinement step further improves flatness and prepares the textile for the final, more targeted unfolding stage.

Stage 3: AI-Guided Final Unfolding. Once large folds have been reduced, the remaining local folds are addressed using an AI-based method. A YOLOv11 model trained on depth-coded textile images is used to detect fold regions. The fold with the highest confidence score is selected, and a centre point within its bounding box is extracted as the new grasp location. Around this point, depth gradients are evaluated along a set of compass-aligned directions to estimate the most promising direction in which to pull. The robot then performs a controlled grasp-and-pull motion along the estimated direction. Perception and action can be repeated until the textile appears flat according to a depth-uniformity criterion over the workspace.

Complementing the textile states shown in Figure 2, Figure 3 provides a high-level overview of the control logic implemented in the system. Textile unfolding begins by initialising the camera, including the integrated LiDAR sensor (Intel RealSense L515 [41]), and the robotic manipulators [42]. The image stream is then processed to detect the textile and estimate its boundary, while the depth stream provides 3D localisation of the target points. Once these targets are available in the robot coordinate frame, the robots execute the unfolding operations.

3.3. System Scope and Assumptions

The system is implemented and evaluated under a set of controlled assumptions that reflect a realistic, but simplified, textile recycling scenario. The main operating conditions are:

Each textile is placed individually and approximately centrally within a clearly defined workspace. No overlapping or tangling of multiple textiles is allowed during an unfolding run, and no other objects are present on the table.
The robotic arms operate within predefined joint limits and a restricted workspace, with software-enforced safety margins to avoid collisions with the table, the camera mount, and each other.
The perception pipeline relies on stable indoor lighting. Strong shadows, glare, or rapidly changing illumination are not explicitly handled in the current implementation.
The entire workspace remains within the field of view of the overhead camera and is kept clear of obstacles, which simplifies segmentation and reduces the risk of false detections.

These assumptions make it possible to evaluate the unfolding pipeline in a controlled and repeatable way and to focus on the core challenges of depth-corrected perception, extremity-based dual-arm manipulation, and AI-guided fold resolution without additional complications from cluttered scenes or highly variable environmental conditions.

4. Methods

The textile unfolding pipeline combines calibrated depth perception, 2D contour analysis, dual-arm manipulation, and a fold-detection model. In broad terms, the camera provides a depth-corrected view of the workspace, image processing isolates the textile and proposes grasp points at its extremities, and a dual-arm controller executes a sequence of stretching and fold-resolution motions guided by a YOLOv11n (Ultralytics, Frederick, MD, USA) model on depth-encoded images.

4.1. Visual Perception and Calibration

Visual perception is used to understand the layout of textiles in the workspace. The system must estimate depth reliably, align the camera view with the robots’ coordinate frames, identify the textile area, and detect surface features such as folds and creases. To achieve this, we combine calibrated depth data, basic 3D geometry, homogeneous transformations between coordinate frames, and standard image-processing operations.

The Intel Real Sense camera with built in LiDAR L515 is mounted above the workspace such that the entire textile area lies within the depth field of view. Figure 4 illustrates the main coordinate frames used in the system: the camera frame and two robot base frames.

In the beginning, the factory-provided intrinsic parameters from the Real Sense SDK (Intel Corporation, Santa Clara, CA, USA) were used, which provide a reasonable starting point for both RGB and depth streams [43]. For higher-accuracy experiments, we optionally refined the intrinsics using a checkerboard-based calibration procedure implemented in [44].

To control the arms in 3D, image pixels must be mapped to the robot coordinate frames. We estimate the camera pose relative to each arm using AprilTags detected with the apriltag Python package (version 0.0.16), attached near the base of each robot. AprilTag detections provide 6D poses in the camera frame [45], which are then further refined using the perspective-n-point algorithm [46].

Figure 5 summarises this mapping: a pixel with depth is projected into the camera frame using the intrinsic matrix, then transformed into the world and robot frames using calibrated extrinsics. Initial tests showed a small but systematic height bias across the table, caused by slight camera tilt and depth noise.

To reduce a systematic height bias, a spatial depth error model was used. The correction surface is fitted using depth samples collected on a planar reference surface within the same workspace used for manipulation. Samples are drawn from a central region of the camera field of view to avoid boundary artefacts and are mapped into the robot frame using the calibrated camera-to-robot transformation. The target value is the residual height error with respect to the reference plane. In practice, this model is installation-specific: if the camera height, tilt, or workspace geometry changes noticeably, the correction surface should be re-estimated. The procedure itself only requires a flat reference surface and can be repeated whenever the installation is altered. This makes the calibration procedure practical to reproduce in similar installations while also making explicit that the fitted correction surface is specific to the given camera pose and workspace geometry. Depth values are sampled over a flat, known-height region of the workspace and transformed into the robot frame. The deviation of each point from the ideal plane is used as training data for a regression model that combines global polynomial terms with local radial basis function (RBF) features [47]. The correction process is summarised in Figure 6, which outlines the pipeline from raw depth data to corrected height estimation.

4.2. Textile Segmentation and Extremity-Based Grasping

Reliable segmentation is essential for grasp planning. The system first isolates the textile from the background using static background subtraction and, when helpful, HSV-based colour filtering. It then extracts contours, filters them by size and intensity change, and derives grasp points from the outer shape of the dominant contour.

Before the unfolding sequence starts, the system records an RGB image of the empty workspace as a background reference (Figure 7 left). When a textile is placed on the table (Figure 7 right), a new image is captured and preprocessed in the same way. Both images are converted to grayscale and blurred using a Gaussian filter to suppress sensor noise and small lighting variations (Figure 8 left). The absolute difference between the blurred reference (Figure 8 left) and current images is clipped to produce a binary mask of changed pixels. Morphological closing is then applied to fill small gaps and remove noise. External contours are extracted from the cleaned mask, and each contour is evaluated based on bounding-box dimensions, area, and the change in mean intensity between the two images. Small or low-contrast contours are discarded, and the largest remaining contour (Figure 8 right) is assumed to be the textile.

Once the textile contour is available, grasp points are chosen based on its extremities. Following common practice in deformable-object manipulation, the leftmost and rightmost points provide good mechanical leverage and mirror how humans often grasp soft materials [48]. The system identifies the contour points with minimum and maximum x-coordinate and then moves each point slightly inward (10% of the distance toward the contour centroid) to avoid unstable boundary regions. For the second unfolding stage, only extremities within the lower portion of the contour (approximately the bottom 20% of the textile mask) are considered, which encourages motions that pull remaining folds out toward the edge of the table. The final 2D grasp points are lifted into 3D using the calibrated depth map and the camera-robot transformations. This produces reachability-compatible targets for both arms. A custom depth map visualization (Figure 9) is used during development to verify that the selected extremities lie on valid textile regions.

4.3. Dual-Arm Manipulation and Motion Strategy

Robotic manipulation provides the physical actions needed to unfold the textile. The system uses two Interbotix WidowX 250s (Trossen Robotics, Downers Grove, IL, USA) arranged on either side of the table, each with a parallel gripper. The dual-arm setup allows fabrics to be lifted, tensioned, and repositioned based on the grasp points provided by the perception module. The unfolding is organised around a small set of motion primitives that are composed into each stage:

Forward–back–forward drag. For large folds, the arms grasp the left and right extremities, lift the textile slightly, and then perform a forward–back–forward dragging motion along the table surface (Figure 10 left), stretching the fabric and flattening major wrinkles.
Dynamic jerking for flips. When folds need to be flipped over an edge, the arms execute a rapid jerk motion that combines a short upward pull with a controlled horizontal displacement (Figure 10 right).
Linearity-aware stretching. The system monitors the line between the two grasp points; if that line is close to straight and near-horizontal in the workspace frame, the textile is treated as sufficiently tensioned (Figure 11).

Arm trajectories are planned in Cartesian space with simple collision-avoidance constraints to steer the end-effectors around the camera stand and robot bases [49]. For each grasp pair, the system verifies that both targets lie inside the reachable region of the corresponding arm pair (Figure 11 right) and, if necessary, adjusts the grasp radius inward to preserve feasibility.

The control software is implemented in ROS 2 and is organised into modular nodes. A camera node streams RGB–D data; a frame module maintains the transformations between camera, world, and robot frames; a tag module tracks AprilTags and updates extrinsics; and a process controller implements the unfolding stages and communicates with the ML model.

Figure 12 and Figure 13 show the flow diagrams used for the automatic and manual modes. Each state corresponds to a perception or manipulation action (e.g., segment textilegrasp extremities, evaluate flatness), and transitions are triggered by sensor feedback or operator input.

4.4. Fold Detection and Direction Estimation

The final stage focuses on detecting residual folds and selecting an appropriate direction in which to pull. This is handled by YOLOv11n model trained on depth-encoded images of textiles, followed by a depth-gradient-based direction estimator. A dedicated dataset of colour-encoded depth maps is collected by capturing the L515 depth image and mapping height values to a false-colour representation. This preserves geometric information while visually distinguishing the folds and ridges. Each sample is annotated with a bounding box around a visible fold and a corresponding fold type or configuration (one such example is shown in Figure 14). To train the model, YOLOv11n is used as a base model. Model performance is evaluated primarily on classification accuracy and mAP@0.5 and mAP@0.5 to 0.95.

In addition, to improve robustness, geometric augmentations that preserve colour fidelity such as shear, scaling, translation, and mild perspective distortion are applied. Since the dataset already contains manually flipped and rotated variants, additional flip- and rotation-based augmentations were not used. A small hyperparameter search was leveraged to identify an effective configuration with moderate shear and scale, which improved prediction accuracy and mAP without introducing artifacts in the depth encoding.

The detector outputs bounding boxes around candidate folds in the depth-encoded image. At inference time, the system selects the box with the highest confidence score and extracts the corresponding region from the original depth map. The centre point of the box is used as a reference. Around this point, depth samples are taken along eight compass-aligned directions (N, NE, E, SE, S, SW, W, NW). For each direction, we compute a simple gradient-based score that reflects how strongly the surface drops or rises away from the fold centre. The direction with the largest magnitude drop is interpreted as the most promising unfolding direction. In other words, the method assumes that the most suitable unfolding direction is the one along which the local surface descends most clearly away from the detected fold centre, providing a simple geometric cue for the pulling action. This discrete compass mapping is illustrated in Figure 15.

The resulting direction is transformed into the robot frame and used to define a short pulling motion for the grasped fold region. If the post-action depth map appears uniform within a tolerance threshold, the textile is considered flat; otherwise, the system iterates with a new fold candidate or terminates after a fixed number of attempts.

5. Experiments

This section describes how the experiments were conducted to evaluate each part of the unfolding pipeline: visual perception and calibration, segmentation and grasp-point detection, gripper and trajectory behaviour, YOLO-based fold detection, and the full three-stage unfolding process.

5.1. Perception Experiments

The perception experiments focus on evaluating depth accuracy, intrinsic and extrinsic calibration, and the depth error correction model. To measure the raw depth accuracy of the Intel RealSense L515, the camera was placed in its normal operating position above the workspace, and a ruler was positioned on the table at different distances within the working range for textiles (20–35 cm). For each distance, the corresponding depth reported by the camera was recorded and compared to the ground-truth ruler measurement.

To analyze how the depth error behaves across the field of view, a fixed region in the centre of the camera frame was selected. This region was used both for calibration validation and for sampling depth values when fitting the spatial error model. For model evaluation, the sampled depth points were separated into fitting and evaluation subsets to ensure that reported errors reflect generalisation to unseen workspace locations rather than interpolation of the same samples. The same workspace plane was used throughout all perception experiments to provide a consistent reference surface.

Intrinsic calibration experiments were carried out using checkerboard targets placed at different positions and orientations across the workspace. For extrinsic calibration, AprilTags mounted on the robot bases were observed from multiple viewpoints, and the resulting camera-to-robot transforms were evaluated in terms of spatial error along the z-axis. The same sampled region illustrated in Figure 7 left was used when analysing how the depth correction model affected the transformed 3D points.

5.2. Segmentation and Grasp-Point Experiments

The segmentation and grasp-point experiments evaluate how reliably the system can detect the textile, generate a usable mask, and extract grasp points from the contour. Textiles were placed on the workspace in different initial states, including crumpled, semi-folded, and more complex multi-fold configurations. For each configuration, a set of images was collected under the same camera pose and lighting conditions as in the final system. The RGB and depth frames from these scenes were then processed using the custom depth-based method explained in Section 4.1. Figure 16 shows an example textile scene on the workspace before segmentation, while Figure 17 shows the corresponding segmentation outputs from the custom depth-based method.

For grasp-point evaluation, the binary masks produced by the segmentation pipeline were analysed to extract leftmost and rightmost extremities. These points were then slightly offset inward to avoid grasping exactly at the textile edge. A fixed number of images (e.g., 20 per fold configuration) were used to evaluate how often the detected grasp points lay on the textile region and were suitable for grasping.

5.3. Gripper and Trajectory Tests

The gripper and trajectory experiments assess how the choice of gripper fingers and motion strategy influences textile grasp reliability and unfolding quality. A single textile was placed in a consistent configuration, and gripper attempted 30 grasps. The outcomes (successful or failed grip) were recorded to compute success rate.

Trajectory tests focused on the unfolding strategies described in the methods section. Two main types of motion were evaluated: a sequential drag motion designed to gently pull the textile across the table, and a more dynamic jerk-like motion originally intended for flipping. During testing, the jerking strategy was found to push the hardware close to its torque limits, so emphasis was placed on low-force trajectories that maintain tension while avoiding collisions with the robot bases.

An example of an invalid unfolding trajectory is shown in Figure 18, where the arms must route around the robot bases during placement, causing a loss of tension and a less accurate unfolding result.

5.4. Depth-Map Leveraged YOLO-Based Fold Detection

The machine learning experiments are designed to evaluate the YOLOv11-based fold detection model and the depth-based fold direction estimation. The dataset consists of depth-coded images of textiles placed in realistic configurations that the system may encounter. An initial set of 141 images was captured using the L515 in the final system configuration. These images were then augmented using horizontal and vertical flips and rotations of

90 °

,

180 °

, and

270 °

, yielding a total of 1691 images. Sample images from the fold detection dataset, showing textiles in different folded configurations are shown in Figure 14. The annotated dataset was then split into training, validation, and test sets using a 70–20–10 ratio. To avoid split leakage, augmented variants derived from the same original capture were kept within the same subset (train, validation, or test) as their source image, so that evaluation reflects generalisation to unseen captures rather than transformed copies of the same scene. YOLOv11n is trained for 30 epochs with batch size of 8, and optimized using ADAM optimizer. The performance of the model is evaluated primarily on the accuracy of the classification and mAP@0.5 and mAP@0.5 to 0.95.

5.5. System-Level Unfolding Tests

Finally, system-level tests were carried out to assess the entire three-stage pipeline. For each stage, 50 unfolding attempts were executed using textiles placed in varied but realistic initial conditions. Stage 1 evaluated the dual-arm stretching of a crumpled textile; Stage 2 focused on refinement stretches targeting the lower part of the textile; Stage 3 started from partially unfolded states with one or more visible V-folds and assessed the YOLO-guided fold resolution and direction estimation.

For each attempt, success or failure was logged according to stage-specific criteria. In Stage 1, a trial was considered successful if the textile was grasped at the detected extremities, lifted, stretched, and returned to the workspace in a more organised semi-folded configuration without a major loss of tension. In Stage 2, success required that the lower-region refinement reduced the remaining large folds without introducing new major self-occlusions. In Stage 3, success required that the detected local fold was unfolded in the estimated direction and that the textile became flatter according to the depth-uniformity criterion used in the system. Timing information was also recorded for each stage, together with qualitative notes describing the dominant failure modes, such as missed grasps, loss of tension, incorrect fold detection, or inaccurate direction estimation.

6. Results

This section presents the quantitative and qualitative results from the experiments described above. We first report on depth and calibration performance, then segmentation and grasp-point detection, gripper and trajectory behaviour, YOLO-based fold detection, and finally the overall unfolding performance of the three-stage pipeline.

6.1. Calibration and Depth Correction

The depth accuracy tests confirmed that the Intel RealSense L515 tends to overestimate distances by roughly 1 cm within the operating range, but in a largely consistent way. After applying the fixed offset and learned spatial error model, this systematic bias is significantly reduced.

The depth error correction model was evaluated using standard metrics such as mean absolute error (MAE), root mean square error (RMSE), and maximum error. Since the residual depth bias varies across the workspace, we compared a small set of regression models that predict the height residual as a function of planar position on the table. We included simple baselines such as a linear model (plane fit) and a second-order polynomial surface (Poly2), which capture global trends in the error field [50]. To reduce sensitivity to noise while keeping the model simple, we also tested Ridge regression (a linear model with

ℓ_{2}

regularisation) [51]. For smooth local variations, we evaluated an RBF regressor, which represents the spatial error using distance-based kernels [52]. In addition, we tested tree-based models that can capture non-linear structure without a fixed functional form, including Random Forests [53], gradient-boosted trees (XGBoost) [54,55], and LightGBM [56]. Finally, for the variants labelled as Scaled, the same models were trained with standardised inputs (mean zero, unit variance), which can improve numerical conditioning for linear methods and make comparisons more consistent across features [57]. The combined XGB + Poly2 + RBF model achieved the lowest MAE while keeping prediction time suitable for real-time use. Mean Absolute Error (MAE) comparison is shown in Figure 19.

To better understand how the final model behaves across the workspace, predicted and true z-values were compared along representative lines in the x- and y-directions, (shown in Figure 20). The fitted model successfully captured both the global slope and the local wave-like variations of the depth error surface, leading to a substantial reduction in residual error. Overall, the corrected depth values reduce spatial error in the robot frame to the millimetre range shown in Figure 21, which is sufficient for stable grasp planning and fold localisation.

6.2. Segmentation and Extremity Detection

For segmentation, we compared the custom depth-based approach to SAM2 [58] across different fold counts and textile configurations. The custom method achieved perfect detection for one and two folds and remained competitive for more complex configurations, while SAM2 required significantly more computation time and tended to miss folds when they were closely spaced. Contour quality was evaluated qualitatively by looking at cases with loose, overly tight, and accurate contours shown in Figure 22, Figure 23, Figure 24 and Figure 25. Loose contours tended to underfit the textile, excluding some parts of the fabric, while tight contours hugged internal wrinkles and could fragment the mask. The target behaviour was a contour that tightly follows the outer boundary without reacting to small internal creases. Extremity-based grasp-point detection was evaluated over 100 trials covering variation in textile rotation, fold state, and position on the table. In all cases, the detected leftmost and rightmost points fell on the segmented textile region, yielding 100% validity for image-space grasp target selection under the tested conditions. Physical grasp execution success is reported separately in the gripper comparison and system-level unfolding tests.

6.3. Gripper and Trajectory Performance

Over 30 grasp attempts, gripper achieved 28 successful grasps (93.33%). Trajectory experiments further highlighted the importance of maintaining tension and staying within a safe region of the workspace. Trajectories that forced the arms to move behind their bases or around obstacles often led to loss of tension and partial re-folding of the textile (Figure 18). In contrast, carefully constrained trajectories that stayed within the valid region and preserved a roughly straight line between grasp points produced more reliable unfolding results.

6.4. YOLO-Based Fold Detection and Direction Estimation

The YOLOv11n model trained on the augmented dataset achieved a classification accuracy of approximately 98.8% on the test set. This is further validated by evaluating the pixel error by comparing the predicted fold centre points with the ground-truth centres. The majority of predictions fall within a small pixel error range relative to the image resolution, which is sufficient for the subsequent depth-based direction estimation. This pixel error distribution is shown in Figure 26.

Using the detected fold location, the depth-based direction estimation procedure achieved correct direction predictions in roughly 87% of the test cases, which is adequate for the downstream manipulation policy given that unsuccessful attempts can often be recovered by re-detecting and re-trying a fold.

6.5. System-Level Unfolding Performance

The full three-stage unfolding pipeline was evaluated in 50 independent runs per stage. For Stage 1, the dual-arm stretch from an initially crumpled textile achieved a success rate of about 86%, with most failures caused by minor grasp slips or insufficient tension during the stretch. Stage 2, which refines the lower part of the textile, achieved a higher success rate of around 94%, reflecting the simpler geometry once the main folds have been reduced.

Stage 3 combines YOLO-based fold detection with depth-based direction estimation and targeted extremity grasps. In this stage, the system achieved a success rate of approximately 78%. The remaining failures were mainly due to incorrect fold direction estimates in ambiguous depth configurations or cases where the fold could not be fully resolved with a single manipulation.

Across all stages, the average execution time per attempt remained within a practical range for integration into an automated textile sorting line. The overall results demonstrate that the system can reliably flatten a wide range of textile configurations with limited human intervention, while still leaving room for further improvement in fold direction estimation and robust handling of more complex multi-fold garments.

7. Discussion

The discussion is structured around four core aspects of the system: visual perception and depth correction, extremity-based grasping, AI-based fold detection and direction estimation, and system-level performance. Finally, key limitations are summarised.

7.1. Benefits and Limitations of Extremity-Based Grasping

The extremity-based grasping method—selecting leftmost and rightmost textile points in the segmented contour for Stage 1 and restricting the search to the lowest 20% of the textile for Stage 2—proved to be both simple and effective in practice. Segmentation was successful for most textile configurations, and the resulting extremity detection provided stable grasp points across crumpled, semi-folded, and multi-fold setups. However, the experiments also highlight limitations. The method relies heavily on clean contour extraction, which in turn assumes a reasonably flat background, controlled lighting, and a textile that is fully within the camera’s field of view. When textiles are highly wrinkled, self-occluded, or extend beyond the workspace boundaries, the detected extremities may not correspond to physically convenient grasp points or may lie too close to the robot bases. In these scenarios, Stage 1 can leave more residual folds than desired, and Stage 2 has to compensate by focusing on the lower region of the textile. These situations suggest that extremity-based grasping is well suited for moderately complex, approximately rectangular items, but will need to be complemented by more advanced strategies for highly irregular or layered garments.

The current policy is intentionally heuristic to prioritise robustness and interpretability in a recycling-oriented setting; replacing fixed rules with optimisation-based planning or learned decision-making (e.g., grasp affordances and adaptive trajectories) is a natural next step and is included in future work.

7.2. YOLO-Based Fold Detection: Strengths and Failure Modes

The YOLOv11n-based fold detection module achieves high classification accuracy and reliable fold localisation in depth-coded images. Training convergence was stable, and geometric augmentations such as shear and scaling improved generalization without degrading the underlying depth encoding. In practice, minor bounding box misalignments did not negatively affect unfolding, because the fold center and direction estimation tolerate small shifts. Fold direction estimation, combined with the YOLO detections, achieved around 87% accuracy. This confirms that the depth-gradient-based direction search, using compass-aligned candidate directions and average heights, is generally reliable for the fold types represented in the dataset. When folds are clearly visible and isolated, the method consistently identifies a direction that leads to successful unfolding in Stage 3. However, misclassifications and missed detections tend to occur in ambiguous or low-contrast folds, for instance when folds are shallow or when multiple folds overlap. In these cases, the depth differences can be close to the noise level, and the resulting detection boxes may drift or split. Direction estimation struggles particularly when a detected fold lies under or near another protruding region with a higher average height; in such cases, the algorithm may follow the wrong gradient and choose an unfolding direction that does not correspond to the actual fold geometry.

7.3. System-Level Performance and Industrial Relevance

The system-level evaluation not only shows that the three-stage unfolding pipeline is capable of robust operation across a range of textile states but also reveals where performance plateaus. Stage 1, which performs the initial dual-arm stretch from a crumpled starting configuration, achieved an 86% success rate with an average execution time of approximately 34 s. Most failures were related to pickup issues rather than mis-detected grasp points, indicating that the perception pipeline is stable while some grasps fail due to fabric slippage or arm reach limitations. Stage 2, which focuses on the lower portion of the textile and refines the unfold, reached a higher success rate of 94% with an average time of approximately 35 s. Failures in this stage were relatively few and were mainly due to imperfect tension during stretching. The absence of detection or placement errors suggests that once the textile geometry is simpler, the extremity-based strategy is very reliable. Stage 3, driven by YOLO-based fold detection and directional unfolding, achieved a 78% success rate with an average time close to 49 s. Most failures were due to incorrect fold detection or ambiguous direction estimation (most probable cause explained above), with a smaller number caused by execution issues such as poor grasping of the fold region. This stage is also the slowest, since it may require several perception-action cycles before the textile is fully unfolded. From an industrial perspective, success rates in the 80–90% range and cycle times between 30 and 50 s suggest that the approach is promising for semi-structured environments or hybrid workflows in which human operators handle edge cases, rather than for immediate deployment in fully autonomous high-throughput recycling lines.

8. Conclusions

This work presents the development of an autonomous robotic system for unfolding textiles as a critical pre-processing step in automated textile sorting and recycling. The system integrates dual-arm manipulation, depth-based visual perception, and machine-learning-based fold detection to address the challenges posed by highly variable, flexible, and deformable textile items.

The system leverages two Interbotix WidowX 250 arms placed on either side of a shared workspace and an Intel RealSense L515 depth camera mounted overhead. Using calibrated depth data, the system segments the textile, identifies suitable grasp points, and analyses the surface geometry. A YOLO-based model detects residual folds and guides the final unfolding actions. The unfolding pipeline is organised into three stages: an initial dual-arm stretch based on leftmost and rightmost extremities, a second stretch targeting extremities in the lower portion of the textile, and a fold-driven stage that iteratively resolves remaining small folds using AI-based detection and depth-gradient direction estimation. The experimental results show that the visual perception pipeline can achieve sub-millimeter spatial accuracy after depth correction, that the segmentation and extremity-based grasping strategy is robust across several fold configurations, and that the YOLOv11n model reaches high classification accuracy with reliable bounding box localization. Fold direction estimation attains around 87% accuracy, and the three unfolding stages achieve success rates of approximately 86%, 94%, and 78%, respectively. These findings indicate that the system can reliably transform crumpled textiles into a mostly flat state and expose potential hard components such as zippers and buttons, supporting safer and more efficient downstream recycling operations. Overall, this work demonstrates that combining calibrated depth sensing, extremity-based dual-arm manipulation, and fold-aware machine learning in a structured, multi-stage pipeline is a promising approach to automated textile unfolding. The system provides a practical foundation that can be extended toward more general textile handling and may, with further development, be integrated into larger automated recycling workflows.

Although the current system successfully demonstrates autonomous textile unfolding in a controlled setting, several directions for future work remain. Camera calibration and depth accuracy could be improved further through more extensive checkerboard-based calibration and closed-loop validation. The AprilTag-based pose estimation pipeline may also benefit from additional tuning of detection parameters and refinement settings. On the manipulation side, the trajectory planner could be made more adaptive by replacing fixed motion primitives with a dynamic, constraint-aware strategy. Grasping could also be improved through richer local surface modelling and 6D pose-aware alignment, especially for elevated or multi-layered folds. Finally, both the AI model and the hardware can be refined further by expanding the dataset, exploring more advanced detection or segmentation models, and improving gripper contact surfaces and robot reach. Taken together, these directions point toward a more robust, adaptable, and scalable textile unfolding system that can move beyond controlled laboratory conditions and contribute more directly to future automated textile recycling lines.

Author Contributions

Conceptualization, T.E.B., J.J., A.J. and I.T.; methodology, T.E.B., J.J., A.J. and I.T.; software, T.E.B. and J.J.; validation, T.E.B., J.J., A.J. and I.T.; formal analysis, T.E.B. and J.J.; investigation, T.E.B. and J.J.; resources, A.J. and I.T.; data curation, T.E.B. and J.J.; writing—original draft preparation, T.E.B. and J.J.; writing—review and editing, A.J. and I.T.; visualization, T.E.B. and J.J.; supervision, A.J. and I.T.; project administration, A.J. and I.T.; funding acquisition, A.J. and I.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Regional Research Fund Agder through project number 341372 and forms part of the research project “Spectral imaging-based machine vision for intelligent automated sorting and disassembly of textile waste (ISORTx)”.

Data Availability Statement

The project repository can be accessed on 1 October 2025 at: https://github.com/Joakimjoh/MAS500. The data used for the results are available from the corresponding author upon request.

Acknowledgments

During the preparation of this manuscript/study, the authors used ChatGPT 5.2 for the purposes of grammar correction and structural enhancement. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Juanga-Labayen, J.P.; Labayen, I.V.; Yuan, Q. A Review on Textile Recycling Practices and Challenges. Textiles 2022, 2, 174–188. [Google Scholar] [CrossRef]
McKinsey & Company. Scaling Textile Recycling in Europe—Turning Waste into Value (Full Report PDF); McKinsey & Company: New York, NY, USA, 2023; Available online: https://www.mckinsey.com/industries/retail/our-insights/scaling-textile-recycling-in-europe-turning-waste-into-value (accessed on 5 March 2026).
European Environment Agency. Management of Used and Waste Textiles in Europe’s Circular Economy; European Environment Agency: Copenhagen, Denmark, 2024. [Google Scholar] [CrossRef]
Klima- og miljødepartementet. Nye Krav Til Kildesortering av Tekstiler; Klima- og miljødepartementet: Oslo, Norway, 2024; Available online: https://www.regjeringen.no/no/aktuelt/nye-krav-til-kildesortering-av-tekstiler/id3040775/ (accessed on 5 March 2026).
Niinimäki, K.; Peters, G.; Dahlbo, H.; Perry, P.; Rissanen, T.; Gwilt, A. The Environmental Price of Fast Fashion. Nat. Rev. Earth Environ. 2020, 1, 189–200. [Google Scholar] [CrossRef]
NORCE. ISORTx: Spectral Imaging-Based Machine Vision for Intelligent Automated Sorting and Disassembly of Textile Waste, 2025. Project Page. Available online: https://www.norceresearch.no/en/projects/isortx-spectral-imaging-based-machine-vision-for-intelligent-automated-sorting-and-disassembly-of-textile-waste (accessed on 5 March 2026).
UK Fashion and Textile Association. Textile Waste: Challenges and Opportunities; UK Fashion and Textile Association: London, UK, 2025. [Google Scholar]
Zhu, J. Automated Manipulation of Deformable Objects: Textile Handling and Unfolding; ARM Lab, University of Michigan: Ann Arbor, MI, USA, 2024. [Google Scholar]
Gu, F.; Zhou, Y.; Wang, Z.; Jiang, S.; He, B. A Survey on Robotic Manipulation of Deformable Objects: Recent Advances, Open Challenges and New Frontiers. arXiv 2023, arXiv:2312.10419. [Google Scholar] [CrossRef]
Huang, Z.; Lin, X.; Held, D. Mesh-based Dynamics with Occlusion Reasoning for Cloth Manipulation. In Proceedings of the Robotics: Science and Systems (RSS), New York, NY, USA, 27 June–1 July 2022. [Google Scholar]
Dastider, A.; Fang, H.; Lin, M. APEX: Ambidextrous Dual-Arm Robotic Manipulation Using Collision-Free Generative Diffusion Models. arXiv 2024, arXiv:2404.02284. [Google Scholar] [CrossRef]
Ecologic Institute. Study on the Recycling of Textile Waste; Ecologic Institute: Berlin, Germany, 2022; Available online: https://www.ecologic.eu/18392 (accessed on 5 March 2026).
U.S. Government Accountability Office. Science & Tech Spotlight: Textile Recycling Technologies; U.S. Government Accountability Office: Washington, DC, USA, 2024.
European Environment Agency. Digital Technologies Will Deliver More Circular Economy Benefits; European Environment Agency: Copenhagen, Denmark, 2024; Available online: https://www.eea.europa.eu/en/analysis/publications/digital-technologies-will-deliver-more-efficient-waste-management-in-europe (accessed on 5 March 2025).
SYSTEMIQ. The Textile Recycling Breakthrough: Why Policy Must Lead the Scale-Up of Polyester Recycling in Europe; SYSTEMIQ: Amsterdam, The Netherland, 2022; Available online: https://www.systemiq.earth/textile-recycling/ (accessed on 5 March 2025).
Sandin, G.; Lidfeldt, M.; Nellström, M. Exploring the Environmental Impact of Textile Recycling in Europe: A Consequential Life Cycle Assessment. Sustainability 2025, 17, 1931. [Google Scholar] [CrossRef]
Kaltsas, P.I.; Koustoumpardis, P.N.; Nikolakopoulos, P.G. A Review of Sensors Used on Fabric-Handling Robots. Machines 2022, 10, 101. [Google Scholar] [CrossRef]
Nahavandi, S.; Alizadehsani, R.; Nahavandi, D.; Lim, C.P.; Kelly, K.; Bello, F. Machine Learning Meets Advanced Robotic Manipulation. Inf. Fusion 2024, 105, 102221. [Google Scholar] [CrossRef]
Longhini, A.; Wang, Y.; Garcia-Camacho, I.; Blanco-Mulero, D.; Moletta, M.; Welle, M.; Alenyà, G.; Yin, H.; Erickson, Z.; Held, D.; et al. Unfolding the Literature: A Review of Robotic Cloth Manipulation. Annu. Rev. Control. Robot. Auton. Syst. 2025, 8, 295–322. [Google Scholar] [CrossRef]
Desingh, K. Perception for General-purpose Robot Manipulation. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23), Washington, DC, USA, 7–14 February 2023; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2023. [Google Scholar]
Andronas, D.; Arkouli, Z.; Zacharaki, N.; Michalos, G.; Sardelis, A.; Papanikolopoulos, G.; Makris, S. On the Perception and Handling of Deformable Objects—A Robotic Cell for White Goods Industry. Robot. Comput.-Integr. Manuf. 2022, 77, 102358. [Google Scholar] [CrossRef]
Proesmans, R.; Verleysen, A.; Wyffels, F. UnfoldIR: Tactile Robotic Unfolding of Cloth. IEEE Robot. Autom. Lett. 2023, 8, 4426–4432. [Google Scholar] [CrossRef]
Xue, H.; Li, Y.; Xu, W.; Li, H.; Zheng, D.; Lu, C. UniFolding: Towards Sample-Efficient, Scalable, and Generalizable Robotic Garment Folding. arXiv 2023, arXiv:2311.01267. [Google Scholar] [CrossRef]
Zacharia, P.T.; Aspragathos, N.A.; Mariolis, I.G.; Dermatas, E. A robotic system based on fuzzy visual servoing for handling flexible sheets lying on a table. Ind. Robot. Int. J. 2009, 36, 489–496. [Google Scholar] [CrossRef]
Servi, M.; Mussi, E.; Profili, A.; Furferi, R.; Volpe, Y.; Governi, L.; Buonamici, F. Metrological Characterization and Comparison of D415, D455, L515 RealSense Devices in the Close Range. Sensors 2021, 21, 7770. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Huang, T.; Zhao, Q. A New Model of RGB-D Camera Calibration Based on 3D Control Field. Sensors 2019, 19, 5082. [Google Scholar] [CrossRef]
Xing, H.; Liu, Y.; Chen, J.; Li, W.; Ding, L.; Tavakoli, M. Variable admittance control for door opening with a wheeled mobile manipulator considering ground obstacles. Intell. Serv. Robot. 2026, 19, 41. [Google Scholar] [CrossRef]
Triantafyllou, D.; Mariolis, I.; Kargakos, A.; Malassiotis, S.; Aspragathos, N.A. A Geometric Approach to Robotic Unfolding of Garments. Robot. Auton. Syst. 2016, 75, 233–243. [Google Scholar] [CrossRef]
Triantafyllou, D.; Koustoumpardis, P.N.; Aspragathos, N. Type-Independent Hierarchical Analysis for the Recognition of Folded Garments’ Configuration. Intell. Serv. Robot. 2021, 14, 427–444. [Google Scholar] [CrossRef]
Gabas, A.; Kita, Y.; Yoshida, E. Dual Edge Classifier for Robust Cloth Unfolding. ROBOMECH J. 2021, 8, 15. [Google Scholar] [CrossRef]
Kuribayashi, Y.; Yoshioka, Y.; Onda, K.; Yamazaki, T.; Wu, T.; Arnold, S.; Takase, Y.; Yamazaki, K. A Dual-Arm Manipulation System for Unfolding and Folding Rectangular Cloth. In Proceedings of the 2023 IEEE International Conference on Mechatronics and Automation (ICMA), Harbin, China, 6–9 August 2023. [Google Scholar] [CrossRef]
Ha, H.; Song, S. FlingBot: The Unreasonable Effectiveness of Dynamic Manipulation for Cloth Unfolding. In Proceedings of the Conference on Robot Learning (CoRL), London, UK, 8–11 November 2021. [Google Scholar]
Jiménez, P. Visual Grasp Point Localization, Classification and State Recognition in Robotic Manipulation of Cloth: An Overview. Robot. Auton. Syst. 2017, 92, 107–125. [Google Scholar] [CrossRef]
Maitin-Shepard, J.; Cusumano-Towner, M.; Lei, J.; Abbeel, P. Cloth Grasp Point Detection Based on Multiple-View Geometric Cues with Application to Robotic Towel Folding. In Proceedings of the 2010 IEEE International Conference on Robotics and Automation (ICRA), Anchorage, AK, USA, 3–7 May 2010. [Google Scholar] [CrossRef]
Qian, J.; Weng, T.; Zhang, L.; Okorn, B.; Held, D. Cloth Region Segmentation for Robust Grasp Selection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021. [Google Scholar] [CrossRef]
Tabernik, D.; Muhovič, J.; Urbas, M.; Skočaj, D. Center Direction Network for Grasping Point Localization on Cloths. IEEE Robot. Autom. Lett. 2024, 9, 8913–8920. [Google Scholar] [CrossRef]
Lips, T.; De Gusseme, V.L.; Wyffels, F. Learning Keypoints for Robotic Cloth Manipulation Using Synthetic Data. IEEE Robot. Autom. Lett. 2024, 9, 6528–6535. [Google Scholar] [CrossRef]
Li, C.; Fu, T.; Li, F.; Song, R. Design and Implementation of Fabric Wrinkle Detection System Based on YOLOv5 Algorithm. Cobot 2024, 3, 5. [Google Scholar] [CrossRef]
Hassan, S.A.; Beliatis, M.J.; Radziwon, A.; Menciassi, A.; Oddo, C.M. Textile Fabric Defect Detection Using Enhanced Deep Convolutional Neural Networks with Safe Human–Robot Collaborative Interaction. Electronics 2024, 13, 4314. [Google Scholar] [CrossRef]
He, C.; Meng, L.; Sun, Z.; Wang, J.; Meng, M.Q.H. FabricFolding: Learning Efficient Fabric Folding without Expert Demonstrations. arXiv 2023, arXiv:2303.06587. [Google Scholar] [CrossRef]
Intel. Intel RealSense LiDAR Camera L515: Specifications; Intel: Santa Clara, CA, USA, 2019; Available online: https://www.intel.com/content/www/us/en/products/sku/201775/intel-realsense-lidar-camera-l515/specifications.html (accessed on 12 February 2026).
Trossen Robotics. WidowX-250 6DOF Robot Arm Specifications; Trossen Robotics: Downers Grove, IL, USA, 2026; Available online: https://docs.trossenrobotics.com/interbotix_xsarms_docs/specifications/wx250s.html?highlight=dof (accessed on 12 February 2026).
Intel RealSense Documentation. Camera Calibration Tools and Guides. 2024. Available online: https://dev.realsenseai.com/docs/calibration (accessed on 12 February 2026).
Zhang, Z. A Flexible New Technique for Camera Calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]
Olson, E. AprilTag: A robust and flexible visual fiducial system. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011. [Google Scholar] [CrossRef]
Lepetit, V.; Moreno-Noguer, F.; Fua, P. EPnP: An Accurate O(n) Solution to the PnP Problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar] [CrossRef]
Neto, J.P. Radial Basis Functions. 2013. Available online: https://www.di.fc.ul.pt/~jpn/r/rbf/rbf.html (accessed on 30 May 2025).
Miller, A.; Allen, P.K. Automatic grasp planning using shape primitives. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Taipei, Taiwan, 14–19 September 2003. [Google Scholar]
Vahrenkamp, N.; Asfour, T.; Dillmann, R. Efficient Inverse Kinematics Computation Based on Reachability Analysis. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Algarve, Portugal, 7–12 October 2012. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar] [CrossRef]
Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
Buhmann, M.D. Radial Basis Functions: Theory and Implementations; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the Advances in Neural Information Processing Systems 30 (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar] [CrossRef]

Figure 1. Overview of the dual-arm robotic unfolding system and workspace layout.

Figure 2. Textile configurations during the unfolding process: (left) crumpled, (centre) semi-folded, and (right) fully unfolded.

Figure 3. Overview of the textile unfolding process, showing the sequence of sensing, grasp selection, motion execution, and fold resolution.

Figure 4. Coordinate frames of the overhead camera, the two robot bases, and the workspace table. Red, Green and Blue axis represent X, Y and Z axis respectively.

Figure 5. Flowchart illustrating the transformation from image pixels and raw depth to robot-frame 3D coordinates. Different colours are used to distinguish standard flowchart elements such as start/end, process steps, and decision nodes.

Figure 6. Flowchart showing the depth correction process. It illustrates the steps of sampling depth data, projecting points into the robot frame, fitting a spatial error model, and applying the correction surface to improve height accuracy. Different colours are used to distinguish standard flowchart elements such as start/end, process steps, and decision nodes.

Figure 7. Static background subtraction: (left) empty workspace reference, green square indicates the region selected for depth correction analysis; (right) workspace with a textile placed.

Figure 8. Segmentation pipeline: (left) blurred grayscale image after preprocessing; (right) detected textile contour, shown as a green outline around the largest segmented region corresponding to the textile.

Figure 9. Depth map of a segmented textile with extremity-based grasp points projected into 3D.

Figure 10. Examples of dual-arm motion primitives: forward–back–forward drag (left) and dynamic jerking motion for flipping folds (right).

Figure 11. Linearity analysis of grasp points (left), where the green line represents a fold, and reachable workspace of both arms and AprilTags near the robot bases (right), where different colours (red and blue) distinguish the workspace regions of the left and right manipulators respectively.

Figure 12. Flow diagram governing automatic unfolding operation. Different colours are used to distinguish standard flowchart elements such as start/end, process steps, and decision nodes.

Figure 13. Flow diagram governing manual or semi-automatic automatic unfolding operation. Different colours are used to distinguish standard flowchart elements such as start/end, process steps, and decision nodes.

Figure 14. Example depth-encoded images from the fold dataset, including rotated and augmented training samples (top), and an annotated sample showing a fold region enclosed by a bounding box (bottom). The different colours represent the false-colour depth encoding used to visualise height variations on the textile surface.

Figure 15. Fold centre and local sampling points used for depth-gradient analysis, leftmost and rightmost extremities of the textile contour (left) and the eight compass-based candidate unfolding directions (right). The green frame indicates the selected local fold region around the detected centre point. In the left panel, the coloured points mark the left point (red), right point (blue), edge point (yellow), and centre point (magenta). In the right panel, the dashed black lines represent the compass-based candidate directions evaluated during fold direction estimation.

Figure 16. Photo of the fabric on the table before segmentation.

Figure 17. Custom method: RGB image with fold areas highlighted using depth data.

Figure 18. Example of an invalid unfolding trajectory. The textile is initially stretched, but as the arms move to lay it flat, they must route around the robot bases, causing a loss of tension and a less accurate unfolding.

Figure 19. Mean Absolute Error (MAE) comparison of different spatial error correction models.

Figure 20. Z prediction comparison along the x-axis (at fixed y). The model tracks both large-scale and local variations in the depth error.

Figure 21. Z prediction comparison along the y-axis (at fixed x). The learned model closely follows the wave-like spatial error pattern.

Figure 22. Loose contour detection for textile. The green contour lines indicate the detected outer boundary of the segmented textile region.

Figure 23. Overly tight contour detection for textile. The green contour lines indicate the detected outer boundary of the segmented textile region.

Figure 24. Accurate contour detection under ideal conditions. The green contour lines indicate the detected outer boundary of the segmented textile region.

Figure 25. Accurate contour detection with a nearby object present.

Figure 26. Pixel error in bounding box centre predictions.

Table 1. Qualitative positioning of the proposed approach relative to representative deformable manipulation systems.

Work	Table-Top Cloth	Dual-Arm	Depth Err. Model.	Learn. in Loop
Zacharia et al. (2009) [24]	✓	×	×	×
Jiménez (2017) [33]	✓	–	–	–
This work	✓	✓	✓	✓

Note: ✓ indicates that the feature is explicitly used in the corresponding work; × indicates that it is not used; – indicates that the feature is not applicable or is not explicitly reported.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Båserud, T.E.; Johansen, J.; Jha, A.; Tyapin, I. Dual-Arm Robotic Textile Unfolding with Depth-Corrected Perception and Fold Resolution. Robotics 2026, 15, 78. https://doi.org/10.3390/robotics15040078

AMA Style

Båserud TE, Johansen J, Jha A, Tyapin I. Dual-Arm Robotic Textile Unfolding with Depth-Corrected Perception and Fold Resolution. Robotics. 2026; 15(4):78. https://doi.org/10.3390/robotics15040078

Chicago/Turabian Style

Båserud, Tilla Egerhei, Joakim Johansen, Ajit Jha, and Ilya Tyapin. 2026. "Dual-Arm Robotic Textile Unfolding with Depth-Corrected Perception and Fold Resolution" Robotics 15, no. 4: 78. https://doi.org/10.3390/robotics15040078

APA Style

Båserud, T. E., Johansen, J., Jha, A., & Tyapin, I. (2026). Dual-Arm Robotic Textile Unfolding with Depth-Corrected Perception and Fold Resolution. Robotics, 15(4), 78. https://doi.org/10.3390/robotics15040078

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Arm Robotic Textile Unfolding with Depth-Corrected Perception and Fold Resolution

Abstract

1. Introduction

1.1. Contributions

1.2. Structure of the Paper

2. Related Work

3. System Overview

3.1. Hardware Setup

3.2. Three-Stage Unfolding Pipeline

3.3. System Scope and Assumptions

4. Methods

4.1. Visual Perception and Calibration

4.2. Textile Segmentation and Extremity-Based Grasping

4.3. Dual-Arm Manipulation and Motion Strategy

4.4. Fold Detection and Direction Estimation

5. Experiments

5.1. Perception Experiments

5.2. Segmentation and Grasp-Point Experiments

5.3. Gripper and Trajectory Tests

5.4. Depth-Map Leveraged YOLO-Based Fold Detection

5.5. System-Level Unfolding Tests

6. Results

6.1. Calibration and Depth Correction

6.2. Segmentation and Extremity Detection

6.3. Gripper and Trajectory Performance

6.4. YOLO-Based Fold Detection and Direction Estimation

6.5. System-Level Unfolding Performance

7. Discussion

7.1. Benefits and Limitations of Extremity-Based Grasping

7.2. YOLO-Based Fold Detection: Strengths and Failure Modes

7.3. System-Level Performance and Industrial Relevance

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI