Author Contributions
Conceptualization, Y.L. (Yaxin Liu), Y.H. and M.Z.; methodology, Y.H., Y.L. (Yan Liu) and M.Z.; software, Y.H.; validation, Y.L. (Yaxin Liu), Y.H., Y.L. (Yan Liu) and M.Z.; formal analysis, Y.L. (Yaxin Liu), Y.H., Y.L. (Yan Liu) and M.Z.; investigation, Y.H.; resources, Y.L. (Yaxin Liu), and M.Z.; data curation, Y.L. (Yaxin Liu) and Y.H.; writing—original draft preparation, Y.L. (Yan Liu), Y.H.; writing—review and editing, Y.L. (Yaxin Liu), Y.H.; visualization, Y.H.; supervision, M.Z.; project administration, M.Z.; funding acquisition, Y.L. (Yaxin Liu). All authors have read and agreed to the published version of the manuscript.
Figure 1.
Task-Oriented Grasping Framework. The arrows indicate the workflow direction, while the dashed lines denote the components involving human-in-the-loop interaction.
Figure 1.
Task-Oriented Grasping Framework. The arrows indicate the workflow direction, while the dashed lines denote the components involving human-in-the-loop interaction.
Figure 2.
Coarse-stage target object localization framework.
Figure 2.
Coarse-stage target object localization framework.
Figure 3.
Instruction parsing prompt template.
Figure 3.
Instruction parsing prompt template.
Figure 4.
Fine-stage task-relevant part grounding framework. The numerical labels on the spoon indicate the indexed part-level segmentation results, which are used as visual prompts for VLM-based task-relevant region selection.
Figure 4.
Fine-stage task-relevant part grounding framework. The numerical labels on the spoon indicate the indexed part-level segmentation results, which are used as visual prompts for VLM-based task-relevant region selection.
Figure 5.
Prompt template for semantic reasoning between task instructions and candidate part regions in the fine stage.
Figure 5.
Prompt template for semantic reasoning between task instructions and candidate part regions in the fine stage.
Figure 6.
Demonstration of Semantic-SAM limitations on textureless and homogeneous objects. Subfigure (a) shows the segmentation result produced by Semantic-SAM, while subfigure (b) shows the result obtained using the proposed GHS method.
Figure 6.
Demonstration of Semantic-SAM limitations on textureless and homogeneous objects. Subfigure (a) shows the segmentation result produced by Semantic-SAM, while subfigure (b) shows the result obtained using the proposed GHS method.
Figure 7.
HITL correction process triggered when the automated pipeline fails. The robot avatar represents the proposed system, while the human avatar represents the user. Triggered by negative feedback (“N”), the user provides keyboard-based corrections, including error confirmation, segmentation method selection, PCA parameter M specification, final mask ID selection, and command correction.
Figure 7.
HITL correction process triggered when the automated pipeline fails. The robot avatar represents the proposed system, while the human avatar represents the user. Triggered by negative feedback (“N”), the user provides keyboard-based corrections, including error confirmation, segmentation method selection, PCA parameter M specification, final mask ID selection, and command correction.
Figure 8.
The transformation process from HITL interaction to structured information.
Figure 8.
The transformation process from HITL interaction to structured information.
Figure 9.
Illustration of task affordance exemplar.
Figure 9.
Illustration of task affordance exemplar.
Figure 10.
Hierarchical Task-Case storage architecture for the human experience database.
Figure 10.
Hierarchical Task-Case storage architecture for the human experience database.
Figure 11.
Example of structured storage for an experience exemplar in the human experience database. The experience database adopts a three-level directory structure, where the green box indicates the selected second-level folder and its third-level contents, and the red box highlights the corresponding .pkl file storing the structured experience information.
Figure 11.
Example of structured storage for an experience exemplar in the human experience database. The experience database adopts a three-level directory structure, where the green box indicates the selected second-level folder and its third-level contents, and the red box highlights the corresponding .pkl file storing the structured experience information.
Figure 12.
Retrieval-driven dynamic reasoning strategy framework. Based on the results of multi-level experience retrieval, the proposed strategy uses the Geometry Activation Gate (GAG) to dynamically schedule the dual-track perception strategy, selecting either the default Semantic-SAM branch or the GHS-based geometry branch. In the reasoning stage, the retrieval outcome further guides a dual-path reasoning strategy, switching between exemplar-alignment reasoning with retrieved task-affordance experience and default semantic reasoning. The numbers in the visual prompt denote indexed candidate part-level masks used as visual prompts for VLM-based task-relevant region selection rather than semantic category labels.
Figure 12.
Retrieval-driven dynamic reasoning strategy framework. Based on the results of multi-level experience retrieval, the proposed strategy uses the Geometry Activation Gate (GAG) to dynamically schedule the dual-track perception strategy, selecting either the default Semantic-SAM branch or the GHS-based geometry branch. In the reasoning stage, the retrieval outcome further guides a dual-path reasoning strategy, switching between exemplar-alignment reasoning with retrieved task-affordance experience and default semantic reasoning. The numbers in the visual prompt denote indexed candidate part-level masks used as visual prompts for VLM-based task-relevant region selection rather than semantic category labels.
Figure 13.
Construction process of prompt templates based on dynamic assembly strategy.
Figure 13.
Construction process of prompt templates based on dynamic assembly strategy.
Figure 14.
Exemplar-alignment reasoning prompt.
Figure 14.
Exemplar-alignment reasoning prompt.
Figure 15.
Overview of the WMRA platform and experimental setup. (a) Overall structure of the WMRA system, consisting of an electric wheelchair base and a Kinova Gen2 6-DOF three-finger robotic arm; (b) experimental setup of the WMRA platform; (c) representative experimental scenarios; (d) installation configuration of the Intel RealSense D435i depth camera.
Figure 15.
Overview of the WMRA platform and experimental setup. (a) Overall structure of the WMRA system, consisting of an electric wheelchair base and a Kinova Gen2 6-DOF three-finger robotic arm; (b) experimental setup of the WMRA platform; (c) representative experimental scenarios; (d) installation configuration of the Intel RealSense D435i depth camera.
Figure 16.
Representative human experience samples for complex affordance reasoning and textureless-object tasks.
Figure 16.
Representative human experience samples for complex affordance reasoning and textureless-object tasks.
Figure 17.
Qualitative comparison of task-relevant region predictions between the baseline and the proposed method with human experience enhancement. (a) Input RGB images. (b) Retrieved task affordance exemplars via multi-level experience retrieval. The results without human experience enhancement (baseline) are highlighted in purple, while those with human experience enhancement (proposed method) are highlighted in yellow. (d) and (e) Fine-stage segmentation results of the baseline and the proposed method, respectively. (c) and (f) Corresponding task-relevant region masks.
Figure 17.
Qualitative comparison of task-relevant region predictions between the baseline and the proposed method with human experience enhancement. (a) Input RGB images. (b) Retrieved task affordance exemplars via multi-level experience retrieval. The results without human experience enhancement (baseline) are highlighted in purple, while those with human experience enhancement (proposed method) are highlighted in yellow. (d) and (e) Fine-stage segmentation results of the baseline and the proposed method, respectively. (c) and (f) Corresponding task-relevant region masks.
Figure 18.
Representative failure cases of the proposed system in task-relevant region grounding. (a) Incorrect background selection caused by background regions being included in the fine-stage candidate masks; (b) coarse-stage target object localization failure, resulting in incorrect or incomplete fine-stage reasoning regions; (c) multiple plausible task-relevant regions, where the current system outputs only a single mask and may ignore other feasible manipulation regions; (d) unstable part-level candidates generated by Semantic-SAM, leading to inaccurate task-relevant region grounding.
Figure 18.
Representative failure cases of the proposed system in task-relevant region grounding. (a) Incorrect background selection caused by background regions being included in the fine-stage candidate masks; (b) coarse-stage target object localization failure, resulting in incorrect or incomplete fine-stage reasoning regions; (c) multiple plausible task-relevant regions, where the current system outputs only a single mask and may ignore other feasible manipulation regions; (d) unstable part-level candidates generated by Semantic-SAM, leading to inaccurate task-relevant region grounding.
Figure 19.
Qualitative segmentation results of GroundedSAM under occlusion, varying illumination, and complex backgrounds. Each pair shows the input RGB image with the target object label as the text prompt, and the corresponding target object segmentation result.
Figure 19.
Qualitative segmentation results of GroundedSAM under occlusion, varying illumination, and complex backgrounds. Each pair shows the input RGB image with the target object label as the text prompt, and the corresponding target object segmentation result.
Figure 20.
Task-wise and overall task-relevant region segmentation success rates with 95% confidence intervals across different methods.
Figure 20.
Task-wise and overall task-relevant region segmentation success rates with 95% confidence intervals across different methods.
Figure 21.
Task-wise and overall task-oriented grasp success rates with 95% confidence intervals across different methods.
Figure 21.
Task-wise and overall task-oriented grasp success rates with 95% confidence intervals across different methods.
Figure 22.
Comparison of valid grasp regions after task-relevant mask filtering.
Figure 22.
Comparison of valid grasp regions after task-relevant mask filtering.
Figure 23.
Representative real-world task-oriented grasping results of our method on simple tasks.
Figure 23.
Representative real-world task-oriented grasping results of our method on simple tasks.
Figure 24.
Representative real-world task-oriented grasping results of our method on more complex tasks, demonstrating its capability in handling challenging affordance reasoning scenarios.
Figure 24.
Representative real-world task-oriented grasping results of our method on more complex tasks, demonstrating its capability in handling challenging affordance reasoning scenarios.
Table 1.
Segmentation success rate comparison of different methods.
Table 1.
Segmentation success rate comparison of different methods.
| Method | Segmentation Success Rate (%) |
|---|
| w/o Coarse | 53.04 |
| Baseline | 68.02 |
| Copa | 26.32 |
| AffordGrasp | 47.77 |
Table 2.
Segmentation success rate (%) under different retrieval parameter settings.
Table 2.
Segmentation success rate (%) under different retrieval parameter settings.
| = 0.15 | = 0.2 |
|---|
| = 0.6 | = 0.7 | = 0.8 | = 0.6 | = 0.7 | = 0.8 |
|---|
| 35 | 72.47 | 70.04 | 72.06 | 69.64 | 72.06 | 72.87 |
| 40 | 72.06 | 73.68 | 71.26 | 70.04 | 70.85 | 71.66 |
| 45 | 74.09 | 74.09 | 71.66 | 70.04 | 72.87 | 69.23 |
| 50 | 73.28 | 71.26 | 72.87 | 70.45 | 69.23 | 72.47 |
Table 3.
mIoU under different retrieval parameter settings.
Table 3.
mIoU under different retrieval parameter settings.
| = 0.15 | = 0.2 |
|---|
| = 0.6 | = 0.7 | = 0.8 | = 0.6 | = 0.7 | = 0.8 |
|---|
| 35 | 65.97 | 63.8 | 66 | 64.58 | 66.61 | 67.11 |
| 40 | 65.91 | 67.74 | 66.21 | 65.07 | 65.83 | 65.84 |
| 45 | 67.52 | 67.47 | 66.09 | 64.87 | 66.87 | 63.79 |
| 50 | 66.96 | 65.61 | 67.09 | 65.93 | 64.02 | 66.84 |
Table 4.
Segmentation success rate under different GHS threshold settings.
Table 4.
Segmentation success rate under different GHS threshold settings.
| Method | Segmentation Success Rate (%) |
|---|
| Ours ( = 0) | 68.42 |
| Ours ( = 27) | 73.28 |
| Ours ( = 29) | 70.45 |
| Ours ( = 30) | 73.68 |
| Ours ( = 32) | 76.11 |
| Ours ( = 34) | 72.87 |
| Ours ( = 36) | 76.92 |
| Ours ( = 38) | 76.92 |
| Ours ( = 40) | 77.73 |
Table 5.
Ablation study summary of segmentation success rate.
Table 5.
Ablation study summary of segmentation success rate.
| Variant | Coarse | Fine | Exp (Semantic) | Exp (Geometry) | SSR (%) |
|---|
| w/o Coarse | - | ✓ | - | - | 53.04 |
| w/o Experience | ✓ | ✓ | - | - | 68.02 |
| w/o GHS | ✓ | ✓ | ✓ | - | 68.42 |
| Ours (Full) | ✓ | ✓ | ✓ | ✓ | 77.73 |
Table 6.
Target object segmentation success rate (%) across different methods.
Table 6.
Target object segmentation success rate (%) across different methods.
| Human Instruction | Baseline | Ours | Copa | AffordGrasp |
|---|
| Cut with the knife | 90 | 90 | 100 | 65 |
| Grasp the fork to poke | 100 | 100 | 90 | 20 |
| Use the spoon to scoop | 95 | 90 | 100 | 100 |
| Grasp the hammer to beat nail | 100 | 95 | 85 | 10 |
| Comb with a comb | 80 | 100 | 85 | 65 |
| Use a brush to brush some oil | 100 | 90 | 100 | 55 |
| Sweep the hammer across the table | 95 | 100 | 90 | 0 |
| Hang the cup on the cup holder | 95 | 95 | 95 | 60 |
| Avg. | 94.375 | 95 | 93.125 | 46.875 |
Table 7.
Task-wise and average task-relevant region segmentation success rate (%) across different methods.
Table 7.
Task-wise and average task-relevant region segmentation success rate (%) across different methods.
| Human Instruction | Baseline | Ours | Copa | AffordGrasp |
|---|
| Cut with the knife | 85 | 85 | 25 | 60 |
| Grasp the fork to poke | 80 | 85 | 50 | 20 |
| Use the spoon to scoop | 85 | 85 | 55 | 75 |
| Grasp the hammer to beat nail | 85 | 80 | 60 | 10 |
| Comb with a comb | 40 | 100 | 25 | 0 |
| Use a brush to brush some oil | 0 | 90 | 25 | 55 |
| Sweep the hammer across the table | 20 | 100 | 20 | 0 |
| Hang the cup on the cup holder | 5 | 95 | 10 | 0 |
| Avg. | 50 | 90 | 33.75 | 27.50 |
Table 8.
Task-wise and average grasp success rate (%) across different methods.
Table 8.
Task-wise and average grasp success rate (%) across different methods.
| Human Instruction | Baseline | Ours | Copa | AffordGrasp | AnyGrasp |
|---|
| Cut with the knife | 70 | 75 | 30 | 50 | 70 |
| Grasp the fork to poke | 75 | 70 | 65 | 20 | 45 |
| Use the spoon to scoop | 70 | 85 | 60 | 80 | 45 |
| Grasp the hammer to beat nail | 90 | 70 | 50 | 25 | 60 |
| Comb with a comb | 20 | 80 | 35 | 10 | 45 |
| Use a brush to brush some oil | 0 | 80 | 35 | 25 | 25 |
| Sweep the hammer across the table | 10 | 75 | 40 | 0 | 50 |
| Hang the cup on the cup holder | 20 | 65 | 10 | 5 | 30 |
| Avg. | 44.375 | 75 | 40.625 | 26.875 | 46.25 |